NOTE: Audio file is at the bottom of this page.Will Roberts (WR): Hi, my name's Will Roberts. I'm a data scientist and evangelist at IBM and with me is Mike Tamir. Mike, how are you?Mike Tamir (MT): Great. Thanks for having me today.WR: Of course! Last time we spoke, we promised that we'd follow up with you again and we'd talk about LSTM's and since then we've been reading a lot about sequence modeling, generally. Where should we start in this conversation?
MT: So sequence modeling is relatively broad topic. We did talk about, you know, word embeddings and how we kind of broke through from individual word tokens or categorical representation of tokens as one-hot vectors to embeddings. When we think about sequence modeling for words, words are given in the form of sentences, which are a sequence of words. And so modeling that sequence that becomes very valuable if you want to get the full context of the meaning of each term as it's coming out. Now, the first place to start is probably with recurrent neural networks. For techniques for using those words in order and modeling them, recurrent neural networks are designed to look at data in order and then pass along relevant information, or pass along some information. And for more advanced recurrent neural networks like LSTM or gated recurrent units, they learn not just to pass on information, but also to keep track of what's important to pass on and what's not so important to pass on over longer scope of time steps or sequence steps.
MT: If you think about how an LSTM, a long short term memory unit, is going to go through words in order, what it's doing is it's looking at those at those terms. There are different components to an LSTM. There's a forget gates. There's a remember gates. There's an internal cell state that it's keeping track of as it's going along. And by the time you get to the end of the sentence, the hope is that if this model is learning, it's going to capture the relevant encoding of the entire meaning of that sentence that's relevant to the task that you're training your neural network to accomplish.Often you don't just read the sentence from front-to-back, but you also read it the other way and that's called a bi-LSTM, or bidirectional LSTM. What these do is they'll run through every word in order, and then you get context of the entire sentence by the time you get to the end of the sentence front-to-back. They'll do the other direction as well. So you go for every word in order from back-to-front, and you'll get the full context of the sentence from that direction as well. Now, if you look at those two vectors together, that's going to be double context, doubling encoding of the meaning, and hopefully that's going to capture, if you can concatenate those two vectors together, the entire meaning of the word.Bringing that home to your question. When you talk about attention mechanisms, at every step along the way, every word in the longer sentence, you're getting front-to-back and back-to-front. You're getting a partial context of the sentence up until that word. And if it's a bi-directional LSTM, you're getting the encoding or that word pivoting around that word from one of the bi-LSTM sides, through front-to-back, and from the other bi-LSTM, the remainder of the sentence. And what that does is it captures the importance of the sentence, or rather the importance of the word relative to that entire sentence. It makes pivot point. And attention mechanisms are looking at all of those outputs of the sentence from the perspective of that word, so to speak, and then figures out a way of scoring the importance of each of those words - to which of those words should most attention be paid in order to you get the most information out of that sentence.This is very useful for text classification, and it's very useful for text summarization. And more broadly in the past 18-20 months, we've also found that attention mechanisms on their own can be very useful even without the LSTM if you really push hard on just using attention mechanisms themselves. You can kind of kick away the ladder and not even encode the starting point of a sentence with an LSTM.
MT: Right, so maybe it's worthwhile - I hinted that attention mechanisms, have been a part of a lot of the advances we've seen over the past year, year and a half - to maybe elaborate on that. Starting at end of 2017 a bunch of the team at Google, a bunch of researchers in Google, decided to see if they can push that to the limit and what that looked like was actually coming up with what's called multi headed attention mechanisms. So you take several 8, 16, etc. attention mechanisms. Each of them is looking at all of the sentences, or the words in the sentence, and trying to encode the meaning of a word in a sentence in context of all the other words in that sentence. And by doing this with so many different avenues of looking at the attention from different aspects, what you end up doing is getting a very effective encoding of the words in a sentence.And then they use a similar technique of multi-headed attention network structure. In order to then decode. So if you're taking the words in order for a sentence that you want to translate, encoding it into a sequence of vectors and then decoding that sequence of vectors into a new sentence. And that new sentence can be in a different language. It could be for generating new texts entirely that you feed. So that's something that we saw with GPT-1, which was the predecessor for GPT-2, and all of these methods, in particular one of the biggest winners now that people are taking a lot of advantage of is BERT algorithm. Which is again using these multi-headed attention network structures and stacking them together in order to represent the meanings of words in context. What that's done is allowed us to really open up the floodgates on solving, not just sequence to sequence tasks like translation from one sentence to another sentence, but solving a whole host of tasks in natural language processing.To be a little bit more specific, let's take a step back and think about what was happening with like word2vec, and Glove and some of these single term encodings. With those it's a shallow network. It's not a deep neural network and it's finding the best matrix representation that projects every word in the language down to a low rank subspace; usually a few hundred dimensions and that captures the semantics of that term. The big problem with that is that by encoding a token, so an instance of where you've spelled B-A-N-K bank, if you take that token and map it to a certain vector, the different instances of that vector are going to have vastly different meanings including different, unrelated meanings, depending on the context of the different sentences. An example is "I crossed the river to get to the bank" versus "I crossed the street to get to the bank." And by looking at the encoding of the term bank in context with the other words, as that happened in those sentences, you can now have a vector representation of the word "bank" in the first sentence that is distinguishable from the vector representation of the word "bank" in the second sentence. You can peel apart those different meanings of the homonyms, or homonyms are sometimes spelled differently, but tokens that are spelled identically, and this has been very powerful just at the word level, just at the able to disambiguate, the sense of term in this sentence spoken.But that's just where it starts. The BERT Algorithm, by playing a game where it's strategically masking certain words, then training the multi headed attention units that are stacked together in order to fill in those blanks, they found out that they could have it solve all sorts of tasks. And in particular, it broke records late last year for the GLU evaluation; so General Language Understanding evaluation tasks, it beat all the benchmarks. So this is good for filling in the missing context. It's good for disambiguating different sentences. It's also good for doing a sequence-to-sequence mapping. But it's also useful for being able to do classification tasks themselves.And what they do there is they have this special token, this CLF token, that they put in the beginning of the sentence - CLF for classifications. Then that CLF token gets vectorized based on all the other sentences in their context, and the output of that then they feed that CLF token in order to solve different classification tasks. Some of those classification tasks can be of the form, "what's the sentiment of this sentence or this text?" Or you could use two sentences, and ask "does the first sentence look like a translation of the second sentence?", or "does the second sentence look like an answer to the question posed by the first sentence." "What's the natural language inference relationship between the first sentence and the second sentence?" or "is the second sentence entailed by the first sentence?" And so it really breaks open a lot of the different tasks you want to do. Not just doing sequence-to-sequence, which is the original use case for using these multi-headed attention encoders, but also for solving all sorts of problems. In fact, most of the standard ones that are in the NLP world for benchmarking
MT: Well, no doubt ... you might suspect at least that the reason why this sort of technology has been developed so well out of primarily Google research is they want to sell more TPU hours, right? That's a little bit tongue in cheek, but it turns out that LSTM's are inherently sequential. Recurrent neural networks in general are inherently sequential. Which means that acceleration from things like GPU's and TPU's is limited. One of the really big advantages, even if it wasn't also giving advantages in evaluation, is the advantages to performance. That is to say the speed with which these calculations can be made in a multi headed attention context versus in an LSTM context. Fundamentally, what's happening is it's just a bunch of dot products, and matrix multiplications. It's primarily that. And because it doesn't need to store hidden state, pass them on to the next step, and so forth, over and over again sequentially like a recurrent neural network, there's no recurrent elements to that. If you're paralyzed the computations that are major computations, I can actually execute them in a distributed way on GPU's or TPU's much more quickly. So actually you will save money and time if you switch over to these different kinds of architectures.Now if we're saying that regardless of if LSTM's could beat this on their own, earlier this year they came up with Transformer XL. So the entire sequence-to-sequence architecture using multi-headed attention networks to encode and then decode sentences as a sequence-to-sequence process is called a transformer architecture; transforming one sequence to the next. Transformer XL seems to be bringing back some of that fundamental, inherently sequential processing. And mostly because it's useful, for or designed for, tasks where you want to detect long range relationships that may not be captured by a traditional transformer.