Global AI and Data Science

 View Only

Ask a Data Scientist: Intro to Modern Natural Language Understanding

By Michael Tamir posted Thu April 11, 2019 12:37 PM

  

Isaiah Brown (IB): Hello Data Science Community. My name is Isaiah Brown - Community Marketing Associate with IBM and manager of the Data Science and Business Analytics Communities and I'm joined by Will Roberts, IBM Data Science Technical Product Marketer. This is part three of the Ask a Data Scientist series with Mike Tamir - former Head of Data Science at Uber ATG, and now Chief ML Scientist and Head of AI at Susquehanna International Group. How are you doing, Mike?


Mike Tamir (MT): I'm doing well. Good to be here.


IB: For this installment of the Ask a Data Scientist series, we're going to do an intro to modern natural language understanding. Mike, while I was exploring some of your work and research in this area, your project FakerFact.org stood out, especially in our current political climate. Ultimately, that's a natural language processing problem you're trying to solve and a tough one at that. Can you tell me about the project and help me understand natural language processing a little bit more?


MT: Absolutely. First, it’s important to say that natural language processing has really gone through a series of revolutions over last the last six years. The initial game change came with understanding how we can embed concepts (words) in a mathematically salient representation - this opened up the ability to do everything that we've seen since. 2013 specifically stands out as one of the first breakthrough years because that was the year that Word2Vec was first published, giving the community its initial exposure to neural word embedding algorithms. That sort of technology - and we can talk more about why that was such a game changer at the time - has enabled us to start to leverage a lot of really powerful algorithms for looking at the words in a text to answer questions that we couldn't answer before. And this really ranges across the entire gamut of natural language use cases, from question and answering systems, neural machine translation, text summarization, and text classification and analysis, which is what fundamentally something like a fake news detection AI is going to be doing.


Will Roberts (WR): And it sort of sounds like ultimately with FakerFact.org, you're trying to categorize documents, right? You mentioned that word2vec was a starting point in being able to do this. Can you compare and contrast word2vec with a traditional natural language processing process like TFIDF?


MT: Sure - TFIDF is a scaling method, which refers to Term Frequency Inverse Document Frequency, that is part of the old regime of ways that we've worked with text. There are still, I should say, some use cases where TFIDF alone is just as good and swifter than more modern techniques, but for the most part, when working with text, you want to start with something like word2vec or another embedding algorithm.


WR: And why is that?  


MT: When you think about how a machine is going to analyze text or is going to turn text into the sort of thing that it's going to be able to run an algorithm on, you need to be able to represent those words as numbers that the machine will understand. In other words, you need to represent it as a vector (an ordered set of numbers) that it can process. Historically the way we did that was something called one hot encoding. And if you haven't heard of one hot encoding, it's relatively simple. You take every word in your vocabulary and assign it it’s own “slot” or dimension in the vector.  Modern languages have about a million terms in their vocabulary. So if each one of the slots represents an individual word, then we have vectors that are a million dimensions long. It's called one hot encoding because the way you would indicate an individual word is you have this string of a million slots or numbers, and all of them are zero except for that one slot (dimension) assigned to a particular word. For example, if we want to represent the letter “a” it would get a 1 in the slot assigned to that word (presumably the first slot if we are going alphabetically), and a 0 for all the other slots (words) to which it is not assigned. Now TFIDF comes in when we observe that some words are common and should get less weight because they have less information about the particular text you’re looking at than a word that only occurs in certain contexts. Roughly speaking, TFIDF gives higher weight to the rarer words.


IB: So what are the problems with one hot encoding then?


MT: There are two main problems. Problem number one has to do with the geometry: with one hot encoding, you can’t really get any significance out of doing different mathematical operations on terms (adding, subtracting different terms, etc).  More generally, if we just think about 2 different terms and the distance between them in one hot encoding, they are are all in a sense equally far away from each other: every word is 90o.away from every other word, whether they're synonyms, they're antonyms, they mean related things, or they're completely unrelated to one another. Another way of saying this is that the meaning of the words has nothing to do with how they're represented mathematically, so you don’t capture any semantics built into the representations of the words. The second major problem is that even if you have only the important words - so reduce from a million dimensions to 100,000 or even 10,000 dimensions for the 10,000 most signal rich words - then you end up with a what's called the curse of dimensionality. Geometry, in particular, the way volume works, can be a little counterintuitive when you have so many dimensions (compared to our human intuitions about the way volume works in a measly 3 dimensions).  When you have too many dimensions, there are too many possible ways you can spread out these words over so many dimensions and where they're represented to get a clear signal. This is a “sparsity” problem. We have rich data but they are too sparsely spread out when we have so many dimensions to find a generalizable pattern. One way to see why this is a problem is by considering what happens when you add a dimension to your features. In order to maintain that same density of data, that same richness of data that allows you to run a machine learning algorithm and make an inference and find a pattern that's repeatable, you need a multitude of a number of data points. Best case scenario, you need twice as many data points to maintain the same density, if you add one more binary dimension. So if you add 10 dimensions or 1,000 dimensions, very quickly you need way more data than you typically can support with your text. Finding a way to control the number of dimensions is critical and the difficulty in doing this with one hot encoding was a major drawback.


WR: Can you tell us more about TFIDF? You mentioned it was a scaling process?


MT: Yes TFIDF is really just a way of taking that one hot encoding of the terms, and giving them an importance weighting. So some words are used all the time. They're used so often that they're kind of meaningless as far as making inferences from them. There are some exceptions to this but we called those words “stop words”. Some other words are still so common that they have less importance compared to the words that are more rare. One concrete example we saw last year has to do with an anonymous document that was in the news for a couple days and supposedly came from the Trump White House.  All the pundits were talking about specific words to uncover the mystery of who wrote the document. For example, the word “loadstone” was pointed to as reason for suspecting Mike Pence, and news organizations were playing clips of him using the uncommon term in different situations to suggest that he is more likely to say the word than the average person. We can't make any inference about that specific example of course, but that's kind of the idea behind TFIDF. If you use a word that you don't really see very often, then maybe it's going have more significance for making inferences about the text than words that you do see more often. And so TFIDF is just a way of using term frequency relative to its occurrence in documents to weigh that importance. TFIDF can certainly be a big improvement. It helps to give significance to individual terms, but it still doesn't do anything to manage those two fundamental problems: the semantics are still not captured in the geometry of the mathematical representation of your work; and your dimensionality is still proportional to your vocabulary. And vocabulary is typically too big to support the data volume that you're going to be able to process.  


IB: You just described 2 big problems with one hot encoding that TFIDF doesn’t solve.  How does this get solved by word2vec?


MT: Neural word embeddings like word2vec solved both of these problems by allowing us to let the algorithm learn how to represent the “too many dimensions” one hot vectors in a space of vectors with far fewer dimensions.  In practice, this tends to be in the several hundreds of dimensions rather than the tens to hundreds of thousands of dimensions you see with one hot encoding. Better yet, word2vec and its successors were able to take full advantage of those dimensions. Different words no longer have to be at 90o angels from each other in word2vec space. In fact, words with similar meanings end up relatively close to each other in the embedded word2vec space. And we can even do arithmetic on these words now that make sense, like the famous examples: the king-vector minus the man-vector plus the woman-vector equals the queen-vector, or the London-vector minus the England-vector plus the France-vector equals the Paris-vector.  


WR: How does it do this?


MT: It did it by taking almost a side effect of solving a simpler problem, namely, the “predict the missing word” problem. You train an algorithm to e.g. predict a missing word in the sentence. And by training a shallow neural net to find that missing word in the sentence, what ends up happening is you get a representation of the one hot vectors remapped to a lower dimensional space that that neural net used to solve the simple problem. It re-represents these very long one hot encoding vectors so it can solve the missing word prediction task, but this new representation was helpful elsewhere as well.


IB:  When I look at FakerFact, it seems like it's classifying entire documents. Can you tell us how TFIDF vs neural net based NLP implementations relevant for different projects?


MT: Yes, so far we've just been focusing on individual words. There are all sorts of use cases for drawing inferences on the presence of individual words. But what about whole documents? If we were using some of these old style methodologies like TFIDF, what we would do is count up the frequency of different words. We would then use TFIDF to weight the importance of the presence or absence of those individual words. And then we would go ahead and feed that information, that very long vector, into an algorithm that can draw conclusions like “this sounds like satire,” vs “this doesn't sound like satire,” or “this sounds like an op-ed piece,” vs “this doesn't sound like an op-ed piece.” TFIDF works all right, but there are several things that really are important to take into account. Especially when you're solving a problem like fake news detection. One of those problems is that word order matters. So what we really want to do is not just take individual terms, but look at them in the order in which they're composed to create full sentences, and look at the sentences and the order in which they're composed in order to create a full paragraph, and paragraphs in order to create full documents and so forth. To do that, word2vec was really the starting point. With word2vec you've got the semantics of the individual terms, and using other neural networks like LSTMs and Transformers you can combine these sequences of semantically rich vectors and start embedding entire sentences, paragraphs or documents. Specifically, LSTMs, Long Short Term Memory units, can look at words in order and represent them mathematically in sequences. Recurrent neural networks like LSTMs have the ability to “read” in those words in order and then come up with an encoding of that sequence of vectors that represents the entire string most effectively to solve the task at hand. Sometimes that task at hand is to represent a numerical encoding - a vector encoding of the entire sentence because you want to translate it from English to a numerical encoding and then run the process in reverse in a sense to turn the numerical encoding back into words, but this time in German. Another use case is text summarization, where you do that same process but maybe you go from English to English and you add other things like different elements of the neural network architecture that will find what are the most important word or words in each sentence, because that will then influence how you unravel the longer text into a shorter summarization of the text. Over the past year, we’ve also seen a lot of traction using Transformers which similarly can take in sequences of vectors and output another sequence (or individual encoding) that captures what is most important about the sequence for the task at hand. For the use case of fake news, some things are going to stand out in a bit of text more than others. In particular, when you're in the business of trying to detect a hidden agenda, you’re looking for those subtle kinds of manipulation and methods that fake news might use to trigger an emotional reaction. While that seems like a very complex thing to try and detect, it is possible with these sorts of algorithms to pick up on the patterns that you tend to see if you feed your algorithm enough examples of biased text, on the one hand, vs text that, on the other hand, is just trying to share the facts.


IB: Mike, that's all the time we have for today but thank you so much for joining Will and for this conversation and we look forward to part two of this conversation for a deeper dive into modern natural language processing, how LSTMs work, and other more advanced context algorithms.
#GlobalAIandDataScience
#GlobalDataScience
0 comments
28 views

Permalink