Embeddable AI

 View Only

Methods of semi-supervised learning, weakly supervised learning in NLP

By TAKESHI INAGAKI posted Sat June 19, 2021 08:46 AM

  

Natural language processing (NLP) is one of promised application areas of machine learning technology. As same as other application areas, preparation of training data with labels requires large workload of human and it consumes a lot time. Semi-supervised learning and weakly supervised learning are methods expected to reduce that workload. It uses data without labels or wrongly labeled by combining with data with correctly labeled to train an NLP model. In this article, we explain two examples of these techniques used in Watson Discovery and Watson Knowledge Studio.

 

A typical implementation of semi-supervised learning is training lower layers of neural network by unsupervised learning algorithm, and training upper layers of network with supervised learning algorithm. A benefit of existence of unsupervised layer in lower layers in application to NLP is that a model becomes to recognize words not included in supervised training data. Expectation at here is that resemble words form a cluster and it is used as an input feature for upper layer supervised learning model, it is called word embedding.

 

One variation of this semi-supervised learning is dictionary suggestion of Watson Discovery which uses an unsupervised learning model to create a dictionary. Dictionary is a simplest type of supervised learning. In this process, users selectively put words gained from unsupervised learning in a dictionary instead of embedding all of them in it. But it uses same information as word embedding to extend words to recognize. In that sense, a process of dictionary creation with suggestions can be regarded as a kind of semi-supervised learning.

 

Related but different approach to reduce effort to create training data is weakly supervised learning. Instead of using information gained from unsupervised learning, it creates training data with less effort by allowing errors in it. Bulk annotation of Watson Knowledge Studio is an example. This is a capability to annotate all mentions of a word in training data at once by searching them. To know whether a word mentioned in text document representing an entity exactly, users need to confirm context of the word by referring the entire sentence. For example, we know a word “Watson” is used to represent various entities. In one case, it is a “person”, in another case, it is “AI technology”. If we annotate all mentions of “Watson” in training data with an entity label “AI technology”, some of them may not be correct. This error in training data may or may not be allowable case by case. However, in some cases, having enough number of training data is important even if the data is not 100% correct. That is idea of weakly supervised learning.

 

Customization of NLP can be done with less effort if we understand nature of these techniques correctly and use them in appropriate manner.

 


#BuildwithWatsonApps
#EmbeddableAI
0 comments
9 views

Permalink