Global AI and Data Science

 View Only

Topic Modelling

By Moloy De posted Fri December 23, 2022 10:54 PM

Topic modeling is an unsupervised machine learning approach that can scan a series of documents, find word and phrase patterns within them, and automatically cluster word groupings and related expressions that best represent the set. Because it doesn't require a preexisting list of tags or training data that has been previously categorized by humans, this type of machine learning is known as 'unsupervised' machine learning. But this is not to be confused with the different topic classification models, which are ‘supervised’ machine learning techniques. Before studying a series of writings, it's necessary to know what they're about. Data is manually labeled with these themes so that a topic classifier can learn and make predictions on its own. We are not going to talk about Topic Classification here.

To infer subjects from unstructured data, topic modeling includes counting words and grouping similar word patterns. Suppose, if we are a software firm interested in learning what consumers have to say about specific elements of our product, we would need to use a topic modeling algorithm to examine our comments instead of spending hours trying to figure out which messages are talking about our topics of interest.

A topic model groups feedback that is comparable, as well as phrases and expressions that appear most frequently, by recognizing patterns such as word frequency and distance between words. We may rapidly infer what each group of texts is about using this information. Five algorithms are particularly used for topic modeling.

Latent Dirirchlet Allocation (LDA):

The statistical and graphical concept of Latent Dirichlet Allocation is used to find correlations between many documents in a corpus. The greatest likelihood estimate from the entire corpus of text is obtained using the Variational Exception Maximization (VEM) technique.

This is traditionally solved by selecting the top few words from a bag of words. The statement, however, is utterly devoid of meaning. Each document may be represented by a probabilistic distribution of subjects, and each topic can be defined by a probabilistic distribution of words, according to this approach. As a result, we have a much better picture of how the issues are related.

Consider the following scenario: you have a corpus of 1000 documents. The bag of words is made up of 1000 common words after preprocessing the corpus. We can determine the subjects that are relevant to each document using LDA. The extraction of data from a corpus of data is therefore made straightforward. The upper level represents the documents, the middle level represents the produced themes, and the bottom level represents the words in the diagram above. As a result, the rule indicates that a text is represented as a distribution of themes, and topics are described as a distribution of words.

Non Negative Matrix Factorization (NMF):

NMF is a matrix factorization method that ensures the non-negative elements of the factorized matrices. Consider the document-term matrix produced after deleting stopwords from a corpus. The term-topic matrix and the topic-document matrix are two matrices that may be factored out of the matrix. Matrix factorization may be accomplished using a variety of optimization methods. NMF may be performed more quickly and effectively using Hierarchical Alternating Least Square. The factorization takes place in this case by updating one column at a time while leaving the other columns unchanged.

Latent Semantic Analysis (LSA):

Latent Semantic Analysis is another unsupervised learning approach for extracting relationships between words in a large number of documents. This assists us in selecting the appropriate documents. It merely serves as a dimensionality reduction tool for the massive corpus of text data. These extraneous data adds noise to the process of extracting the proper insights from the data.

Parallel Latent Dirichlet Allocation:

Partially Labeled Dirichlet Allocation is another name for it. The model implies that there are a total of n labels, each of which is associated with a different subject in the corpus. Then, similar to the LDA, the individual themes are represented as the probability distribution of the entire corpus. Optionally, each document might be allocated a global subject, resulting in l global topics, where l is the number of individual documents in the corpus.
The technique also assumes that every subject in the corpus has just one label. In comparison to the other approaches, this procedure is highly rapid and exact because the labels are supplied before creating the model.

Pachinko Allocation Model (PAM):

The Pachinko Allocation Model (PAM) is a more advanced version of the Latent Dirichlet Allocation Model. The LDA model identifies themes based on thematic correlations between words in the corpus, bringing out the correlation between words. PAM, on the other hand, makes do by modeling correlation between the produced themes. Because it additionally considers the link between subjects, this model has more ability in determining the semantic relationship precisely. Pachinko is a popular Japanese game, and the model is named for it. To explore the association between themes, the model uses Directed Acrylic Graphs (DAG).

An early topic model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998. Another one, called probabilistic latent semantic analysis (PLSA), was created by Thomas Hofmann in 1999. Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSA. Developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002, LDA introduces sparse Dirichlet prior distributions over document-topic and topic-word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words. Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Hierarchical latent tree analysis (HLTA) is an alternative to LDA, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.

In practice, researchers attempt to fit appropriate model parameters to the data corpus using one of several heuristics for maximum likelihood fit. A recent survey by Blei describes this suite of algorithms. Several groups of researchers starting with Papadimitriou et al. have attempted to design algorithms with probable guarantees. Assuming that the data were actually generated by the model in question, they try to design algorithms that probably find the model that was used to create the data. Techniques used here include singular value decomposition (SVD) and the method of moments. In 2012 an algorithm based upon non-negative matrix factorization (NMF) was introduced that also generalizes to topic models with correlations among topics. In 2018 a new approach to topic models was proposed: it is based on stochastic block model.

QUESTION I: What is topic classification and how is it different from topic modelling?
QUESTION II: What are the other types of unstructured data besides text where topic modelling is used?

REFERENCE: Topic Modelling Wikipedia, What is Topic Modelling in NLP?