(Migrated from old blog)
As the name suggests topic modeling refers to the determination of the topic of the text. Imagine creating a news platform where the news that comes is automatically categorized based on the type of news it is, either it is technology, politics, weather or anything like that.
The input to the algorithm is a document-term matrix. Where each topic consists of a set of words where order doesn’t matter, So it is a Bag Of Word implementation.
We have two assumptions for Topic Modelling Algorithms that we are going to discuss,
- each document consists of a mixture of topics
- each topic consists of a collection of words
LSA: Latent Semantic Analysis
In this, we generate a matrix
Where
So,
With these documents, we can easily apply measures such as cosine similarities to evaluate.
- The similarity of different documents
- The similarity of different words
LSA is quick and easy to use but has few drawbacks
- lack of interpretable embeddings that is we don’t know whether the topics are arbitrary positive or negative
- We need huge datasets for it to perform well.
- less efficient representation
pLSA: Probabilistic Latent Semantic Analysis
pLSA is a Bayesian version of LSA. It uses Dirichlet priors for a document-topic and word-topic distribution. We assume how the text is generated is that you think of a topic and then for the words in that topic you pick a word and write it. The goal of this is to choose distributions such that the probability of the generation of the collection is MAX.
Okay So there is good math in this, let us talk about notations.
According to the law of total probability,
Similarly,
We now will calculate the log likelihood and maximize its arguments:
Which is equal to
We have normalized and non-negative constraints that means
So how it works, really… The Optimization generally used is E-M optimization. So we start with $$ P(t|d,w) = \frac{P(w| t,d)}{P(w|d)} = \frac{P(w|t)P(t|d)}{P(w|d)} $
The E step:
The Algorithm looks like this:
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf