# Unsupervised Topic Modelling
## Background and Motivation
- Product: provide analytic insights from customer conversations
- Classify conversations into topics in an unsupervised way
- Discover new topics not in labelled intents
- For consversations not have a classified intent (i.e. falls under a probability threshold for all defined intents), want to use topic model to identify the main topic and relay that information to stakeholders
## Challenges
- Long-tail problem
- Reliant on qualitative evaluations
- Hard to extract main topic, especially given a long conversation
## Previous Implemented Method
- Encode text into vectors using a pre-trained language model
- Use [Sentence-Transformers](https://arxiv.org/pdf/1908.10084.pdf)
- Pre-trained models available [here](https://www.sbert.net/)

- PCA for dim reduction
- HDBSCAN for clustering
- Too sensitive to slight changes in text. E.g. text like `cancel my account` and `i would like to cancel my current account with x` may be grouped under different topics
## Non-Negative Matrix Factorization

- Non-Negative Matrix Factorization (NMF) is a technique that decomposes (factorizes) a non-negative matrix `A`, into two (or more) non-negative matrices. In machine learning, NMF is often used as an unsupervised learning technique. NMF can be used for text mining applications.
- In this process, a document-term matrix `A` is constructed with the weights of various terms (typically weighted word frequency information) from a set of documents. This matrix is factored into a term-feature and a feature-document matrix. The features are derived from the contents of the documents, and the feature-document matrix describes data clusters of related documents.
- Given a matrix `A`, NMF decomposes A two matrices, `W` and `H`, where `W` is the topics found and `H` is the weights corresponding to those topics. Compared to Latent Dirchlet Allocation (LDA), another popular unsupervised topic modelling method, NMF produces more coherent topics compared to LDA.
### Drawbacks
- Using a word-frequency based method to encode the text, such as TF-IDF, can be problematic.
- A freqency based method to encode text is required due to non-negativity constraint
- Can group various messages containing common key-words to one topic, but can contain many actual intents (eg cancelling a service might have.
- Proposed workaround: use a 1-to-many topic mapping with a confidence threshold, and group a list of topics as the main topic
- Still not good enough
- Choosing a larger n for n-grams may help capture more context (will benefit conversation level better since text is longer), but due to low frequency, may not contribute much to the overall result.
- NMF can be sensitive to how the text is preprocessed to a BOW representation
- Doesn't guarantee every document has a topic (or multiple topics)
- BOW based methods doesn't capture semantic/positional context of a document. For example, if the main intent/topic is typically found at the beginning of a conversation, or that different words may refer to the same thing
## GPT for topic modelling
Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive pre-trained language model, and is the latest in a series of massive pre-trained language models created by OpenAI. At the time of this document, it is currently the 2nd largest trained language model to date, with 175B trainable parameters. GPT-3 is a powerful language model, capable of generating long-form, human-like text, but it is also capable of performing few-shot learning on downstream tasks such as intent classification without any additional model training. The few-shot learning for GPT-3 is done purely via text interactions with the model, where the user provides examples of the downstream task for the model, this process is typically called prompt engineering.
### Prompt engineering
As previously stated, using GPT-3 for a case specific task does not require fine-tuning, which involves training the model on a specific task. Rather, GPT-3 learns to perform a task by treating every problem as a language modelling problem. When given a task to perform, the user provides a prompt to the model, which is comprised of examples of the task, as well as a text input with no output. For example, in an English to French translation task, the user might provide the following prompt
```
English: I do not speak French.
French: Je ne parle pas français.
English: See you later!
French: À tout à l’heure!
English: Where is a good restaurant?
French: Où est un bon restaurant?
English: What rooms do you have available?
French: Quelles chambres avez-vous de disponible?
English: We’ll cross that bridge when we come to it
French:
```
By providing examples of English to French translation, then providing an untranslated English phrase the same translation prompt, the model will attempt to complete the text by generating the French translation of `We’ll cross that bridge when we come to it`, which comes out to `On verra quand on y sera.`
### Using GPT-3 API
Using Python
```
import os
import openai
openai.api_key = <$openai_api_key>
prompt = """I would like to speak to a live representative:Escalate
I would like to cancel my service:Cancellation
View my bills:Billing
Can I speak to an agent:"""
response = openai.Completion.create(
engine="davinci",
prompt=prompt,
tempurature=0,
stop="\n"
)
```
The following prompt engineering can also be done via curl. The same example would be
```
curl -s -u $OEPNAI_API_KEY -H 'Content-Type: application/json' https//api.openai.com/v1/completions -d '{
"model": "davinci",
"temperature":0
"stop": "\n",
"prompt": <prompts>
}' |jq
```
### GPT-J
https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b
- A 6B parameter open source GPT model trained by Eleuther AI on the [Pile](https://pile.eleuther.ai/) dataset
- Performance similar to the `curie` (2nd largest) GPT-3 model despite being smaller
Running inference using Huggingface
- https://huggingface.co/transformers/master/model_doc/gptj.html
- `GPTJForCausalLM`
Fine-tuning GPT-J on target data
- https://github.com/kingoflolz/mesh-transformer-jax/blob/master/howto_finetune.md
- Requires TPU V3 instance
### Results
Overall accuracy of predicted topic improved, but predicted topics require normalization, for example:
```
text: i am looking to cancel the plan on the account so i don't keep getting billed
topic: account cancellation
text: how do i cancel my current plan plan
topic: cancel plan
```
Similar issue to SBERT+clustering, generated topics are too granular.
- Example: given 100 conversations, GPT returns 80 unique topics
No way to specify the number of topics to predict
### Next steps with GPT
Expriment with 2 step approach
1. Generate topic with GPT
2. Normalize topics
- can use GPT3 to normalize topic
- or use clustering apporach to cluster topics together
- Sentence-Transformers + K-means/HDBSCAN etc.
- NMF
Other ideas?
## Contextualized Topic Models
- https://arxiv.org/abs/2004.07737v2
- https://github.com/MilaNLProc/contextualized-topic-models
- Builds off of [Neuro-ProdLDA](https://openreview.net/pdf?id=BybtVK9lg)
- Using an autoencoder approach
- Reconstruct a BOW representation of the text using a VAE
- Original Neuro-ProdLDA uses BOW vector as the input, CTM replaces (or combines) a transformer embedding vector (e.g. SBERT)
- To get the predicted topic, the decoder network samples `n` times from a gaussian distribute with params `(mu, sigma)`, to get `(n, K)` matrix, `theta`
```
def get_theta(self, x, x_bert, labels=None):
with torch.no_grad():
# batch_size x n_components (n_components == K)
posterior_mu, posterior_log_sigma = self.inf_net(x, x_bert, labels)
#posterior_sigma = torch.exp(posterior_log_sigma)
# generate samples from theta
theta = F.softmax(
self.reparameterize(posterior_mu, posterior_log_sigma), dim=1)
return theta
```
- Then take `np.argmax(np.average(theta, axis=0), axis=1)` as the topic
- To get associated topic words, use the reconstructed BOW associated with that topic, and take the top weighted words

### Pros:
- Balances BOW approach with contexualized embeddings. Potentially more informative than either one alone
- Shown to have good results in benchmark data
- Better for cross-lingual topic modelling (just ust a multi-lingual encoder)
### Cons:
- Works better with <2000 terms in BOW construction
- Tested only with the first 200 tokens of each document. Does not seem to perform very well with shorter documents from own experiments
### Results
- Initial experiments seemed favorable. So far the best out of what's been tried from qualitiative evaluation
- Performance still needs to be validated through more experiments
- Still room for improvement
### Combining CTM with previous experiments
- Possible extension: use GPT (or some summarization model) as a summarizer (reduce the length of text to make it easier to learn), then use CTM
- Inject GPT topic as another context vector (either another vector to concatenate, or add GPT topic to original text)
## Conclusion
- Topic modelling is hard
- No perfect out of the box solution
- Improvements will likely come from combining existing methods