# Creative Classification Label / Detection <span style="color:green"> Green text is side note. </span> <span style="color:red"> TODO </span> ## Outline: - What we have so far - What we have tried - Issue - Understand the problem foundamentally. - Potential solutions - Solution: Control the embedding space using cosine loss ## Creative Classification This goal of the project is to classify creative images into a category. For example, given the input image with text **Honda offers expiring Parts service**, we want to train a model to be able to output something like: **Service/Vehicles/Maintenance/Honda**. The current problems are: we do not have enough labeled data; we do not have enough labels. For a given unlabeled text: Doing business gets you a great price. > SHOP NOW Mercedes-Benz of LA., we want to determine if it falls into our existing labels, or it is out of our existing labels. If the former case happens, we can simply label the text using our existing labels. If the latter case happens, we will need to create new labels for the text. We divide the problem into two parts. The first part is detection, in which we determine if the text should be assigned new label or not. The second part is Assignment, in which if the first part was triggered, we need to assign new labels for the text. To solve these issues, we finetuned embedding models using contrastive learning, to make the labels and texts and images that belong to the same group, to be close enough, and different group to be far from each other. Then, we can use this embedding model to calculate the distance between the text or image and predicted label. The embedding model acts as outlier detector. We use both Vision transformer and the sentence transformer to encode the text and image to the same embedding space, and also the label. Then, in the same embedding space, we perform contrastive training. ## What we have so far This goal of the project is to classify creative text or summary text into a category. For example, given the input text: **Honda offers expiring Parts service**, we want to train a model to be able to output something like: **Service/Vehicles/Maintenance/Honda**. So far, we have experimented with the following approaches shown in the figure below. ![](https://i.imgur.com/h5PjkJp.png) - **Left: Chase Model 1**. We use the pretrained Bert model from Huggingface as the base model, together with a fully-connected neural network as the classifier. In particular, the Bert model takes the text as input and transforms it to a vector, $Emb_{bert} \in \mathbb{R}^{768}$. Then, the FC-classifier takes $Emb_{bert}$ as input and outputs a number, say 36, which we have a look up tabel to find the actual label. For example, for the case shown in the figure, it should be $36: Product/Cars/Mercedes$. - **Middle: Chase Model 2**: We also try using OpenAI Embedding API to obtain the OpenAI embedding, $Emb_{openAI} \in \mathbb{R}^{1536}$. In particular, the OpenAI API takes the text as input and transforms it to a vector, $Emb_{openAI}$. The following procedure is the same as **Chase model 1**. The only difference between these two models is how we get the embedding vector. Both models can produce promising results. So far, the evaluation accuracy of both models can reach $97\%$. - **Right: GPT 3.5**: In addition, we try using GPT 3.5 API to produce labels. Since it is a generative language model, the labels can be very flexiable. But most of the time, we observe that it can generate labels close to our labels. For example, in the figure, GPT 3.5 generates **Product/Vehicles & Cars/Mercedes-Benz**, which is close to our label **Product/Cars/Mercedes**. The current problems are: - 1. we do not have enough labeled data; - 2. we do not have enough labels. For a given unlabeled text: *Doing business gets you a great price. > SHOP NOW Mercedes-Benz of LA.*, we want to determine if it falls into our existing labels, or it is out of our existing labels. If the former case happens, we can simply label the text using our existing labels. If the latter case happens, we will need to create new labels for the text. We divide the problem into two parts. The first part is detection, in which we determine if the text should be assigned new label or not. The second part is Assignment, in which if the first part was triggered, we need to assign new labels for the text. - Detection - New Label - Wrong Label - Assignment - Existing - New Label ## What we have tried We have tried using the embeddings of our labels, GPT 3.5 labels, and text to solve the detection problem. As shown in the figure below, we use some encoder to encode the text, our label, and GPT 3.5 Label and get the embeddings. The encoder can be sentence transformer, OpenAI Embedding model, etc. We try to calculate some distance metrics such as cosine similarity, to solve the detection problem. For example, if the similarity score between the GPT-label Embedding and label (prediction) embedding is lower than some threshold, then we determine that the text is either mislabeled or out of our existing labels. We can also compute similar metrics for text embeddings and label (prediction) embeddings. ![](https://i.imgur.com/6r4sbsm.png) ## Issue - Understand the problem fundamentally. The potential issue of the above methods is that the embedding space might be mismatched. Let's take a close look at the methods we have tried. Our underlying assumption is that the Encoder (Sentence transformer, OpenAI model, etc. ) is able to **understand** the creative text, our label, and the gpt 3.5 Label, so that it could capture the meaning of them, and further transform them into some proper embeddings that can be used for calculating some distance metrics. However, does the assumption holds? Do those encoders **understand** the text and labels? For the text input, maybe, because most encoders were trained on a large amount of texts. But the text extracted from the creative images might be a little bit different from natual languages. Anyways, for now, let's say for text, the encoders can **understand**. For the label, say "Product/Cars/Mercedes", the encoders are less likely to fully understand it. The reason is that the label is different from the training text that the encoders originally used. Now, we are using an illed embedding space to calculate some distance metrics, trying to find out if this text is close enough to that label, if this label is close enough to that GPT-label. ## Potential solutions We need some encoder that **aligns the text and labels**, so they can speak the same language. In other words, the encoder should **understand** both text and label, shown in the figure below. In mathematical language, the encoder should be able to transform the text and label to the same space (embedding / Vector space) **properly** so that we can calculate some distance metrics to find out some clue. ![](https://i.imgur.com/kyfUiko.png) Let's take a close look at **Chase Model 1**: ![](https://i.imgur.com/e87cH7k.png) The solution lies in the vector $Emb_{bert}$. We do not need to get help from other embedding models because our fine-tuned Bert model has seen our creative texts and is able to understand them. However, one might ask: does it understand our label? The answer is no because if we take a close look at the the output of the model, it is a softmax over some numbers and select one number as output. There is no interaction between the label *Product/Cars/Mercedes* and the model. **Proposed solutions**: Again, we are hoping to find an encoder that understands both creative text and label. Now, we already have the encoder (Bert is an encoder-only language model) that understands the creative text. We only need some encoder that understands our labels. But the **some** encoder can be the same Bert. **Solution one:** One easy solution is shown is the figure below, which I think can be done within one hour. The dataset looks like: - text: *Where the lower cost of doing business gets you a great price. > SHOP NOW walter's Mercedes-Benz of Riverside* - label: 36 - text: *Product/Cars/Mercedes* - label: 36 ![](https://i.imgur.com/vYE9sR6.png) **Solution two:** We use a two-head neural networks with the Bert model. The first is the classifier, and the second is the decoder, which decodes the embedding to the label. This is very common in self-driving and planning problems, where we have multiple objectives. In this way, we can achieve the classification task and at the same time, let the bert model to understand the label. The decoder can potentially be used for suggesting new labels. ![](https://i.imgur.com/qRwBvk7.png) The dataset looks like: - FC-classifier: - text: *Where the lower cost of doing business gets you a great price. > SHOP NOW walter's Mercedes-Benz of Riverside* - label: 36 - Decoder - text: *Where the lower cost of doing business gets you a great price. > SHOP NOW walter's Mercedes-Benz of Riverside* - label: *Product/Cars/Mercedes* In conclusion, it is all about data. How we use data is crucial. ## Use distance metric to trigger ![](https://i.imgur.com/jBFZanN.png) Let's denote the embedding vector of the input text as $E_t$, and the embedding vector of the label as $E_l$. $d_{tl} = |E_t - E_l|$ - 1. if $d_{tl} > threshold$: Trigger wrong/new label. - Test if the distance $d_{tl}$s are consistant. - For every $(E_t, E_l)$ pair, calculate $d_{tl} = |E_t - E_l|$, then plot $d_{tl}$. In the plot, we draw a line (threshold). - 2. if distance not top k, trigger wrong. - or (if distance top k, label is correct) ![](https://i.imgur.com/9KUY74r.png) ## Solution: Control the embedding space using cosine loss Note that our goal is to control the embedding space, in which we want the embeddings of texts that belong to the same category to be close enough, and texts that belong to different categories to be far away from each other. This is related to [Contrastive Representation Learning](https://lilianweng.github.io/posts/2021-05-31-contrastive/). This blog [Contrastive Representation Learning](https://lilianweng.github.io/posts/2021-05-31-contrastive/) gives a summary about contrastive learning, different loss functions, and different use cases. We refer to the paper [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/pdf/1908.10084.pdf). We use the loss shown in the figure below. ![](https://i.imgur.com/2V8Fm41.png) In the paper, there are three objectives: classification, regression, and triplet objective. So far, we have exprimented with the regression objective. In particular, as shown in the figure below, we feed texts about *Mercedes-Benz* (colored green) to the Encoder and get the embeddings, denoted as $E_{pi}$. We want the embeddings $E_{pi}$ to be close so that we can calculate some distance metric such as cosine similarity for downstream tasks. With that in mind, we can directly use the consine similarity to calculate the loss. For example, we calculate the distance score for *Mercedes-Benz*: $score_p=Cosine(E_{p2}, E_{p3})$. Since we want the score as high as possible, we can assign a scalar label , say 0.9, and then calculate the mean square error $loss=MSE(score_p, 0.9)$. Conversely, if we calculate the distance score between the embeddings of *Mercedes-Benz* (colored green) and *Audi* (colored red), $score_n=Cosine(E_{p3}, E_{n1})$, we want the score to be low. Thus, we can assign a scalar label, say 0.1, then calculate the mean square error loss. ![](https://i.imgur.com/PF5yYQs.png) We first use the pretrained model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) from [sentence-transformers](https://huggingface.co/sentence-transformers). We feed our texts and get the embeddings. The 2-d visualization is shown below. Red dots are the embeddings of 30 Acura texts, and blue dots are 30 Honda texts. Green dots are the embeddings of Acura and Honda labels, and some other random labels. The original model did a decent job on grouping the acura texts(red) and the honda texts(blue). However, the embeddings of the two text labels: *Service/Vehicles/Maintenance/Acura* and *Service/Vehicles/Maintenance/Honda*, shown as the two green points in the middle, are not close their own group of texts. We want the label text *Service/Vehicles/Maintenance/Acura* to be close to red acura texts, and *Service/Vehicles/Maintenance/Honda* to be close to blue honda text. ![](https://i.imgur.com/pm2eQYB.png) We apply the cosine loss to fine-tune the model using the same data in the visualization () and get a new version of the model. We feed the same texts to the new model, and the visualization becomes: ![](https://i.imgur.com/8Y9OAb7.png) Now, the label text *Service/Vehicles/Maintenance/Acura* is very close to the red acura text, and *Service/Vehicles/Maintenance/Honda* is close to the blue honda text. In addition, the two group of texts are seperated further. To summarize, we can manipulate the embedding space using the cosine loss. This loss function can serve as the human-model interface, connecting human intent with the model. ## Other broader use cases More generally, for any tasks related to **semantic search** in embedding space, we can use such kind of technique to change the embedding space to what we want it to be, then we can do a more accurate search. Also, one potential use case might be combining image embedding with text embedding, like [CLIP: Connecting text and images](https://openai.com/research/clip) by OpenAI (Chase priviously talked about this clip). ![](https://i.imgur.com/M5TvgH4.png) In our case, we want an embedding space in which the embeddings of creative images and creative texts that belong to the same category to be close, and those belonging to different categories to be far away. We can use exact the same cosine loss method to control the embedding space. For example, as shown in the figure below, ![](https://i.imgur.com/kR7Ho8t.png) We feed the *Mercedes Image* and *Mercedes Text* to Image Encoder and Text Encoder, and we want the embeddings of these two to be close. Thus, we can assign a scaler label, say 0.9. Conversely, if we want the texts or images to be far away, we can assign a scaler label, say 0.1. This embedding space is potentially useful for two cases: - Semantic search for cases where there is **limited text** or no text. - Two Embeddings (text and image) for downstream tasks like classification or CTR model, will provide more features to the model. ## The limitation of cosine similarity ![](https://i.imgur.com/PHuVjpV.png) ![](https://i.imgur.com/iFcH67x.png) ![](https://i.imgur.com/qEqDyIM.png) [Euclidean vs. Cosine Distance](https://cmry.github.io/notes/euclidean-v-cosine) ## Euclidean space vs Hypobolic space Euclidean assumption is incorrect. New Loss: [Poincaré Embeddings for Learning Hierarchical Representations](https://arxiv.org/pdf/1705.08039.pdf) - [Maximilian Nickel](https://mnick.github.io/) ![](https://i.imgur.com/JMwqzbr.png) ![](https://i.imgur.com/NE31Sao.png) [A Simple Framework for Contrastive Learning of Visual Representations](https://arxiv.org/abs/2002.05709) # Multi-modal Embedding [Clip Blog](https://www.pinecone.io/learn/clip/) The multi-modal nature of CLIP is powered by two encoder models trained to “speak the same language”. Text inputs are passed to a text encoder, and image inputs to an image encoder. These models then create a vector representation of the respective input. Both models “speak the same language” by encoding similar concepts in text and images into similar vectors. That means that the text “two dogs running across a frosty field” would output a vector similar to an image of two dogs running across a frosty field. ![](https://i.imgur.com/ApBJKmm.png) ## Proof of concept In order to find out whether we can use similar multi-modal in Ads, creative images and creative text specifically, we did some investigation on CLIP. In particular, we use the open source pre-trained [CLIP Model](https://huggingface.co/sentence-transformers/clip-ViT-L-14) from Hugging Face to encode our creative images and texts. As shown in the figure below, we input to the model: Acura creative images, Acura creative text, and the text "Service/Vehicles/Maintenance/Acura", respectively. Then, we can get the embeddings of each input. The 2D visualization of embeddings is shown below. The blue dots are the embeddings of Acura creative images, the red dots are the embeddings of Acura creative texts, and the greed dot marked with *Service/Vehicles/Maintenance/Acura* is the embeddings of "Service/Vehicles/Maintenance/Acura". We can see that the original CLIP model does a good job on Acura creative images and Acura creative texts seperately. The creative images looks pretty different but the model is still able to find out they are close or similar. It is not suprising if we take a look at the [Original CLIP Paper](https://arxiv.org/pdf/2103.00020.pdf). In Section 3, it mentions that they trained on OCR dataset and performed OCR tasks (extracting texts from images). However, it fails when we consider both creative images and texts. It also fails to understand the label text "Service/Vehicles/Maintenance/Acura". In other words, we want the embeddings of Acura images, texts, label texts to be close or similar to each other, but the model fails to do that. ![](https://i.imgur.com/yrKq3kn.png) To make use of the CLIP model, we fine-tuned it using contrastive loss. We did not use the exact method in the [Original CLIP Paper](https://arxiv.org/pdf/2103.00020.pdf), but the high level idea is the same, pushing the embedding of Acura texts, images, label texts together, while pushing Acura far away from other Ads. In particular, the training dataset consists of the label text "Service/Vehicles/Maintenance/Acura", randomly sampled 20 Acura images, and 20 Acura texts. The behavior of the fine-tuned CLIP model is shown in the figure below. Now, the model knows Acura texts, images, and label texts are similar to each other. ![](https://i.imgur.com/iEKZ5ko.png) In conclusion, use cases: - sementic search - image-image - image-text - text-text - ... - down stream tasks - Classification - CTR - ... ## Active Learning ![image](https://hackmd.io/_uploads/r1HSl5uC1x.png) ## Deployment on GCP ![image](https://hackmd.io/_uploads/ByI1-qdAyl.png) ## Domain Knowledge Multi-class classification Contrastive learning Bert Clip