Thompson Sampling on Searching Engine

Thompson Sampling on Searching Engine

Overview

From the last article Embedding Searching Articles and Corresponding Thompson Sampling, I explain how to build a embedding vector seraching engine with OpenAI embedding model. However, in OpenAI, there three types of embedding models: text-embedding-3-small, text-embedding-3-large, and ada v2. Although I can cancel out text-embedding-3-large for its bad performance on Mandarin, but I cannot tell the difference between ada v2 and text-embedding-3-small in a short time. Therefore, I decide to perform a Bayesian AB Testing, more precisely, Thompson Sampling to decide which embedding model is more suitable in this project.

Thompson Sampling

Since here we only record clicking, in Thompson Sampling, we use bernoulli-beta model. The model looks like:

p \sim Beta (α + x, β + (1 - x)),

where

p

means the probability,

x \in {0, 1}

means click or not,

α

and

β

are the parameters of beta models.

The steps to do experiment by Thompson Sampling:

set
$B e t a (1, 1)$ for each embedding model:
$B e t a_{a d a}$ and
$B e t a_{s m a l l}$
when users ask question, each beta model comes out a probability
$p_{a d a}$ and
$p_{s m a l l}$ , and assign the model whose probability is higher.
after sending the searching results back to users, the
$β$ parameter of the beta model which is corresponding to assigned embedding model plus one. That is,
$β_{a s s i g n e d} = β_{a s s i g n e d} + 1$
Once clicking searching results, Record function will do two things:
$β - 1$ and
$α + 1$

The following is the procedure of experiment during the searching results and clicking results.

Record

The Record structure is only for recording the results of the experiment.

Clicking

When users click the urls of searching results, the urls send http information to Google Cloud Function.

User Database

In GCF, the function checks whether this clicking is right after searching or not. Since we only want to know the results which embedding models made we sent to customers is more suitable, we only record the clicking right after searching.

Update Parameters and Redirect

Once checking the click, then we update the testing parameter. After recording, we redirect that users to the real result article.

Experiment Results

I ended up the experiment on July 3, 2024. The results of parameters from two beta models are

Model/Parameters	$α$	$β$	success	Number of Trials
ada v2	63	35	62	96
text-embedding-3-small	160	80	159	238

The following is the mean probability and the corresponding credible interval for each model. The x-axis is the click, and the y-axis is probability.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

You can see most of time that Small are larger than Ada and they became stable after around 300 clicks. You can see the change of beta distributions in the next gif.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

According to the end period of experiment time and the stable of both models, I decided that text-embedding-3-small is the embedding model for searching article engine.

Conclusion

This time, I use a different method, Thompson Sampling, to do the AB testing. In this method, the adventage is that I do not need to worry about UX during experiment period, since the algorithm automatically increase the probability of the better model. However, in this method, it is hard to do the analysis after experiment. I can not claim that a straghtforward statement with some quantified results, for instance, "A is better than B since the click rate is significantly larger than B by xx%", so if the results of experiment needs to report to others, it might not be a best choice.

Thompson Sampling on Searching Engine

Overview

Thompson Sampling

Record

Clicking

User Database

Update Parameters and Redirect

Experiment Results

Conclusion

Read more

Zapier vs. Make.com vs. n8n

Winsorization

Partial Out Everything - from FWL theory to CUPED, IV/2SLS, and Double ML

Introduction to Switchback Testing