Try   HackMD

Thompson Sampling on Searching Engine

Overview

From the last article Embedding Searching Articles and Corresponding Thompson Sampling, I explain how to build a embedding vector seraching engine with OpenAI embedding model. However, in OpenAI, there three types of embedding models: text-embedding-3-small, text-embedding-3-large, and ada v2. Although I can cancel out text-embedding-3-large for its bad performance on Mandarin, but I cannot tell the difference between ada v2 and text-embedding-3-small in a short time. Therefore, I decide to perform a Bayesian AB Testing, more precisely, Thompson Sampling to decide which embedding model is more suitable in this project.

Thompson Sampling

Since here we only record clicking, in Thompson Sampling, we use bernoulli-beta model. The model looks like:

pBeta(α+x,β+(1x)),

where

p means the probability,
x{0,1}
means click or not,
α
and
β
are the parameters of beta models.

The steps to do experiment by Thompson Sampling:

  1. set
    Beta(1,1)
    for each embedding model:
    Betaada
    and
    Betasmall
  2. when users ask question, each beta model comes out a probability
    pada
    and
    psmall
    , and assign the model whose probability is higher.
  3. after sending the searching results back to users, the
    β
    parameter of the beta model which is corresponding to assigned embedding model plus one. That is,
    βassigned=βassigned+1
  4. Once clicking searching results, Record function will do two things:
    β1
    and
    α+1

The following is the procedure of experiment during the searching results and clicking results.

Query

Searching Results

Click

Redirect

beta+1

Assign Embedding Models

beta-1\nalpha+1

Search

Record

User

Testing\nParameters

Target Url

Record

The Record structure is only for recording the results of the experiment.

Record

Click

Check if Search Again

Update

Redirect

User

Testing\nParameters

User\nDatabase

Target Article

Article Urls

Clicking

When users click the urls of searching results, the urls send http information to Google Cloud Function.

User Database

In GCF, the function checks whether this clicking is right after searching or not. Since we only want to know the results which embedding models made we sent to customers is more suitable, we only record the clicking right after searching.

Update Parameters and Redirect

Once checking the click, then we update the testing parameter. After recording, we redirect that users to the real result article.

Experiment Results

I ended up the experiment on July 3, 2024. The results of parameters from two beta models are

Model/Parameters
α
β
success Number of Trials
ada v2 63 35 62 96
text-embedding-3-small 160 80 159 238

The following is the mean probability and the corresponding credible interval for each model. The x-axis is the click, and the y-axis is probability.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

You can see most of time that Small are larger than Ada and they became stable after around 300 clicks. You can see the change of beta distributions in the next gif.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

According to the end period of experiment time and the stable of both models, I decided that text-embedding-3-small is the embedding model for searching article engine.

Conclusion

This time, I use a different method, Thompson Sampling, to do the AB testing. In this method, the adventage is that I do not need to worry about UX during experiment period, since the algorithm automatically increase the probability of the better model. However, in this method, it is hard to do the analysis after experiment. I can not claim that a straghtforward statement with some quantified results, for instance, "A is better than B since the click rate is significantly larger than B by xx%", so if the results of experiment needs to report to others, it might not be a best choice.