From the last article Embedding Searching Articles and Corresponding Thompson Sampling, I explain how to build a embedding vector seraching engine with OpenAI embedding model. However, in OpenAI, there three types of embedding models: text-embedding-3-small, text-embedding-3-large, and ada v2. Although I can cancel out text-embedding-3-large for its bad performance on Mandarin, but I cannot tell the difference between ada v2 and text-embedding-3-small in a short time. Therefore, I decide to perform a Bayesian AB Testing, more precisely, Thompson Sampling to decide which embedding model is more suitable in this project.
Since here we only record clicking, in Thompson Sampling, we use bernoulli-beta model. The model looks like:
where means the probability, means click or not, and are the parameters of beta models.
The steps to do experiment by Thompson Sampling:
The following is the procedure of experiment during the searching results and clicking results.
The Record structure is only for recording the results of the experiment.
When users click the urls of searching results, the urls send http information to Google Cloud Function.
In GCF, the function checks whether this clicking is right after searching or not. Since we only want to know the results which embedding models made we sent to customers is more suitable, we only record the clicking right after searching.
Once checking the click, then we update the testing parameter. After recording, we redirect that users to the real result article.
I ended up the experiment on July 3, 2024. The results of parameters from two beta models are
Model/Parameters | success | Number of Trials | ||
---|---|---|---|---|
ada v2 | 63 | 35 | 62 | 96 |
text-embedding-3-small | 160 | 80 | 159 | 238 |
The following is the mean probability and the corresponding credible interval for each model. The x-axis is the click, and the y-axis is probability.
You can see most of time that Small are larger than Ada and they became stable after around 300 clicks. You can see the change of beta distributions in the next gif.
According to the end period of experiment time and the stable of both models, I decided that text-embedding-3-small is the embedding model for searching article engine.
This time, I use a different method, Thompson Sampling, to do the AB testing. In this method, the adventage is that I do not need to worry about UX during experiment period, since the algorithm automatically increase the probability of the better model. However, in this method, it is hard to do the analysis after experiment. I can not claim that a straghtforward statement with some quantified results, for instance, "A is better than B since the click rate is significantly larger than B by xx%", so if the results of experiment needs to report to others, it might not be a best choice.