# Datasets
ZEDataset: A pydantic BaseModel, identified by `dataset_id`, referring to a ZeroEntropy training dataset. It requires the local directory `./data/datasets/{dataset_id}` to exist, and inside of it, two files are expected:
- ze_results.jsonl
- Each line is a (query, top ~100 documents), where the documents are retrieved using hybrid search (BM25 + Vector Search, with reciprocal rank fusion)
- Our final dataset was like ~20k queries, i.e. 2M (q, d) pairs.
- ai_scores.json
- For each query, two out of the 100 are selected. Then, the (q, d1, d2) are inferenced by 3 LLMs (OpenAI, Anthropic, Gemini). The three scores are stored as a single datapoint. `ai_scores.json` contains all such datapoints.
- Often, the "two out of the 100" are via random selection. However, "_rl" datasets use an RL concept: We intelligently pick (d1, d2) such that our model likely gets it wrong.
# Scripts
./ml/training/pairwise/train.py
- From ai_scores.json, and a list of dataset ids, we train a 4B Qwen model to mimic the pairwise judgements of the (OpenAI, Anthropic, Gemini) ensemble. 1 query : 1 pair
./ml/training/pairwise/sft_generate.py
- This inferences the pairwise AI Model 400 times for each query, i.e. each document gets 4 inferences. The purpose is to later use it for ELO Scoring. 1 query : 400 pairwise comparisons.
./ml/training/elos/sft_elo.py
- Takes the 400 pairwise comparisons, and outputs 100 ELO Scores. This goes into e.g. `v0.3_elo.jsonl`. No AI, just math calculation (Bradley-Terry).
./ml/training/pointwise/train_elo.py
- Takes `v0.3_elo.jsonl`, and trains a pointwise model that takes (q, d) and outputs the ELO Score.