ZEDataset: A pydantic BaseMode

# Datasets ZEDataset: A pydantic BaseModel, identified by `dataset_id`, referring to a ZeroEntropy training dataset. It requires the local directory `./data/datasets/{dataset_id}` to exist, and inside of it, two files are expected: - ze_results.jsonl - Each line is a (query, top ~100 documents), where the documents are retrieved using hybrid search (BM25 + Vector Search, with reciprocal rank fusion) - Our final dataset was like ~20k queries, i.e. 2M (q, d) pairs. - ai_scores.json - For each query, two out of the 100 are selected. Then, the (q, d1, d2) are inferenced by 3 LLMs (OpenAI, Anthropic, Gemini). The three scores are stored as a single datapoint. `ai_scores.json` contains all such datapoints. - Often, the "two out of the 100" are via random selection. However, "_rl" datasets use an RL concept: We intelligently pick (d1, d2) such that our model likely gets it wrong. # Scripts ./ml/training/pairwise/train.py - From ai_scores.json, and a list of dataset ids, we train a 4B Qwen model to mimic the pairwise judgements of the (OpenAI, Anthropic, Gemini) ensemble. 1 query : 1 pair ./ml/training/pairwise/sft_generate.py - This inferences the pairwise AI Model 400 times for each query, i.e. each document gets 4 inferences. The purpose is to later use it for ELO Scoring. 1 query : 400 pairwise comparisons. ./ml/training/elos/sft_elo.py - Takes the 400 pairwise comparisons, and outputs 100 ELO Scores. This goes into e.g. `v0.3_elo.jsonl`. No AI, just math calculation (Bradley-Terry). ./ml/training/pointwise/train_elo.py - Takes `v0.3_elo.jsonl`, and trains a pointwise model that takes (q, d) and outputs the ELO Score.