<style> .red { color: red; } .blue{ color: blue; } .green{ color: green; } </style> # OpenAI Evals :::danger **Github:** https://github.com/openai/evals **Youtube:** https://www.youtube.com/watch?v=XGJNo8TpuVA&t=1078s (17:20~25:28) ::: - Evaluating Performance 1. Create eval suites 2. Use model-graded evals ## 1. Create eval suites - Lack of evaluations has been a key challenge for deploying to production - Model evals are unit tests for the LLM ![image](https://hackmd.io/_uploads/rkAEap3KT.png) - Type of mistakes to build evals for: - Bad output formatting - Inaccurate responses/actions - Going off the rails - Bad tone - Hallucinations ![image](https://hackmd.io/_uploads/S1BmCphtp.png) ![image](https://hackmd.io/_uploads/rkcH0a3Y6.png) - When human feedback is impractical or costly, automated evaluations allow developers to monitor progress and detect regressions ## 2. Use model-graded evals - GPT-4 is actually smart enough to grade evals for you ![image](https://hackmd.io/_uploads/Bk921AhtT.png) ![image](https://hackmd.io/_uploads/r18zg0hFT.png) ![image](https://hackmd.io/_uploads/S1FrxChFT.png) ![image](https://hackmd.io/_uploads/H1BteAhFT.png) ## Summary **Good Evals = Correlated with outcomes + High coverage + Scalable to compute**