<style>
.red {
color: red;
}
.blue{
color: blue;
}
.green{
color: green;
}
</style>
# OpenAI Evals
:::danger
**Github:** https://github.com/openai/evals
**Youtube:** https://www.youtube.com/watch?v=XGJNo8TpuVA&t=1078s (17:20~25:28)
:::
- Evaluating Performance
1. Create eval suites
2. Use model-graded evals
## 1. Create eval suites
- Lack of evaluations has been a key challenge for deploying to production
- Model evals are unit tests for the LLM

- Type of mistakes to build evals for:
- Bad output formatting
- Inaccurate responses/actions
- Going off the rails
- Bad tone
- Hallucinations


- When human feedback is impractical or costly, automated evaluations allow developers to monitor progress and detect regressions
## 2. Use model-graded evals
- GPT-4 is actually smart enough to grade evals for you




## Summary
**Good Evals = Correlated with outcomes + High coverage + Scalable to compute**