# llm evaluations
n. https://hackmd.io/s/how-to-create-slide-deck
---
## Definition
1. The definition of python/llm.
2. The scope today is HELM - Holistic Evaluation of Language Models(NLP ;))
- 
- [Ref.screenshot from, Rishi Bommasani -- Holistic Evaluation of Language Models (February 15th 2023)](https://www.youtube.com/watch?v=A0kD00WdlKY)
3. 上一次我們快速討論到的內容比較廣,這次只專注在HELM。
- 
---
## Intentions
### Why build LLM applications
Before we get lost among the metrics. remember that, the reason we want to build LLM applications: we try to get the best answer(text response) generator.
## Why llm evalution
- (1) how to **quantify** the evaluation. (low. avg. high)
- (2) 
- so that you can take follow up actions, instead of stopping there.
- [screenshot from, Rishi Bommasani -- Holistic Evaluation of Language Models (February 15th 2023)](https://www.youtube.com/watch?v=A0kD00WdlKY)
---
## Motivation of HELM
- 在standford做model evaluation背後的動機:社會影響
- 
- 
- [screenshot from, Rishi Bommasani -- Holistic Evaluation of Language Models (February 15th 2023), https://www.youtube.com/watch?v=A0kD00WdlKY]
---
## Heml
- https://crfm.stanford.edu/helm/latest/
- HELM design principles.
- 
- . The multi metric helps you to priotise based on your use case. Also, from imagenet envolve to the latest HELM(now become multi-dataset x multi-metrics)
- 
## HELM: enhance"accuracy, robustness, calibration, efficiency"
- robustness> explain me pythonn as like I am 5 years old.
- calibration> if you don't know the answer, then you don't share random information to your customer
- (i) instruction fine-tune. Answer belowing questions. if you don't know the answer, just say so.
- (ii) no instruction fine-tune(Jurassiz-2 Jumbo)
- 原始函數(實驗流程) 
6. let's see common scenarios and common metrics

[screenshot from. https://www.youtube.com/watch?v=1jReCwBgM84]
- [Hugging Face Course.](https://www.youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o)
- [BLEU metrics](https://www.youtube.com/watch?v=M05L1DhFqcw&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=62).
- [ROUGE metrics](https://www.youtube.com/watch?v=TMshhnrEXlg&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=64).
6. run the helm with the github repo.(https://crfm-helm.readthedocs.io/en/latest/installation/)
7. 
8. what's the best way to training the parameters? reading the paper. check what parameters. prompt-tuning.
9. customize
10. pull models not on HF. probably an adopter is needed
10. add dataset sources. helm>benchmark>scenarios>mmlu_scenario
8. After HELM. there're still some works need to be done from your side.
- 
10. Reference
11. Evaluation at scale. 實驗成本.
12. 
---
Appendix
- AWS basic knowledge
- [HELM paper](https://arxiv.org/pdf/2211.09110.pdf)
- [A Survey on Evaluation of Large Language Models](https://arxiv.org/pdf/2307.03109.pdf). include an exhaustive list as a mindmap
- GenAI at MIT channel: 
---
10. HITL

[screenshot from. https://www.youtube.com/watch?v=1jReCwBgM84]
11. compare SM SDK and boto3 SDK
5. sagemaker python SDK: better for data scientist
6. boto3 SDK: Metadata management. manage services.
7. sometimes you can see defineetly overlap between two SDKs.
12. I skipped. (Rishi)實驗細節
- case 1. MMLU
- 
- 
- 
- case 2. RAFT.
- 
13. I skipped. (Rishi)scenarrio taxonomy/classification
- 
- 
14. I skipped. (Rishi)metric selection
- 
- sometimes. the new use case that the NLP researchers haven't seen before. hence the corresponding metrics haven't come up yet.
- 
15. I skipped. (Rishi) Metric. Calibration.
- 
16. I skipped. (Rishi) the whole story of HELM. is acoutn scenatios x metric. sometimes limitation come from the datasets
- 
17. I skipped. (Rishi) these are selected research topic:
- 
18. I skipped. (Rishi) these are selected model:
- 
- the access of these model(API/close source)
19. I skipped. (Rishi) several pages of the result tables
14. furthor reading.
URL list from Saturday, Aug. 26 2023 23:51 PM
To copy this list, type [Ctrl] A, then type [Ctrl] C.
(212) Evaluation: Bias & Toxicity - YouTube
https://www.youtube.com/watch?v=Qm--_1M_Uvk
Evaluation: Factuality and Halllucination - YouTube
https://www.youtube.com/watch?v=OWD54-RVYjc
(212) Deep Dive into LLM Evaluation with Weights & Biases - YouTube
https://www.youtube.com/watch?v=7EcznH0-of8
(212) Language Model Evaluation and Perplexity - YouTube
https://www.youtube.com/watch?v=gHHy2w2agEo
(212) Deep Dive into LLM Evaluation with Weights & Biases - YouTube
https://www.youtube.com/watch?v=7EcznH0-of8
(212) Evaluating LLM-based Applications // Josh Tobin // LLMs in Prod Conference Part 2 - YouTube
https://www.youtube.com/watch?v=r-HUnht-Gns
(212) LLM Evaluation Basics: Datasets & Metrics - YouTube
https://www.youtube.com/watch?v=1jReCwBgM84
Large Language Model Evaluation in 2023: 5 Methods
https://research.aimultiple.com/large-language-model-evaluation/
Google 翻譯
https://translate.google.com/?hl=zh-TW&sl=en&tl=zh-TW&op=websites
The Importance of Evaluating Large Language Models | by Minhajul Hoque | Jun, 2023 | Medium
https://medium.com/@minh.hoque/the-importance-of-evaluating-large-language-models-6eff1f68de5f
How to Evaluate a Large Language Model - Evaluate LLMs
https://www.analyticsvidhya.com/blog/2023/05/how-to-evaluate-a-large-language-model-llm/
How to Evaluate, Compare, and Optimize LLM Systems | llm-eval-sweep – Weights & Biases
https://wandb.ai/ayush-thakur/llm-eval-sweep/reports/How-to-Evaluate-Compare-and-Optimize-LLM-Systems--Vmlldzo0NzgyMTQz
How to evaluate an LLM on your data?
https://radekosmulski.com/how-to-evaluate-an-llm-on-your-data/
What is LLM Evaluation
https://deepchecks.com/glossary/llm-evaluation/
stanford-crfm/helm: Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110).
https://github.com/stanford-crfm/helm
Fine-tuning with Instruction prompts, Model Evaluation Metrics, and Evaluation Benchmarks for LLMs | by kanika adik | Jul, 2023 | GoPenAI
https://blog.gopenai.com/fine-tuning-with-instruction-prompts-model-evaluation-metrics-and-evaluation-benchmarks-for-llms-7dc49e8dade9
Evaluation on LLMs. The advent of large language models… | by Sheldon L. | Medium
https://medium.com/@sheldon88/evaluation-on-llms-4914215459d7
Fine-Tuning and Evaluating Large Language Models (LLMs) | by Tarapong Sreenuch | Jul, 2023 | Medium
https://sreent.medium.com/fine-tuning-and-evaluating-large-language-models-llms-f38f245f87f9
Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5 | by Chris Mauck | Towards Data Science
https://towardsdatascience.com/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058
4 Crucial Factors for Evaluating Large Language Models in Industry Applications | by Skanda Vivek | Aug, 2023 | Towards Data Science
https://towardsdatascience.com/4-crucial-factors-for-evaluating-large-language-models-in-industry-applications-f0ec8f6d4e9e
aws-samples/llm-evaluation-methodology
https://github.com/aws-samples/llm-evaluation-methodology/tree/main
Generative AI Dive Deep | Highspot
https://aws.highspot.com/items/648a3874a558e6026053cf10?lfrm=srp.1#66
Generative AI customer-facing deck with industry slides | Highspot
https://aws.highspot.com/items/632e4909931439bbe94f8408?lfrm=srp.4#9
Holistic Evaluation of Language Models (HELM)
https://crfm.stanford.edu/helm/latest/?scenarios=1
helm/src/helm/benchmark/test_data_preprocessor.py at main · stanford-crfm/helm
https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/test_data_preprocessor.py
Tutorial - CRFM HELM
https://crfm-helm.readthedocs.io/en/latest/tutorial/#using-helm-run
2211.09110.pdf
https://arxiv.org/pdf/2211.09110.pdf
(216) Evaluation: LLM robustness and self-consistency - YouTube
https://www.youtube.com/watch?v=XNlfvnEyB8Y
(215) Mia Chang - YouTube
https://www.youtube.com/channel/UCVyyGRg_74MbwVq3yTPkznQ
Channel customization - YouTube Studio
https://studio.youtube.com/channel/UCVyyGRg_74MbwVq3yTPkznQ/editing/details
淺談 LLM 應用開發 Roadmap
https://gamma.app/public/-LLM-Roadmap-0pv5lh3kbogyoae?mode=doc
Pull requests · aws-samples/aws-ai-intelligent-document-processing
https://github.com/aws-samples/aws-ai-intelligent-document-processing/pulls
(19) Activity | Asmaa Ibrahim | LinkedIn
https://www.linkedin.com/in/asmaa-ibrahim/recent-activity/all/
(19) Activity | Manoranjan Rajguru | LinkedIn
https://www.linkedin.com/in/manoranjan-rajguru/recent-activity/all/
(19) Activity | Adnan Masood, PhD. | LinkedIn
https://www.linkedin.com/in/adnano/recent-activity/all/
(19) Activity | Jon Krohn | LinkedIn
https://www.linkedin.com/in/jonkrohn/recent-activity/all/