# llm evaluations n. https://hackmd.io/s/how-to-create-slide-deck --- ## Definition 1. The definition of python/llm. 2. The scope today is HELM - Holistic Evaluation of Language Models(NLP ;)) - ![](https://hackmd.io/_uploads/HJ4TU1d6h.png) - [Ref.screenshot from, Rishi Bommasani -- Holistic Evaluation of Language Models (February 15th 2023)](https://www.youtube.com/watch?v=A0kD00WdlKY) 3. 上一次我們快速討論到的內容比較廣,這次只專注在HELM。 - ![](https://hackmd.io/_uploads/SyPyRvOTh.png) --- ## Intentions ### Why build LLM applications Before we get lost among the metrics. remember that, the reason we want to build LLM applications: we try to get the best answer(text response) generator. ## Why llm evalution - (1) how to **quantify** the evaluation. (low. avg. high) - (2) ![](https://hackmd.io/_uploads/r14CBJdT2.png) - so that you can take follow up actions, instead of stopping there. - [screenshot from, Rishi Bommasani -- Holistic Evaluation of Language Models (February 15th 2023)](https://www.youtube.com/watch?v=A0kD00WdlKY) --- ## Motivation of HELM - 在standford做model evaluation背後的動機:社會影響 - ![](https://hackmd.io/_uploads/r15w41O6n.png) - ![](https://hackmd.io/_uploads/BkPaE1d6n.png) - [screenshot from, Rishi Bommasani -- Holistic Evaluation of Language Models (February 15th 2023), https://www.youtube.com/watch?v=A0kD00WdlKY] --- ## Heml - https://crfm.stanford.edu/helm/latest/ - HELM design principles. - ![](https://hackmd.io/_uploads/SJOjukOan.png) - ![](https://hackmd.io/_uploads/B1s6ukuT3.png). The multi metric helps you to priotise based on your use case. Also, from imagenet envolve to the latest HELM(now become multi-dataset x multi-metrics) - ![](https://hackmd.io/_uploads/S16EFyd6n.png) ## HELM: enhance"accuracy, robustness, calibration, efficiency" - robustness> explain me pythonn as like I am 5 years old. - calibration> if you don't know the answer, then you don't share random information to your customer - (i) instruction fine-tune. Answer belowing questions. if you don't know the answer, just say so. - (ii) no instruction fine-tune(Jurassiz-2 Jumbo) - 原始函數(實驗流程) ![](https://hackmd.io/_uploads/ryfQj1dT3.png) 6. let's see common scenarios and common metrics ![](https://hackmd.io/_uploads/BkNuRAP6n.png) [screenshot from. https://www.youtube.com/watch?v=1jReCwBgM84] - [Hugging Face Course.](https://www.youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o) - [BLEU metrics](https://www.youtube.com/watch?v=M05L1DhFqcw&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=62). - [ROUGE metrics](https://www.youtube.com/watch?v=TMshhnrEXlg&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=64). 6. run the helm with the github repo.(https://crfm-helm.readthedocs.io/en/latest/installation/) 7. ![](https://hackmd.io/_uploads/BJDsq0DTh.png) 8. what's the best way to training the parameters? reading the paper. check what parameters. prompt-tuning. 9. customize 10. pull models not on HF. probably an adopter is needed 10. add dataset sources. helm>benchmark>scenarios>mmlu_scenario 8. After HELM. there're still some works need to be done from your side. - ![](https://hackmd.io/_uploads/SyxKFydT2.png) 10. Reference 11. Evaluation at scale. 實驗成本. 12. ![](https://hackmd.io/_uploads/HksW9kd6n.png) --- Appendix - AWS basic knowledge - [HELM paper](https://arxiv.org/pdf/2211.09110.pdf) - [A Survey on Evaluation of Large Language Models](https://arxiv.org/pdf/2307.03109.pdf). include an exhaustive list as a mindmap - GenAI at MIT channel: ![](https://hackmd.io/_uploads/H196fedp2.png) --- 10. HITL ![](https://hackmd.io/_uploads/H1RJ1yuph.png) [screenshot from. https://www.youtube.com/watch?v=1jReCwBgM84] 11. compare SM SDK and boto3 SDK 5. sagemaker python SDK: better for data scientist 6. boto3 SDK: Metadata management. manage services. 7. sometimes you can see defineetly overlap between two SDKs. 12. I skipped. (Rishi)實驗細節 - case 1. MMLU - ![](https://hackmd.io/_uploads/ryO3syup3.png) - ![](https://hackmd.io/_uploads/Bkgaoydph.png) - ![](https://hackmd.io/_uploads/Bk20iyOa3.png) - case 2. RAFT. - ![](https://hackmd.io/_uploads/Sk7OTy_ah.png) 13. I skipped. (Rishi)scenarrio taxonomy/classification - ![](https://hackmd.io/_uploads/r1BEnydah.png) - ![](https://hackmd.io/_uploads/H1cY3ydah.png) 14. I skipped. (Rishi)metric selection - ![](https://hackmd.io/_uploads/BJ1c61Oa2.png) - sometimes. the new use case that the NLP researchers haven't seen before. hence the corresponding metrics haven't come up yet. - ![](https://hackmd.io/_uploads/HyP_Ckdp3.png) 15. I skipped. (Rishi) Metric. Calibration. - ![](https://hackmd.io/_uploads/Skee1xOa3.png) 16. I skipped. (Rishi) the whole story of HELM. is acoutn scenatios x metric. sometimes limitation come from the datasets - ![](https://hackmd.io/_uploads/Hy_EyxOa2.png) 17. I skipped. (Rishi) these are selected research topic: - ![](https://hackmd.io/_uploads/BkX6kgu6n.png) 18. I skipped. (Rishi) these are selected model: - ![](https://hackmd.io/_uploads/HJjC1ldTn.png) - the access of these model(API/close source) 19. I skipped. (Rishi) several pages of the result tables 14. furthor reading. URL list from Saturday, Aug. 26 2023 23:51 PM To copy this list, type [Ctrl] A, then type [Ctrl] C. (212) Evaluation: Bias & Toxicity - YouTube https://www.youtube.com/watch?v=Qm--_1M_Uvk Evaluation: Factuality and Halllucination - YouTube https://www.youtube.com/watch?v=OWD54-RVYjc (212) Deep Dive into LLM Evaluation with Weights & Biases - YouTube https://www.youtube.com/watch?v=7EcznH0-of8 (212) Language Model Evaluation and Perplexity - YouTube https://www.youtube.com/watch?v=gHHy2w2agEo (212) Deep Dive into LLM Evaluation with Weights & Biases - YouTube https://www.youtube.com/watch?v=7EcznH0-of8 (212) Evaluating LLM-based Applications // Josh Tobin // LLMs in Prod Conference Part 2 - YouTube https://www.youtube.com/watch?v=r-HUnht-Gns (212) LLM Evaluation Basics: Datasets & Metrics - YouTube https://www.youtube.com/watch?v=1jReCwBgM84 Large Language Model Evaluation in 2023: 5 Methods https://research.aimultiple.com/large-language-model-evaluation/ Google 翻譯 https://translate.google.com/?hl=zh-TW&sl=en&tl=zh-TW&op=websites The Importance of Evaluating Large Language Models | by Minhajul Hoque | Jun, 2023 | Medium https://medium.com/@minh.hoque/the-importance-of-evaluating-large-language-models-6eff1f68de5f How to Evaluate a Large Language Model - Evaluate LLMs https://www.analyticsvidhya.com/blog/2023/05/how-to-evaluate-a-large-language-model-llm/ How to Evaluate, Compare, and Optimize LLM Systems | llm-eval-sweep – Weights & Biases https://wandb.ai/ayush-thakur/llm-eval-sweep/reports/How-to-Evaluate-Compare-and-Optimize-LLM-Systems--Vmlldzo0NzgyMTQz How to evaluate an LLM on your data? https://radekosmulski.com/how-to-evaluate-an-llm-on-your-data/ What is LLM Evaluation https://deepchecks.com/glossary/llm-evaluation/ stanford-crfm/helm: Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). https://github.com/stanford-crfm/helm Fine-tuning with Instruction prompts, Model Evaluation Metrics, and Evaluation Benchmarks for LLMs | by kanika adik | Jul, 2023 | GoPenAI https://blog.gopenai.com/fine-tuning-with-instruction-prompts-model-evaluation-metrics-and-evaluation-benchmarks-for-llms-7dc49e8dade9 Evaluation on LLMs. The advent of large language models… | by Sheldon L. | Medium https://medium.com/@sheldon88/evaluation-on-llms-4914215459d7 Fine-Tuning and Evaluating Large Language Models (LLMs) | by Tarapong Sreenuch | Jul, 2023 | Medium https://sreent.medium.com/fine-tuning-and-evaluating-large-language-models-llms-f38f245f87f9 Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5 | by Chris Mauck | Towards Data Science https://towardsdatascience.com/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058 4 Crucial Factors for Evaluating Large Language Models in Industry Applications | by Skanda Vivek | Aug, 2023 | Towards Data Science https://towardsdatascience.com/4-crucial-factors-for-evaluating-large-language-models-in-industry-applications-f0ec8f6d4e9e aws-samples/llm-evaluation-methodology https://github.com/aws-samples/llm-evaluation-methodology/tree/main Generative AI Dive Deep | Highspot https://aws.highspot.com/items/648a3874a558e6026053cf10?lfrm=srp.1#66 Generative AI customer-facing deck with industry slides | Highspot https://aws.highspot.com/items/632e4909931439bbe94f8408?lfrm=srp.4#9 Holistic Evaluation of Language Models (HELM) https://crfm.stanford.edu/helm/latest/?scenarios=1 helm/src/helm/benchmark/test_data_preprocessor.py at main · stanford-crfm/helm https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/test_data_preprocessor.py Tutorial - CRFM HELM https://crfm-helm.readthedocs.io/en/latest/tutorial/#using-helm-run 2211.09110.pdf https://arxiv.org/pdf/2211.09110.pdf (216) Evaluation: LLM robustness and self-consistency - YouTube https://www.youtube.com/watch?v=XNlfvnEyB8Y (215) Mia Chang - YouTube https://www.youtube.com/channel/UCVyyGRg_74MbwVq3yTPkznQ Channel customization - YouTube Studio https://studio.youtube.com/channel/UCVyyGRg_74MbwVq3yTPkznQ/editing/details 淺談 LLM 應用開發 Roadmap https://gamma.app/public/-LLM-Roadmap-0pv5lh3kbogyoae?mode=doc Pull requests · aws-samples/aws-ai-intelligent-document-processing https://github.com/aws-samples/aws-ai-intelligent-document-processing/pulls (19) Activity | Asmaa Ibrahim | LinkedIn https://www.linkedin.com/in/asmaa-ibrahim/recent-activity/all/ (19) Activity | Manoranjan Rajguru | LinkedIn https://www.linkedin.com/in/manoranjan-rajguru/recent-activity/all/ (19) Activity | Adnan Masood, PhD. | LinkedIn https://www.linkedin.com/in/adnano/recent-activity/all/ (19) Activity | Jon Krohn | LinkedIn https://www.linkedin.com/in/jonkrohn/recent-activity/all/