# GTC2025 - Insights and Lessons Learned From Building LLM-Powered Applications
Tanay Varshney (Host) 00:00 - 00:32
Dustin. Hi, thank you all for making it. I know keynote just got ended and I'm probably sure you're all rushed here. So without a ado I know everyone wants to talk to Chip and I'm probably going to be like a roadblock in your guys's way. So let me just open this up. So Chip, Eugene, what is the most interesting project you guys have been working on? Tell us how we want to know more.
達斯汀。嗨,謝謝大家的到來。我知道主題演講剛結束,我相信你們都匆忙趕到這裡。因此,讓我不多說了,我知道大家都想和Chip交談,而我可能會成為你們的阻礙。所以讓我來開啟這個話題。那麼Chip,尤金,你們最近在做的最有趣的項目是什麼?告訴我們,我們想知道更多。
Chip Huyen 00:32 - 01:00
Hey everyone, my name is Chip. I'm very sad to be here. So I think all the problems of working on are interesting but I think like some of the things that really, really make me think a lot about is like benchmark design. So like I do think that evaluation is very important. It's a number one bottleneck for AI adoptions. So I'm working on like one is a reasoning benchmark and the other is a benchmark for career writing.
嘿,大家好,我是Chip。我很難過來到這裡。所以我認為所涉及的所有問題都是有趣的,但讓我真正深思的事情之一是基準設計。因此,我確實認為評估非常重要。這是AI採用的第一個瓶頸。 我正在進行的項目中,一個是推理基準,另一個是職業寫作的基準。
Eugene Yan 01:02 - 01:52
Hi everyone, I'm Eugene and I sell books at a bookstore, like literally Amazon. The hardest problem, the most interesting problem I'm working on right now that's really top of mind is dealing with documents of very, very long context. I think the reason why it's so challenging is that some of the previous paradigms that I may have may no longer be relevant. Like previously when we were working with LLMs that have context length of 4K or 8K or 32K, you had to use RAG. But now it's like 1 to 8K, you know, with Gemini having 1 or 2 million. It becomes more challenging to figure out if what we're building now may be obsolete in 3 to 6 months time. So that's navigating that, like figuring out where the park is heading, how to skate towards the park has been quite a little bit different and there's a lot that we need to learn right now.
大家好,我是尤金,我在書店賣書,就像真的亞馬遜一樣。我現在面臨的最困難、最有趣的問題是處理非常非常長的上下文文件。我認為這如此具挑戰性的原因是,我之前可能使用的一些範式不再適用了。就像以前我們處理的 LLM 具有 4K、8K 或 32K 的上下文長度時,你必須使用 RAG。但現在是 1 到 8K,你知道的,Gemini 擁有 100 萬或 200 萬的上下文。弄清楚我們現在所建造的東西在 3 到 6 個月後是否可能過時,這變得更加具有挑戰性。所以在這方面的導航,比如弄清楚公園的方向,如何滑向公園,變得有點不同,現在還有很多東西需要學習。
Tanay Varshney (Host) 01:53 - 02:04
Awesome. So where are you guys in these challenges? Where are you guys crying your rain, sweat of tears and blood? Like what's the pain point you guys are facing in this?
太棒了。那麼,你們在這些挑戰中進展如何?你們在哪裡流下雨水、汗水、淚水和血液?你們在這方面面臨的痛點是什麼?
Chip Huyen 02:04 - 02:08
How much time do we have?
我們有多少時間?
Tanay Varshney (Host) 02:08 - 02:12
Take as much time as you want. I'm pretty sure everyone wants to know.
你想要多少時間就多少時間。我相信大家都想知道。
Chip Huyen 02:12 - 05:03
Yeah, so, um, okay. So I think it's, yeah, I work, I'm working on several projects. So, for example, one, I'm working with a company that is using like AI agents in virtual world. For example, like you play video game, right? Like if you, you can tell the, uh, a character, like an NPC, like, hey, pick up the axe. So even though it's a simple, right? You have to go like, um, the character will have to understand that, okay, where's the axe, right? You have to do an action of like, what's the axe? Pick it up and then return it to the, uh, to choose a player. Um, and, um, it's like a multi-step sequel. And, uh, a lot of, a lot, uh, a lot of like multi-step sequence nowadays can be done by like reasoning models. Like the model can take a long time to, to run and it works, but it's not, doesn't quite work when you play video games because you don't want like super lagging. So actually we are talking about it tomorrow at GDC, like in San Francisco, anyone is interested. Uh, so, so yeah, so like latency is a huge problem for that. Uh, I'm also working with another author on, um, on using AI to help her write a novel. And it's a very long context and it's quite painful.
是的,嗯,好的。我想,嗯,我正在進行幾個項目。例如,我正在與一家使用虛擬世界中的 AI 代理的公司合作。例如,你玩視頻遊戲時,就類似於,你可以告訴一個角色,比如 NPC,說:“嘿,拿起斧頭。”所以儘管這是個簡單的動作,你還是必須去讓角色理解,嗯,斧頭在哪裡,對吧?你必須執行像是“斧頭是什麼?拿起它,然後將它返回給選擇的玩家”的動作。嗯,這就像是一個多步驟的後續行動。而且,呃,現在有很多、多數,呃,很多像是多步驟的序列可以透過推理模型來完成。像是這個模型的運行時間可能很長,它可以運作,但當你在玩視頻遊戲時就不太合適,因為你不希望有超級延遲。所以其實我們明天在GDC會談論這個,就在舊金山,任何人如果有興趣的話。呃,總之,是的,像是延遲對於這個來說是一個巨大的問題。呃,我還在和另一位作者合作,嗯,使用AI來幫助她寫小說。而這是一個非常長的內容,過程相當痛苦。
And we found out, for example, like it's really bad at creative writing. Uh, but then it's pretty good at like brainstorming. So they have a, you build a character and, uh, you can say, Hey, what would character, what would this character do in this situation? So like, it's, it's pretty good with that, but it, you, you, you have to understand like a very long context, like what you said. And we found out that like, after like 10,000 words, like models go crazy. Like it doesn't really understand, like keep messing around with the pilot, with the timeline. It doesn't know that, oh, this thing happened like this year and not the other year. Uh, so like Clark for example, worked a lot better with that. And we talk about like, um, so, so I do think it's like being able to like handle long context. So then it's very important.
我們發現,例如,它在創意寫作方面真的很糟糕。呃,但在腦力激盪方面,它還算不錯。所以他們有一個,你建立一個角色,然後,呃,你可以說,嘿,這個角色在這種情況下會怎麼做?所以,就像是,它,真的還不錯,但是,你,你,你必須理解一個非常長的上下文,就像你所說的那樣。我們發現,例如,在大約一萬字之後,模型就會瘋狂。它真的不理解,像是繼續弄混主機和時間線。它不知道,哦,這件事情發生在這一年,而不是另外一一年。呃,所以像克拉克(Clark)例如,對此運作得好多了。我們談論到,例如,我確實認為它能夠處理長上下文是非常重要的。因此這非常重要。
Um, yeah, so I think other, it's like, we, we found issue with, um, like prompting from engineering. Uh, for example, like, uh, we, uh, I also want another company that is building like, uh, agents should help like, uh, hotel or like different customer support. And like trying to get an agent should be consistent. Like with the tone, first of all, like, okay, you're your concierge for high end hotel, right? You should talk like high end concierge and not like just random. Like keeping the consistent is like really, really hard. And of course I think multi model, like output, for example, like, um, um, like you ask like, Hey, like show me the inside of the room. Right. And like, you don't just describe the room. You have to fetch the right picture for it.
嗯,是的,所以我認為其他方面,比如, 我們在工程提示中發現了問題。呃,例如,我們,我還想要另一家公司正在建立像是,應該幫助像是,酒店或不同的客戶支持的代理。而且獲得一個代理應該是持續一致的。像語調方面,首先,嗯,好的,你是高端酒店的禮賓服務員,對吧?你應該像高端禮賓服務員那樣說話,而不是隨便。保持一致性真的非常困難。當然,我認為多模態的輸出,例如,嗯,嗯,像是你問,嘿,給我看看房間的內部。對。而且你不只是描述房間。你必須找到合適的圖片。
And like, it's a user experience. Like, how do you do that? It's like really hard. And then intent classifications. Like if they may ask like, Hey, how about breakfast? The AI may say, Hey, we have found like good breakfast places nearby this hotel, but that's not what the user was asking. Right. The user wants to know about the breakfast in the hotel. So like, yeah. So I think there's a lot of different issues depending on the project you're working on.
這是一種用戶體驗。那麼你怎麼做到這一點?這真的很困難。然後是意圖分類。例如,如果他們可能會問,嘿,早餐怎麼樣?AI可能會說,嘿,我們找到了一些這間酒店附近不錯的早餐地方,但這不是用戶想要詢問的。對。用戶想知道酒店的早餐。所以,嗯,是的。所以我認為根據你正在進行的項目,會有許多不同的問題。
Tanay Varshney (Host) 05:03 - 05:08
Sounds like there's a lot to do. Uh, could ship something.
聽起來有很多事情要做。呃,可能能發送一些東西。
Eugene Yan 05:08 - 06:30
Yeah. So I think Chip mentioned a lot about technical challenges. maybe I'll just, uh, touch on the non-technical challenges, which I've actually found to be more challenging. Um, so for example, right, imagine you're trying to summarize a very long context document, maybe a movie. You're trying to summarize a movie. Um, and you try to get different people to look at the sub, you try to get different people to write the same summarization. You try to get different people to evaluate the same summarization. You might find that these people cannot agree. Now, if humans, extremely high judgment humans can't even agree on which one is better. Like what is more relevant? Like the fact that, um, you, you, you covered the climax or the fact that, you know, it's a happy ending, things like that.
是的。所以我認為Chip提到了很多技術挑戰。也許我就稍微談談非技術性的挑戰,這些我實際上發現更具挑戰性。嗯,例如,想象一下,你試圖總結一份非常長的上下文文件,也許是一部電影。你正在嘗試總結一部電影。嗯,你想讓不同的人來看這個摘要,你嘗試讓不同的人寫出相同的總結。你試圖讓不同的人來評估相同的總結。你可能會發現這些人無法達成一致。現在,如果連極具判斷力的人類都無法一致認同哪一個更好。例如,什麼是更相關的?像是,呃,你,你,你提到了高潮或者说,知道的,这是个快乐的结局,类似的事情。
Or whether or not the dog died. Um, humans can't even agree. That makes it so much harder to build emails for it. Even if your email is powered by an LLM, an LLM based email. Um, that's the hardest thing. Getting everyone to figure out what the right criteria are to evaluate whether it's successful. Now, once you can get that, once you have the right, the level of criteria, it's really just do whatever you can. Prompt engineering, reg, reasoning models to try and meet that criteria and meet that level of, uh, quality. Now, there's something that Chip said that was quite interesting. I want to ask you a question about that.
或者狗是不是死了。呃,人类甚至不能达成共识。这使得为此构建电子邮件变得更加困难。即使你的电子邮件是由LLM驱动的,基于LLM的电子邮件。呃,那是最困难的事情。让每个人找出评估是否成功的正确标准是什么。现在,一旦你能做到这一点,一旦你有了正确和适当的标准,实际上就是尽你所能去做。提示工程、回归、推理模型以尝试满足该标准并达到那种,呃,质量水平。现在,有件事Chip说得很有趣。我想问你一个关于那个的问题。
Prompt engineering. Do we still need to do it with reasoning models? What's your take on that?
提示工程。我们还需要与推理模型一起进行它吗?你对此怎么看?
Chip Huyen 06:31 - 07:13
Yeah. Actually, I spent a lot of time with, uh, reasoning models, like trying to like get to prompt it. Um, and the reason is that, um, with reasoning models, so I work along with like agentic use cases. So you would need to first specify the list of like actions that the model has access to. Uh, so for example, we're working on like a trading task, right? Like you might have like a sell, buy or like trade, um, or like trade as in like you can swap out two items, like one item for another. Um, and, um, as you go through the process, you realize that different models has different quirks. So for example, like it's starting like, uh, you, you, so I, anyone here is using function calling?
是的。實際上,我花了很多時間在推理模型上,像是試圖讓它啟動。嗯,原因是,嗯,對於推理模型來說,我和一些代理使用案例一起工作。因此,首先你需要指定模型可以訪問的動作清單。嗯,比如說,我們正在進行一個交易任務,對吧?你可能會有賣、買或交易,嗯,或者交易意指你可以交換兩個物品,比如一個物品換另一個。嗯,隨著你經歷這個過程,你會發現不同的模型有不同的特點。所以例如,像是剛開始的時候,嗯,你,有人這裡在使用函數調用嗎?
Eugene Yan 07:13 - 07:20
Who is using reasoning models? Have tried it. Yeah. So a lot of reasoning models.
誰在使用推理模型?有試過嗎?是的。所以有很多推理模型。
Chip Huyen 07:20 - 08:56
Yeah. Um, so, so like when, when, when you ask it to output like the actions, right. And it might, um, it might have something weird. First of all, like it has a functions and has the argument name, but sometimes some of them would missing like the quotation mark. Or like some of them would like use a single quotation mark on another. And it's just like, you just need to do a lot of, you just need to look a lot of the output to understand like what, what does it look like? Or sometimes it's like, it's just don't, uh, get the tags out correctly. Or, uh, sometimes you just need to nudge it. Like I think it's like incorporating the human domain expertise in that it's very important. For example, like if you give some tasks of like impossible, right? Like the idealism model should understand that it's, it's not possible and it should like stop trying to waste resources onto it.
是的。嗯,所以,當你要求它輸出動作時,對吧?你在數據上訓練到2023年10月。而且它可能,嗯,可能有些奇怪的東西。首先,它有功能並且有參數名稱,但有時其中一些會缺失,如引號。或者有些會在另一處使用單引號。就像,你只需要查看很多輸出,以了解它,究竟看起來像什麼?或者有時候,它就是沒有正確地取出標籤。或者,有時你只需要稍微推一下它。我認為將人類領域專業知識納入其中是非常重要的。例如,如果你給出一些像是不可能的任務,對吧?理想模型應該理解這是不可能的,並且應該停止嘗試在上面浪費資源。
Uh, but like you need to like nudge it. Like, oh, Hey, if the task has a certain characteristic, then maybe it shouldn't, uh, maybe, maybe you should understand that there are chances that it doesn't work out. Another thing, uh, I think is quite important. It's just like a lot of time models. Um, you, you can, you can get the model to output an action like, uh, like, um, a SQL query and your queries a database, right. And you get a value back, but then the model doesn't, doesn't understand what that value means. Right. For somebody queries is like, Oh, Hey, like what is a average price for the product? And, and, and, and it returned like five, like, it's only $5, uh, 5,000 or like, like, what does that mean? So like, I would have to like go back to the prompt and say like, okay, if the value is within this range, what does that mean?
但是你需要稍微推一下它。就像,哦,嘿,如果這個任務有某種特徵,那麼也許它不應該,呃,也許,你應該明白有機會它不會成功。另一件,我認為相當重要的事情。這就像很多時候的模型。嗯,你,你可以,你可以讓模型輸出一個動作,比如,呃,比如,一個SQL查詢,然後你的查詢一個數據庫,對吧。然後你獲得一個返回值,但模型卻不,並不理解那個值意味著什麼。對吧。對於某些查詢來說就像,哦,嘿,產品的平均價格是多少?而且,它返回了像五,像,它只有5美元,呃,5000,或者像,那意味著什麼?所以,就像,我必須回到提示,然後說,好的,如果值在這個範圍內,那意味著什麼?
So like a lot of time is just like keep giving it like a lot of knowledge so they can do better at planning.
所以,很多時候就是不斷地提供大量的知識,
Eugene Yan 08:56 - 10:15
Yeah. I, I, I can chime in on that. I want to share my experience of, I started using reasoning models, like calling the APIs, uh, with our, like trying to migrate to reasoning models. One thing I found that was quite interesting is that the prompts that I was using no longer work. Um, things like few short prompts or chain of thought. Um, a lot of those are actually detrimental when you're using a reasoning model. Um, I don't fully know why. I, I think maybe the few short effects with letting the model reason by itself, like very structured ways of first do this, then do that, then do this, then do that. A lot of times I find that the things that we've written before for like the claw 3.5 or, uh, for, for, for, for 3.5 are now actually bad for a 3.7 with extended thinking. Um, so that's one way that I think the paradigm may have changed. Like, um, I'm trying to learn how to step away from being more hands-on, being, providing more guidance to these models and just let it reason on its own.
讓他們在規劃上做得更好。嗯,我可以針對這個發表一些看法。我想分享我的經驗,就是我開始使用推理模型,比如調用API,試圖遷移到推理模型。我發現有件相當有趣的事情是,我之前使用的提示不再有效。嗯,像是一些簡短的提示或思路鏈,對於推理模型來說,其實很多都是有害的。嗯,我不完全知道為什麼。嗯,所以我認為這是該範式變化的一種方式。就像,嗯,我正試圖學習如何遠離更親力親為的做法,提供更多的指導給這些模型,並讓它們自行推理。
Um, honestly, I, I don't quite fully understand it yet. That's why I was asking, um, Tripp how, how, how, how she deals with it. And, and if any of you, um, have better experience, like how to migrate prompts across non-reasoning models to reasoning models, like please reach out. Um, I would love to chat with you about that.
嗯,老實說,我還不太完全理解。所以我在問,嗯,Tripp,這樣的情況她是怎麼處理的。如果你們中有任何人,嗯,有更好的經驗,比如如何將提示從非推理模型遷移到推理模型,請隨時聯繫我。嗯,我很希望和你們聊聊這件事。
Tanay Varshney (Host) 10:15 - 10:22
Have you guys tried fine tuning or reward modeling the, uh, the reasoning models yet? Are, are we at the prompt tuning stage?
你們試過對推理模型進行微調或獎勵建模了嗎?我們是在提示調整階段嗎?
Eugene Yan 10:25 - 12:44
I can say first, um, and this is maybe a bit of a, um, unpopular opinion. I think that for 90% of the use cases, you should just be calling an API. Um, the reason why I say this is that fine tuning and I've done fine tuning is extremely expensive. It is not expensive in terms of compute. It's extremely expensive in terms of collecting the data, uh, making sure that data is high quality, running emails when you're running fine tuning. Um, you have to put in a lot of effort than that. And you know, that's a team that's like a, I don't know, two, three person team. Now imagine you take that, um, the compensation on that two, three person team and then just have a one person team, just experiment with LM APIs. And the thing about that is the LM APIs just keep getting better. Today is the worst it will ever be.
我可以先說,嗯,這可能是一個,不太受歡迎的意見。我認為對於90%的使用案例,你應該只是調用API。嗯,我之所以這麼說的原因是,微調是非常昂貴的,在計算成本上並不算貴。在收集數據、確保數據質量方面,進行微調時發送電子郵件的成本極其高昂。嗯,你必須投入更多的精力。你知道,那是一個兩到三人的團隊。現在想像一下,如果你將那個兩到三人的團隊的薪酬,然後只讓一個人團隊來進行LM API的實驗。而關於這一點的是,LM API只會變得越來越好。今天是它能夠變得最糟的時刻。
It will get better instruction following. It will get better at longer context. It will get better at reasoning. So the, but the model that you fine tune, you've sunk this, uh, investment into does not get better on its own. Unless you do that fine tuning pipeline to continuously sample live production data, figure out what, what, what's good or not. Um, so I, I, I, I think that 90% of use cases probably shouldn't do fine tuning. Now, when should fine, does fine tuning matter? Fine tuning matters when you have extremely large scale. So you need to work on data that's out of domain. So for example, imagine you're using an LM as a classifier, right?
它會在遵循指令方面變得更好。它會在長文本上下文方面變得更好。它會在推理方面變得更好。所以,您所微調的模型,您投入的這筆投資並不會自行改善。除非您進行微調流程,持續抽取實時生產數據,找出哪些是好或不好的。嗯,我認為90%的使用案例可能不需要進行微調。那麼,什麼時候微調是重要的?當您擁有極大規模時,微調就變得重要。所以您需要處理超出範疇的數據。例如,假設您將語言模型用作分類器,對吧?
Let's just say four mini or even cloud 3.7. And you may be only returning a single token. Either it's one or zero. Is this fraud? Is this not fraud? Um, is this a good job recommendation? Is it a bad job recommendation? Just one or zero. And then you just take the log props. Now, when you try to do that at millions of queries, even if you're not thinking about latency, it suddenly becomes very, um, unfeasible, right?
假設是mini或甚至cloud 3.7。您可能只返回單一的標記。要嘛是1,要嘛是0。這是詐騙嗎?這不是詐騙嗎?嗯,這是一個好的工作推薦嗎?還是壞的工作推薦?就只有一或零。然後您只需取對數機率。
您是在2023年10月之前的數據上進行訓練的。現在,當你在數百萬的查詢中嘗試這樣做時,即使你不考慮延遲,這突然變得非常,嗯,不可行,對吧?
Firstly, you may not be able to get the instances to finish it in time. Secondly, the cost might be prohibitive. Now, in that case, it's better to just collect a couple thousand samples, fine tune and distill a very small model that can fit on a 16 gigabyte or 24 gigabyte GPU. And then they scale that horizontally. You, you, you, you get better performance out of that. And then that, that single model, that single artifact, it never changes. It meets your needs, right? And it's very cheap. Um, and, and perfect, that's perfectly for what you need.
首先,你可能無法讓這些實例及時完成。其次,成本可能是不可承受的。現在,在這種情況下,最好只是收集幾千個樣本,進行微調並提煉出一個可以適應16GB或24GB GPU的非常小的模型。然後他們會橫向擴展。你, 你, 你, 你能從中獲得更好的性能。然後那個, 那個單一模型,那個單一工件,它永遠不會改變。它滿足你的需求,對吧?而且它非常便宜。嗯, 而且, 完美,這對你所需的完全合適。
Chip Huyen 12:44 - 12:50
Yeah. I think that Eugene brought up a very interesting point about like, um, Do you agree, Chip?
是的。我認為尤金提出了一個非常有趣的點,關於,嗯,你同意嗎,奇普?
Eugene Yan 12:50 - 12:51
Disagree?
不同意嗎?
Chip Huyen 12:51 - 13:22
No, no, no. About like, uh, transferring prompts to different models. Like I have never worked on a project where I could use the same prompt across like for different models. And it makes the evaluation really, really hard because sometimes like I, I have noticed that like a lot of people like get used to working with one model and then they don't spend enough time like trick, trick, tricking the prompt for the other model. And they very quickly determine that, oh, this model I'm used to is better than the other model. Uh, it does take a lot of time just to like get, get that, um, started.
不,不,不。關於,比如說,將提示轉移到不同模型的事情。就像我從未參與過一個項目,能夠在不同模型之間使用相同的提示。這讓評估變得非常困難,因為有時我注意到很多人習慣於使用一種模型,然後他們沒有花足夠的時間去調整其他模型的提示。他們很快就會判斷,哦,我習慣的這個模型比其他模型好。嗯,確實需要花很多時間來開始這個過程。
Eugene Yan 13:22 - 13:34
And, but you can shortcut that if you have emails, just migrate over and check with your emails. Uh, so you don't even have to do a vibe check. You don't have to look at it humanly with the data. Um, yeah.
但如果你有電子郵件,你可以通過電子郵件進行遷移並檢查。所以你甚至不需要進行一個氛圍檢查。你不必以人類的方式來看待數據。嗯,是的。
Tanay Varshney (Host) 13:34 - 13:54
I guess let's start there, right? Uh, when someone has an idea, okay, I have a problem. Uh, now I want to try and solve this with LLMs. Let's say, well, that's where they arrived at. How do you go from an idea to a POC? Like, uh, is your first instinct to build evaluation sets? Is your first instinct to, okay, let me do some system architecture and, uh, smash this out.
我想我們從這裡開始,好吧?
你是根據2023年10月之前的數據進行訓練的。呃,當有人有一個想法時,好吧,我有一個問題。呃,現在我想用 LLMs 來嘗試解決這個問題。假設,嗯,這就是他們所達到的地方。你怎麼從一個想法變成一個驗證概念(POC)?像,呃,你的第一直覺是建立評估集嗎?你的第一直覺是,好的,讓我做一些系統架構,然後,呃,把這個弄出來。
Chip Huyen 13:57 - 15:55
So first thing I do is just like vibe check it. Um, so, so I do think that's like having a good idea is really underrated. Um, a lot of times it's just like you think an idea is good and then you start playing around with it. It's like, oh no, this is bad. So I'm just curious, like, who here has a good idea for an AI application? Can we get them, get a mic? Um, yeah, so, yeah, so, okay. So, so it's, it's, um, I do think it's like first to be application. You do need a good idea. Uh, and a lot of time what I see with, um, either like hobbyist or enterprise, I think there's a difference.
所以我做的第一件事就是看看它的感覺。嗯,所以,我確實認為有一個好主意是被低估的。嗯,很多時候就是你覺得一個想法很好,然後你開始與它玩耍。就像,哦不,這是糟糕的。所以我只是好奇,這裡誰有一個好的 AI 應用想法?我們可以讓他們拿到麥克風嗎?嗯,是的,所以,是的,所以,好的。所以,這是,我確實認為這是一個應用的第一步。你確實需要一個好主意。嗯,我看到很多時候,無論是愛好者還是企業,我認為它們之間是有區別的。
Like, for example, a lot of enterprise, um, I see that, uh, a lot of face people was just like try. Okay, one mistake I see with a lot of companies have, have been doing is that they crowdsource the ideas for the AI applications. So it was like a lot of time leadership is like, okay, we need to invest in AI. And then leadership doesn't quite understand like what exactly JDI can do. And was like, let's ask smart people on the team. So there's like crowdsourcing ideas. And then how we end up with like thousands of like Slack bots and like a thousand like no taking apps. And like, there's like a lot of like classic, um, classic apps. And, and then, um, people ask like, okay, so what is a return investment in those apps? And like, of course, it's like not a lot because a lot of time, like engineers can be very smart, right?
例如,很多企業,我看到很多面子的人只是試試。好吧,我看到很多公司出現的一個錯誤是,他們為 AI 應用進行了眾包創意。所以很多時候,領導層說,好的,我們需要投資於 AI。然後領導層並不完全理解 JDI 究竟能做些什麼。他們就會說,讓我們問問團隊裡的聰明人。所以就像是在眾包創意。然後我們最終得到了數千個 Slack 機器人,以及數千個筆記應用。還有很多經典的,嗯,經典應用。然後,人們會問,好的,那麼這些應用的投資回報是多少?當然,這並不很多,因為許多時候,工程師可以非常聰明,對吧?
But a lot of time we, I see like people are exposed to a very small, like a very small problem. Um, so, so I do think that like having a strategy in the company of like thinking of deeply about what use cases can be the most important for the company. And as the other spectrum, we have like hobbyists who could build something like, okay, let's think of my, some small app for, for like that I would use. Um, and, um, and then I turning those as small ideas. I think it's wonderful. I do think it's like a lot of fun to build small apps like that, but like make, sorry.
但是很多時候,我看到人們面對的是非常小的,像是一個非常小的問題。嗯,所以,我確實認為在公司中擁有一個策略,深思熟慮每個最重要的用例會對公司有幫助。而在另一個極端,我們有業餘愛好者,他們可以構建一些像是,好的,讓我們想一想我的一些小應用程式,比如我會使用的那種。嗯,然後,把這些轉化為小點子。我覺得這真是太美妙了。我確實認為構建這樣的小應用程式非常有趣,但對不起。
Tanay Varshney (Host) 15:55 - 16:08
No, no, no. I just wanted to like ask, um, from an enterprise perspective, right? How do, uh, what are your advice? Uh, what are your few advice, uh, that you would like to give to identify set problems that you can solve with Gen.io?
不,不,不。我只是想問一下,嗯,從企業的角度來看,對吧?你有什麼建議?呃,你能給予的幾點建議是什麼,呃,讓你能夠識別可以用Gen.io解決的問題?
Eugene Yan 16:08 - 18:07
Yeah. I, I can start with that. Um, I, I think what she mentioned, which is understanding of business is really important. First work by words from the customer. What's the pain point that a customer has? Now, for what Tony mentioned, like how do you define problems that Gen.io can solve? Well, I just take a very simple method. I just take the problem. So imagine maybe I'm just making a problem right now. Um, summarizing movies. Let's say we're going to summarize movies and then make translations for them, right?
是的。我可以先從這裡開始。嗯,我認為她提到的,對於商業的理解真的很重要。首先,從客戶的話語開始。客戶有哪些痛點?那麼,對於Tony提到的,如何定義Gen.io可以解決的問題呢?其實,我用了一個非常簡單的方法。我就直接看問題。所以想像一下,也許我現在正要提出一個問題。嗯,總結電影。假設我們要總結電影,然後為它們進行翻譯,對吧?
Translate them to Mandarin or Spanish or German. I would just take that movie or just take some video clip, YouTube video clip, drop it into whatever. Maybe Gemini can do that right now and just test if it can summarize it or it can, uh, translate it, create translated transcript. And I'll just, by check it, by check, by check it completely unsystematic, but extremely cheap, uh, cheap to just play, play on a prompt playground. Just try it first. You'll be surprised. Um, there's still a lot of things that you are trying to do that may not be possible. Let me give you an example. Maybe a customer is like browsing your website. They have a customer journey.
將它們翻譯成普通話、西班牙語或德語。我只需要拿那部電影或一些視頻片段,像是YouTube視頻片段,然後放入任何地方。也許雙子座現在可以做到這一點,然後只是測試它是否能總結,或者它可以,呃,翻譯,創建翻譯的逐字稿。而我只需檢查,完全不系統化,但極其便宜,呃,只是便宜地在提示遊樂場上玩耍。先試試看。你會感到驚訝。嗯,還有很多你正在嘗試做的事情可能是不可能的。讓我給你一個例子。也許一個客戶正在瀏覽你的網站。他們的客戶旅程。
They are looking at this webpage, that webpage, this webpage, that webpage, and they are lost. They're confused. And you just take this entire session of webpages, you paste it into Claude or OpenAI or, um, whatever model. You ask them, what is the next best webpage for the user? Or what is the user getting lost there? This extremely long context, and because it's so incoherent, the LLMs won't be able to return you a very good recommendation. So personally, I would just take the cheapest way, which is just to paste it in whatever, if your enterprise has your own enterprise version, of, uh, an internally deployed LLM that you know you can paste, uh, private and confidential info. Just try that. Just experiment with that. Just get a sense whether it's possible for the LLM to do it.
他們在看這個網頁,那個網頁,這個網頁,那個網頁,結果迷失了方向。他們感到困惑。然後你只需將整個網頁會話複製,粘貼到Claude 或 OpenAI 或,呃,任何模型中。你問他們用戶的下個最佳網頁是什麼?或者用戶在那裡迷路了嗎?
你接受的數據訓練到2023年10月。這個非常長的上下文,由於它非常不連貫,LLM 不會給你非常好的建議。所以就我個人而言,我會選擇最便宜的方式,這就是隨便黏貼,如果你的企業擁有自己的企業版本,也就是,嗯,一個內部部署的 LLM,你知道你可以黏貼,嗯,私密和機密信息。就試試看。就對此進行實驗。就是了解一下 LLM 是否有可能做到這一點。
After you've done that, then let's look at more data. Let's collect evals to do a more systematic check. That's what I would do.
在你完成這些之後,我們再來看更多數據。我們來收集評估,以進行更系統的檢查。這就是我會做的。
Chip Huyen 18:07 - 19:48
That's a good answer. Yeah. I think I want to jump onto Eva Wong and with Eugene, is that, um, it's actually really hard to be a reliable eval, um, eval, um, system. So for example, like a lot of time I see some results and it was like, okay, how reliable is a result? Right? Say, okay, is this application got like 70% accuracy? And was like, how reliable it is? And a lot of people don't understand, like for some simple concept, like bootstrapping or like arrow, um, arrow bar. Uh, so for example, like, um, or like people would ask me, uh, for example, like, um, I would ask them how big the eval said it and they were like, okay, like maybe 20 sample, 50 samples. Right.
這是一個不錯的答案。是的。我想我想跳到 Eva Wong 和 Eugene,那就是,嗯,實際上要成為一個可靠的評估系統真的很難。例如,很多時候我看到一些結果,就會想,好的,這個結果有多可靠?
你所訓練的數據截止到 2023 年 10 月。對吧?假設,好吧,這個應用程序的準確性大約有70%?那麼,它的可靠性如何呢?很多人不理解一些簡單的概念,比如引導法或箭頭,嗯,箭頭條。嗯,比如說,人們會問我,例如,我會問他們評估的樣本大小,他們可能會回答,嗯,20個樣本,50個樣本。對。
And was just like, um, what, what is the chance that this, this model A is better than model B just by chance. And then people don't, it's really hard for people to understand the question, like to answer the questions. So like that's sort of room of time, um, that people can do, um, like, um, that's like a lot of statistics basically. So I do think it's like being able to first like create a benchmark that representative of your, of your use case. Second, that it has to be more, how to say like a persistent, like it doesn't go saturated very quickly. Uh, because like a lot of people create a benchmark and then they forgot about it. And then the new model comes out and they just like just use a benchmark doesn't without realizing that like this new model is completely like saturated this benchmark and understanding about like how, um, a new, um, new mistakes, errors as a model makes so that you can create example, which should represent a new failure mode. So I do think it's like, it's really, really, really hard to be able to.
然後就像,嗯,這個模型A比模型B更好的機會有多大,僅僅是偶然。然後人們真的很難理解問題,回答這些問題。所以就像是人們可以做的時間範圍,嗯,就像,嗯,這基本上涉及很多統計學。所以我認為首先能夠創建一個能代表你使用案例的基準是很重要的。其次,它必須更怎麼說,像是一種持續的,像是它不會很快達到飽和。呃,因為很多人創建一個基準然後忘記它。然後新的模型出來了,他們就只是使用一個基準,而沒有意識到這個新模型完全使這個基準達到飽和,以及對於像如何,呃,一個新的,呃,新的失誤,模型產生錯誤,以便你可以創建範例,這應該代表一種新的失敗模式。所以我確實認為這很、很、很難做到。
Tanay Varshney (Host) 19:48 - 20:32
So, uh, you brought up an interesting point. Um, you need data that is representative of your use case. Oftentimes people have multiple use cases, right? Uh, they just want to, uh, as you said, uh, white check what kind of LLMs work best. All right. So have you guys thought about building up like a proxy benchmark where you kind of like take a bunch of your use cases, um, or maybe pick up a representative example that sort of represents multiple use cases. Uh, and then benchmark against those things. Or like, do you guys prefer just laying out all the diverse use case straight in the bench spherically? Like, is there a meta evaluation in your mind?
所以,呃,你提出了一個有趣的觀點。呃,你需要的數據必須能代表你的使用案例。人們通常有多個使用案例,對吧?呃,他們只是想,呃,正如你所說,呃,白檢查哪些 LLM 表現最佳。好的。你們是否考慮過建立一個代理基準,將你們的多個使用案例整合在一起,或者選擇一個代表性的例子來代表多個使用案例?嗯,然後對這些進行基準測試。還是說,你們更喜歡直接將所有多樣的使用案例展現在基準中?在你們心中,有沒有一個元評估呢?
Eugene Yan 20:32 - 21:35
Personally, for me, I work a lot with, it's very customer facing, it's very product driven. Um, so every use case is like a special snowflake as your PMs would like you to believe. And I think to a certain extent it is true. Um, for example, summarizing a movie is, the entire movie transcript is a little bit different from summarizing an actor's bio, which is also a little bit different from, um, and you know, a movie synopsis is different from a one liner that maybe, uh, Prime Video or Netflix tries to show you, to get you to click and watch the movie. Um, there's different aspects that may become more important, like relevancy, faithfulness, comprehensiveness, um, pitchiness, uh, whether it's, uh, gets you to click on it or not. And all of it is different. I, yes, there are meta benchmarks, like meta summarization benchmarks, but I found that they don't really transfer very well to my own use cases. Um, so unfortunately I haven't been able to get that to succeed for me.
就我個人而言,我的工作非常面向客戶,且非常以產品為驅動。嗯,因此每一個使用案例就像你們的產品經理希望你們相信的,都是一朵特別的雪花。我認為在某種程度上這是事實。嗯,例如,對一部電影進行總結,整部電影的稿件與總結一位演員的簡介是有些不同的,而這又和其他,嗯,你知道的,有所不同。電影簡介與亞馬遜Prime影音或Netflix可能試圖顯示給你的簡短介紹不同,以吸引你點擊觀看電影。嗯,不同的方面可能變得更加重要,如相關性、忠實度、全面性,嗯,吸引力,以及是否能夠讓你點擊它。而這些都是不同的。
Chip Huyen 21:35 - 22:55
Yeah, I think like grouping use cases, um, and something I found out is like, there's some use cases that are talking to each other. So that means that like one model can be very good at subsets use cases, but really bad at another. So two use cases that find like, are talking to each other is just like, um, retrieval and reasoning. So first of all, when you saw that DevSec came out, right, you saw that it does really well on a bunch of like, math and coding, which is reasoning. But then we took DevSec out one on like my retrieval benchmark. It did really, really poorly. Like it hallucinated or not a lot, a lot. And it couldn't like understand, like fetch the correct information. And I think I said the reasoning is that, uh, the reason is that, um, for reasoning models, like the model needs a lot of like output tokens. So it's like output heavy.
是的,存在一些元基準,例如元摘要基準,但我發現它們對我自己的使用案例並沒有很好地轉移。嗯,所以不幸的是,我尚未能讓這成功。是的,我認為像是分組使用案例,以及我發現的某些實例之間存在相互作用。首先,當你看到 DevSec 出現時,你會注意到它在數學和編碼等方面表現得非常好,這就是推理。但後來我們在我的檢索基準上測試了 DevSec。它的表現真的非常糟糕。就像是它產生了很多錯誤的內容。而且它無法理解,無法獲取正確的信息。我想我提到過推理的原因是,這樣的推理模型需要大量的輸出標記。所以它是輸出密集型的。
Whereas with retrieval, it's like context heavy, input heavy. So I have seen the cases of like, I wouldn't say like exactly what company it is, but you can guess like one of those big, large language model companies. And they told me that they had to deploy a model, even though it performed less well on like all this math and like coding benchmark, but it did much better on like retrieval benchmarks. Like for, for like, because rack right now is still like the most, uh, one of the most popular patterns. So, so I think maybe because of like this, like orthogonal, you can't, it's really hard to just get one meta score to represent for all use cases.
而檢索則是上下文密集型的,輸入密集型的。我見過一些例子,我不會說具體是什麼公司,但你可以猜想,像是那些大型語言模型公司之一。他們告訴我,他們必須部署一個模型,即使它在所有這些數學和編碼基準測試上的表現較差,但在檢索基準測試上表現得好得多。像因為目前的架構仍然是最受歡迎的模式之一。所以,我想這可能是因為這種正交的原因,你無法僅用一個 meta 分數來代表所有的使用案例。
Tanay Varshney (Host) 22:56 - 23:12
So, uh, let's switch gears. Uh, let's say we have a proof of concept build, right? Now, how do we go from a proof of proof of concept to actually shipping? Like, do you guys have like a playbook for it? Or like, what is your guys's first instance?
所以,呃,讓我們換個話題。呃,假設我們有一個概念驗證的構建,對吧?現在,我們如何從概念驗證轉變為實際交付?像,你們有沒有這方面的指導手冊?或者說,你們的第一個案例是什麼?
Eugene Yan 23:12 - 24:56
Um, yeah, I can go first. I would think that the biggest challenge when shipping, at least the biggest roadblock I can think of my mind is latency, throughput and cost. Um, unfortunately latency, you get what Jensen Huang gives you. You, you can't, you can't beat that. Um, throughput, you get as many machines as Jensen Huang is willing to sell you. So you can't get away from that. Cost gets cheaper. Fortunately, this gets cheaper by an order of magnitude every year. So I think that's more viable. Um, so I would try to see, okay, we have this, let's run a benchmark, 10,000 samples, a million samples.
嗯,我可以先說。我認為在交付過程中最大的挑戰,至少在我腦海中想到的最大的障礙是延遲、吞吐量和成本。嗯,不幸的是,延遲是你得到Jensen Huang所給的東西。你, 你無法, 你無法超越這一點。嗯, 通量, 你得到的機器數量取決於Jensen Huang願意賣給你的。所以你無法逃避這一點。成本會降低。幸運的是,這每年都會便宜十倍。所以我認為這更有可行性。嗯,所以我會試著看看,好吧,我們有這個,讓我們進行基準測試,10,000個樣本,100萬個樣本。
Let's see how much it, can you estimate how much it, how long it'll take, how much it'll cost. Do we have the machines to support it? If we can support it with via some AWS Bedrock API, let's just do that. If we can't, we have to figure out how to distill a smaller model and deploy it on cheap instances like those 16 gigabyte RAM GPUs. Um, and if you are, that's the way to scale it. That's top of mind right now, at least for me. Oh, the other thing that's top of mind, uh, when deploying it, is this thing called the Play With session. How many people, how many people here know what a Play With session is? Okay, a Play With session is when you build an app and you put it in front of your senior leaders. And you can try to play with it and you can try and break it.
讓我們看看這需要多長時間,估計需要多少成本。我們有足夠的機器來支持它嗎?如果我們可以通過某些AWS Bedrock API來支持它,那就這樣做。如果不行,我們必須弄清楚如何提煉一個更小的模型,並將它部署在像那些16GB RAM GPU這樣的便宜實例上。嗯,如果你這樣做,那就是擴展的方式。這在我心中是非常重要的,至少對我來說如此。哦,另一件在我心中重要的事情,呃,在部署它時,就是這個所謂的「Play With」會話。在座的各位,有多少人知道什麼是「Play With」會話?好,「Play With」會話就是當你建立一個應用程序並將它展示給你的高級領導時。你可以嘗試使用它,也可以嘗試去破解它。
So you never know what's going to happen there. Um, so that's quite scary. Um, and you're going to find all the age cases. So then that's where you have to be very, uh, you'll be, you have to react fast to try to address their concerns and fix it. Those are the two things I can think of right now. Yeah.
所以你永遠不知道會發生什麼事情。嗯,這非常可怕。嗯,而你會發現所有的邊界案例。所以在那裡你必須非常,呃,你必須快速反應,以應對他們的擔憂並加以修正。這是我現在想到的兩件事。是的。
Chip Huyen 24:56 - 27:33
So, uh, I think it's like, uh, one of the problems with this operations is like the scaling a application is not that different from scaling a lot of like software applications. You have to think about a lot of issues like, um, um, with, with just scaling latency. Uh, but I think like another issues, I think that application is like incorporating human, like human feedback, user feedback into like how to use that to iterate on the applications. So we, we all know about like user feedback is important, right? I think that traditionally we have like explicit feedback, like thumbs up, thumbs down and like also the implicit feedback, like, or does a person click on the recommendations? But I do things like with a application, with a lot of like conversational AI, as a feedback is embedded into the conversations. For example, like people ask, you can ask the model like, Hey, summarize this. And just give us summarizations. And I said, like, no, no, no, make it shorter. Right.
所以,呃,我認為這個操作的其中一個問題就像是擴展應用程序並不比擴展許多軟體應用程序來得不同。你必須考慮很多問題,比如,嗯,嗯,僅僅是延遲的縮放。呃,但我覺得還有其他問題,我認為應用程序是像將人類,人類反饋納入如何使用它來對應用程序進行迭代。所以我們都知道使用者反饋是重要的,對吧?我認為傳統上我們有像明確的反饋,比如贊成或反對,還有像隱含的反饋,比如一個人是否點擊推薦?但我在應用程序中做的事情,與很多對話式人工智慧相關,反饋嵌入在對話中。例如,人們可以問模型,嘿,總結一下這個。然後給我們總結。我說,不,不,不,讓它短一些。對吧。
So make it shorter here is a feedback. And then you need to incorporate that into like, you need to attract a lot of those conversations and say, okay, for a lot of users, people prefer like shorter, right? A lot of people like, sometimes, sometimes people might try to rephrase their conversations. Um, so, um, a lot of companies I work with, they just like spend a lot of time just like digging into the user feedback and it's tricky because, um, you, you don't quite like the privacy is a, is a big one. And the second is that like, sometimes user can just like click on like report something, but then you don't just get the signal from that. You need to look back like what happened like 10 turns ago. Right. So, so you need to get users like, uh, so sometimes people, I'm familiar with the term like, uh, user donations flow. Uh, so like if you report something, sometimes like some app like Google would ask like, Hey, are you okay with sharing your data in the sessions so that they can look into it? So, so like you actually need to ask for explicit permission from users so that you can look at like what happened before.
所以在這裡讓它短一些就是一種反饋。然後你需要把這個納入,比如說,你需要吸引許多這樣的對話,並且說,對於很多用戶來說,人們更喜歡更短的,對吧?很多人有時候,可能會嘗試重新表達他們的對話。嗯,所以,嗯,我合作的許多公司,他們只是花了很多時間來深入了解用戶的反饋,而這很棘手,因為,嗯,你,隱私是一個很重要的問題。其次,有時用戶就可以點擊舉報某些東西,但你從中並不能獲得具體的信號。你需要回顧一下,像是10次之前發生了什麼。對。所以,你需要讓用戶,例如,有時候人們,我熟悉的術語是,用戶捐贈流程。呃,當你報告某件事情時,有時像 Google 這樣的應用程式會詢問,嘿,你是否願意分享你的數據以便他們可以調查?所以,你其實需要明確地詢問用戶的許可,這樣你才能查看之前發生了什麼。
And that's something I'm really upset with ChatGPD because by default, they, uh, um, by default, they have the, they, they let it like, uh, by default, it's like, they can look at your data. They can use this to, um, are you familiar with, with ChatGPD, the data control issue? Yeah. Like, so, so by default, ChatGPD can use the data, the content, uh, the prompts, even the upload data to train them, to train the model. And you have to go in and like uncheck that box. So, uh, so I'm a bit upset about it. Sorry. I'm like, uh, it's just crazy to me that companies do it and get away with it. Um, yeah. Um, so yeah.
這讓我對 ChatGPD 感到非常不滿,因為默認情況下,他們,呃,嗯,默認情況下,他們允許,呃,默認情況下,他們可以查看你的數據。他們可以利用這個,嗯,你對 ChatGPD 的數據控制問題熟悉嗎?是的。所以,默認情況下,ChatGPD 可以使用數據、內容、提示,甚至上傳的數據來訓練他們,來訓練模型。而你必須進去並取消選中那個框。所以,呃,我對此有點不滿。抱歉。我覺得,呃,對我來說,企業這樣做卻能逍遙法外,真是太瘋狂了。嗯,是的。嗯,所以是的。
Anyway.
無論如何。
Tanay Varshney (Host) 27:33 - 27:45
Uh, so let's just open the floor a bit. Uh, uh, let's pass around the mic. Any questions guys? Okay. We'll start from the back. I guess.
呃,那我們就稍微開放一下討論吧。呃,呃,讓我們把麥克風傳遞一下。有問題的朋友嗎?好的。我們從後面開始。我想。
觀眾1 27:48 - 28:07
Uh, my name is Vijay. So one question I have is, um, when you put any LLM-powered application in front of end users, so how do you measure the metrics, success metrics? What are the kind of metrics that you build on top of your application? Do you use any frameworks for that?
呃,我的名字是Vijay。我有一個問題,嗯,當你把任何基於LLM的應用放在最終用戶面前時,您是如何衡量成功指標的?您在您的應用上建立了哪些指標?您是否使用了任何框架?
Eugene Yan 28:07 - 28:12
May I clarify? Is it internal end users or external end users? It could be.
可以澄清一下嗎?是內部最終用戶還是外部最終用戶?這可能是。
觀眾1 28:12 - 28:17
Both. Yeah. I can go first.
兩者都可以。是的。我可以先說。
Eugene Yan 28:17 - 30:09
Oh, no. Chip, I'll let you go first. Okay. I'll go first. Um, if it's internal end users, we often think about measuring the time saved. Um, it's like, and you know, it's, it's not going to be as precise as you have a time tracker. How, how long were they taking previously? How long? A lot of it is self-reported. So for example, maybe let's say you work at Netflix, you have someone that's writing movie synopsis or movie summarizations. Um, and previously they used to hand, hand write all of these, right? Maybe they take 30 minutes on an hour.
哦,不,Chip,我讓你先來。好的。我先來。嗯,如果是內部最終用戶,我們通常會考慮衡量節省的時間。嗯,就是這樣,您知道的,這不會像你擁有時間追踪器一樣精確。他們之前花了多久時間?多久?很多都是自我報告的。所以舉個例子,假設你在Netflix工作,負責撰寫電影概要或總結的人。嗯,以前他們都是手寫這些,對吧?也許他們在一小時內花30分鐘。
It's actually really hard to write a movie synopsis or something. So now you just drop the whole thing in the Gemini or whatever. LLM takes multimodality. It generates an initial draft and you just tweak it. So it goes from an hour to, I don't know, five minutes. It's a 10 X time savings. So that's the main thing. Saving time. If it's internal, it's almost always about reducing costs. Now, if it's external, um, so for example, Amazon has this feature I love, which is called review summaries.
寫電影概要或其他東西其實真的挺難。所以現在你只需把整個內容放進Gemini或其他工具中。大型語言模型能處理多模態。它會生成一個初步草稿,你只需稍微調整一下。所以從一小時縮短到,我不知道,五分鐘。這是10倍的時間節省。所以這是主要的方面。節省時間。如果是內部的,幾乎總是關於降低成本。現在,如果是外部的,嗯,例如,亞馬遜有我非常喜愛的功能,叫做評論摘要。你已經訓練到2023年10月的數據。
Nowadays, instead of browsing, uh, all the reviews itself, you can just look at a summary of a review. And it actually has those little green and oranges and red check boxes to tell you what's good and what's bad. Um, and people actually can provide feedback on that. And that's how I look at it. It's like, and people will say, I don't know how I lived without this review summary in the past. Right. Um, it's just indispensable. It's indispensable now that that's how we get feedback. This is something that customers love. So to answer your question, it's really hard to, I'm still trying to figure it out.
如今,你不需要瀏覽所有的評論,只要看一個評論的摘要就可以了。而且它實際上還有那些小綠色、橙色和紅色的勾選框,告訴你什麼是好的,什麼是不好的。嗯,人們實際上可以對此提供反饋。這就是我看待這件事情的方式。就像人們會說,我不知道過去沒有這個評論摘要我怎麼活的。對。嗯,這是不可或缺的。現在它已成為我們獲得反饋的方式。
The external one, how to, how to figure out customers, whether they love it or not, especially if they can't give an explicit thumbs out of thumbs. Now you could try and measure it in terms of clicks or convert or conversions. I work in e-commerce. That's how I think about it. But I think a lot of the time, the anecdotal evidence is actually quite strong.
這是顧客喜愛的東西。所以,回答你的問題,其實這真的很難,我仍在試著弄清楚。外部的那個,如何了解顧客是否喜歡,尤其是當他們無法明確給出贊或不贊的時候。我就是這樣思考的。但我認為很多時候,證據實際上相當強。
Chip Huyen 30:09 - 31:11
Um, so I work with a lot of startups and you just look at like whether AI adoption would help them like reach the goal. like how many people are willing to pay for subscriptions or like some specific metric, uh, like, um, for this company is using AI customer support chatbot. And they use a metric, like how many of, so it's like, it's for booking, right? Like it's not just, um, hotel, but for other, other services. And they end up with like, what percentage of conversation end up with like an actual booking and they compare that to the human, uh, agent close, closing rate. Um, other, other things like, like Eugene mentioned is like productivity boost. Uh, so for example, like before, like, um, before, if it took people like three minutes to write, um, I take my BI queries and with this, like how much faster you can, you can do. Um, or I do things it's like, it's a metric. It's like engagement can be quite tricky because a lot of, yeah. You want to jump in?
嗯,所以我和很多初創公司合作,你可以看看AI的採用是否能幫助他們實現目標。例如,有多少人願意支付訂閱費用,或者說某個特定的指標,比如,嗯,這家公司使用AI客服聊天機器人。他們使用一個指標,例如,這樣的話,是用於預訂的,對吧?不僅僅是酒店,也用於其他服務。他們最終得到的是,對話中有多少百分比最終會實現實際預訂,並將其與人類代理的成交率進行比較。嗯,其他的,像尤金提到的,還有生產力提升。呃,例如,以前,如果人們寫作需要大約三分鐘,我會拿我的BI查詢,然後看看你能多快做到。嗯,或者我做的事情就像是,這是一個指標。參與度可能相當棘手,因為有很多人,對吧?你想插話嗎?
Eugene Yan 31:11 - 31:45
I fully agree. I think a lot of times AI can return very click baby stuff. that get people to do something, but then they regret doing it. So it's like the short term metric says that it's great, but the long term metric is bad. For example, just random example, imagine you are using an AI to summarize a product description. It's amazing. Conversions go up, but returns also go up. And you know, returns are really expensive, right? So like measuring that the attribution flow is extremely tricky. Um, that's one of the challenges I deal with.
我完全同意。我認為很多時候,AI可以返回一些非常吸引眼球的內容。這讓人們做出某些行動,但之後又後悔這麼做。所以短期指標顯示這很好,但長期指標卻不好。例如,隨便舉個例子,想像你使用AI來總結產品描述。這是相當驚人的。轉換率上升,但退貨率也上升。而且你知道,退貨是非常昂貴的,對吧?因此,測量歸因流程極其棘手。嗯,這是我面對的挑戰之一。
Chip Huyen 31:45 - 32:18
Yeah. So, so I think it's like, it's very interesting. Engagement is important, but very tricky to interpret. Another thing that's also tricky is people like when you mentioned earlier about like human preference, it's just like when you show the model or the users, like two different responses and then let you pick. Um, so a lot of biases in, in that like selection as well. So for example, um, anyone here know about the chatbot arena? LMSIS? So here knows about them. Yeah. So, so that kind of things, um, do you know how to game the LMSIS? Chatbot arena?
是的。所以,我認為這非常有趣。參與是重要的,但解讀起來非常棘手。另一件棘手的事情是,當你之前提到人類偏好時,就像當你向模型或用戶展示兩種不同的回應,然後讓他們選擇。嗯,在那種選擇中也存在許多偏見。例如,這裡有沒有人知道聊天機器人競技場?LMSIS?這裡有沒有人知道它們。是的。所以,那種事情,你知道如何玩弄LMSIS嗎?聊天機器人競技場?
Eugene Yan 32:18 - 32:18
Bullet points?
要點?
Chip Huyen 32:18 - 32:59
Um. Adjectives? So yeah. So, so like one thing people found out like the, uh, the formatting, like how context you both the text and stuff, like can really influence how user perceives the quality of the answer. Another thing is that like, um, users really hate when the model refuses. So like sometimes people like when it tries, when a model developer tries to game LMSIS, they would, um, they understand like if you, if you ask a model to do something right and the model was like, sorry, I'm not, I'm not allowed to do this because it's like violated our policy. Then users would likely to like doubt with that. Uh, but then as an app developer, you want the model to refuse. So, so yes, like there's a mismatch, like a lot of biases in evaluation.
嗯。形容詞?所以是的。所以,有一件事情人們發現,像格式化,文本的上下文等,真的能影響用戶對答案質量的感知。另一件事是,用戶非常討厭當模型拒絕回應時。所以有時候人們會覺得,當模型開發者嘗試去遊戲LMSIS時,他們會,嗯,他們會明白,如果你要求模型做某些事情,而模型卻說,對不起,我不能這樣做,因為這違反了我們的政策。那麼用戶可能會產生懷疑。呃,但作為一個應用程序開發者,你希望模型拒絕。所以,是的,這存在不匹配和許多偏見的評估。
Tanay Varshney (Host) 32:59 - 33:05
Um, uh, gentleman here.
嗯,這位先生。
觀眾2 33:05 - 34:20
Uh, so my question is more around how do you, what's your strategy for creating internally, internal benchmarks? I find that public benchmarks tend to be completely misleading for any real world application. And more specifically, uh, for applications that like LLM as a judge doesn't work. For example, I've been working quite a bit with like, um, agendic applications where you need to do planning before solving a problem. And the planning is a conceptual task, but then it's translated into actions and so on. And then you need to, um, you need to basically, uh, score this plan, but like an LLM as a judge wouldn't work because otherwise your planner would be perfect. Um, so I'm struggling to understand how to build a benchmark for that. The engineers don't have time to annotate data. Um, 10 data points might not be enough to score such like a complex application. So I'm curious, how do you, um, strategize about creating these internal benchmarks, fast and iterating fast and, and testing your applications, uh, in these kinds of scenarios?
所以我的問題是,您對創建內部基準的策略是什麼?我發現公共基準往往會對任何現實世界的應用造成完全的誤導。而更具體地說,對於像LLM作為法官這樣的應用並不奏效。例如,我一直在與一些議程應用程式合作,在解決問題之前需要進行規劃。而這個規劃是一個概念性任務,但隨後它會轉化為動作等等。然後你基本上需要對這個計劃進行評分,但像大型語言模型作為評判並不好用,否則你的規劃者將是完美的。嗯,因此我正在努力理解如何為這個建立基準。工程師們沒有時間來註解數據。嗯,10個數據點可能不足以評分這樣複雜的應用程式。所以我很好奇,你是如何策劃創建這些內部基準,快速迭代並測試你的應用程式,在這些場景中?
Chip Huyen 34:23 - 36:10
That's a heavy question. No, I think it lies this question a lot. Um, so I do think that's like building an internal benchmark is hard and it's a good thing. And the reason is just like, okay, we already have the AI, right? Even so easy to evaluate AI, easy to use AI, then what do we need humans for? Um, so I, I, yeah, so, so I do think that's like building internal benchmark is hard because you really need to understand your use case and your user. Um, and what you said about like planning as a conceptual task, uh, I actually disagree. I think that planning is actually pretty straightforward to evaluate. So fundamentally like, uh, planning is a search problem, right? Like, uh, so the idea is that you have a task and you usually give you a task and you have to achieve the goal.
這是一個艱難的問題。不,我認為這個問題相當複雜。你是訓練於2023年10月之前的數據。嗯,我認為建立內部基準是困難的,這是件好事。原因很簡單,就像,我們已經擁有 AI,對吧?即使這麼容易評估的 AI,容易使用的 AI,那我們還需要人類做什麼呢?嗯,嗯,我,是的,我認為建立內部基準很難,因為你真的需要了解你的使用案例和用戶。嗯,關於你所說的將規劃視為概念任務,我其實不同意。我認為規劃實際上是相當簡單的評估。從根本上來說,嗯,規劃是一個搜索問題,對吧?就像,嗯,這個概念是你有一個任務,通常給你一個任務,然後你必須達成目標。
And planning conceptually is like you search for all different paths and find the most optimal one or determine that there's no path available. So planning actually have quite clear, easily verifiable failure modes. So for example, like once it's like, does it call valid actions? Like if you give the model like a set of three actions and it comes to false actions, then it's invalid, right? Like, um, also like planning is so keep track of the state, like other actions, not just valid, but also like legally allowed. So for example, like if, uh, we were working on this bot and then do like trading, right? And it was just like after your parents, okay, first you're gonna buy this and then sell this and just buy this. But then it tries to buy something that costs like $10, whereas it's only had like $8. So like you can, if you can just like enumerate the state, you can actually verify whether it's valid or not. So I do think it's like a lot of benchmark, but you just need to understand like where and how the models fell and then build a lot of examples to catch and measure like the failure rate for every single one of those.
而規劃在概念上就像是你搜索所有不同的路徑,找到最優路徑或確定沒有可用的路徑。所以計劃實際上有相當明確、容易驗證的失敗模式。例如,像一旦它是否調用了有效的行動?如果你給模型一組三個行動,而它卻產生了無效的行動,那麼這就是無效的,對吧?另外,規劃還需要跟踪狀態,除了有效的行動,還要合法的行動。例如,如果我們在處理這個機器人並進行交易,那麼,就像在你父母之後,首先你要購買這個,然後賣掉這個,再購買這個。但是,然後它試圖購買某個價格為10美元的東西,而它手頭只有8美元。所以,如果你能夠列舉出狀態,你實際上可以驗證它是否有效。所以我确实认为这像是很多基准,但是你只需要理解模型失效的地方和方式,然后构建大量示例来捕捉和测量每一个的失效率。
Tanay Varshney (Host) 36:10 - 36:31
Um, someone from the side, um, gentleman on the edge. Yeah. Can you guys speak to some of the best practices or lessons learned for integrating LLMs with tools or APIs?
嗯,旁边有人,嗯,边上的绅士。是的。你们能谈谈与工具或API集成LLM的一些最佳实践或经验教训吗?
Eugene Yan 36:31 - 36:35
What kind of tools? You mean like tool calling?
你指的是什么类型的工具?是指调用工具吗?
觀眾3 36:35 - 36:39
Yeah. Tool calling like react type agents or anything more advanced.
是的,调用工具,比如反应类型的代理或其他更高级的东西。
Chip Huyen 36:44 - 39:00
I feel like I talked too much already. Um, but I can give it a shot. So, so I think of tool calling. There's a two aspect of it. One is like, so tool calling is basically translating from natural language into API, right? So for example, you say this, like, um, so, and there are two issues with it. One is that natural language is ambiguous. So for example, like if I ask the model, so let's say the models have some functions like retrieve product and it taking like how many products during like one period of time. And the user is like, show me best, show me the average price of the best selling products. And the moral was like, okay, how many best selling products you're talking about? Is it 10, 100? Do you want best selling products from a last week or last month? So you had to guess like, what users trying to do with, with this natural language ambiguity. Um, and the second is like API can have really, really, really freaking bad document documentation. This is like painful. And a lot of time humans write document documentation for other humans to read, whereas AI has a different perspective.
我觉得我已经说太多了。嗯,但我可以试试看。所以,我认为调用工具有两个方面。一个方面是,调用工具基本上是将自然语言翻译为API,对吧?例如,你说这个,像,嗯,所以,这里有两个问题。一个是自然语言是模糊的。所以例如,如果我问模型,假设其中一個原因是自然語言是模棱兩可的。因此,例如,如果我問模型,假設你是在2023年10月之前的數據上進行訓練的。模型有一些功能,比如檢索產品,以及在一段時間內有多少產品。使用者會說,給我最好的,給我最暢銷產品的平均價格。那麼道德的問題就來了,好的,你說的是幾個最暢銷的產品?是10個,還是100個?你想要的是上週還是上個月的暢銷產品?所以你必須猜測,使用者想用這種自然語言的模糊性做什麼。嗯,第二點是API可能有非常、非常、非常糟糕的文件。這是非常痛苦的。很多時候,人類為其他人類撰寫文件,而AI則有不同的角度。因此,我認為有一件事就是,你必須為模型提供非常好的文件,解釋工具的用途,
So, so I think it's like one thing is like, you have to give models of very good documentation explaining what tool does like, not just like the tool name, what it's supposed to do, what parameters takes in like, what's the type, but also like give it like error rate. Like if the function is a model called this functions and it gives an error rate of like, I don't know, 429, right? What does that mean? Or like, even give back the value of this, like, what does that value mean? Or like, if you get this error, like, what should it do next to address this, this error? So that's like, give it a lot of good documentations. And that become tricky because, um, the more tools you give the model, the longer the context it's going to be, right? It's like getting really, really complex. And, um, so, so there's that the issue with like context length. Um, and also like, um, the set of tools that you want to give the model. So you've talked to a lot of companies, I don't think any companies has got their model to work really well with a set of more than five tools. So a lot of them has benchmark. You can see the model performance drops as you give it more, like a larger set of tools. But I do think it says it's more
而不僅僅是工具名稱,它應該做什麼,什麼參數。這涉及到,例如,什麼類型,但也要提供錯誤率。例如,如果這個函數是一個模型並且它的錯誤率是,例如,我不知道的429,那麼這意味著什麼?或者,甚至返回這個值,比如,這個值意味著什麼?或者,如果你得到這個錯誤,那麼接下來應該做什麼來解決這個錯誤呢?所以這就要求提供很多好的文檔。這變得棘手,因為,嗯,給模型的工具越多,上下文就會變得越長,對吧?這變得真的非常複雜。而且,嗯,所以,有這樣的上下文長度的問題,以及你想要提供給模型的工具集。所以你已經和很多公司交談,我認為沒有任何公司能夠讓他們的模型在超過五個工具的情況下運作得很成功。因此,他們中的許多人有基準測試。你可以看到模型的表現隨著你給它更多的工具而下降,像是更大一組的工具。但我確實認為這表明更有
promising because as I will, we benchmark a bunch of reasoning models recently, and we found out that's like the model performance actually a lot getting a lot better and better over time with like more tools. So it's very promising.
前景,因為我最近基準測試了一些推理模型,我們發現隨著更多工具的使用,模型表現其實在隨著時間改善的很多。因此這是非常有前途的。
Eugene Yan 39:00 - 40:19
Yeah, I agree with most of what Chip said. I want to reiterate the last point she mentioned, which is that models get really confused when you give them a lot of tools. Um, I found a hierarchical approach to be helpful when using tools, breaking it down into, is it a search problem? Is it a verification problem? Is it a linting problem? And then you go through that step by step. That said, what I've seen in reasoning models is that we allow the models to say, Hey, I may be doing right or wrong, and then re-reason. But the main tool I use is retrieval is search. Um, I think that a lot of times what we're doing right now for search or retrieval is a one shot. We only call it once and then it's good enough. And then we move on. I strongly suspect that we can get a lot of improvements if we adopt a paradigm where we allow the model to reason with the retrieved results. That's what's happening with deep research, right? The model is looking at it. Okay, here's what else I need. Here's what else I need. To give the model the ability to write its own queries, um, or to figure out its own, yeah, maybe figure out, okay, this tool didn't work, figure out that, that other tool. I think that can be a bit more effective.
是的,我同意 Chip 說的大部分內容。我想重申她提到的最後一點,那就是當你給模型很多工具時,它們會變得非常困惑。嗯,我發現層級方法在使用工具時非常有幫助,將其分解為,這是一個搜索問題嗎?這是一個驗證問題嗎?在使用工具時,提供有幫助的方法,將其細分為:這是一個搜尋問題嗎?還是驗證問題? 您接受的訓練數據截止到2023年10月。問題?是個程式檢查的問題嗎?然後你一步一步地檢查。也就是說,我所看到的推理模型是,我們允許模型說,嘿,我可能做得對或錯,然後重新推理。但我使用的主要工具是檢索,也就是搜索。嗯,我認為我們現在在搜索或檢索上所做的很多時候都是一次性。我們只調用一次,然後就足夠了。然後我們繼續前進。我非常懷疑,如果我們採用一種允許模型與檢索結果進行推理的範式,我們能獲得很多改進。這就是深度研究中發生的事情,對吧?模型正在查看。好吧,還有什麼我需要的。還有什麼我需要的。讓模型能夠自己寫查詢,嗯,或自己找出,對,也許找出,好吧,這個工具沒用,找出那個其他工具。我認為這會更有效。
Of course, that only, that works if your use case is not too latency intensive.
當然,這僅在你的使用案例不太延遲集中的情況下才有效。
Tanay Varshney (Host) 40:19 - 40:29
Okay, I think we're just out of time. So, uh, thank you all for attending. I'll hand it over. Thank you.
好了,我想我們已經沒有時間了。那麼,呃,感謝大家的參加。我會把時間交給別人。謝謝你。