通用機器人的新時代:人形機器人的崛起 [S72543]
Deepak Pathak,CEO and Co-Founder, Skild AI
Tiffany Janzen,Software Developer and Founder, Tiffintech
Jim Fan,Principal Research Scientist / Senior Research Manager, NVIDIA
Aaron Saunders,CTO, Boston Dynamics
Pras Velagapudi,CTO, Agility Robotics
Bernt Bornich,CEO and Founder, 1X
機器人領域的 ChatGPT 已經到來。人形機器人將成為世界上迄今為止規模最大的通用機器人部署之一。聽聽世界上一些領先的人形機器人公司的傑出人物對人形機器人未來的展望,以及實現這一目標需要從基礎模型、模擬技術和計算開始。
將網路規模的基礎模型應用於實體機器人任務的挑戰
數據策略的重要性,以及每種策略如何處理現實世界和模擬數據
開發人員、研究人員和最終客戶應該關注的技術和行業里程碑
主題:機器人技術 - 人形機器人
行業細分:所有行業
技術等級:技術 - 初學者
目標對象:開發人員/工程師
所有目標受眾類型:企業主管、開發人員/工程師、研究:學術
3 月 20 日,星期四
上午 5:00 - 上午 6:00 中部標準時間
AI逐字稿
大家好,這群人看起來真是太棒了!正如剛才提到的,我的名字是蒂芙妮·詹森(Tiffany Janzen),今天我將擔任你們的主持人。先簡單介紹一下我自己,我是Tiffintech的軟體開發者(Software Developer)兼創始人。你們可能不知道,但我真的是一天一天倒數計時,期待著今天的這場討論。人權(Human Rights)一直是個重要議題,最近我們看到了許多相關的進展。能與這個領域的一些領袖坐在一起,聽聽他們的看法,真的非常難得。這不僅讓我們了解當前的狀況,也能展望未來的方向。現在,讓我們從一輪自我介紹開始吧,我會先從伯恩特(Bernt)開始。
Hello, WOW this crowd looks amazing! As just mentioned, my name is Tiffany Janzen and I’m going to be your moderator today. A little bit about myself, I’m a Software Developer and the founder of Tiffintech, and I don’t know about you, but I have been counting down the days for this panel. Human rights have been an important topic. We’ve seen a lot of advancements with them very recently, and being able to sit down with some of the leaders in this space and hear from them is truly incredible—not only to learn where we are now, but where we are headed. Let’s start with a round of introductions; I’ll start with Bernt.
當然好。我的名字是伯恩特·博尼克(Bernt Bornich),我是1X的創始人兼首席執行官(CEO)。我們的使命是通過這些安全且聰明的青少年機器人,創造豐富的勞動力。我們深信,要真正實現智慧(Intelligence),這些機器人必須與我們共同生活並學習。這也是為什麼我們認為消費市場(Consumer Market)必須先行,只有這樣,它們才能體驗到人類生活中所有的細微差別。接著,智慧才能逐步應用到各個領域,提供有用的勞動,對吧?醫院(Hospitals)、老人照護(Elderly Care)、零售(Retail)、工廠(Factories)、物流(Logistics),還有整個社會,
Sure. My name is Bernt Bornich. I’m the founder and CEO of 1X. We’re on a mission to create an abundance of labor through these safe, intelligent juveniles. We really believe that to truly get to intelligence, these robots need to live and learn among us. That’s why we say the consumer has to happen first to really be able to experience all of the nuances that make up human life. Then you start intelligence to be able to provide useful labor in all of the verticals down the line, right?
大家好!我的名字是迪帕克·帕塔克(Deepak Pathak),我是Skild AI的首席執行官(CEO)兼聯合創始人。我們在Skild AI所做的是為機器人打造一個通用大腦(General Brain)。我們的論點是,我們可以建立一個單一的共享模型,因為機器人技術(Robotics)本身是一個數據稀缺的領域。
Hi, I am Deepak Pathak and I am CEO and co-founder of Skild AI. At Skild, what we are doing is we are building a general brain for robotics. Our thesis is that we can have a single shared model because robotics is a field which is anyway scarce of data.
我們不妨利用任何平台、任何任務、任何場景中所有可用的資源。把它想像成一個大規模基礎模型(Large-Scale Foundation Model),你可以將它應用於任何機器人、任何硬體(Hardware)、任何任務(Task)、任何場景(Scenario)。
We might as well use everything that’s available from any platform, any task, any scenario. So think of it as a large-scale foundation model you can use for any robot, any hardware, any task, any scenario?
我的名字是普拉斯·維拉加普迪(Pras Velagapudi),我是Agility Robotics的首席技術官(CTO)。在Agility Robotics,我們的人形機器人Digit是專為工作設計的,我們正在將它應用於製造(Manufacturing)和物流(Logistics)領域。今天,我們認為要讓這項技術推廣出去並從中學習,最好的方法就是獲得真實的客戶和實際的部署,讓機器人真正投入工作。這正是我們一直專注的目標——讓機器人走進職場,參與勞動。
My name is Pras Velagapudi, the CTO at Agility Robotics. At Agility Robotics, our humanoid Digit is made for work, and we’re bringing it out to manufacturing and logistics use cases. Today, we feel that the best way to get the technology out there and learn from it is to be able to get real customers and real deployments doing work. And that’s what we’ve been focused on—getting a robot out there and in the workforce.
我的名字是亞倫·桑德斯(Aaron Saunders),我是Boston Dynamics的首席技術官(CTO)。在人權(Human Rights)變得流行之前,我就一直在研究相關領域。長久以來,我們在Boston Dynamics的使命始終如一,那就是讓機器人成為現實(Make Robots Real)。我們已經出貨了數千台機器人,而人形機器人(Humanoid Robot)是我們最新的產品宣告。我們非常希望將產品推向市場,讓它能執行真正有用的工作。沒錯,就是那些能讓人們從枯燥且危險的工作中解放出來的工作。這是我們長期以來一直在努力的方向。我認為還有更多工作要做,但我們對未來的發展感到非常興奮。
Um, I’m Aaron Saunders, CTO of Boston Dynamics. I’ve been working on human rights before they were cool. And at Boston, you know, our mission has been the same for a long time. It’s to make robots real. We’ve shipped a couple thousand robots. The humanoid is the latest kind of announcement for us. And we’re really wanting to bring our product to market that can do really useful work, right? So to do work that removes people from their dull, dangerous things. And that’s the thing that we’ve been up to for a long time. And I think there’s more work to do yet, but we’re pretty excited about where we’re going.
大家好,我是吉姆·范(Jim Fan),我在NVIDIA擔任首席研究科學家(Principal Research Scientist)兼高級研究經理(Senior Research Manager),負責相關項目。我們在NVIDIA的登月計劃(Moonshot Initiative)是打造人形機器人(Humanoid Robots)的基礎模型(Foundation Model),也就是機器人大腦(Robot Brain)。這個名為Groot的項目也代表了我們對下一代物理人工智慧(Physical AI)計算平台的策略。我們的使命還包括讓物理人工智慧普及化。事實上,就在昨天的傑克逊主題演講中,我們宣布了Groot模型的開源,這是全球首個開放的人形機器人基礎模型。
Hi, everyone. I’m Jim Fan. I’m the Principal Research Scientist and Senior Research Manager at NVIDIA, working on projects as a group. This is NVIDIA’s moonshot initiative at building the foundation model, the robot brain for humanoid robots. And Groot also represents our strategy for the next-generation computing platform for physical AI. We are also on a mission to democratize physical AI. In fact, yesterday at Jackson’s keynote, we announced the open-sourcing of the Groot model, which is the world’s first open humanoid robot foundation model.
它只有20億個參數(Parameters)。
It’s only 2 billion parameters.
但它的表現遠超預期。你基本上可以說是把當今世界最先進的自主人形智慧(Autonomous Humanoid Intelligence)握在手中。我還想說,就像今天這場討論中的其他講者一樣,我在機器人學(Robotics)變得吸引人之前就開始研究它了。今天看到這裡座無虛席,我真的非常高興,因為它終於變得吸引人了。感謝大家,你們讓我今天過得很開心!
And it punches above its weight. You are basically holding the world’s state-of-the-art autonomous humanoid intelligence in the palm of your hand. And I would also like to say that, just like everyone else on this panel, I started working on robotics before it was sexy. And today I see the full house here. So I’m just really, really glad that it becomes sexy today. So thank you all so much. All of you made my day.
再次感謝。
Again,
謝謝大家來到這裡。我知道我們都很興奮。顯然,在這場討論之前,我們曾經有過一次電話會議。在那次對話中,我不太記得是誰說的了……
Oh, thank you all for being here. I know we’re all excited. We had a call before our panel, obviously. And in that conversation, I can’t recall who exactly it was…
在我們這場討論之前,我們顯然有過一次電話會議。在那次對話中,我不記得具體是誰說的,但有人提到機器人技術(Robotics)是人工智慧(AI)最古老的應用。
But someone shared on the call that robotics is the oldest application of AI.
從歷史上看,它的進展一直是最慢的,但現在情況已經不同了。
And historically, it’s moved the slowest, not the case anymore.
這是因為像大型基礎模型(Large Foundation Models)——例如ChatGPT這樣的時刻——改變了一切。現在我們擁有了能夠進行推理(Reasoning)的模型。我們還有能夠理解計算(Computation)的多模態模型(Multimodal Models),對吧?它們具備開放詞彙(Open Vocabulary)和對三維視覺世界(3D Visual World)的理解能力,遠比過去要好得多。這些是解決機器人技術問題的必要但不充分條件。你得先解決視覺問題(Vision),在談論通用機器人(General Purpose Robot)之前,你需要一個非常出色的視覺系統。所以我認為,當其他模型變得越來越優秀時,我們終於可以開始系統性地應對機器人技術的挑戰。這是第一點。第二點是數據(Data)方面的變化。不像語言模型(Language Models),我這裡引用一位早期來源的話,他說:「網路是人工智慧的化石燃料(Fossil Fuel)。」對語言模型來說沒錯,但機器人技術甚至連這種「化石燃料」都沒有,至少對大型模型來說是這樣。你可以從網路上抓取文字,從維基百科(Wikipedia)下載資料,但我們從哪裡抓取馬達控制(Motor Control)數據?從哪裡獲取機器人軌跡(Robot Trajectory)這樣的資料?
Because of large foundation models like LMs, the ChatGPT moment. Now we have models that can do reasoning. And we have multimodal models that understand computation, right? Open vocabulary, understanding of the 3D visual world, much much better than what we had before. And these are necessary but insufficient conditions to solve robotics. Like you’ve got to solve vision—to have a really good vision system—before you even talk about having a general-purpose robot. So I think the rest of the models are becoming really good that we can start to tackle robotics a lot more systematically. So that’s number one. And number two, what has changed is on the data side. So, you know, unlike our LMs, I am quoting an earlier source here. He says the internet is the fossil fuel, right, for AI. Well, robotics doesn’t even have the fossil fuel, at least for LMs. You can download text. You can scrape text from Wikipedia. Where do we scrape motor control? Where do we scrape all of those like robot trajectories?
從網路上嗎?你根本找不到。所以我們必須自己生成數據(Generate Data),必須大規模收集數據(Collect Data)。我認為GPU加速模擬(GPU-Accelerated Simulation)的出現,讓這些問題變得更容易解決。因為現在你可以用大約三小時的計算時間,生成相當於十年的訓練數據(Training Data)。這真的幫助我們突破了數據困境(Data Paradox)。這是第二點。第三點是硬體(Hardware)方面的進展。我們這裡有一些最傑出的創始人,他們打造了我們見過最好的機器人硬體。我認為硬體不僅變得更好了,還便宜了很多。對吧?比如今年,我們看到的硬體價格大約在4萬美元(40K)左右,這不過是一輛車的價格。回想2001年,NASA打造了Robot Naut,那是第一批重要的人形機器人之一……
From the internet? You can’t find it anywhere. So we’ve got to generate data, we’ve got to collect data at scale. And I think the advent of simulation—of GPU-accelerated simulation—really makes these problems more tractable. Because now you can generate like 10 years’ worth of training data in maybe 3 hours’ worth of computer time. So that really takes us beyond this data paradox. So that’s number two. And number three is on the hardware side. And we have some of the most extraordinary founders on some of the best robot hardware we have ever seen. And I think hardware has become a lot better and also a lot cheaper, right? Like this year we have seen hardware in the range of maybe 40K. That’s the price of a car. Um, and back in 2001, NASA built Robot Naut, one of the first major humanoids…
你知道,這麼多零散的元件(Components)已經被商品化(Commoditization)。所以我認為我們應該大力讚揚一些相關產業。對吧?消費電子(Consumer Electronics)發展出了電池(Batteries)和相機(Cameras),這些技術讓我們能夠感知世界(Perception)、觀察環境並進行計算(Computing)。回想10年、15年前,大多數機器人裡塞滿了電路板(PCBs)和電線,電池容量(Battery Capacity)卻非常小。現在情況完全改變了。我們可以放入大量的計算能力(Compute),還能安裝微型傳感器(Tiny Sensors),而且它們非常節能(Power Efficient)。所以我認為元件的商品化並不只是關於低成本(Low Cost)。我知道這是目前的一個焦點,但真正的關鍵在於,我們之所以迎來機器人創業的時代(Era of Robotic Startups),是因為全球供應鏈(Global Supply Chain)提供了許多重要的零件,就像拼圖(Puzzle Pieces)一樣。你可以把它們組裝起來。因此,我們已經將機器人社群從一群試圖設計每個齒輪(Cog)的人,轉變成能夠把這些零件拼湊起來、在更高層次運作的團隊。現在,我們有了專注於智慧層面(Intelligence Level)的公司,他們開發應用程式(Applications),而不是把所有資金和精力都花在讓一台物理機器站起來上。
You know, the commoditization of so many bits and pieces. So I think we probably need to give a huge amount of credit to some adjacent industries, right? So consumer electronics have developed batteries and cameras that, you know, the technology for perception, for seeing the world, for computing—when I look back, even 10, 15 years ago, most of the robot was full of PCBs and wires and had a very small amount of battery capacity. That’s completely changed, right? We can put a huge amount of compute in. We can put tiny sensors in, they’re power efficient. So I think the commoditization of the components—and I don’t think it’s so much about low cost. I know that’s a big focus right now, but I think the reason we’re seeing the era of the robotic startup is because there’s a global supply chain full of really important pieces that you can put together as puzzle pieces. And so we’ve elevated the robotics community from the people trying to design every cog to people who can put those things together as a puzzle and basically operate at a higher level. So now we have companies that are operating at the intelligence level that are developing applications rather than spending the same amount of capital and energy just making a physical machine stand.
是的,我想補充一些關於吉姆(Jim)最初觀點的想法。他很清楚地描述了所有改變的差異。我想再加一點:機器人技術(Robotics)不只是人工智慧(AI)的第一個應用,它其實就是人工智慧的本質。如果你回顧圖靈(Turing)的原始文件,當他談到人工智慧時,他指的是機器人。他說,你應該創造一個會學習的東西,而不是直接打造一個完美的成品。他提出要做一個像孩子一樣的機器人,然後讓它成長。你可以把這個機器人放在教室裡,讓它隨著時間和環境一起進步。這是多麼有趣的想法啊!他在1950年代就有了這樣的洞見。對吧?語言(Language)、視覺(Vision)這些都很酷,但如果你觀察自然界,相較於物理動作(Physical Action),它們出現的時間要晚得多。例如,我們用來訓練的數據(Data)可能來自過去100年、200年,最多1000年,我們不會用超過1000年的數據。但人類已經存在了超過1000年。所以,不是語言帶來了智慧(Intelligence),而是基礎設施——我們的大腦——早已存在,這是通過物理推理(Physical Reasoning)發展出來的。這就是為什麼它的影響如此巨大,你甚至不需要向一個人解釋太多。
Yeah, I can add something to, I think, Jim’s point initially, about, I think you put it very well about all the differences that have changed. I want to add that yeah, it was not just the first application, so you know what it was—not the first application of AI. It is what AI is like. If you look at the original documents of Turing, when he talked about AI, it was for robotics. It was like, you should make something that learns instead of building an adequate—make something that looks like a child and then it can grow. You can put the same robot in the classroom. It will grow together over time. So fascinating thought and he had this thought in the 1950s, right? Because language, vision, all of these things are cool. But if you look at nature, they come much later in the timeline compared to physical action. Like for instance, elements we are training on the data is from maybe the last hundred years, 200 years, let’s say a thousand—we are not training on more than a thousand years of data. Humans have been around for more than 1,000 years. So it’s not that language led to intelligence. It was the infrastructure already existed, our brain already existed, which came through physical reasoning, which is why the impact is so big. Like if you don’t have to explain to a person…
你知道機器人技術(Robotics)是什麼嗎?你能感受到它,因為你每隔一天就在做一些物理任務(Physical Tasks)。每家公司都會受到機器人技術的影響,這是最重要的事情。除了你們提到的其他因素外,究竟改變了什麼?這些是技術細節,沒錯,但真正改變的是我們看待機器人技術的方式。到目前為止,機器人技術一直是由控制學(Controls)主導的領域。我會說,甚至直到34年前,控制系統仍在驅動機器人技術。對吧?對於那些長期投入這個領域的人來說,他們知道控制學並不是專為機器人設計的。就像在第二次世界大戰期間,控制學因為飛機(Flying Planes)、導彈(Missiles)等應用而大放異彩。後來,機器人熱潮開始了,受到某種啟發,人們開始思考:我們該用什麼呢?於是他們採用了當時很出色的控制技術,這種方法延續了下來,持續了幾十年,70年。但這並不是同樣的精神,不是像孩子般的學習(Childlike Learning)。你不會先教孩子微積分(Calculus)來讓他們學會走路,而是讓他們先搞清楚關節動作(Joint Movements),然後通過經驗(Experience)學習。所以,從經驗中學習是已經發生的主要改變。現在我們看到這種轉變正在發生,我們今天剛剛發布了一個關於從經驗中學習的影片。所以我想說,一個重大的改變已經發生:我們從編程經驗(Programming Experience)轉向了從經驗中學習(Learning by Experience),這徹底改變了我們對機器人技術的思考方式。
Do you have any person who knows what robotics is? You can feel it because you do physical tasks every other day. Every company gets impacted by robotics, and that’s the main deal. So what has changed apart from other factors that you mentioned? These are technical details, that is, yes, but what has changed is completely how we approach robotics. Robotics so far has been the field of controls. Controls was driving robotics until I would say even 34 years ago, right? Now, for those who have been in this area for a long time, they know controls was not designed for robotics. Like control had its really limelight during World War II for flying planes, missiles, and everything. And then the craze for robotics began because of kind of queuing, and then people were like, what do we use? Well, let’s use controls, which were busy, which was great for that, and that stuck around—it has stuck around for decades, for 70 years. But it’s not in the same spirit, it’s not childlike learning a child. You are not teaching them calculus first to learn how to walk—figure out your joint movements and then learn what you are learning by experience. So learning by experience is the main change that has happened. Now we are seeing all the shift happening, and I mean we just released a video today—learning by experience. So that is, I want to say, one major change has happened instead of programming experience. From that, we have gone to learning by experience, which is a major shift, how we think about robotics.
我想深入探討這一點。對我來說,我在這個領域待得夠久,經歷過傳統控制學(Classical Controls)的時代。但發生了一個很大的改變,那就是網路(Internet)的出現。對吧?如果你想想,這就像一個持續近30年的巨大人類實驗(Human Experiment),全世界的人都在貢獻,創造了這些龐大的數據來源(Source of Data)。於是我們可以訓練人工智慧(AI),這完全是神奇的。現在我們要做的是,請你們所有人一起參與,創造出足夠有用的東西,讓我們能以這樣的規模高效運作。這不需要你事無鉅細地告訴機器人該做什麼。這可能是通往——如果不是AGI(Artificial General Intelligence),至少是非常非常有用的機器——的路徑,或許最終也會通往AGI。
I want to actually double-click on that. So I think for me, right? And I’ve been in the field long enough to be in the classical controls receiving. But a large part of what happened was the internet, right? If you think about it, it’s this enormous human experiment of like 30, close to 30 years of everyone around the world contributing to creating these enormous sources of data. So we can train AI, which is completely magical. And now what we’re gonna do is we’re just gonna ask all of you to do the same—useful enough that you can create a very efficient day-to-day with that scale. That doesn’t require you to tell, operate the robots for everything it does. And that is probably the path to, if not AGI, at least very, very useful machines, and maybe also to AGI.
我們會看到一些成果。我想我們得回到你剛才說的最後一點。但與此同時,我很好奇想聽聽你們的看法。
We’ll see some results. I think we’ve got to circle back to the last part you just said there. But in the meantime, perhaps I’m curious to hear your take.
是的,我認為這與亞倫(Aaron)的一些評論相呼應,為什麼機器人技術(Robotics)現在有點像是重新回歸。你知道,我一開始從事機器人技術,然後進展到其他各種領域,後來又繞了回來。機器人技術之所以具有挑戰性,主要有兩個原因。第一,硬體(Hardware)很難搞定;第二,世界是非結構化的(Unstructured)。如果你看看人工智慧(AI)是如何演進的,以及機器人技術是如何發展的,你會發現機器人技術很大一部分都在處理硬體這個難題。許多新興的感測器(Sensors),像是MEMS(微機電系統),還有執行器(Actuator)、驅動技術(Drive Technology)、能源儲存技術(Energy Storage Technology),這些問題都必須先解決。即便是像某些平台這樣的存在,也是在民主化(Democratize)人們讓東西在現實世界中動起來的能力,並把這種能力帶給大家,這樣他們就不用每次都從頭開始重新發明輪子(Reinventing the Wheel)。而在人工智慧這邊,我們一直在逐步前進,從解決結構化問題(Structured Problems)到越來越多的非結構化問題(Unstructured Problems),從處理查詢(Queries)和提示(Prompts),到API,再到簡化的世界模型(Simplified World Models),直到現在的非結構化世界模型(Unstructured World Models)。這個拼圖的每一塊都在提升人工智慧平台(AI Platforms),尋找新的數據攝取方式(Ingest Data),採用之前結構的最佳實踐(Best Practices),然後再往前邁進一步:如果我們拿掉這些輔助輪(Training Wheels)會怎麼樣?現在,你可能只是在看自動駕駛車(Self-Driving Car)傳來的影片,或者是我們機器人相機拍攝的以自我為中心的影片(Egocentric Video),然後思考這個世界接下來會發生什麼。所以我認為,這背後有一種逐步的進展和解鎖(Unlock)正在發生。我們現在看到的,是這些努力終於達到了轉捩點(Tipping Point)。我們終於可以說,好,我們可以開始追逐這個目標了。
Yeah, I think echoing some of Aaron’s comments here, you know why robotics is kind of coming back in. You know, hey, I started with robotics and then kind of progressed to all these other areas and then sort of circled back. Well, robotics is challenging for two reasons. One, hardware is hard, and the world is unstructured, right? If you look at how AI has evolved, right? And how robotics has evolved, a huge portion of robotics has been dealing with the hardware as a hard problem. And many rising sensors like MEMS, building an actuator and drive technology and energy storage technology, all of that stuff had to be solved. Even platforms like ours, we know, democratize people’s ability to just get things to move in the real world at all and bring that to people. So they weren’t reinventing the wheel every single time. On the AI side, we’ve been basically walking our way through going from solving structured to increasingly unstructured problems, solving problems of queries and prompts to APIs to simplified world models to now unstructured world models, where every piece of this puzzle has been sort of up-leveling the AI platforms, finding new ways to ingest data, taking the best practices of the previous sort of structure and how that works and then taking it to this next step of, okay, what if we remove some of these training wheels? And now, you’re just looking at video coming from a self-driving car or now you’re just looking at egocentric video being captured by our robot’s cameras and what is going to happen next in this world. So I think there’s a bit more of this progression and unlock that’s been happening behind the scenes. And we’re just seeing the culmination of that finally reach this tipping point. We’re now okay, we can go after this.
與世界以非結構化方式(Unstructured Ways)互動的完整問題。我認為你剛才說的最後一點現在非常重要,尤其是在談到硬體(Hardware)方面的進展時。過去幾年最大的改變之一,或許是硬體的穩健性(Robustness)——製造出能在現實世界中互動卻不易損壞的硬體的能力。因為我們這些長期在機器人領域(Robotics)工作的人都知道,對吧?如果每次實驗後都需要重建機器人(Rebuild the Robot),那麼實驗的時間就會拖得很長。但現在我們的硬體技術已經到了一個階段,我們可以在現實世界中學習,並安全地與環境互動,既不會損壞自己,也不會對周圍的世界造成傷害。這也是實現進展的一個必要條件(Necessary Condition)。
The full problem of just interacting with the world in unstructured ways. I think the last thing you say there is so important now and talking about what has happened in hardware. Maybe one of the biggest things happening the last few years is the robustness of hardware and the ability to make hardware that doesn’t break when it interacts with the real world, because all of us have worked long in robotics, right? Experiments take a long time when you need to rebuild the robot every time you run it. But we’re also now really at the point in hardware where we can have something you learn in the real world and safely interact with the world without damaging itself or the world. And that’s also a necessary condition ready for this progress on that.
關於你們的策略和方法,比如人工智慧(AI)的角色如何從專家模型(Specialist Models)轉向通才模型(Generalist Models),或者你們如何看待基礎模型(Foundation Models)激增這樣的現象,你們有什麼想法?
On your strategies and approaches for things such as the role of AI, moving from specialist to generalist models, or how you think through things such as the explosion of foundation models?
是的,我可以稍微談談我們的成長策略(Growth Strategy)。我們正在解決一個非常非常困難的問題——為各種不同的人形機器人(Humanoid Robots)打造一個通用大腦(General Purpose Brain),而不是只針對某一種。我們還希望實現所謂的跨載體性(Cross Embodiment)。那麼,我們該如何應對這個挑戰呢?
Yeah, I can perhaps talk a bit about growth strategy, right? We’re solving a really, really hard problem to build a general-purpose brain for all kinds of different humanoid robots, not just one. We also hope to get what we call cross embodiment. So how do we tackle this problem?
我想說,有兩個主要原則(Principles)。第一個原則是關於模型本身(Model)。我們希望它盡可能簡單,盡可能做到端到端(End-to-End),甚至到達從光子到動作(Photons to Actions)的程度。對吧?你從攝影機(Video Camera)中獲取像素(Pixels),然後直接輸出連續的浮點數(Continuous Floating-Point Numbers),這些基本上就是馬達上的控制值(Control Values)。這就是端到端的模型,沒有任何中間步驟(Intermediate Steps),盡可能保持簡單。
I would say there are two main principles. Principle number one is that the model itself—we want it to be as simple as possible. We want it to be as end-to-end as possible, to the point that it’s basically photons to actions, right? You take pixels in from like a video camera, and then you directly output continuous floating-point numbers, which are essentially the control values on the motor. And that is the model end-to-end. There are no intermediate steps, as simple as possible?
為什麼這是好的呢?因為如果我們看看自然語言處理(NLP)領域——順便說一下,自然語言處理可能是目前人工智慧(AI)解決得最成功的領域——我認為在機器人技術(Robotics)中,我們應該抄作業(Copy Homework),從那些已經成功的領域借鑑經驗。以ChatGPT為例,在ChatGPT出現之前,自然語言處理領域有點混亂。你有文本摘要(Text Summarization)、機器翻譯(Machine Translation)、內容生成(Content Generation),這些都使用完全不同的數據管道(Data Pipeline)、訓練協議(Training Protocols)和現代架構(Modern Architectures),有時甚至不只一個模型。然後一個策略出現,把一切都顛覆了,因為它很簡單。它能將任何文本映射到任何其他文本(Text-to-Text Mapping),就這樣。底層是一個轉換器(Transformer),將一個整數序列(Sequence of Integers)映射到另一個整數序列。正因為它如此簡單,你才能把所有問題的數據統一到一個模型中(Unify All Data)。我認為這正是機器人技術應該抄作業的地方——讓模型盡可能簡單。第二個原則是數據管道(Data Pipeline)其實會非常複雜。模型周圍的所有東西都會很繁瑣。為什麼呢?因為對於機器人技術來說,正如我一開始提到的,數據是一個巨大的問題。你無法從YouTube或維基百科(Wikipedia)下載馬達控制數據(Motor Control Data),你哪裡都找不到。對於Groot(假設為 NVIDIA 的項目名稱),我們的數據策略可以組織成一個金字塔(Pyramid)。現在,在金字塔的頂端,你有真實的機器人數據(Real Robot Data)。這是品質最高的數據之一,因為它不僅是在現實世界中經過時間操作選擇的數據(Time-Operated Data)。但這種數據量很有限,也不太靈活,因為你受限於每個機器人每天24小時的基本物理限制(Fundamental Physics)。對吧?就是這樣。這種數據在現實世界中很難擴展(Scale)。在金字塔的中層,模擬(Simulation)就派上用場了。我們大量依賴物理引擎(Physics Engines),比如Isaac(NVIDIA 的模擬平台),來生成海量數據(Scale Lots of Data)。這些數據可以基於現實世界收集的數據生成,或者通過經驗學習(Learning from Experience)來產生。
And why is this good? Because if we look at the field of NLP—and by the way, NLP is extremely, perhaps the most successful field so far that’s been solved by AI. And I think as robotics, we should copy homework—copy homework from someone that only works. So for ChatGPT, right before ChatGPT, the field of NLP was kind of a mess. But you had text summarization, machine translation, content generation that’s using completely different data pipelines and training protocols and modern architectures, sometimes not just one model. And then a strategy came and just blew everything out of the water because it’s simple. It maps any text to any other text, and that’s it. Underneath is a transformer that maps the sequence of integers to another sequence of integers. And because it’s so simple, you can unify all of the data of the problems into one model. And I think that’s where robotics should copy homework—to make a model as simple as possible. And the second principle is the data pipeline will actually be very complicated. Like all the things that surround the model will be very common. And that is because for robotics, as I said at the beginning, data is a huge problem. You cannot download motor control data from YouTube, from Wikipedia. You can’t find it anywhere. So for Groot, our data strategy can organize into a pyramid, right now. Also, I specialize the pyramid right at the top. You have real robot data. That’s one of the highest quality because there is not only gap-selected using time operation in the real world. But that’s gonna be quite limited, not very skilled, because you are limited by the fundamental physics theory of 24 hours per robot, right? That’s it. And it’s really hard to scale in the world of athletes. And in the middle of the pyramid, that’s where simulation comes in, where we rely heavily on physics engines, like Isaac, for example, to scale lots and lots of data. And this data can be generated based on the real-world collected data or through learning from experience.
正如大家提到的,
As people mentioned,
這就是模擬數據(Simulation Data)。請記住,在NVIDIA成為一家人工智慧公司(AI Company)之前,它是一家圖形公司(Graphics Company)。什麼是圖形引擎(Graphics Engines)?就是在等待物理(Physics),對吧?渲染(Rendering)。這就是我們的模擬策略(Simulation Strategy),位於金字塔的底部(Bottom of the Pyramid)。我們仍然需要從網路上獲取所有那些多模態數據(Multimodal Data),但這次我們的用法有些不同。對吧?我們會用這些數據來訓練視覺語言模型(Visual Language Models),這些模型可以成為視覺(Vision)、語言(Language)和動作(Action)模型的基礎。這些視覺語言模型(VLMs)是從大量的網路數據中訓練出來的,包括文本(Text)、圖像(Images)、音頻(Audio),應有盡有。最近,視頻生成模型(Video Generation Models)也變得非常出色,它們可以成為世界的神经模擬(Neural Simulations)。金字塔的最後一層其實就是這種神經模擬,對吧?它超越了傳統的圖形引擎(Traditional Graphics Engines)。這些神經模擬可以讓你提示一個視頻生成模型,要求它生成一些東西,比如「幫我幻覺(Hallucinate)出一個新的軌跡(Trajectory),一個新的機器人軌跡」。這些視頻模型對物理學(Physics)的學習非常到位,因為它們是在數億個線上影片上訓練出來的。它們學會了物理規律,能夠生成像素中物理上準確的軌跡(Physically Accurate Trajectory)。然後,你可以運行算法——我們在Groot N1(推測為 NVIDIA 項目名稱)中提出的東西,比如所謂的潛在動作(Latent Actions)——從這些幻覺生成的內容中提取出動作選項。
Right, so that will be simulation data. And just remember before NVIDIA was an AI company, it was a graphics company. And what are graphics engines? Waiting for physics, right? Rendering. So that’s our simulation strategy and on the bottom of the pyramid, right? We still need all of those multimodal data from the internet, but this time we use it a little bit differently, right? We’ll use it to train visual language models that can become the foundation for vision, language, action models. And the VLMs are trained from lots and lots of internet, text, images, audio, you name it. And then recently there’s also video generation models that become so good that they can be neural simulations of the world. So the last layer of the pyramid is really the neural simulation, right? That goes beyond traditional graphics engines. And these neural simulations, you can prompt a video generation model and ask for things like, you know, hallucinate a new trajectory, a new robot trajectory for me. And the video models learn physics so well, because it’s trained on hundreds of millions of videos online—it trains, it learns physics so well that it is able to give you physically accurate trajectory in pixels. And then you can run algorithms, which we’re proposing in Groot N1, something called latent actions, to extract back those options from the hallucinated.
我們稱之為機器人的夢想(Dreams of the Robot),對吧?
What we call the dreams of the robot, right?
就像人形機器人(Humanoid Robots)夢見電子羊(Electric Sheep)一樣。對吧?這很夢幻(Dreamy),你可以從中收集那些潛在動作(Latent Actions),然後將它們放回這個數據循環中。
Like the humanoid robots dream of electric sheep, right? It’s dreamy and you collect those latent actions from it and you put it back into this data.
德國(Germany),
Germany,
對於所有這些非常複雜的數據策略(Data Strategies),我們把它們壓縮起來。我們將它們壓縮成一個乾淨的成品(Clean Artifact),從光子到動作(Photons to Actions),對吧?一個20億參數的模型(2 Billion Model)就足以應對廣泛的任務(Wide Range of Tasks)。這就是Groot策略(Groot Strategy)的概述。
And with all of these very complicated data strategies, we compress them. We compress them into this one clean artifact from photons to actions, right? A 2 billion model suffices for a wide range of tasks. So that’s an overview of Groot strategy.
我認為這描繪了一幅非常美好的未來圖景,對吧?我們有一個簡單的大模型(Big Model),其實它甚至不算太大。這個模型基本上解決了從像素到動作(Pixels to Motion)的所有問題,對吧?
I think that paints a really kind of great future picture, right? So we have a simple, big model. It’s not even that big. That kind of solves everything—pixels to motion, right?
但我認為在這個過程中,我們也需要關注一些事情——那些我們必須承擔的責任,比如將產品推向現實世界(Real World),這需要確定性(Determinism)。對吧?當你需要向客戶交付產品時,你得明白它在意外情況(Unexpected Conditions)下會怎麼表現。你需要考慮功能安全(Functional Safety),還要思考如果在現有功能的基礎上增加新功能(New Capabilities),它會如何退化(Regress)。所以我覺得你提到了一個很重要的觀點,那就是複雜性被推到了數據上(Complexity Gets Pushed into the Data),對吧?也就是你所收集的數據。我認為我們才剛剛開始構建這個數據集(Data Set)的旅程。所以我想說,我們認為一個重要的策略(Piece of Strategy)是,確保在追求這個潛在的、非常強大的最終狀態(Powerful End State)時,不要把整個工具箱(Toolbox)都丟棄。因為作為一個社群,我們還有許多事情要做,其中之一就是維持購買機器人的客戶的信任(Trust of Customers)。我們必須運用手邊所有的工具來做到這一點。所以我認為,有很多令人興奮的新功能(New Capabilities),這些東西已經在改變機器人技術的格局(Landscape of Robotics)。但與此同時,我們也要現實一點——過去70年來,機器人技術已經積累了一個巨大的工具箱(Big Toolbox of Robotics Tools)。這些工具中有一些仍然是解決現實世界問題的正確選擇,尤其是當你在操作可能傷人的大型強力機器人(Large Powerful Robots),或者在人身邊工作時需要保持信任的時候。因為一旦你打破了這種信任,你就再也拿不回來。對吧?所以我想說,我們有一個大工具箱,我們應該為此鼓掌(Applaud)。
But I think along the way, we also need to pay attention to all the things that we have to own—delivering products out into the real world that require determinism, right? So when you need to deliver something to a customer, you need to understand what it’s going to do in unexpected conditions. You need to think about functional safety. You need to think about how it’s going to regress if you add new capabilities on top of existing ones. So I think you pointed out a really important thing, which is the complexity gets pushed into the data, right? And the data you gather. And I think we’re at the very beginning of the journey of building that data set. And so I think, you know, maybe I would say a piece of strategy that we think is important is to make sure that you don’t throw the whole toolbox out in pursuit of this potentially very powerful end state, because we have a lot of things to do as a community on the way. And one of them is to maintain the trust of the customers that are buying robots. We have to be able to do that by applying all of the tools we have. So I think there’s a lot of exciting new capabilities—things that we think will totally change the landscape of robotics, they already are. But at the same time, we need to also be realistic that there’s a big toolbox of robotics tools going back 70 years. And some of those are also the right tool to apply to solving real-world problems, especially when you’re doing things with large powerful robots that could hurt somebody or doing things around a person where you want to maintain that trust. Because as soon as you break it, you never get it back, right? So I think maybe I just say there’s a big toolbox that we need to applaud.
我的意思是,
And I mean,
非常感謝德國(Germany)的支持,我們確實是站在那個陣營裡,努力打造一個簡單的模型(Simple Model)。我們還不知道它最終會是什麼樣子,所以我們稱它為簡單,但相對來說是個簡單的模型系列(Relatively Simple Model Line)。一切都跟數據(Data)有關。如果我們要從早期語言模型(Early LLMs)和後期語言模型(Late LLMs)中吸取教訓,我想說,有一件事經常被低估,那就是多樣性(Diversity)的重要性。
Uh, I thank you very much, German-like, and we are very much in that camp where we’re making like one simple model. We don’t know exactly how it’s going to look yet. So we call it so simple, but a relatively simple model line. It’s all about the data, right? And if we’re going to take the lessons learned from early LLMs, end of the late LLMs, in that case, I think one of the things that often gets underestimated is the importance of diversity.
所以,
So,
就像語言模型(LLMs)歷史的開端一樣,對吧?很多公司試圖訓練一個很棒的模型來創作詩歌(Poems)。他們會用世界上最好的詩歌來訓練,但結果並不好用。因為除非你在與寫詩無關的、非常多樣化的數據(Diverse Data)上訓練,否則你無法獲得智慧(Intelligence)。智慧正是從這種多樣性中產生的。我們現在看到的,至少在我們的模型中,這一點對機器人技術(Robotics)來說顯然也是成立的。即使是在非常小的規模上,我們現在剛開始使用這些微小數據集(Tiny Data Sets),我們其實更受多樣性的限制,而不是數據規模(Scale of Data)的限制。所以,問題在於:你如何獲得盡可能多的任務(Tasks)?
In the beginning of like the history of our LLMs, right? There’s a lot of companies that have tried to train, let’s say, a very good model to create poems, so they would train on all the best poems in the world and it doesn’t really work. Because unless you train on these very diverse data that has nothing to do with writing poems, then you’re not going to get intelligence, because intelligence comes from that diversity. And what we’re seeing now, at least in our models, is that this is obviously also true for robotics. And even after a very small scale, we are now in the beginning with these tiny data sets, we’re actually more limited by diversity than we are by the scale of data. So it’s about like, how do you get as many tasks as possible?
我們希望盡可能涵蓋多種不同的環境(Environments),最好有一些行進的聲音(Marching Voice)和動態元素(Dynamic Stuff),這樣你才能真正理解什麼是實際任務(Actual Task)。我最喜歡的例子是開啟洗衣機(Opening a Washing Machine)。當我們走進來,看到一台洗衣機時,我們會想:「好,我們要把衣服放進那個圓洞裡。」於是我們試著打開它,找找看有沒有把手(Handle)。如果打不開,可能某處有個插銷(Latch);如果還是沒用,或許我們得把刻度盤(Dial)調回零。我們對洗衣機的運作方式有很深的理解,對吧?所以我們能搞清楚如何使用一台新的洗衣機。但現在的機器完全沒有這種能力,它們就像在學習重複動作(Repeat Motion)。這就是為什麼我們認為,把機器人大量推向現實世界(Get Robots Out There in Volume),並收集多樣化的數據(Diverse Data),是如此重要。我想說,這是我們一個很獨特的觀點(Contrarian View),也很有趣值得討論。因為我們認為,這必須在人群中實現,必須在家中實現(Happen in a Home)。安全性(Safety)必須是機器內在的特性(Intrinsic Thing)。你要如何確保機器裡的能量(Energy)不會太大、不會帶來危險?然後,我們還要思考,如何將這一點與經典工具箱(Classical Toolbox)結合起來?
And there’s many different environments as possible, preferably with some marching voice and dynamic stuff going on as possible so that you can understand what an actual task is. My favorite example is opening a washing machine. And when we come in, like, we see a washing machine, we see like, okay, we’re going to put the clothes into the round hole there. So we’re going to try to open it and we try to find a handle. And if it doesn’t open, maybe there’s a latch somewhere. If not, maybe we turn the dial back to zero, but we have this great understanding of like, how does the washing machine actually work? Right, so we can figure out how to use a new one and machines today don’t have that at all. You’re kind of like learning how to repeat emotion. Um, this is why we really think the thing is so important about getting the robots out there in volume and really getting that diverse data. And I guess here’s our very contrarian view, which I think is very interesting to discuss, because that’s why we mean like this has to happen among people. It has to happen in a home. And safety has to be an intrinsic thing to the machine, right, how you ensure that the energy in the machine is not so big, that it’s dangerous. Um, and then thinking about how can we combine this with the classical toolboxes?
是的,我想在這裡補充一點。我很喜歡肢體(Limbs)或視覺(Vision)這類概念。當你提到機器人技術(Robotics)的時候,方法是什麼?總是有兩件事:硬體(Hardware)的方法是什麼?軟體(Software)又有什麼好處?
And yep, yep, I think one thing I would like to add here is I like limbs or vision. When you’re saying robotics, what’s the approach? It’s always two things. What’s the approach for hardware? And what’s the good for software?
沒有人會對語言(Language)提出這樣的問題。我們應該為通用模型(GPUs 或假設為 General Purpose Robots)做什麼?因為我們已經用更密集的東西(Denser,推測為代碼或技術)解決了這個問題。但這是兩件不同的事情,對吧?這是一個重大問題:應該只有一種機器人嗎?比如,應該只有一個1X機器人(1X Robot)來代表所有機器人嗎?那麼,你到底在部署什麼?
Nobody asked that for language. What’s supposed for GPUs, because we have covered that code by denser? So it is like, but there are two different things, right? And this is a major question like, should there be only one robot? Like, should there be one X robot to the robots? And so what, like, which of what are you deploying?
如果你部署了所有的機器人(Robots),它們之間會共享一個大腦(Shared Brain)。我認為這裡蘊含了一個重要的洞見(Insight)。主要有兩件事。第一,任何人都可以從觀眾中站出來,你可以給他們一套虛擬實境套裝(VR Suit),例如動作捕捉服(Tracking Suit)、手套(Gloves)或虛擬實境頭盔(VR Headset)。他們就能控制任何一台機器人,不需要了解馬達的細節(Motor Details),也不需要知道馬達的運作原理(How the Motors Work)。這已經證明了存在一個大腦(Brain)能夠控制機器人。這是第一個面向,因此你可以從任何地方獲取數據(Data)。第二件事則是,這一點無庸置疑,大家都知道,對吧?但我們卻忽略了一種特殊的機器人——那種我們擁有海量數據的機器人,也就是人類(Humans)。我們不是機械機器人(Mechanical Robots),也不是由電力設計而成(Designed by Electricity),我們是生物機器人(Biological Robots)。但歸根究底,類似的原則(Principles)也在驅動我們。例如你的運動神經元(Motor Neurons)和感覺神經元(Sensory Neurons),它們將信號從感測器(Sensors)傳遞到大腦(Brain),再從大腦傳到馬達(Motors)。所以,如果我們認同存在一個能夠控制所有硬體(Hardware)的大腦,
And then if you’re deploying all the robots, there is a brain shared across them. And this is where I think there’s an insight. There are two things here. One is humans—like anyone can come up from the audience and you can give them an AVR suit, like a tracking suit or some gloves or VR headset. They can control any robot, any robot, and they don’t need to know the motor details. They don’t need to know how the motors work. This is already evidence that the brain can exist, that can control a robot. So that’s the first aspect. So you can use data from anywhere. But the second thing is—there is no doubt there. Everybody knows that, right? But we are missing one special robot that is out there and we have tons of that data. And those robots are humans. We are not mechanical robots, we are not designed by electricity, we are biological robots. But at the end of the day, similar principles guide us—like your motor neurons, like they are called motor neurons, sensory neurons carry signals from your sensors to your brain and, you know, some brain to their motors. So if we are agreeing that a brain can exist that controls all hardware,
那為什麼要排除生物硬體(Biological Hardware)呢?如果不排除這一點,你就可以利用人類的影片數據(Human Video Data),也就是那些我們可能尚未擁有的關於人類活動(Human Activity)的資料。例如一台1X機器人(1X Robot)去做某件事,像是拿起東西或打開冰箱(Opening a Fridge),但你自己每天可能要開冰箱10次。網路上有數以兆計(Trillions)的影片記錄了人類執行這些動作。所以,我們堅信,這是機器人技術(Robotics)中一個極其關鍵的數據來源——關於人類生活運作方式(How Human Lives Operate)的知識。你可以利用這些知識來推動發展,並與模擬(Simulation)相結合。當然,單靠這些數據並不完整,因為你不能只是觀看影片然後模仿(Watch and Play)。但這些元素是可以相互結合(Combine)的。
Why should we exclude the biological hardware? And if you don’t exclude that, you can actually use human video data of humans and human activity that we may not have. One next robot actually doing something, picking up, opening a fridge, but you must open a fridge every like 10 times a day. There are trillions of videos out there of humans doing it. So this is, at least our belief is that this is one of a very critical data for robotics—like how human lives operate, how this—so you can actually use that knowledge to go towards, in addition to simulation. Of course, it’s not complete without that because you cannot just watch and play. But these things can combine.
好吧,這一點我要講得很快。我認為我們非常認同這個觀點。我的意思是,所有這些數據(Data)都非常有用,我們也在利用它們。這些數據並不像核安全(Nuclear Security)那樣敏感。
All right, this is very so very quickly is that I think we very much agree. I mean all that data is incredibly useful and we use it to like—these are not nuclear security—that data.
我明白了,我想在此澄清一下,因為這一點似乎混淆了兩件事。不,很好。我能感覺到普拉斯(Pras)對此也有一些強烈的看法。
I see, I just want to go to clarify the point at this point was getting mixed in two things. Yeah, no, it’s good. And I can tell Pras has some strong thoughts on this, too.
嗯,作為一個經常遠端操作(Tele-operate)機器人的人,我可以說,人類大腦(Human Brain)確實很擅長控制各種平台(Platforms)。但從我的經驗來看,我得告訴你,性能水平(Level of Performance)並不總是一樣,硬體(Hardware)確實會有很大的影響。我曾經遠端操作過一台1X機器人(1X Robot),對吧?那是一次很棒的體驗(Great Experience)。但我也操作過一些工業機器人(Industrial Robots),體驗就不怎麼好(Not a Great Experience)。硬體在現實世界中的效果(Effectiveness in the Real World)真的很重要。我的意思是,我們的機器人在世界各地都有應用(Area West),但那是過去10年的成果,對吧?
Well, someone who’s tele-operated a lot of robots, I could say that sure, the human brain is great at tele-operating a variety of platforms. But I can tell you from experience—not at the same level of performance—the hardware can definitely make a difference. And definitely, I mean, I’ve totally operated a 1X robot, right? And it’s a great experience, right? Tele-operated some industrial robots—not a great experience. Um, the hardware can matter a lot in the effectiveness in the real world. I mean, we have our area west like throughout the world, but that’s for the last 10 years, right?
但機器人的動態特性(Dynamics of the Machine)很重要,你可以清楚地看到這一點。我們這裡缺少一個例子,比如達文西機器人(Da Vinci Robot)。人們用這種機器人來做手術(Surgery),這已經是一家市值1000億美元的公司($100 Billion Company)。他們所做的就是通過這種方式進行操作,這真的很驚人(Amazing)。這意味著沒有人會否認人類大腦非常強大(Human Brain is Very Powerful)。但硬體(Hardware)的問題,這些討論之所以存在,正是因為機器人技術(Robotics)總是包含這兩件事——硬體和軟體的方法可能不同,但最終它們必須整合在一起(Come Together)。所以這不是非此即彼的選擇(One Hard Way or the Other),而是現實世界的數據(Real World Data)、人類數據(Human Data)、模擬(Simulation),以及從所有這些方面進行擴展(Scaling)。
But the dynamics of the machine matters and you can really see it. What example is like we are missing here, which is not here, like Da Vinci robot. People use that robot for doing surgery—like that’s already a $100 billion press company. And all they do is the operation through this—like this is amazing—like so, which means that nobody is disagreeing on the fact that the human brain is very powerful. And hardware, these questions are kind of—that’s why robotics is always these two things—like approach can be different. In the end they all have to come together. So it’s not like one hard way or the other. It’s like, but real world data, human data, simulation, and scaling from all these things.
我認為這也有點關乎自下而上(Bottom Up)和自上而下(Top Down)的問題,對吧?因為我們現在談論的很多是控制架構(Control Architecture)的自上而下方法。但我覺得自下而上的部分也很有趣,比如,你如何學習靈巧性(Dexterity),對吧?至少我們正在體驗到,手部的快速靈巧性(In-Hand Quick Dexterity)是可以通過學習獲得的。
I think it’s also a bit about like bottom up and top down, right? Because now we’re talking very much top down on the control architecture. But I think it’s also very interesting with the bottom up of like, how do you learn dexterity, for example, right? And at least we’re experiencing that learning like in-hand quick dexterity,
凹版(Intaglio),
Intaglio,
我們不知道該怎麼做到這一點。就像我們不知道如何建立一個遠端操作系統(Tele-operation System)來實現這一點,既要快速又要夠好,還能提供觸覺反饋(Tactile Feedback)之類的功能。但機器人其實可以學得很好,只要你給它一堆東西讓它玩(Objects to Play With)。這是可以學習的(Learnable)。然後問題就變成,你如何在抽象層(Abstraction Layer)上提升這個介面(Interface),也就是你的遠端操作介面(Tele-operation Interface)。所以你不再是說:「嘿,我要這樣捏東西(Pinch Graphs)。」而是更像在引導機器完成什麼任務(Guiding the Machine for Tasks),讓系統自己去學習這些細節(Learn the Thirties)。是的。
We don’t know how to do it. Like we don’t know how to build a tele-operation system to do so that’s fast and good enough and really gives you tactile feedback and all these things. But the robot can actually learn it really well if you just give it a bunch of objects to play with. This is learnable. And then it becomes a question of how do you kind of lift that interface in like at an abstraction layer, basically on your tele-operation interface. So you’re no longer saying like, hey, I’m gonna pinch graphs like this. You’re more like guiding the machine for what tasks to be done, allowing the system to actually learn the thirties. And yep.
我認為有一件事我們常常忽略,當我們試圖把大腦(Brain)從硬體(Hardware)中分離出來時,那就是你想完成的任務(Task)。如果你考慮的是一系列任務,裡面的物件很小(Small Objects),慣性影響不大(Inertially Irrelevant),那麼是的,你可以把大腦和身體(Body)分開很多。但現實是,我們想讓這些機器執行的任務,大多超出了簡單的桌面任務(Simple Tabletop Tasks),也就是很多人一開始會嘗試的那種。如果你想舉起又大又重又複雜的物體(Big, Heavy, Complex Objects),或者你想接觸某些柔軟或脆弱的東西(假設為 Sasha Pucci, Tender Party,可能指柔軟物品或場景),那麼我認為硬體真的很重要。我覺得硬體和軟體必須共同進化(Co-evolve)。我認為那種完全分離的想法——有一個帶API的優秀硬體平台(Good Hardware Platform with an API)配上任何軟體大腦(Software Brain)——有時候並不可行,這兩者需要共同發展(Co-involvement)。了解你的執行器(Actuator)品質,比如它有多少摩擦(Friction),對你在模擬(Simulation)中如何表現它影響很大。例如,我認為我們還需要更多時間,才能完全理解像Groot這樣的模型(Model like Groot)如何部署在A型機器人(Type A Robot)和B型機器人(Type B Robot)上,因為我們目前還沒有足夠的數據點(Data Points)來證明一個模型能適用於所有不同類型的機器人。由此產生的行為(Resulting Behavior)會有顯著差異。如果我只是要拿起一袋薯片(Bag of Chips),移動它然後放下,我覺得這無所謂;但如果我要拿起一個高精度零件(High Precision Part)並將它組裝到另一個高精度孔(High Precision Hole)中,那可能就差別很大了。
I think there’s one thing that I think we tend to skip when we try to separate the brain from the hardware. And that’s the task you’re trying to do. So if you’re thinking about a whole set of tasks where the objects are small, they’re inertially irrelevant, um, yeah, you can separate a lot of the brain from the body. But I think the reality is that mostly what we want to build these machines for extends beyond the simple tabletop tasks that I think a lot of people start with. If you want to be lifting big, heavy, complex objects, or you want to be touching Sasha Pucci, tenter par rty, then I do think hardware really does matter. And I think it has to co-evolve. I think the idea that we can kind of have a complete disconnection between, you know, a good hardware platform with an API and any software brain—I think sometimes you know those two things need to co-involve. Understanding the kind of quality of your actuator, how much friction it has in it can matter a lot for how well you can represent it in simulation, for example. And I think we’re gonna need to have more time before I think we fully understand how a model like Groot, for example, deploys on a robot that’s of type A and a robot that’s of type B, because I don’t think we have enough data points yet to say that one model will deploy across all of these different kinds of robots. And there will be significant differences in the resulting behavior. And if I’m trying to pick a bag of chips and move them and drop them, I don’t think it matters, but if I’m trying to pick, you know, a high precision part and assemble it in another high precision hole, it might matter a lot.
所以我認為,關於是否能真正分離這兩件事(硬體與軟體),目前還沒有定論(Jury’s Out)。這真的取決於我的想法,可能還有另一種方式。就像有些硬體(Hardware)不適合某些任務,但有些硬體卻很適合。很多公司在这方面做得非常出色,所以……
So I think there’s the jury’s out on me on whether you can really separate those two things and it really depends on what I think—what it could be another way out. To like, one hardware is not some things, yeah, the video one hardware is. Lots of companies were really great, so…
我認為亞倫(Aaron)其實提到了一個非常有趣的主題,也是一個很棘手的挑戰——跨載體性(Cross Embodiment)。跨載體性對模型(Model)意味著什麼?讓我們稍微想想自己。我認為人類很擅長跨越身體(Crossing Body)。比如,每次你打開一個電玩遊戲(Video Game)開始玩的時候,你其實就在進行跨載體性,對吧?就像你在遊戲裡開車(Driving Cars),或者扮演一些奇怪的角色(Weird Character),有時候甚至是非人類角色(Non-Human Character)。然後,玩了一會兒搖桿(Joystick)後,你就會開始感覺到如何在虛擬遊戲中控制那個身體(Control That Body)。過一段時間,你就能玩得很好了。人類大腦確實很擅長這種跨越。所以我認為這是一個可以解決的問題(Solvable Problem),我們只需要找到幾個關鍵參數(Parameters)來實現它。我同意亞倫的看法,現在還太早(Too Early)。現在談什麼「一招通吃的跨載體性」(40-Shot Cross Embodiment),也就是拿來一個機器人和一個模型就魔法般運作(Magic Works),我覺得還不行,對吧?我們還沒到那一步,但總有一天我們會實現的。我認為一個方法是擁有大量不同的機器人硬體(Robot Hardware),甚至在模擬(Simulation)中有更多種類的硬體。此前,我們的研究小組做了一個很有趣的工作,雖然我得說還是像個玩具級的探索(Toy Exploration),叫做「Many More」。我們在模擬中程序化生成(Procedurally Generated)了許多簡單的機器人,這些機器人有不同的關節連接方式(Joint Connectivity)。它可能看起來像蛇(Snake)、像蜘蛛(Spider),很怪,但我們生成了數千個。然後,我們用一種機器人語法(Robot Grammar)來標記(Tokenize)機器人的身體,把載體本身轉換成一個整數序列(Sequence of Integers)。一旦有了整數序列,對吧?我們就看到了轉換器(Transformers)。「注意力是你所需的一切」(Attention Is All You Need),對吧?我們把轉換器應用到這數千個載體集合中。我們發現,你其實可以推廣到第千零一個環境(Thousand-First Environment)。但這仍然是一個很初步的實驗(Super Early Experiment)。不過我相信,如果我們能有一種通用的描述語言(Universal Description Language),再加上大量不同類型的真實機器人和模擬機器人,我們就能組織它們,從中生成大量數據(Generate Lots of Data)。這些載體就變成了一個通用的向量空間(Vector Space of Embodiment),新的機器人或許就能落在這個分佈範圍內(Within Distribution)。我還想補充,這不只是學術好奇心(Intellectual Curiosity),它正成為一個很實際的問題(Real Problem)。這裡所有的硬體公司創始人都面臨這個問題:你有不同代的機器人(Different Generations of a Robot),你在上一代收集的數據(Previous Generation Data)和訓練的模型(Trained Model),無法很好地推廣(Doesn’t Generalize),甚至對你自己公司的V2和V3機器人(V2 and V3 Robots)都會顯著退化(Degrades Significantly)。甚至在同一版本的機器人內,因為製造(Manufacturing)、因為各種小缺陷(Little Defects),物理世界就是這麼混亂(Messy),對吧?每台機器人都不完全一樣(Not Replicate Perfectly)。即使在同一代內,你也會遇到跨載體性的問題,更別提跨代(Across Generations)、跨公司和設計(Across Companies and Designs)了。所以這正成為一個真實的挑戰。我認為我們才剛剛開始觸及表面(Scratching the Surface)。
Uh, I think Aaron actually touched on a very interesting topic and a very difficult challenge of cross embodiment. What does cross embodiment mean for a model? So let’s maybe think a second about ourselves. I think as humans are great at crossing body—like any time you open up a video game and you start playing it, you’re actually doing cross embodiment, right? Like if you are, let’s say, you know, you’re driving cars in a game or like playing something like a weird character, sometimes like a non-human character. And then after a while, right, after you play with a joystick a little bit, you’ll get a feel of how you control that body inside the virtual game. And after a while you can play it super well. Such human—the human brain is great at crossing body. So I think it’s a solvable problem. We just need to find those several parameters to enable this. And I agree with Aaron that for now it’s too early—it’s still quite early to talk about like 40-shot cross embodiment, meaning that you bring a robot and a model just magic works. I don’t think so, right? We’re not there yet, um, but someday we will. And I think like one way to have that is to have lots and lots of different robot hardware and even more different robot hardware in simulation. So previously, our research group had a very interesting work, but I would say still like a kind of toy—you know—exploration work called Many More. So what we did is in simulation, we procedurally generated a lot of simple robots with like, you know, different kinds of joint connectivity. It can look like a snake, look like a spider, really weird, but we generate thousands of them. And then we use a robot grammar to tokenize, to tokenize the body of the robot, right? Essentially converting the embodiment itself to a sequence of integers. And once we see a sequence of integers, right? Then we see transformers. Attention is all you need, right? We see transformers. We apply transformers to this whole set of thousands of embodiments. And we find that you can actually generalize to the thousand-first environment. But again, it’s a very tall experiment, super early. But I do believe that if we’re able to have like a universal description language and we have lots of different types of real robot and simulation robot, and we can organize them, we can generate lots of data from them—they are, all the embodiments become this kind of universal space, vector space of embodiment—and perhaps a new robot will be within distribution. I also want to add that this is not just an intellectual curiosity. It’s becoming a very real problem, right? So I think all of the hardware company founders here have this issue where you have different generations of a robot and the data you collected on the previous generation and the model you train on that data—it doesn’t generalize or degrades significantly, even to your own company’s V2 and V3 robots. Actually, like, you forget that within the same version of the robot—because of manufacturing, because of all the little defects—it’s the physical world. It’s messy, right? Because of all the messiness, robots don’t even always replicate the same model perfectly, right? You have cross embodiment issues even within one generation of the robot, let alone across generations, let alone across different companies and designs. So it’s becoming a real problem. And I think we’re just scratching the surface.
是的,現在的確沒有太多的多樣性(Diversity),老實說。如果你看看人形機器人領域(Humanoid Space),我們幾乎都在使用一個相當簡單的框架(Simple Infinity)。我們公司在某些應用上,比如奧斯汀(Austin,假設為地點或項目名),決定只用三指夾具(Three-Finger Gripper),這打破了完全擬人化手部(Fully Anthropomorphic Hand)的趨勢。我們發現,人類真的很擅長把自己映射(Mapping Themselves)到這些設計上,即使手指數量不同,對吧?所以,一個遠端操作員(Tele-operator)在經過幾小時的訓練後,用我們的遠端設備(Telly Rig)操作三指夾具,就能做到你用五指能做的幾乎所有事。所以我認為這裡有很大的探索空間(Space to Explore)。目前,因為每個人都專注於建立基礎(Building a Foundation),我們還不夠大膽(Not Being Very Brave)。但我認為,一旦這些模型開始展現泛化能力(Generalizations),你會看到人們開始稍微偏離這些常規(Break Away from These)。這可能是好事,也可能是壞事。我覺得我們最終可能會造出一些機器人,它們的外觀離人類夠遠(Far Enough Away from Humans),可能會讓人覺得有點可怕(Scary)。但我認為,單單在機械手(Manipulator)這部分,就有非常豐富的機會空間(Rich Space of Opportunities)。比如,Agility Robotics 用了一個完全不同的夾具(Completely Different Gripper),跟其他任何人形機器人上的都不一樣,但他們仍然能完成一些相同的任務(Same Tasks)。所以我覺得這會是未來幾年一個令人興奮的主題(Exciting Topic)。
Yeah, there’s not a lot of diversity right now, honestly. And if you look at the humanoid space, it’s—we’re all pretty much working with a fairly simple infinity. Some of the application of our body—at Austin—decided to only use three fingers for our gripper and, you know, it was bucking the trend of having a fully anthropomorphic hand. Um, and we found like, you know, humans are so good at mapping themselves even if your fingers, right? So you can have a tele-operator operate a three-finger gripper within a couple hours of training in the telly rig—they’re doing kind of everything you do with five fingers. So I think there’s a lot of space to explore here. Um, I think because everybody’s trying to get a foundation built right now, we’re not being very brave, but I think what’s gonna happen—as soon as you see these generalizations start to show up in our models—you’ll see people break away from these a little bit. That might be good or might be bad. I think, you know, we may end up with robots that look just far enough away from humans that it’s scary, but I think just inside of the manipulator alone, there’s such a rich space of opportunities there. I think Agility has got, you know, a completely different gripper than anything you see on these other humanoids. And they’re still able to do some of the same tasks, so I think that’s going to be an exciting topic in the coming years.
是的,亞倫(Aaron),你給了我一千個不同的挑戰(Athletes,推測為誤聽,可能意指案例或任務)。我來幫你解決,好吧?給你。
Yeah, Aaron, you gave me 1,000 different athletes. I’ll solve it for you, all right? Here.
我覺得你們已經回答了我的下一個問題,那個問題特別是關於硬體(Hardware)的。所以謝謝大家。但我想繼續深入這個話題,因為這確實是一個很有趣的挑戰,你們都有很深刻的洞見(Insight),而且是從獨特的角度出發。你們會不會說剛剛談到的這些——甚至你,吉姆(Jim),當你提到即使是同一台製造出來的機器人(Same Robot That’s Manufactured),它的表現也可能有所不同(Perform Differently),這取決於很多因素——這是你們認為目前硬體方面最大的挑戰(Biggest Challenge)嗎?
Well, I feel like you already answered my next question, which was specifically around hardware. So thank you all. But I want to continue on that because it is a really interesting challenge that you all have such real insight into and from unique perspectives. Would you say what you’re just talking about? I mean even you, Jim, when you were mentioning more so about the same robot that’s manufactured, it might perform differently—it just depends. Is that—would you say the biggest challenge right now when it comes to hardware?
我認為這絕對是其中一個挑戰(One of the Challenges)。這也促使我們去研究跨載體性(Cross Embodiment)這條路線,看看如何彌合這些差距(Bridge Some of These Gaps)。但我想把這個問題留給在座的所有專家來回答。
I think that’s definitely one of the challenges. And that also prompted us to kind of study this line of research on cross embodiment on how we can bridge some of these gaps. But I will defer this question to all our experts here.
這又回到了我認為工具箱其他部分(Rest of the Toolbox)很重要的地方,對吧?如果你製造了一個有很好校準方法(Good Calibration Methods)的機器人,如果你能理解如何描述它的特性(Characterize It),並在關節層級控制(Joint-Level Control)上做了很多扎實的工作——那些在人工智慧(AI)層面之下的東西(Stuff That Sits Way Below the AI)——那麼我覺得這些問題就不會那麼嚴重(Not as Big a Deal)。但如果你有一個機器人,你無法描述它的特性(Can’t Characterize),也沒有校準(Haven’t Calibrated),從一台到另一台之間有很多變異性(Variability from Copy to Copy),然後你只是隨便套上一個控制系統(Throw a Control Around It),不管是AI策略(AI Policy)還是其他什麼,我認為你會發現輸出的變異性很大(A Lot of Variability in the Output)。不過,我覺得現在你可以做很多工作來縮小這個差距(Minimize That Gap),我想你們可能也有一些想法。
This is back to where I think you’ll find that this is where the rest of the toolbox matters, right? So if you make a robot that has really good calibration methods, if you make a robot that you understand how to characterize—if you do a lot of the good work on the joint-level control, right? The stuff that sits way below the AI—then I think some of these things aren’t as big a deal. So I think when you have a robot that you can’t characterize, that you haven’t calibrated, that has a lot of variability from copy to copy, and you just kind of throw a control around it—whether it’s an AI policy or whether it’s something else—I think you find a lot of variability in the output. But I think you can do a lot of work to minimize that gap right now, and I think you probably have some thoughts here as well.
是的,我認為另一個面向是把機器人推向現實世界(Real World),進行製造(Manufacturing),觀察你會遇到的變異性(Variability)。你會從中獲得很多經驗教訓(Learning),這些教訓會反饋到你建立的流程(Pipeline)中。一個很好的例子是我們的 Digit 機器人,它有一個完全靠學習得來的恢復行為(Recovery Behavior)。我們已經把它部署到現實世界中,在我們的生產系統(Production Systems)上運行。我們用來訓練它的領域隨機化(Domain Randomization)和數據多樣性(Diversity of Data),都來自於我們在現實世界中的經驗(Experienced in the Real World),以及我們管理的所有 Digit 機器人車隊(Fleets)的變異性。結果發現,我們在領域隨機化和強化策略(Hardening the Policy)上花了很多功夫,以至於當我們把這個策略轉移到新款機器人上時——這款新機器人剛剛亮相(Just Debuted),比之前重10公斤(10 Kilograms Heavier),框架也更大(Much Larger Frame)——這個策略竟然一次就成功轉移(One-Shot Transferred)到了這個全新的機器人上。它的運動學(Kinematics)略有不同,負載更重(Heavier Payload),一切都變了。但之所以能做到這一點,是因為我們花了很多時間去強化(Hardening)和穩健化(Robustifying),真正理解了腳部接觸(Foot Contact)之類的細節,以及所有這些部分(All These Pieces)。所以我確實認為,隨著經驗的累積,你會在跨載體性上做得更好,這不僅僅是……
Yeah, I think another aspect to this is just, you know, getting robots out there in the real world and doing manufacturing and seeing what variability you have—you do get a lot of learning that feeds back into the pipeline you’ve built. So a great example of this is, you know, Digit has a recovery behavior that’s fully learned, right? And we’ve been deploying it out in the real world—it’s on our production systems—and the domain randomization and diversity of data that we used to train that comes from what we experienced in the real world and the variation in all of the Digits that we’ve governed fleets. It turned out that we were doing so much of this domain randomization and hardening the policy so much that when we transferred the policy to our new robot, which we just debuted, which is like 10 kilograms heavier—it’s like a much larger frame—the policy actually just one-shot transferred to this totally new robot, slightly different kinematics, heavier payload, everything. And it’s because we had been spending all of this time like hardening and robustifying—like all of the sensory transfer—really understanding all the details of things like foot contact and all of these pieces. So I do think that with experience you get better at this cross embodiment and it isn’t just your…
以下是根據您的要求修正後的演說內容,更新了人名與公司名稱,並根據提供的名單調整(發言者 1 為 Tiffany Janzen,發言者 2 為 Jim Fan,發言者 4 為 Pras Velagapudi,發言者 5 為 Aaron Saunders)。語句已修正為更流暢的繁體中文,並保留英文原文對照:
我們總是得非常仔細地檢查機器人的製造序列號(Manufacturing Serial Number)。
Always doomed to need to like look at the manufacturing serial number of the robot super carefully.
當你開始做這件事,並在現實世界中累積經驗(Experienced with the Real World)時,你會更了解在訓練流程(Training Pipeline)中需要捕捉哪些關鍵因素(Levers)。
There’s some amount of—as you do it and as you get experienced with the real world—you understand more about what the levers are that you need to capture in the training pipeline.
我想你是認同的,當你從幾百台機器人擴展到幾十萬台時,這不再是一個選擇(Not a Choice)。當你擁有數千或數十萬台機器人(Hundreds of Thousands of Robots)時,你不可能為每台機器人單獨調整軟體堆疊(Software Stack)。所以我認為,這是必須發生的事(Has to Happen)。
I think you’re consented by—when you get from when you go from hundreds of thousands of robots—it’s not a choice. You can’t be tuning your software stack per robot when you have thousands or hundreds of thousands of robots. So I think it’s just something that has to happen.
我不能完全同意你們兩位的看法,但我在這一點上是認同的——校準(Calibration)很重要,非常重要(Matters a Lot)。我覺得這其實很有趣,可能有點深入(A Bit Too Deep)。當你進行領域隨機化(Domain Randomization)時,你實際上是在教你的系統保持保守(Be Conservative)。你教系統的邏輯是:「如果我不知道這麼做會發生什麼,那就當作是不安全的(Unsafe Anyway)。」這會掩蓋你的動態特性(Masks Your Dynamics)。所以,這真的取決於你想達到的目標(What You’re Trying to Achieve)。如果你不介意這種隨機化(Mind Around the Mice),你可能無法從系統中獲得最佳性能(Same Performance),但你確實會得到一個非常穩健(Robust)的東西。如果你的校準做得好,你就能從系統中榨取更多潛力(Get More Out of Your System)。所以從長遠來看(In the Long Term),這很重要。我認為現在有一些非常令人興奮的工作正在進行,比如把機器人的歷史(Robot History)加入模型的背景(Context of the Model)中。對於每台獨立的機器人(Individual Robot),你提取它的運行時間數據(Run-Time Data),把它放入歷史背景(History)中,融入模型的實際背景(Context)。然後,系統會在這個背景下學習自己的動態特性(Own Dynamics),效果居然出乎意料地好(Surprisingly Well)。
I can’t agree with both of you, but like I agree along here would like—calibration matters. It matters a lot, you know. What I think is very interesting, actually—and maybe it’s a bit too deep—but like when you do the domain randomization, what you’re actually teaching your system, right, is to be conservative. You’re teaching your systems like, oh, if I don’t know what will happen if I do this—unsafe anyway, yeah. And this kind of masks your dynamics. So it really depends on what you’re trying to achieve—like you won’t get the same performance out of the system if you don’t mind around the mice, but of course you will get something that’s very robust. So if you do really good calibration, you can kinda get more out of your system. So it will matter in the long term. And I think there’s some incredibly exciting work going on right now with adding their robot history to the context of the model. So for every single individual robot, you take some of that robot’s run-time and you put it into the history, into the context of the actual model. And then it learns kinda its own dynamics in context, which actually works surprisingly well.
這真的很酷,這有點像是我們所相信的,所謂的 RMA(Robust Multi-Task Adaptation,推測為技術縮寫),它會引起關注(Have Attention)。這就是這個概念。我想從稍微不同的角度來談談這個問題——你無法跨版本改變你的模型(Change Your Model Across Versions)是一個大問題(Big Problem)。很難期望世界上只有一個機器人、一家公司,統治所有機器人(One Robot, One Company)。現實完全不是這樣。就像汽車行業有那麼多汽車公司(Car Companies),手機行業有那麼多手機公司(Mobile Phone Companies),對吧?
Uh, I mean, we need—this is really cool and that’s kind of like getting—that’s what we believe this is called RMA that it would have attention. That’s this idea. I’m going to go with a slightly different flavor to this—this thing that it’s a big problem that you cannot change your model across versions like. And it’s very hard to expect that there’s only one robot, one company, all the robots in the world. It’s nowhere there like that—in cars, there are so many car companies; in mobile phones, so many mobile phone companies, right?
對於他們來說,甚至對於日常應用(Everyday Application),有這麼多GPU(GPUs)在創造價值。但你可以通過更高層次的抽象(Abstracts You Away)來脫離這些硬體細節,對吧?對於這兩種特性系統(Two Property System),機器人技術(Robotics)的等價物是什麼?讓我們來解決這個機器人問題。我想說,這裡的觀點略有不同(Slightly Different Take)。在其他領域,我們總是被從硬體(Hardware)中抽象出來,不管是視覺語言(Visual Language)還是其他。如果一家新公司要進入市場,比如 AMD 或其他公司,他們必須確保所有人都能無縫運行(Seamlessly Run)他們的 NVIDIA 代碼,或者那些真正針對 GPU 的代碼(Code That Is Really GPUs),在他們的 GPU 上。這是他們的責任(Their Burden),而不是軟體的負擔(Software Burden)。用這個比喻來說,對於我們正在建造的機器人,AI 就像大腦(Brain)。我們不應該打造一個只適用於特定機器人的大腦,而是要創造一個能適應機器人(Adapts on the Robot)的大腦。這是最大的區別(Major Difference)。人類擁有的不是一個能做很多事的系統,而是一個能學會做很多事的系統(Learn to Do Many Things)。我們腦子裡裝的是一個學習引擎(Learning Engine),它能即時學習(Learn on the Fly)。無論你聽到什麼,你都在即時學習、即時適應(Adapting on the Fly)。這將是 AI 在其他領域與機器人技術之間最大的不同(Major Great Difference)。在機器人技術中,我們真正要部署的是這些學習引擎(Learning Engines)。因為很多事情都在變化。例如,別說其他人類或汽車之類的東西了,就連你自己的身體(Your Own Body)也是如此。如果我去健身(Do Workout),運動完後我的手會酸痛(Hands Are Sore)。我要拿起牙刷或水瓶時,我的身體已經不一樣了(Different Body)。因為運動後的身體需要更多力氣(Requires a Lot More Torque)才能達到運動前的相同效果(Same Output)。所以,我們的大腦會即時適應(Adapting on the Fly)這些變化,從每一微秒(Microseconds)、幾分鐘到幾小時(Long Hours)。我認為,這就是關鍵差異(Mean Difference),也應該是、將會是,當這些 AI 應用到機器人時,與其他領域相比的區別。在其他地方,這一直是簡單的流程:訓練、部署(Train, Deploy)。你不用擔心適應(Adaptation),也不用擔心變化,因為 GPU 每天都在進步(GPUs Get Better),你都被照顧好了(Taken Care Of)。任何公司進來,你都被照顧好了。但在機器人技術中,這將是不同之處,你部署的是學習引擎。這就是為什麼這是 AI 的一種非常不同的應用(Very Different Application),與我們迄今見過的任何東西相比。
So, but the thing is for them—and even for like, for everyday application—there’s so many GPUs and it creates. But then you could earlier which abstracts you away from it, right? For these two property systems—what is the equivalent for robotics? Let’s come to solving robotics. So here, I would say the slightly different take here for every other field, since we are always abstracted away from the hardware—whether it’s visual language—like it does that—if a new company has to come in, let’s say, AMD or any other company, they have to make sure that everybody else can seamlessly run their NVIDIA code or their code that is really GPUs on their GPUs. It’s their burden—it’s not the software burden. For the analogy, for AI is the brain for the robot that we are building—we shouldn’t be building the brain that just works on the robot, but rather a brain that adapts on the robot. And that’s the major difference—like what humans have is not a system which can do many things, it’s the system which can learn to do many things. What we are carrying in our head is a learning engine that can learn on the fly—like whatever you hear, you are learning on the fly and adapting on the fly. And that’s what will be the major great—like major difference between how AI has been done for everything else and for robotics—like for robotics—what we will be deploying really are these learning engines. And they are because many things happen—like, for instance, forget about other humans and other cars, et cetera—even basic thing, your own body. If I go to do workout—I’m gonna go out of workout—my hands are sore and I have to pick up a toothbrush or even a bottle—I have a different body now. Because my body now requires a lot more torque to get the same output than it did before workout. So our brain is adapting on the fly to these changes happening in every microseconds, 2 minutes to long hours. And this is what the mean difference—I believe—should be or will be when these AI borders get to the robot compared to how they have been applied to anywhere else. If anywhere else this study has been simple—really deploy, train, deploy—you don’t have to worry about adaptation—no changing because every day is taking care of you as the GPUs get better—you are taken care of—any company comes in, you are taken care of. But in robotics that will be the difference—you will be deploying learning engines—which is why this is a much very different application of AI than anything that we have seen so far.
這就是為什麼這是與我們目前看到的任何 AI 應用都非常不同的應用。但我認為,總體來說,機器人 AI(Robotics AI)與其他數位 AI(Digital AI)之間的區別(Distinction),我也覺得這種區別最終會消失(Go Away),對吧?所以……
But I think in general, this distinction between like robotics AI and other digital AI, I also think that will go away, right? So…
我想我們這些天問得太多了:「AI 能為機器人技術做什麼(What Can AI Do for Robotics)?」但我們卻不常問這個問題——
I think we asked the question, what can AI do for robotics way too much these days? And we don’t ask the question,
「抱歉,機器人技術能為 AI 做什麼(What Can Robotics Do for AI)?」
What can AI—sorry, what can robotics do for AI?
因為你在現實世界中採取行動(Taking Actions in the Real World)時得到的數據(Data),你會有一個假設(Hypothesis),你採取行動(Take an Action),觀察結果(Observe the Result),然後你在學習(Learning)。這就是我們學習的方式,對吧?
Because the data that you get when you’re actually taking actions in the real world and you’ll have a hypothesis—you take an action, you observe the result, and you are learning. That’s how we learn, right?
我們最近在推理模型(Reasoning Models)中看到很多例子,比如它們在數學(Math)和程式碼(Code)上表現得非常出色,因為這些是可以驗證的(Verifiable)。你可以去檢查:「我做對了嗎(Did I Get It Right)?」
And we see a lot of things in reasoning models lately, for example, being incredibly good at math, being incredibly good at code, because it’s verifiable. You can go and see, like, did I get it right?
機器人可以讓你對任何事情都做到這一點(Do That for Everything)。這就是我們學習的方式(How We Learn)。從這個意義上來說,
Well, robots can allow you to do that for everything. That’s how we learn and in that sense,
從這個意義上說,我認為一個例子是幻覺(Hallucination)。幻覺在 AI 中是一個大問題(Big Problem)。你有沒有聽過機器人會產生幻覺(Robots Hallucinating)?這不是我們常討論的話題(Discussion Topic)。為什麼?因為機器人不會像語言模型那樣「幻聽」(Rewards Cannot Have Listened It)。如果我得去猜測(假設幻聽),比如「如果我把這個按鈕從這裡按到那裡會怎麼樣?」我可以直接試試(Just Try),結果可能是東西掉了(It Will Drop)。我能看到結果(I Can See),我不需要特別去學習(Learned by Interaction),因為我已經通過互動知道了。所以,既然我能互動(Interact),互動就是幻覺的敵人(Enemy of Hallucination)。因為當你互動時,幻覺就會消失(Goes Away)。然而,當你從被動數據(Passive Data)中學習時,比如數據來自維基百科(Wikipedia),你就無法驗證一切(Verify Everything),除非是數學(Math)或程式碼(Coding)。在這些領域,幻覺的問題較小(Less of a Problem),因為你可以實際驗證答案(Verify the Answer)。
I think an example is hallucination—hallucination is a big problem in their lives. Have you ever heard robots eliciting? This is not a discussion topic we discussed. Why? Because rewards cannot have listened it. Because if I have to listen it, what will happen if I push this button from here to here? I can just try, it will drop. I can see, I don’t have to—I learned by interaction. So since I interact, interaction is the enemy of hallucination, because as you interact, hallucination goes away. While when you’re learning from passive data where data is coming from Wikipedia, you can’t go and verify everything unless it’s math or coding where hallucination is less of a problem because you can actually verify the answer.
是的,所以會發生什麼呢?我們得到了更多的數據(Way More Data)。就像你說的,我們翻轉了金字塔(Flipped the Pyramid),對吧?你不就是這麼說的嗎?我們翻轉了金字塔。現在,機器人數據(Robot Data)的規模遠遠超過了互聯網(Bigger Than the Internet),我們就能解決今天遇到的很多問題(Solve a Lot of Problems)。而我們需要的只是更多的 GPU(GPUs)。我認為這絕對是——這一直是答案(Always the Answer),對吧?我們都在這裡,對吧?我覺得機器人絕對會有幻覺(Can Have Hallucination)。它只是以不同的方式表現出來(Manifests in a Different Way),也就是機器人預期的結果(Expected Outcome)與現實世界中發生的事(What Happens in the Real World)出現偏差(Deviation)。現在,這是可以驗證的(Verifiable),就像程式碼生成中的幻覺(Code Generation Hallucinations)在無法編譯(Don’t Compile)時可以被驗證一樣,對吧?
Yeah, so what happens is we get way more data. Like you said, we flipped the pyramid. Isn’t that what you can say? We flip the pyramid. And now we get this robot data being way bigger than the internet, and we can solve a lot of problems we have today. And all we need is a lot more GPUs. I think we absolutely—that’s always the answer, yeah—so we’re all here, right? I think you absolutely can have hallucination, right? It manifests in a different way, which is a deviation in the robot’s expected outcome from what happens in the real world. Now, it’s verifiable in the same way that code generation hallucinations are verifiable when they don’t compile, right?
但這種幻覺表現在,比如機器人執行了一個不可行的軌跡(Infeasible Trajectory),或者生成了一些我後來意識到的東西。如果你有好的動作(Good Act),這些幻覺是可以消除的(Can Go Away)。但如果你沒有互動的能力(Ability to Interact),幻覺就永遠不會消失(Never Go Away)。舉個例子,你永遠無法知道——如果我無法驗證(Cannot Verify),比如「你住在這個地方嗎(Do You Live at This Location)?」——我就無法糾正我的幻覺(Correct My Hallucination)。但在機器人技術中,大多數情況下你都能通過互動來糾正(Correct Because of Interaction)。
But it manifests in, you know, robots doing a trajectory that’s infeasible or generating out what I realized you can go away since you get a good act. Well, if you do not have the ability to interact it, it can never go away. Like, for instance, you can never know—like if you cannot—I don’t know, like—do you live at this location? If I cannot verify, I can never correct my hallucination. But in robotics, you can mostly correct because of interaction.
我有一個很實用的例子(Practical Example),其實是我們去年做的。當時我們在辦公室裡遇到一個問題:沒有人會放下馬桶蓋(Toilet Seat)。我們用了一個以前的機器人 Eve,裝有輪子(Some Wheels),但它還是很靈活(Very Mobile)。於是我們讓它自主進去檢查(Autonomous to Go In),看看馬桶蓋是掀開還是放下(Up or Down)。我們對此跑了5040次測試(Ran 5040 on This),結果是50%掀開,50%放下。就像完全不知道(No Idea),是隨機的(Random),對吧?它分辨不出馬桶蓋是掀開還是放下。這算是個邊緣案例(Edge Case),因為它通常很擅長處理這些事(Pretty Good at These Things)。但我們讓機器人去關馬桶蓋(Close the Toilet Seat),這是一個自主策略(Autonomous Policy)。它會在浴室裡巡查(Check the Bathroom),如果馬桶蓋掀開了(Toilet Seat Goes Up),就把它放下(Put the Toilet Seat Down)。這真的很有趣(Really Fun),我們最後還為此笑了好久(Had a Lot of Fun)。這其實就是在現實世界中閉環(Closing the Loop in the Real World)。現在,模型能得到反饋(Feedback):我看到馬桶蓋放下了(Seat Is Down),我知道它現在是關著的(I Closed It);你告訴我它是掀開的(You Told Me It Was Up),你錯了(You Were Wrong)。這種閉環在其他地方也在用 AI 互動(Interact With),比如 API 或編譯器(Compilers)。你讓它生成一些結果(Emits Some Result),然後通過驗證階段(Verification Phase),把結果反饋到系統的背景中(Feedback Into the Context)。只是,在這個案例中,閉環的速度有點慢(Loop Closure Is a Little Bit Slower),因為它得通過物理過程(Going Through This)。現在的問題是,我們不知道如何在一般情況下做到這一點(General Case)。我們可以設計一個特定的東西,比如馬桶蓋(Architect One Specific Thing)。但現在的問題是,你如何為這個問題提出一個通用的公式(Formulation),把世界上的一切都接地(Grounding Everything in the World)?目前還沒人知道怎麼做到這一點(No One Knows How to Do That Yet)。
I have a very good practical example, actually, because we did this—that’s actually last year—where we have the problem of no one putting down the toilet seat in the office. And we had one of our previous robots, Eve, some wheels, but it’s still very mobile. So we had it autonomous to go in and see if the toilet seat is up or down. And we ran 5040 on this, right? And it was 50% up or down—like it had no idea—it was random, right? It couldn’t tell if the seat was up or down. It’s kind of an edge case because it’s usually pretty good at these things. But we have the robot going close the toilet seat—this is an autonomous policy. So it will go around and check the bathroom and put the toilet seat down if the toilet seat goes up. And that was really fun—and we had a lot of fun in the last about it. And what is actually closing the loop in the real world, right? So now the model can get the feedback of—seat is down—I know this, it is down—I closed it and know it’s—and you told me it was up, you were wrong. And closing the loop in other places where we’re using AI to interact with—for example—APIs or compilers or things like that where you have it emits some result and you put it through a verification phase that you can feedback into the context of the system. It’s just, in this case, the loop closure is a little bit slower because it’s going through this. Yeah, the problem right now is that we don’t know how to do this in a general case, right? We can architect one specific thing like the toilet seat. And now, the question is, how do you come up with some formulation of this problem where you’re grounding everything in the world? And no one knows how to do that yet.
現實世界中的學習速度(Rate of Learning in the Real World)會非常慢(Painfully Slow),對吧?我們能在現實世界中學習這些東西,因為掉東西有後果(Consequences to Dropping Something)。重力會讓它落下(Gravity Makes It Fall),你能看出壞事發生了(Something Bad Happened),對吧?但用物理機器人(Physical Robot)探索的速度(Rate We Can Explore),這又回到了數據的混合(Blend of Data),對吧?你可以做這些很令人興奮的小事(Exciting Small Things),但在你擁有足夠的數據(Enough Data)之前,你得做多少千次、百萬次這樣的事?我認為真正的問題還是:我們負擔得起嗎(Can We Afford)?
The rate at learning in the real world is gonna be painfully slow, right? So we can learn these things in the real world because there are consequences to dropping something. Gravity makes it fall—you can tell something bad happened, right? But the rate at which we can explore with a physical robot—I mean that’s back to the blend of data, right? I mean you can do these really exciting small things, but how many thousands or millions of those things you need to do before you have enough data? And so I think the question is really still—can we afford…
我知道我們時間越來越近了(Getting Closer to Time),我真的很想帶著這些問題結束。我對此很好奇(Very Curious)。在接下來的2到5年(Next 2 to 5 Years),你們覺得這會朝哪個方向發展(Where Do You See This Headed)?我故意把這個問題留得很模糊(Vague Question),你們可以按自己的方式回答(Answered How You Go)。
I know where we’re getting closer time and I really wanted to end with these questions. I’m very curious about this. In the next 2 to 5 years—where do you see this headed? And I’m going to leave that as a vague question—answered how you go.
在未來2到5年,你覺得這會走向哪裡?我故意把這個問題留得很模糊,你們可以按自己的方式回答。我希望你先開始,伯恩特(Bernt),2到5年,這是一個相當大的範圍(Pretty Big Range)。考慮到目前這個領域的發展速度(Current Velocity of the Field)。如果要我說,我要作弊一下(Cheat),我會先說,我認為這需要10年才能完全實現(Fully Pans Out)。說10年後會是什麼樣子(Where Will This Be in 10 Years)很容易。
I’d love for you to start, Bernt—so 2 to 5 years, that’s a pretty big range. Given the current velocity of the field. Okay, what would tell me—like, say, I’m going to cheat and I’m going to say—start by saying, I think it’s going to take 10 years before this fully pans out. And it’s very easy to say, where will this be in 10 years?
我認為到那時,我們的社會將會發生像幾百年前電力帶來的那種改變(Same Kind of Change)。我們現在覺得早上按下電燈開關是理所當然(Take It for Granted),這種改變將跨越數位與實體世界(Across Digital and Physical)。活在這個時代真是件有趣的事(Interesting Time to Be Alive)。我認為我們能真正專注於什麼讓我們成為人類(What Makes Us Human)。在我們創造的這個社會中,5年後,我希望能做到這一點。我覺得這很雄心勃勃(Ambitious),我們會努力推動(Push for It)。現在沒有人知道確切答案(No One Knows)。我認為這真的取決於社會採用機器人的速度(How Fast Society Adopts Robots),以及我們能多快擴大製造規模(Scale Manufacturing)。什麼樣的產品對客戶來說是有用的(Customer Is Useful),對吧?以我們的产品為例,它目前在家庭中是有用的(Useful in a Home)。它不完美(Not Perfect),不是說你什麼都不用自己做,但它很有用,也很有趣(It’s Fun)。然後你可以從這裡開始加速(Start Accelerating)。希望它不像自動駕駛汽車(Autonomous Car)那樣,還要多花10年。我不覺得會這樣,但3到5年內,我認為它幾乎會普及到大多數人中(Pretty Much Out There Amongst Most People)。即使不是每個人都擁有機器人(Everyone Has a Robot),人們也會認識有機器人的人(Know Someone Who Has a Robot),它們會普遍成為社會的一部分(Generally Part of Society),從消費者和家庭(Consumers and Homes),到工廠(Factories)、物流(Logistics)等各個領域。你可以接著說。那些常說的人……
I think then we’re going to have the same kind of change in society that we had a few hundred years ago with electricity where we now just take it for granted when you flip the light switch in the morning—that’s going to happen to neighbor across digital and physical—and man, it’s an interesting time to be alive. And I think we can really get to focus on what makes us human. In that society we’re creating—5 years—I hope for there—I think that’s ambitious. We’re going to push for it. I don’t think anyone knows at this. I think it really depends on how fast society adopts robots, and how fast we can scale manufacturing—what’s kind of like the customer is useful, right? So I would say the product we have, just as an example, is currently useful in a home. It’s not perfect—it’s not like you don’t need to do anything yourself—but it’s useful and it’s fun. And then you can kind of start accelerating from there. And hopefully it’s not an autonomous car—and it doesn’t take a decade longer now, I think—but I do think like 3 to 5 years—it is pretty much out there amongst most people. And even if not everyone has a robot, people know someone who has a robot and they’re generally part of society across everything from consumers and homes, into factories, logistics, everything else. You can go next—so these are saying that people often…
人們常常高估短期進展(Overestimate the Progress in the Short Term),卻低估長期進展(Underestimate the Progress in the Long Run)。我覺得這可能是比爾·蓋茨(Bill Gates)或其他人的話。我不能保證(Cannot Avail),這是個免責聲明(Disclaimer)。但我認為,機器人 AI(Robotics AI)有一個獨特之處(Unique About Robotics AI),讓它不同於語言模型(LLMs)或視覺語言模型(VLMs)。語言模型必須幾乎完美解決問題(Solve the Problem Almost to Completion)才能真正有用(Really Useful),比如寫作(General Writing)之類的應用。它得非常非常好(Really, Really Good)。我們之前也有不錯的系統,但直到性能達到很高水準(Really High in Performance),它們才有用。但對機器人技術來說,這並不完全正確(Not Quite True)。因為我們不需要完全解決機器人問題(Solve Robotics Fully),機器人就能變得有用。這只是想說,即使今天,已經有數十萬、數百萬台機器人被部署出去(Deployed Already Out There)。我們現在很多東西都是機器人製造的(Made by Robots),對吧?它們已經存在(Already Out There)。所以這裡的關鍵是什麼?機器人技術的關鍵在於任務決策(Decision of Tasks)。一個能解決所有任務的機器人(Solves All Tasks Everywhere)可能還很遙遠(Farther)。我不會對此做任何預測。但我們會開始看到一些機器人,能處理少數任務(Few Tasks)、特定任務(World Tasks),甚至是專業任務(Specialized Tasks),而且它們已經非常有用(Super Useful)。因為有些任務很難找到勞動力(Difficult to Define Labor)或聘請人力(Hire)。我今天跟一家公司聊過,他們因為勞動力短缺(Shortage of Labor),正在召回退休人員(Underlying People from Retirement)。專業機器人(Specialized Robots)會更快出現(Come Much Quicker),而通用的(Generalist)則會更晚。但它們從第一天起就開始發揮作用(Useful Lesson Starts from Day One)。
Overestimate the progress in the short term, but they often underestimate the progress in the long run. And I think this is probably by Bill Gates or someone. And so I cannot avail—this is a disclaimer—but I think that one thing unique about robotics AI that makes it different from LLMs or VLMs is that LM—LLM has to really solve the problem almost to the completion to be really useful—like a recording, whether it’s a general writing or anything—it has to be really, really good like. And then we had good systems earlier—but until you reach a really high in performance, these are not useful. But that’s not quite true for robotics AI, because we don’t have to solve robotics fully for robotics to be useful—and this is just to say, like even today, there are like hundreds of thousands of millions of robots deployed already out there. Many of our things are made by robots today, right? So they are already out there, they are already there. So what is the key part here? The key thing in robotics is that this decision of tasks—the robot that solves all tasks everywhere—that may be farther. So I will not make any prediction for that, but we will start seeing robots that make that—maybe some few tasks or world tasks or task or tasks, especially specialist—and even they are super useful. Because there are several tasks for which it’s very difficult to define labor or hire. I was talking to some company today—they are underlying people from retirement because this is a shortage of labor for their especially this one. Specialized robots will much come much quicker and the generalist will be farther, but they’re useful—lesson starts from day one in robotics.
不像語言(Unlike Language),這是真的。如果自動駕駛汽車(Autonomous Cars)不危險(Weren’t Dangerous),這個問題早在2015年就解決了(Solved in 2015)。你可以坐進一輛2015年就能載你四處跑的車(Drove You Around)。當時我做得還不錯(Doing Pretty Well)。
Unlike language—that’s so true. And if autonomous cars weren’t dangerous, the problem was solved in 2015. You could get into a car that drove you around in 2015—I was doing pretty well.
從某種方式開始(Start from One Way),這不是人類的結果(Not the Human Result)。是的……
Start from—in one way and it’s not the human result—yeah…
是的,我認為這是挑戰的一部分。採用(Adoption)不僅僅是技術問題(Technological Problem)。還有安全性(Safety)、社會接受度(Societal Adoption)等因素也在影響。所以在3到5年內,我們可能會看到某些領域的機器人數量遠超預期(A Lot More Robots in Certain Areas),而某些領域的則遠低於預期(A Lot Fewer Robots in Certain Areas)。但我認為重要的是,我們正真切地看到機器人發展的高峰(Culmination of Robots)。它們從歷史上單一用途(Very Single Purpose)的工具,轉變到如今人們幾乎期待它們能多用途(Multi-Purpose)的概念——也許不是通才(General Purpose),但至少是多用途,這正成為人們的期望(People’s Expectation)。我們通過這些新的基於 AI 的平台(AI-Based Platforms)展示的是:嘿,一件硬體(Piece of Hardware)可以有效做到不止一件事(Do More Than One Thing Effectively)。我認為這一點將在未來3到5年持續下去(Hold Over the Next 3 to 5 Years)。這種期望成為了新的基準線(New Water Line),是大家努力的方向。你們現在都看到了這一點,也相信這一點(Buying Into It),這很好,對吧?因為你們會把這種期望帶入社會文化(Social Culturally Forward)。比如說:「嘿,為什麼我不能有一個在家裡能做三四件事的機器人?或者在我的情況下,在倉庫(Warehouse)或物流設施(Logistics Facility)裡能做五到十件事?應該是這樣的(That Should Be How It Is)。」我認為,真正推動這一切的是人們對這些東西的需求(People Wanting Those Things),這會驅動投資(Investments)和專注(Focus),讓我們實現這些目標。
Yeah, I think that’s part of the challenge is that adoption is not just a technological problem. It’s also, you know, things like safety, things like societal adoption, also play a factor. And so in 3 to 5 years, what we might see is that there’s a lot more robots in certain areas and a lot fewer robots in certain areas than we expected. And I think the important thing, though, is we are really seeing the culmination of robots really going from being historically very single purpose to this notion that it’s almost expected that they could be multi-purpose—maybe not general purpose, but multi-purpose—that’s becoming kind of people’s expectation. And what we’re able to show with these new AI-based platforms is that hey—a piece of hardware can do more than one thing effectively. And I think that’s the piece that will hold over the next 3 to 5 years—is this expectation is the new water line that people are building towards. And you know, all of you are now seeing this and buying into it, which is great, right? Because now you’re gonna be carrying that expectation social culturally forward—and saying like, you know, hey, why can’t I have a robot that does three or four things in a home—or in my case, you know, five or ten things in a warehouse or logistics facility—like that should be how it is. And I think that’s what really drives it—is people wanting those things really drives investments and focus into getting us those things.
當人們問這個問題時,他們很幸運,我們會給出很具體的答案(Really Concrete Specific),比如「我在這個階段會有一個機器人(Robot at This Stage),對吧?它會做所有這些事(Do All These Things)。」但我認為真正的問題是,對每個人的期望沒有統一標準(No Level Setting on What Expectations Are),對吧?所以我通常會問:「我們什麼時候會有人形機器人(Humanoid Robot),它的價值能像我們的車一樣(Valuable to Us as Our Car)?」我不知道(I Have No Idea)。我們的車每天都能用(Works Every Day),在最極端的天气下(Most Extreme Weathers),考慮到投入的材料和精力(Amount of Material and Effort),它的成本幾乎不算什麼(Costs Hardly Anything)。但即使是車本身,也無法完全比擬人形機器人可能為我們生活增加的價值(Add to the Value of Our Lives)。所以我認為,我也是站在10年或更長的立場(10 Years or Longer)。這是典型的技術專家回答(Technologist Answer)。如果你問創始人(Founder),他們會說明年(Next Year);如果你問技術專家,他們會說大約10年(About 10 Years)。10年只是意味著,我們很難具體量化(Hard for Us to Quantify)你會得到什麼。我認為我們應該關注的是進展速度(Rate of Progress)和關鍵領域(Beachheads),對吧?這裡的每個團隊(Groups Up Here)都在不同領域建立有意義的立足點(Meaningful Beachhead)。隨著時間推移,這些領域會成長(Grow),這個空間會從一堆分散的點(Bunch of Dots)擴展。比如,Agility 在倉庫中解決問題(Solving Problems in a Warehouse),我們有機器人在人們家中(Robots in People’s Homes),我們也會在汽車工廠(Automotive Factories)中工作。我認為,從每個立足點(Beachheads),你會看到成長(Growth)。這不會是一夜之間的事(Not an Overnight Thing)。我不認為這裡的任何人能準確預測5年後的情況(Predict 5 Years Into the Future)。但我認為我們會看到這種成長,很快這些領域會開始重疊(Start Overlapping)。有一天,我們會有自動駕駛汽車(Autonomous Car)。回顧這個市場的歷史(History of That Market),很多人嘲笑(Shade Thrown Out)那些預測我們何時會有自動駕駛汽車的錯誤判斷。我認為這很大程度上來自社群中某些人(Elements in the Community)過於樂觀的聲明(Statements),說這會很快實現(How Quickly It Would Come)。但我很感激我的車有自動車道輔助(Autonomous Lane Assist),不會撞上前車(Doesn’t Hit the Car in Front of Me),還能防止我倒車撞到東西(Prevents Me from Backing Into Something)。這些神奇的功能(Magical Stuff)都源自自動駕駛的夢想(Dream of Having an Autonomous Car)。順便說一句,你現在就能叫到自動駕駛計程車(Autonomous Taxi),雖然花了更長時間(Took a Little Longer)。人形機器人也是如此。我認為,只要這個社群保持興奮(Community Is Excited)、積極投入(Leans In),並意識到這是一場長期遊戲(Long Game),我們就能在商業環境中(Commercial Setting)打造出有價值的專業機器人(Specialist Robots)。我認為未來1到2年內我們就能做到(Next 1 or 2 Years)。Agility 已經開始將機器人應用到這個領域(Delivering Robots Into the Space)。當這些機器人能完成10、15、20項任務時(Doing 10, 15, 20 Tasks),那會是未來5年的目標(Next 5-Year Train)。但要解決我們想像中所有行業的所有問題(Solve All of the Problems Across All of the Industries),我認為我們需要繼續夢想(Keep Dreaming),繼續努力(Keep Working)。這個行業得持續投入精力(Keep Putting Energy In),可能還要幾十年(Many More Decades),才能解決所有邊緣案例(Edge Cases)。
And when people ask this question, they’re lucky—we’re really concrete specific like I’m going to have a robot at this stage, right? And it’s going to do all these things. And I think the real problem with this is there’s no level setting on what expectations are for everybody, right? So the question I usually ask is, you know, when will we have a humanoid robot—this is valuable to us as our car? I have no idea, right? Our car works every day—in the most extreme weathers—it costs hardly anything given the amount of material and effort that goes into putting into it. And even the car by itself isn’t quite touching what a humanoid robot might add to the value of our lives. So I think, you know, I’m also in the like 10 years—years or longer. And I think that’s the typical technologist answer—if you ask a founder, they’re going to say next year; if you ask a technologist, they’re going to say it’s about 10 years—right, in 10 years just means it’s really hard for us to quantify specifically what you’re going to have. I think the thing we should be focusing on is the rate of progress and where the beachheads are, right? Each one of the groups up here is establishing a meaningful beachhead in a different area and over time those things are going to grow and that space is going to go from a bunch of dots right where, you know, Agility solving problems in a warehouse—we have robots in people’s homes—you know, we’re gonna be working on automotive factories. And I think you’re gonna see from each one of these beachheads—you’re gonna see a growth, right? And it’s not gonna be an overnight thing—I don’t think anybody here can predict 5 years into the future and say exactly where we’re going to be. But I think we’re going to see this growth and pretty soon all those things are going to start overlapping and one day we’re going to have an autonomous car and when you look back at that history of that market—you know, there’s a lot of shade thrown out on how badly they predicted when we would have an autonomous car. And I think a lot of that came from statements that some elements in the community were making on how quickly it would come—but I’m pretty grateful that my car does autonomous lane assist and it doesn’t hit the car in front of me and prevents me from backing into something—all that magical stuff came out of the dream of having an autonomous car—you know, by the way, you can get an autonomous taxi right now—so yeah, it took a little longer—and so will humanoid robots. And I think as long as the community is excited, leans in and realizes this is a long game—right—to get to specialist robots that are delivering value in a commercial setting—I think we’re gonna have that in the next 1 or 2 years. Agility is already delivering robots into the space—when we get those robots doing 10, 15, 20 tasks—that’s going to be in that next 5-year train. But when we’re going to solve all of the problems we imagine across all of the industries—I think we need to keep dreaming and we keep working in this industry is going to have to keep putting energy in for many more decades—I think until we’ve solved all those edge cases.
我真的很喜歡聽到人們說,人們往往只關注短期的事情,而這有時候會讓我感到不安。所以我想把事情分成短期和長期來談談。我認為在接下來的2到5年內,從技術角度來看,我們將能夠徹底研究具身縮放定律(Embodied Scaling Law)。我覺得大型語言模型(Large Language Models)中最關鍵的時刻,就是最初的欽奇拉縮放定律(Chinchilla Scaling Law)。基本上,這是一個指數曲線:你投入更多的運算資源(Compute),擴大數據量(Data),增加參數數量(Parameters),然後我們就能看到智能以指數級的速度增長。我認為,目前機器人領域還沒有類似的定律,因為機器人的縮放定律實在太複雜了,對吧?你還是可以進行跨模態(Cross-Modal)的建模,可以跨越硬體設備群(Hardware Fleet)來提升能力,而不是單純依賴真實的機器人數據(Real Robot Data)。那模擬數據(Simulation Data)怎麼樣呢?網際網路數據(Internet Data)又如何呢?這些都是值得思考的方向。
I really like hearing people say that, you know, people tend to focus on the short term, and that sometimes makes me feel uneasy. So let me break it down into the short term and long term. I think in the next 2 to 5 years, from a technical perspective, we will be able to fully study the embodied scaling law. I think the biggest moment in large language models is the original Chinchilla Scaling Law, basically that exponential curve where you’re putting in more compute, you scale the amount of data, you scale the number of parameters, and we’re just seeing intelligence growing exponentially. I don’t think we have anything like that for robots yet because the scaling law is so complicated for robotics, right? You can still do cross-modal modeling. You can scale across the hardware fleet, rather than relying solely on real robot data. And how about the simulation data scheme? How about the internet data? These are all worth considering.
那神經模擬(Neural Simulation)呢?就像神經網路的夢境(Neural Dreams)一樣,對吧?就像縮放定律(Scaling Law)一樣,當你生成大量的視頻內容時也是如此。所以我們將能夠研究所有這些東西,也許從現在起五年,或者更短的時間內,我們就能在螢幕上看到這樣的圖表:你清楚知道購買多少GPU(Graphics Processing Units,圖形處理單元),你的機器人就能提升多少性能。所以我們很快就能回答這個問題。短期內,品質將會大幅提升。
How about neural simulation? The neural dreams, right? Like the scaling law, as you are generating lots and lots of videos. So we will be able to study all of these things so that perhaps, you know, five years from now or sooner, we’ll have that plot on the screen where you know exactly how many GPUs you buy and how much better your robot will be. So we can answer that question. Quality will improve very soon in the short run.
現在,讓我們來談談20年後會發生什麼。每次我在實驗室熬夜時,機器人總會出一些奇怪的狀況,我就會覺得特別挫折。我會想:讓我思考一下20年後會發生什麼,然後我就能繼續堅持下去,對吧?所以,展望20年後,有幾件事讓我非常興奮,而且我覺得這些並不是那麼遙遠。其中之一是機器人技術加速科學進展(Robotics Accelerating Science)。我在生物醫學(BioMed)領域有些朋友,他們做一個實驗真的非常耗時又費力。就像那些博士生需要一直待在實驗室裡,照顧那些老鼠,或者處理一盤盤的細胞樣本(Cell Dishes)。
Now, let’s talk about what’s going to happen in 20 years. Every time I stay up late at the lab, the robots break or do something weird, and I’m like, oh, so frustrated. I think: let me consider what’s going to happen in 20 years, and then I can carry on, right? So 20 years from now, there are a couple of things that I’m super excited about. I think it’s not that far away. One is robotics accelerating science. So I have some friends in BioMed, and it’s just so time-consuming to do one experiment and so laborious. Like all those PhD students need to be in the lab, right? Attending to those mice, or all those dishes of cells.
那如果我們把這些全都自動化(Automate)呢?
How about we automate all of that?
對吧?自動化(Automate)可能意味著所有的醫學研究(Medical Research)不再需要花費10億美元。這些研究將能夠擴大規模,因為我們開發了一個API,利用智能帳戶(Intelligence Accounts)來加速物理世界的進展。我希望這會成為一個很棒的版本,或者至少是某種有潛力的東西,對吧?這是我非常興奮的一件事。另一件讓我興奮的事是機器人技術(Robotics)改變機器人自身的發展。
Right? Automating signs that maybe all the medical research will not cost $1 billion to do. There will be scaled up because we get this API to accelerate the physical world by using intelligence accounts that will be a good version or something I hope. Right? So that is one thing I'm super excited about. And the other thing is robotics altering robotics itself.
為什麼我們不讓機器人互相修復呢?我們看到那些大型工廠在製造機器人,但如果讓機器人自己來組裝(Assembly),打造下一代機器人(Next Generation of Robots)會怎麼樣?我完全不覺得這是科幻小說,因為在LOM社群中,他們其實已經走在了我們前面。雖然有點遺憾,但在OM社群中,人們正在研究自動機器學習(AutoML)。也就是說,我們能不能透過提示(Prompt)讓我們的系統自己去研究,找到下一個最佳的變換器(Transformer),或是為智能本身設計出更好的架構(Architecture)。就在我們說話的此刻,這一切正在積極進行。他們可能會先解決這個問題,然後我們就抄他們的作業,把這些應用到物理世界中,實現遞歸自我改進(Recursive Self-Improvement)。我認為這一定會發生。不是100年後,而是在20年內,這絕對會實現。
Right, so why don’t we have robots fixing each other? We see all of those big factories making the robots, but how about the robots themselves assembling the next generation of robots? And I don’t think this is science fiction at all because actually in the LOM community, again, they are ahead of us. Unfortunately, but in the OM community, people are studying AutoML. Meaning that, can we prompt our systems to do research right to find the next best transformer, to find the next best architecture for intelligence itself? They are actively doing this as we speak and probably will solve that first, and then we’re going to copy their homework and we’ll have the physical world doing this. Recursive self-improvement. As we go, and I think that’s going to happen. Not in 100 years, only in 20 years, that’s definitely going to happen.
我要以一個樂觀的結論來結束。我認為我們這一代人,出生得太晚,無法探索地球;出生得太早,無法前往其他星系;但我們的出生時機剛剛好,能夠解決機器人技術(Robotics)的問題。未來,所有移動的東西都將實現自主化(Autonomous)。
So I’m going to end on a bright note. I think we as a generation, all of us, were born too late to explore the Earth, were born too early to travel to other galaxies, were born just in time to solve robotics, and everything that moves will be autonomous.
我覺得這是最好的結束語。非常感謝所有小組成員參加這場討論,分享你們的想法,不僅是關於我們現在的處境,更是關於未來的方向。在大家離開之前,我想說明一下,我們不會進行傳統的問答環節(Q&A)。不過,我們會先下台拿掉麥克風(Microphones),然後回到這裡。如果有興趣的人,歡迎上台直接向小組成員提問。所以我們現在要下台拆掉麥克風,之後會回來回答問題。請上台來提問吧!
Um, I mean I think that’s the best note to end on. Thank you all so much to our panelists for coming. Share your thoughts not only on where we are now, but where we are headed. Just to let you know before everyone heads out, we aren’t doing a traditional Q&A, but we are going to remove our mics, come back here, and for anyone who’s interested, feel free to come up to the stage. And you can ask your questions directly to the panelists for anyone interested. So we’re just going to go off stage to get our mics removed. We’ll be back for any questions. Come up to the stage.