通用機器人的新時代:人形機器人的崛起 — 歐洲、中東和非洲地區問答 [S72543a]
https://register.nvidia.com/flow/nvidia/gtcs25/vap/page/vsessioncatalog/session/1739944762346001kHVL
AI逐字稿
https://hackmd.io/@tsaicm/Sk6qcOy6kx
大家好!哇,這群觀眾看起來真是太棒了!正如Madison剛才提到的,我的名字是Tiffany Johnson,我將擔任今天的主持人。簡單介紹一下我自己,我是Tip and Tech的創辦人。我不知道大家是不是和我一樣,早就開始倒數計時,期待這場座談會。這個領域最近取得了許多重大進展,能夠與這個領域的一些領導者坐在一起,聽他們分享,真是令人難以置信。這不僅是為了了解我們現在的處境,更是為了展望未來的方向。現在,讓我們從一輪自我介紹開始吧,我先從Birds開始。
Hello! Wow, this crowd looks amazing! As Madison just said, my name is Tiffany Johnson, and I’m going to be your moderator today. A little bit about me: I am the founder of Tip and Tech. I don’t know about you, but I’ve been counting down the days for this panel. This field has seen a lot of advancements very recently, and to be able to sit down with some of the leaders in this space and hear from them is truly incredible—not only to learn where we are now, but where we are headed. Let’s start with a round of introductions. I’ll start with Birds.
好的。我的名字是Karen Horney,我是One X的創辦人兼執行長(CEO)。我們的使命是通過這些安全的智能終端來創造勞動力的充裕。我們深信,要真正實現智能,這些機器人需要與我們共同生活並學習。因此,我們認為消費市場必須先行,才能真正拓展人類生活中所有的細微之處。然後,智能才能逐步應用到各個領域的有用勞動中,例如醫院、老人照護、零售、工廠和物流等。
Sure. My name is Karen Horney. I’m the founder and CEO of One X, and we’re on a mission to create an abundance of labor through these safe, intelligent terminals. We really believe that to truly achieve intelligence, these robots need to live and learn among us. That’s why we think the consumer market has to come first to really be able to expand all the nuances of human life. Then, you can start applying intelligence to useful labor in all the verticals down the line—hospitals, elderly care, retail, factories, logistics, and so on.
是的。我是Silly I的執行長(CEO)兼共同創辦人。我想分享我們正在做的事情:我們在為機器人打造一個通用的“大腦”。我們的理念是,機器人領域的數據本來就稀缺,因此我們應該盡可能利用來自任何平台、任何任務、任何場景的所有可用數據,打造一個共享的單一模型。把它想像成一個大規模的基礎模型(Foundation Model),可以用於任何機器人、任何硬體、任何任務和任何場景。
Yeah. I’m the CEO and co-founder of Silly I. What we are doing is building a general brain for robotics. Our thesis is that we can have a single shared model because robotics is a field where data is already scarce. We might as well use everything that’s available from any platform, any task, any scenario. So, think of it like a large-scale foundation model you can use for any robot, any hardware, any task, any scenario.
我是Agility Robotics的技術長(CTO)。在Agility,我們的類人機器人Digit是專為工作而設計的,我們目前正將它應用於製造業和物流領域。我們認為,要讓這項技術走向世界並從中學習,最好的方法是獲得真實的客戶並進行實際部署,讓機器人真正投入工作。這正是我們一直專注的方向——讓機器人走進職場並發揮作用。
I’m the CTO at Agility Robotics. At Agility, our humanoid Digit is made for work, and we’re bringing it out to manufacturing and logistics use cases today. We feel that the best way to get the technology out there and learn from it is to get real customers and real deployments doing work. That’s what we’ve been focused on—getting a robot out there and into the workforce.
我是Boston Emmett的技術長(CTO)。早在類人機器人變得流行之前,我們就開始研究它們了。我們的使命長期以來一直沒變,那就是讓機器人成為現實。我們已經出貨了幾千台機器人,而類人機器人是我们最新的產品発表。我們希望推出一個能真正執行有用工作的產品,特別是那些能讓人們免於骯髒、枯燥或危險工作的任務。這是我們長期以來努力的方向。雖然還有更多工作要做,但我們對此感到非常興奮。
I’m the CTO of Boston Emmett. We’ve been working on humanoids before they were cool. Our mission has been the same for a long time: to make robots real. We’ve shipped a couple thousand robots, and the humanoid is the latest announcement for us. We’re really wanting to bring a product to market that can do real, useful work—work that keeps people away from dirty, dull, dangerous things. That’s what we’ve been up to for a long time. There’s more work to do yet, but we’re pretty excited about it.
嗨,大家好!我是Jim Fan。我是Nvidia的同事,也是Gerald的合作夥伴,同時負責Route專案。Route是我們在建立基礎模型(Foundation Model)時的一項運動倡議,專注於為類人機器人(Humanoid Robots)打造機器人大腦(Robot Brain)。這也代表了我們對下一代計算平台(Next Generation Computing Platform)的策略,特別是針對物理人工智能(Physical AI)。我們還肩負著實現物理人工智能民主化的使命。事實上,就在昨天,我發表了一場主題演講。我們宣布將Route和One模型開源,這是全球首個開放的類人機器人基礎模型。它只有20億個參數(Parameters),但它的表現遠超預期。你基本上可以說,手掌中握著的是當今世界最先進的自主類人智能(Autonomous Humanoid Intelligence)。我想說的是,和在座的各位一樣,我在機器人學變得“性感”之前就開始研究它了。今天看到這裡座無虛席,我真的非常高興,這個領域終於變得熱門了。非常感謝大家,你們的到來讓我這一天特別開心。謝謝你們,我知道我們都很興奮!
Hi, everyone. I’m Jim Fan. I’m a colleague at Nvidia, working with Gerald, and also the project lead for Route. Route is an initiative where we’re building the foundation model—the robot brain—for humanoid robots. It also represents our strategy for the next generation computing platform, focused on physical AI. We’re also on a mission to democratize physical AI. In fact, just yesterday, I gave a keynote speech where we announced the open-sourcing of the Route and One model, the world’s first open humanoid robot foundation model. It’s only 2 billion parameters, but it punches above its weight. You’re basically holding the world’s state-of-the-art autonomous humanoid intelligence in the palm of your hand. I’d also like to say that, like everyone else on this panel, I started working on robotics before it was sexy. Today, seeing a full house here, I’m really, really glad it’s become sexy now. So thank you all so much—you’ve made my day. Thank you all for being here. I know we’re all excited!
顯然,在我們這場座談會之前,我們進行了一次電話會議。在那次對話中,我不記得具體是誰說的,但有人提到,機器人學(Robotics)是人工智能(AI)最古老的應用。從歷史上看,它的發展速度最慢,但我想說,現在情況已經不同了。那麼,是什麼改變了呢?
Obviously, we had a call before our panel. In that conversation—I can’t recall exactly who it was—someone shared that robotics is the oldest application of AI. Historically, it’s moved the slowest, but I’d say that’s not the case anymore. So, what has changed?
我想我在那次電話會議中提出了一個問題。是的,我認為最大的變化是Jensen現在開始關注機器人學(Robotics)。Jensen有點石成金的本事,對吧?只要他把手放在某個領域,那個領域就會呈指數級增長,我們常開玩笑說這是“Jensen效應”。玩笑歸玩笑,我認為機器人學是最古老的領域之一,和人工智能(AI)本身一樣悠久。機器人學之所以這麼難,是因為它面臨著一個悖論(Paradox)。
I think I raised a question in that call. Yeah, so I think the biggest change is that Jensen is now paying attention to robotics. Jensen’s got a Midas touch, right? When he puts his finger on something, it’s going to scale exponentially—we call it “Jensen skating” a lot. Jokes aside, I think robotics is one of the oldest fields, as old as AI itself has existed. The reason robotics is so hard is because of what I’d call a paradox.
那麼,什麼是這個悖論呢?對人類來說容易的事情,對機器來說卻非常困難,反之亦然。那些我們覺得極其困難的事情,比如創意寫作(Creative Writing),對機器來說其實可能並不難。這就是為什麼現在電腦視覺(Computer Vision)的問題解決得比機器人學好得多。我們目前正面臨這個悖論。那麼,什麼改變了呢?我認為有幾個方面。首先是在模型(Model)方面,因為有了像大型基礎模型(Large Foundation Models)這樣的突破,比如ChatGPT的出現。現在我們有了能夠推理(Reasoning)的模型,還有多模態模型(Multimodal Models),它們能理解運算(Computation),具備開放詞彙(Open Vocabulary),並且對三維視覺世界(3D Visual World)的理解遠比過去好得多。這些是解決機器人學問題的必要但不充分條件——就像你要先解決視覺問題,擁有一個非常好的視覺系統(Vision System),才能談論通用機器人(General Purpose Robot)。所以,我認為這些模型變得越來越強大,讓我們能更系統地應對機器人學的挑戰,這是第一點。第二點是數據(Data)方面的變化。你知道,不像其他人工智能領域,有人說互聯網是AI的化石燃料(Fossil Fuel)。但對機器人學家來說,我們甚至連這種“化石燃料”都沒有,至少對大型模型(Large Models)來說是這樣。你可以從維基百科下載文本、抓取文本,但我們從哪裡抓取運動控制(Motor Control)數據?從哪裡獲取機器人軌跡(Robot Trajectory)的數據呢?
So, what is this paradox? It’s that the things that are easy for humans are really hard for machines, and vice versa. The things we find extremely hard, like creative writing, may actually not be that hard for machines. That’s why computer vision is being solved much better than robotics these days. We’re facing this paradox right now. Now, what has changed? I’d say a couple of aspects. One is on the model side—because of large foundation models like ChatGPT’s moment, we now have models that can do reasoning. We have multimodal models that understand computation, with open vocabulary and a much, much better understanding of the 3D visual world than we had before. These are necessary but insufficient conditions to solve robotics. It’s like you need to solve vision and have a really good vision system before you can even talk about having a general-purpose robot. So, I think the rest of these models are becoming really, really good, allowing us to start tackling robotics more systematically. That’s number one. Number two, what has changed is on the data side. Unlike other AI fields, someone—I think it was a colleague earlier—said the internet is the fossil fuel for AI. Well, roboticists don’t even have that fossil fuel, at least not for large models. You can download text, scrape text from Wikipedia—but where do we scrape motor control? Where do we scrape all those robot trajectories?
從網路上嗎?你哪裡都找不到。所以我們必須生成數據(Generate Data),必須大規模收集數據(Collect Data)。我認為GPU加速模擬(GPU Accelerated Simulation)的出現確實讓這些問題變得更容易處理,因為現在你可以用大約3小時的計算時間(Computer Time),生成相當於10年的訓練數據(Training Data)。這真的幫助我們突破了數據悖論(Data Paradox)。這是第二點。第三點是硬體(Hardware)方面的變化。我們有一些最傑出的創辦人,打造了我們見過最好的機器人硬體(Robot Hardware)。我注意到硬體不僅變得越來越好,也越來越便宜。比如,今年我們看到的硬體價格大約在4萬美元左右,這相當於一輛車的價格(Price of a Car)。回想2001年,NASA打造了一個機器人——不是第一個主要的類人機器人(Humanoid Robot)——當時的造價是150萬美元(以2001年的美元計算)。所以,現在硬體終於變得負擔得起(Affordable),而且我相信它很快就會成為主流(Mainstream)。
From the internet? You can’t find it anywhere. So we’ve got to generate data, we’ve got to collect data at scale. I think the advent of GPU-accelerated simulation really makes these problems more tractable because now you can generate, like, 10 years’ worth of training data in maybe 3 hours of computer time. So that really takes us beyond this data paradox. That’s number two. And number three is on the hardware side. We have some of the most extraordinary founders working on some of the best robot hardware we’ve ever seen. I’ve noticed that hardware has become a lot better and also a lot cheaper. Like this year, we’ve seen hardware in the range of maybe 40K—that’s the price of a car. Back in 2001, NASA built a robot—not one of the first major humanoids—and it cost like $1.5 million in 2001 USD. So it’s finally becoming affordable, and it will become mainstream very quickly.
Aaron,我想聽聽你的看法。你在自我介紹時提到,你在機器人學(Robotics)變“酷”之前就開始研究它了。你認為現在發生了什麼變化?
Aaron, I’d love to hear from you on this. In your intro, you mentioned you were in robotics before it was cool. So what do you think has changed?
是的,我想剛才提到了很多東西,就像一列列火車一樣。所以讓我試著挑出一些重點來說。我確實認為,縮小模擬與現實之間的差距(Sim-to-Real Gap)是一件大事。長期以來,機器人學界(Robotics Community)一直在努力打造一個能正確模擬物理現象(Physics)的模擬環境(Simulation Environment),而且這個環境還要具備計算效率(Computational Efficiency)。我們可以創造非常複雜的模型(Model),很好地呈現物理世界(Physical World),但無法讓它們以實時(Real Time)或比實時更快的速度運行。所以對我來說,最大的變化可能是我們現在能以超越實時的速度模擬真實世界的物理特性,這讓我們能加速探索更多模擬(Simulations),並利用這些模擬來開發新的技術(Develop New Technologies)。此外,還有許多零組件實現了商品化(Commoditization)。我認為我們得大大感謝一些相關產業(Adjacent Industries)。比如消費電子(Consumer Electronics)發展出了電池(Batteries)和攝影機(Cameras),這些技術用於感知(Perception)、觀察世界,以及計算(Computing)。回顧10年、15年前,大多數機器人裡塞滿了電路板(PCBs)和電線(Wires),電池容量(Battery Capacity)卻非常小。但現在完全不同了,我們可以放入大量的計算能力(Compute),還能裝進微型感測器(Tiny Sensors),而且它們非常省電(Power Efficient)。所以我認為,零組件商品化(Commoditization of Components)是關鍵。我覺得這不只是關於低成本(Low Cost),雖然這是目前的一大焦點。但我認為,我們之所以迎來機器人創業時代(Era of the Robotic Startup),是因為全球供應鏈(Global Supply Chain)提供了許多重要零件,就像拼圖(Puzzle Pieces)一樣可以組裝起來。因此,機器人學界從過去每個零件都要自行設計的階段,提升到現在能將這些零件組合成拼圖,並在更高層次上運作。於是,我們看到一些公司開始在智能層面(Intelligence Level)上運作,開發應用程式(Applications),而不是把所有資金和精力都花在讓一台物理機器(Physical Machine)站起來上。
Yeah, I think that was a lot—a lot of trains. So let me try to pick apart some pieces there. I do think the closure of the sim-to-real gap is a big deal. For a long time, the robotics community struggled with creating a simulation environment that represented physics properly and was computationally efficient. We could create very complex models that did a great job of representing the physical world, but we couldn’t run them in real time or faster than real time. So for me, the biggest change has probably been the ability to represent the physics of the real world at greater-than-real-time speeds, which lets you accelerate how many simulations you can explore and how you can use those simulations to develop new technologies. Then, you know, there’s the commoditization of so many bits and pieces. I think we probably need to give a huge amount of credit to some adjacent industries. Consumer electronics have developed batteries and cameras—like the technology for perception, for seeing the world, for computing. When I look back even 10, 15 years, most robots were full of PCBs and wires and had a very small amount of battery capacity. That’s completely changed. We can put a huge amount of compute in, we can put tiny sensors in that are power-efficient. So I think the commoditization of components—not so much about low cost, though I know that’s a big focus right now—is key. The reason we’re seeing the era of the robotic startup is because there’s a global supply chain full of really important pieces you can put together like puzzle pieces. So we’ve elevated the robotics community from people trying to design every cog to people who can put those things together as a puzzle and operate at a higher level. Now we have companies operating at the intelligence level, developing applications, rather than spending the same amount of capital and energy just making a physical machine stand.
我想問點什麼。我覺得Jim一開始說得很對,他很好地描述了所有改變的差異。我想補充一點,人工智能(AI)不只是第一個應用。那麼,什麼不是人工智能的第一個應用呢?
You know, can I ask something? I think what Jim initially said—I think you put it very well about all the differences that have changed. I want to add that AI was not just the first application. So, you know, what is not the first application of AI?
人工智能的本質是這樣的:如果你去看圖靈(Turing)的原始文獻,當他談到人工智能時,主要是針對機器人學(Robotics)。他認為應該創造某種東西,但不是像建造一個成人那樣,而是做出一個像孩子一樣的東西(Child-like Robot)。然後它可以成長,你可以把這個機器人放在教室裡,讓它隨著時間逐漸學習,尤其是在被教導的過程中。他在1950年代就提出了這個想法,對吧?因為語言(Language)、視覺(Vision),這些東西都很酷,但如果你觀察自然(Nature),它們在時間線上比物理動作(Physical Action)出現得晚得多。比如說,我們用來訓練的數據(Data),可能來自過去100年、200年,最多不超過1000年。我們並沒有創造超過1000年的數據。而人類已經存在超過1000年。所以,不是語言帶來了智能(Intelligence),而是基礎設施(Infrastructure)早已存在——我們的大腦已經存在,這是通過物理推理(Physical Reasoning)發展出來的。這就是為什麼影響這麼大。如果你不需要向一個人解釋什麼,他就能理解,這種能力是建立在更早的基礎上的。
It’s what AI is like. If you look at the original documents from Turing, when he talked about AI, it was for robotics. It was like you should make something—not building it like an adult, but making something that looks like a child. Then it can grow. You can put the same robot in a classroom, and it will develop over time, especially by being taught. He had to start this in the 1950s, right? Because language, vision—all these things are cool—but if you look at nature, they come much later in the timeline compared to physical action. For instance, the elements we’re training on, the data, might be from the last hundred years, 200 years, less than a thousand. We’re not creating more than a thousand years of data. Humans have been around for more than a thousand years. So it’s not that language led to intelligence. It was the infrastructure—our brains—that already existed, which came through physical reasoning. That’s why the impact is so big. Like, if you don’t have to explain something to a person, they can still understand—it’s built on that earlier foundation.
你知道什麼是機器人學(Robotics)嗎?你能感受到它,因為你每隔一天就在做物理任務(Physical Tasks)。每家公司都會受到機器人學的影響,這是最重要的一點。除了你們提到的那些因素外,什麼改變了呢?這些都是很細節的問題,確實很難回答。但改變的是,機器人學已經完全不同了。到目前為止,機器人學一直是控制學(Control)的領域。我會說,甚至直到34年前,控制學一直在推動機器人學的發展。但對於那些長期在這個領域的人來說,控制學並不是專為機器人設計的。就像在二戰期間,控制學(Control)因為飛機(Flying Planes)、導彈(Missiles)這些應用而大放異彩。後來機器人學開始興起時,大家在想:我們該用什麼呢?於是就拿來了原本為其他用途設計的控制技術(Control Techniques),這套方法就這樣延續下來,持續了70年。但這和孩子般的學習(Childlike Learning)不一樣。你不會先教孩子微積分(Calculus),再讓他們學走路或弄清楚關節動作(Joint Movements)。孩子是通過經驗學習(Learning by Experience)的。所以,從經驗中學習是現在發生的主要變化。我們看到這種轉變正在發生——今天我看了一系列影片,都是關於通過經驗學習的。所以我認為,從過去的編程經驗(Programming Experience),轉向經驗學習(Learning by Experience),這是一個重大的轉變,也是我們思考機器人性能(Performance)方式的改變。
Do you have any idea what robotics is? You can feel it because you do physical tasks every other day. Every company gets impacted by robotics, and that’s the main deal. So what has changed, apart from all the factors you mentioned? These are difficult details, yes. But what has changed is that half of your project in robotics is completely different now. Robotics so far has been the field of controls. Controls were driving robotics until, I’d say, even 34 years ago. For those who’ve been in this area for a long time, they know controls weren’t designed for robotics. Like, control had its real limelight during World War II for things like flying planes, missiles, and everything. Then the grace of robotics began, and people wondered, “What do we use?” Well, let’s use controls, which were built for that, and that stuck around. They stuck around for decades—for 70 years. But it’s not the same thing. It’s not childlike learning. You’re not teaching them calculus first to learn how to walk and figure out joint movements. Instead, they’re learning by experience. So learning by experience is the main change that has happened now. We’re seeing this shift—all this shift happening. I mean, really, a series of videos today show learning by experience. So that is, I’d say, one major change. Instead of programming experience, we’ve gone to learning by experience, which is a major shift in how we think about performance.
我想再深入探討一下這一點。對我來說,我不知道自己是否在這個領域待得夠久,能夠見證傳統控制學(Classical Controls)的時代。但我認為,改變的一大原因是互聯網(Internet)。如果你想想,這就像一個巨大的實驗,近30年來,全世界的人都在貢獻,創造出龐大的數據來源(Data Source)。我們可以用這些數據來訓練人工智能(AI),這完全是神奇的。現在我們要做的是,請你們在接下來的30年裡再做一次同樣的事——到處走動,當機器人。不,我們不會真的這麼做。
I want to actually double-click on that. So I think for me—I don’t know if I’ve been in the field long enough to see the classical controls era—but a large part of what happened was the internet. If you think about it, it’s an enormous human experiment of nearly 30 years, with everyone around the world contributing to creating this enormous source of data. So we can train AI, which is completely magical. And now what we’re going to do is ask all of you to do the same again in the next 30 years—just go around and be robots. No, we’re not going to do that. But—
我們已經有了這些數據(Data)。這正是推動人工智能(AI)向前發展的原因,儘管它最初是從機器人學(Robotics)開始的。對我來說,現在的重點在於,我們如何利用這些已有的數據。
We have that data. And therefore, that is what actually moved AI forward, even though it started in robotics. To me now, it’s all about how we use that data that exists.
利用這些數據來引導(Bootstrap),讓機器人能做一些有用的東西。因為在那之後,你可以開始在真實世界(Real World)中學習,而那是智能(Intelligence)真正的來源,對吧?
To bootstrap to where a robot can do something useful. Because at that point, you can start to learn in the real world, where intelligence really comes from, right?
但你必須先讓它達到某種程度的有用性(Usefulness)。比如說,當我說「去冰箱幫我拿一罐可樂」,如果機器人有一半的時間能做到這一點,那麼我們就找到了一條可行的路徑(Feasible Path)。因為接下來我們只需要說:「這次成功了,這次沒成功」,然後讓它重複運行足夠多次,它就會變得很擅長從冰箱拿東西。我認為,這正是我們現在看到的——隨著這些多模態元素(Multimodal Elements)的出現,你無法完全解決機器人學(Robotics)或智能(Intelligence)的問題,至少我覺得單靠這種方法不行。但你可以讓系統足夠有用,進而大規模創造一個高效的數據飛輪(Data Flywheel)。這個飛輪不需要你為機器人的每一個動作都去操作它。這很可能是一條通往——如果不是完全的人工智能(AI),至少是非常非常有用的機器(Very Useful Machines)的路徑。甚至可能實現通用人工智能(AGI),我們拭目以待,看看結果如何。
But you have to get to where it is, to a certain extent, useful. So when I say, “Go grab me a Coke from the fridge,” if the robot’s able to do so half the time, then we have a feasible path to getting there. Because now we just need to say, “Okay, that worked, that didn’t work,” and we need to run it enough times, and it’ll get really good at getting stuff from the fridge. I think that’s what we’re seeing now with the advent of all these multimodal elements—you can’t really solve robotics or intelligence in general, not even through this approach, I think. But you can get to where your system is useful enough that you can create a very efficient data flywheel at scale. That doesn’t require you to operate the robot for everything it does. And that is probably the path to—if not AGI—at least very, very useful machines. And maybe even AGI. We’ll see how it plays out.
好的,我們得回到你剛才說的最後一部分。但在此之前,我很好奇想聽聽你的看法。
Okay, we’ve got to circle back to that last part you just said there. But in the meantime, I’m curious to hear your take.
是的,我想呼應Aaron的一些看法。你知道為什麼機器人學(Robotics)好像又回來了?我一開始是從機器人學入行的,後來涉足了其他領域,然後又繞了回來。機器人學之所以具有挑戰性,有兩個原因。第一,硬體(Hardware)很難搞定;第二,世界是無結構的(Unstructured)。如果你看看人工智能(AI)和機器人學是如何發展的,會發現機器人學很大一部分都在解決「硬體很難」的問題。像是微型化感測器(Miniaturizing Sensors),比如MEMS技術,還有致動器驅動技術(Actuator Drive Technology)、能源儲存技術(Energy Storage Technology),這些問題都得先解決,甚至像某些平台——
Yeah, I think echoing some of Aaron’s comments here—you know why robotics is kind of coming back? I started with robotics, then progressed to all these other areas, and then sort of circled back. Well, robotics is challenging for two reasons. One, hardware is hard, and the world is unstructured. If you look at how AI has evolved, and how robotics has evolved, a huge portion of robotics has been dealing with the “hardware is hard” problem—miniaturizing sensors like MEMS, building on actuator drive technology, energy storage technology. All of that stuff had to be solved, even platforms like—
我們知道,這民主化(Democratized)了人們單純——
We know this democratized people’s ability to just—
讓東西在現實世界(Real World)中移動,並把這種能力帶給大家。所以在人工智能(AI)這一邊,他們不用每次都重新發明輪子(Reinventing the Wheel)。我們一直在努力,從解決結構化問題(Structured Problems)逐步走向越來越非結構化的問題(Unstructured Problems)。從處理查詢(Queries)和提示(Prompts),到簡化的世界模型(Simplified World Models),再到現在的非結構化世界模型(Unstructured World Models)。這個拼圖的每一塊都在提升人工智能平台(AI Platforms)的水平,找到新的數據攝取方式(Ingest Data),採用之前結構的最佳實踐(Best Practices),然後再進一步問:如果我們拿掉這些輔助輪(Training Wheels)會怎麼樣?現在,你看到的可能是來自自動駕駛汽車(Self-Driving Car)的影片,或者是我們的機器人攝影機(Robot Cameras)捕捉到的以自我為中心的影片(Egocentric Video),接下來這個世界會發生什麼?所以我認為,這背後有一種進展和解鎖(Unlock)的過程正在發生。我們現在看到的,是這一切累積到了一個臨界點(Tipping Point)。我們終於可以說:好,我們可以開始追求這個目標了。
Get things to move in the real world at all and bring that to people. So they weren’t reinventing the wheel every single time on the AI side. We’ve been basically working our way through, going from solving structured problems to increasingly unstructured problems—solving problems of queries and prompts, to APIs, to simplified world models, to now unstructured world models—where every piece of this puzzle has been leveling up the AI platforms, finding new ways to ingest data, taking the best practices of the previous structure and how that works, and then taking it to the next step of, “Okay, what if we remove some of these training wheels?” Now, you’re just looking at video coming from a self-driving car, or now you’re just looking at egocentric video being captured by our robots’ cameras, and what’s going to happen next in this world? So I think there’s a bit more of this progression and unlock that’s been happening behind the scenes. And we’re just seeing the culmination of that finally reach this tipping point. We’re now okay—we can go after this.
以非結構化的方式與世界互動(Interacting with the World)的完整問題。
Full problem of just interacting with the world in unstructured ways.
我想你剛才說的最後一點非常重要,談到了硬體(Hardware)方面的變化。也許過去幾年最大的進展之一,就是硬體的穩健性(Robustness)。現在我們有能力製造出在與現實世界互動時不會損壞的硬體。因為我們這些長期在機器人學(Robotics)領域工作的人都知道,實驗(Experiments)很花時間。當你需要重建機器人或修復零件時,不管什麼時候運行它,都得花不少時間。但現在,硬體發展到了一個階段,我們可以讓機器人在現實世界中學習,並安全地與世界互動(Safely Interact with the World),既不會損壞自己,也不會損壞環境。這也是實現更大進展的必要條件(Necessary Condition)。這個問題很難,花了很長時間才走到這一步。
I think the last thing you said there is so important—talking about what has happened in hardware. Maybe one of the biggest things happening in the last few years is the robustness of hardware and the ability to make hardware that doesn’t break when it interacts with the real world. Because all of us who have worked long in robotics, right? Experiments—they take quite a long time when you need to rebuild the robot or whatever, every time you run it. But we’re also now really at a point in hardware where we can have something learning in the real world and safely interacting with the world without damaging itself or the world. And that’s also a necessary condition for distant progress—it took a long time; it’s a really hard problem.
你知道,聽著你們大家的意見和你們對此的想法,確實帶來了一個有趣的問題。我認為你們都有非常令人興奮且獨特的策略和方法(Strategies and Approaches)。我真的很好奇,想聽聽你們對於一些事情的看法,比如人工智能(AI)的角色如何從專家模型(Specialist Models)轉向通才模型(Generalist Models),或者你們如何看待基礎模型(Foundation Models)激增這樣的現象。
You know, even listening to you all and your thoughts around this, it brings up an interesting question. I think you all have very exciting and unique strategies and approaches. I’m really curious to hear your take on your strategies and approaches for things such as the role of AI—moving from specialist to generalist models—or how you think through things such as the explosion of foundation models.
是的,我可以談談我們的成長策略(Growth Strategy)。我們正在解決一個非常非常困難的問題:為各種不同的類人機器人(Humanoid Robots)打造一個通用大腦(General Purpose Brain),而不僅僅是針對某一種。我們還希望實現所謂的跨載體適應(Cross Embodiment)。那麼,我們要怎麼解決這個問題呢?
Yeah, I can perhaps talk a bit about our growth strategy, right? We’re solving a really, really hard problem—to build a general-purpose brain for all kinds of different humanoid robots, not just one. We also hope to achieve what we call cross embodiment. So how do we tackle this problem?
我要說,我們有兩個主要原則(Principles)。第一個原則是,模型(Model)本身要盡可能簡單。我們希望它能做到端到端的(End-to-End)設計,簡單來說,就是從光子到動作(Photons to Actions)。你從攝影機(Video Camera)中獲取像素(Pixels),然後直接輸出連續的浮點數(Floating Point Numbers),這些數值基本上就是馬達(Motor)的控制值(Control Values)。這就是端到端的模型,沒有任何中間步驟(Intermediate Steps),盡可能保持簡單。
I’d say there are two main principles. Principle number one is that the model itself—we wanted it to be as simple as possible. We wanted it to be as end-to-end as possible, so that it’s basically photons to actions. Right? You take pixels in from a video camera, and then you directly output continuous floating-point numbers, which are essentially the control values for the motor. That’s the model, end-to-end. There are no intermediate steps—as simple as possible.
為什麼這樣做有好處呢?因為如果我們看看自然語言處理(NLP)的領域——順便說一下,這可能是目前人工智能(AI)解決得最成功的領域——我們作為機器人學(Robotics)的從業者,應該向已經成功的領域「抄作業」(Copy Homework)。在ChatGPT出現之前,自然語言處理的領域有點混亂。你有文本摘要(Text Summarization)、機器翻譯(Machine Translation)、內容生成(Co-generation),這些都使用完全不同的數據管道(Data Pipeline)、訓練協議(Training Protocols)和現代架構(Modern Architectures),有時甚至不只一個模型(Model)。然後ChatGPT出現,把一切都徹底改變了。因為它很簡單,對吧?它只是把任何文本映射到任何其他文本,就這樣。底層是一個轉換器(Transformer),將一串整數序列(Integer Sequence)映射到另一串整數序列。因為它如此簡單,你就能把所有問題的數據統一到一個模型中。我認為,這正是機器人學應該學習的地方——讓模型盡可能簡單。第二個原則是,數據管道(Data Pipeline)反而會非常複雜。圍繞模型的一切都會很複雜。因為正如我一開始說的,對機器人學來說,數據(Data)是一個大問題。你不能從YouTube或維基百科下載馬達控制數據(Motor Control Data),你哪裡都找不到。對於Grok,我們的數據策略可以組織成一個金字塔(Pyramid)。現在,閉上眼睛,想像這個金字塔。最頂端是真實機器人數據(Real Robot Data),這是最高品質的,因為沒有模擬與現實的差距(Sim-to-Real Gap)。這些數據是通過真實世界中的遠程操作(Teleoperation)收集的。但這類數據非常有限,無法大規模擴展(Scalable),因為我們受限於物理限制——每台機器人每天只有24小時,對吧?在原子世界(World of Atoms)中很難擴展。金字塔中間是模擬數據(Simulation Data)。我們非常依賴物理引擎(Physics Engines),比如Isaac,來生成大量數據。這些數據可以基於真實世界收集的數據,或者通過經驗學習(Learning from Experience)生成,就像大家提到的。這是模擬策略(Simulation Strategy)。記住,在Nvidia成為AI公司之前,它是一家圖形公司(Graphics Company)。圖形引擎(Graphics Engine)很擅長物理模擬(Physics),比如渲染(Rendering)。這是我們的模擬策略。金字塔底層,我們仍然需要從互聯網獲取多模態數據(Multimodal Data)。但這次使用方式有所不同。我們用它來訓練視覺語言模型(Visual Language Models),作為視覺(Vision)、語言(Language)和動作(Action)模型的基礎。這些模型從大量的互聯網文本(Text)、圖片(Images)、音頻(Audio)中訓練出來。最近,影片生成模型(Video Generation Models)也變得非常出色,它們可以成為世界的神經模擬(Neural Simulations),超越傳統圖形引擎(Traditional Graphics Engines)。金字塔的最後一層就是神經模擬。你可以提示一個影片生成模型,要求它生成新的軌跡(Trajectory),比如為我「幻覺」(Hallucinate)一個新的機器人軌跡。這些影片模型因為在數億個線上影片上訓練,學會了物理規則(Physics),能給你提供物理上準確的像素軌跡(Physically Accurate Trajectory in Pixels)。然後,你可以運行算法——比如我們在Route和One中提出的潛在動作(Latent Action)技術——從這些「幻覺」中提取動作,我們稱之為機器人的夢想(Robot Dreams)。就像類人機器人夢見電子羊(Electric Sheep)一樣。它在「做夢」,你從中收集潛在動作(Latent Actions),再放回這個數據金字塔中。有了這些非常複雜的數據策略,我們把它們壓縮(Compress)成一個乾淨的成品:從光子到動作(Photons to Actions)。一個20億參數的模型就足以應對廣泛的任務(Wide Range of Tasks)。這就是我們的策略概述。
And why is this good? Because if we look at the field of NLP—and by the way, NLP is an extremely, perhaps the most successful field so far that’s been solved by AI—I think as roboticists, we should copy homework from someone that’s already worked. Before ChatGPT, the field of NLP was kind of a mess, right? You had text summarization, machine translation, co-generation—all using completely different data pipelines, training protocols, and modern architectures, sometimes not just one model. Then ChatGPT came and just blew everything out of the water. Because it’s simple, right? It maps any text to any other text, and that’s it. Underneath is a transformer that maps a sequence of integers to another sequence of integers. And because it’s so simple, you can unify all the data of those problems into one model. I think that’s where robotics should copy homework—to make a model as simple as possible. The second principle is that the data pipeline will actually be very complicated. All the things that surround the model will be very complicated. That’s because, for robotics—as I said at the beginning—data is a huge problem. You cannot download motor control data from YouTube or Wikipedia. You can’t find it anywhere. So, for Grok, our data strategy can be organized into a pyramid. Right now, close your eyes, visualize the pyramid. At the top, you have real robot data—that’s going to be the highest quality because there’s no sim-to-real gap. You collected it using teleoperation in the real world. But that’s quite limited, not very scalable, because we’re limited by the fundamental physics limit of 24 hours per robot per day. That’s it. It’s really hard to scale in the world of atoms. In the middle of the pyramid, that’s where simulation comes in. We rely heavily on physics engines, like Isaac, for example, to scale lots and lots of data. This data can be generated based on real-world collected data or through learning from experience, as people mentioned. That’s simulation data. And just remember—before Nvidia was an AI company, it was a graphics company. What are graphics engines great for? Physics, right? Rendering. So that’s our simulation strategy. And at the bottom of the pyramid, we still need all that multimodal data from the internet. But this time, we use it a little differently. We use it to train visual language models that can become the foundation for vision, language, and action models. The VLMs are trained on lots and lots of internet text, images, audio—you name it. And then recently, there are also video generation models that have become so good they can be neural simulations of the world. So the last layer of the pyramid is really the neural simulation—that goes beyond traditional graphics engines. With these neural simulations, you can prompt a video generation model and ask for things like, “Hallucinate a new trajectory, a new robot trajectory for me.” The video models learn physics so well—because they’re trained on hundreds of millions of videos online—that they’re able to give you a physically accurate trajectory in pixels. Then you can run algorithms—like what we’re proposing in Route and One, something called latent action—to extract back those actions from the hallucinated, what we call the “dreams” of the robot. Like the humanoid robots dreaming of electric sheep, right? It’s dreaming, and you collect those latent actions from it. You put it back into this data pyramid. With all these very complicated data strategies, we compress them into this one clean artifact—from photons to actions. A 2-billion-parameter model suffices for a wide range of tasks. So that’s an overview of our strategy.
我認為這描繪了一幅非常美好的未來圖景,對吧?我們有一個簡單的大模型(Model),其實它甚至不算太大,就能解決所有問題,從圖片到動作(Pictures to Motion)。但我認為在這個過程中,我們也需要關注一些事情——那些我們必須掌握的、交付到現實世界(Real World)的產品,需要具備確定性(Determinism)。當你需要為客戶交付某個東西時,你得知道它在意外情況(Unexpected Conditions)下會怎麼表現。你需要考慮功能安全(Functional Safety),還得想想如果在現有功能上增加新能力(New Capabilities),它會不會退步(Regress)。所以我覺得你提到了一個很重要的點:複雜性被推到了數據(Data)上。你收集的數據至關重要。我認為我們才剛剛開始建立這個數據集(Data Set)的旅程。所以我想說,我們認為一個重要的策略(Strategy)是,確保在追求這個潛在非常強大的終極狀態(End State)時,不要把整個工具箱(Toolbox)都丟掉。因為作為一個社群(Community),我們還有許多事情要做,其中之一就是維持購買機器人(Robots)的客戶的信任(Trust)。我們必須通過應用所有現有工具來做到這一點。我認為有很多令人興奮的新能力(New Capabilities),這些能力正在徹底改變機器人學(Robotics)的格局,已經開始了。但與此同時,我們也得現實一點:過去70年來,機器人學已經積累了一個巨大的工具箱。有些工具仍然是解決現實世界問題的正確選擇,尤其是當你操作大型、強大的機器人——這些機器人可能會傷人——或者在人身邊工作時,你得保持那份信任。因為一旦信任破裂,就再也回不來了,對吧?所以我想說,我們需要部署這個大工具箱。
I think that paints a really great future picture, right? So we have a simple big model—it’s not even that big—that kind of solves everything, pictures to motion. But I think along the way, we also need to pay attention to all the things we have to own—delivering products into the real world that require determinism. When you need to deliver something to a customer, you need to understand what it’s going to do in unexpected conditions. You need to think about functional safety. You need to think about how it’s going to regress if you add new capabilities on top of existing ones. So I think you pointed out a really important thing, which is that the complexity gets pushed into the data, right? The data you gather. And I think we’re at the very beginning of the journey of building that data set. So maybe I’d say a piece of strategy that we think is important is to make sure you don’t throw the whole toolbox out in pursuit of this potentially very powerful end state. Because we have a lot of things to do as a community on the way, and one of them is to maintain the trust of the customers buying robots. We have to be able to do that by applying all the tools we have. So I think there’s a lot of exciting new capabilities—things that we think will totally change the landscape of robotics; they already are. But at the same time, we need to be realistic that there’s a big toolbox of robotics tools going back 70 years. And some of those are also the right tools to apply to solving real-world problems, especially when you’re doing things with large, powerful robots that can hurt somebody, or doing things around people where you want to maintain that trust. Because as soon as you break it, you never get it back, right? So I think maybe we just need to deploy this big toolbox.
我要說,謝謝你,Jim。我們非常認同這種想法:打造一個簡單的模型(Simple Model)。我們還不知道它最終會是什麼樣子,所以姑且稱它為「相對簡單」的模型。一切都跟數據(Data)有關。如果我們要從早期和後期的大型語言模型(LLMs)中吸取教訓,我認為有一件事經常被低估,那就是多樣性(Diversity)的重要性。就像在大型語言模型發展的早期歷史中,很多公司試圖訓練一個很棒的模型來寫詩,它們會用世界上最好的詩來訓練。但這其實行不通。因為除非你用與寫詩無關的、非常多樣化的數據來訓練,否則你無法獲得智能(Intelligence)。智能正是從這種多樣性中來的。我們現在看到,至少在我們的模型中,這一點對機器人學(Robotics)來說同樣顯而易見。即使在非常小的規模上,用這些微小的數據集(Data Sets),我們目前受到的限制更多來自數據的多樣性,而不是數據的規模(Scale of Data)。所以,關鍵在於如何獲取盡可能多的任務(Tasks),在盡可能多不同的環境(Environments)中,最好有一些動態的、複雜的情況(Dynamic Stuff)發生,這樣才能真正理解什麼是實際的任務(Actual Task)。
Uh, I thank you very much, Jim. We’re very much in that camp where we’re making one simple model. We don’t know exactly how it’s going to look yet—so we call it simple, but a relatively simple model. It’s all about the data, right? And if we’re going to take the lessons learned from early LLMs and later LLMs, in that case, I think one of the things that often gets underestimated is the importance of diversity. In the beginning of the history of LLMs, right? A lot of companies tried to train, let’s say, a very good model to create poems—so they would train on all the best poems in the world. And it doesn’t really work. Because unless you train on very diverse data that has nothing to do with writing poems, you’re not going to get intelligence. Intelligence comes from that diversity. What we’re seeing now, at least in our models, is that this is obviously also true for robotics. Even at a very small scale, with these tiny data sets we’re starting with, we’re actually more limited by diversity than we are by the scale of data. So it’s about: how do you get as many tasks as possible, in as many different environments as possible, preferably with some messy and dynamic stuff going on as much as possible, so that you can understand what an actual task is?
我最喜歡的例子是開洗衣機(Opening a Washing Machine)。當我們走進來,看到一台洗衣機時,我們會想:好,我們要把衣服放進去。所以我們會試著打開它,找找把手(Handle)。如果打不開,可能哪裡有個插銷(Latch)。如果還是打不開,我們可能會退回去重新思考,但我們對洗衣機如何運作有很深的理解。我們能搞清楚怎麼用一台新的洗衣機。但今天的機器完全沒有這種能力,它們只是在學習重複動作(Repeat Motion)。這就是為什麼我們認為,把機器人大量投入使用(Getting Robots Out There in Volume)並獲取多樣化的數據(Diverse Data)是如此重要。我猜這是我們的一個非常反傳統的觀點(Contrarian View),也很有趣值得討論。因為我們認為,這必須在人群中發生(In People’s Lives),必須在家中發生(In a Home)。安全性(Safety)必須是機器內在的特性(Intrinsic Thing)。你要怎麼確保機器裡的能量(Energy in the Machine)不會太大、不會變得危險(Dangerous)?然後再思考,我們如何把這一切與傳統工具箱(Classical Toolbox)結合起來?
My favorite example, for instance, is opening a washing machine. When we come in, we see a washing machine—okay? We’re going to put the clothes in. So we’re going to try to open it, and we try to find a handle. If it doesn’t open, maybe there’s a latch somewhere. If not, maybe we dial back to zero. But we have this great understanding of how the washing machine actually works, right? So we can figure out how to use a new one. Machines today don’t have that at all—they’re kind of just learning how to repeat motion. This is why we really think it’s so important to get robots out there in volume and really get that diverse data. And I guess here’s our very contrarian view, which I think is very interesting to discuss—because that’s why we mean this has to happen among people. It has to happen in a home. Safety has to be an intrinsic thing to the machine—how do you ensure that the energy in the machine isn’t so big that it’s dangerous? And then think about how we can combine this with the classical toolboxes?
是的,我想在這裡補充一點。我喜歡大象,或者說我們的願景(Vision)。當你提到機器人學(Robotics)時,方法是什麼?總是有兩件事:硬體(Hardware)的方法是什麼?軟體(Software)的方法是什麼?
Yep, I think one thing I’d like to add here is—I like elephants, or our vision. When you’re talking about robotics, what’s the approach? It’s always two things: What’s the approach for hardware? What’s the approach for software?
沒有人問過語言(Language)是不是為了通用人工智能(GPUs)準備的,因為我們有Jensen寫的程式碼(Code)。但這裡有兩件不同的事,對吧?這是一個重要的問題:應該只有一個機器人嗎?比如,應該有一個X機器人作為所有機器人的代表嗎?那麼,我們到底在部署什麼呢?
Nobody asked if language was supposed to be for GPUs, because we have code—code by Jensen. So it’s like, there are two different things, right? And this is a major question: Should there be only one robot? Like, should there be one X robot for all robots? So, what are we deploying?
如果你部署了所有機器人,它們之間會共享一個大腦(Brain)。我認為這裡有兩個重點。第一,人類——任何觀眾都可以走上來,你給他們一套虛擬實境套裝(VR Suit),比如動作捕捉服(Cracking Suit)、手套(Gloves)或虛擬實境頭顯(VR Headset),他們就能控制任何機器人(Any Robot)。他們不需要知道馬達(Motor)的細節,不需要了解馬達如何運作。這已經證明了一個大腦是可能存在的,可以控制任何機器人。這是第一個面向,所以你可以從任何地方獲取數據(Data)。但第二件事是,外面沒有現成的數據,大家都知道,對吧?不過我們忽略了一個特殊的機器人,那個機器人就在我們身邊,而且我們有大量的數據——那就是人類(Humans)。我們不是機械機器人(Mechanical Robots),不是由電力設計的。我們是生物機器人(Biological Robots)。但最終,類似的原則(Principles)也在引導我們。比如你的運動神經元(Motor Neurons)和感覺神經元(Sensory Neurons),它們把信號從感測器(Sensors)傳到大腦(Brain),然後大腦再把指令傳到馬達(Motors)。
Then, if you’re deploying all the robots, there’s a brain shared across them. This is where I think there are two things. One is humans—anyone can come up from the audience, and you can give them a VR suit, like a cracking suit, or some gloves, or a VR headset. They can control any robot—any robot—and they don’t need to know the motor details. They don’t need to know how the motors work. This is already evidence that a brain can exist that can control any robot. That’s the first aspect. So you can use data from anywhere. But the second thing is—there’s no data out there. Everybody knows that, right? But we’re missing one special robot that is out there, and we have tons of that data. And those robots are humans. We’re not mechanical robots; we’re not designed by electricity. We’re biological robots. But at the end of the day, similar principles guide us—like your motor neurons. They’re called motor neurons and sensory neurons; they carry signals from your sensors to your brain. And then, you know, from the brain to the motors?
所以,如果我們同意存在一個可以控制所有硬體(Hardware)的大腦(Brain),為什麼要排除生物硬體(Biological Hardware)呢?如果你不排除這一點,你就可以利用人類的影片數據(Human Video Data),也就是人類活動(Human Activity)的數據,而這些數據可能是我們沒有的。比如說,一個X機器人做某件事,像拿起東西、打開冰箱(Opening a Fridge),但如果連續三個月每天打開冰箱10次,這樣的數據我們可能沒有。然而,外面有數以兆計(Trillions)的影片,記錄了人類在做這些事。這是我們的信念之一:這些數據對機器人學(Robotics)來說非常關鍵,比如人類如何生活、如何操作。你可以利用這些知識,再加上模擬(Simulation)。當然,光靠這些並不完整,因為你不能只是看著影片然後模仿。你得結合起來。但這一點很關鍵。我認為我們在這一點上非常一致——所有這些數據都非常有用,我們也在使用它們。這些數據並不是互相排斥的(Mutually Exclusive),我知道這些數據就在我們身邊。
So, if we’re agreeing that a brain can exist that controls all hardware, why should we exclude the biological hardware? If you don’t exclude that, you can actually use human video data of humans and human activity that we may not have. One X robot, let’s say, doing something—picking up, opening a fridge—but three months of opening a fridge, every other day, like 10 times a day? There are trillions of videos out there of humans doing it. This is, at least our belief, one very critical piece of data for robotics—like how humans live, how they operate. So you can actually use that knowledge to go towards, in addition to simulation. Of course, it’s not complete without that, because you cannot just watch and play. But these things can combine. This is, very quickly—I think we very much agree. All that data is incredibly useful, and we use it too. These are not mutually exclusive; I know that data is near.
我只是想在這一點上澄清一下,因為現在話題好像混雜了兩件事。是的,這樣很好。我能感覺到Prize對此也有很強烈的看法。
I just want to get paid for the point—at this point, it was getting mixed into two things. Yeah, it’s good. And I can tell Prize has some strong thoughts on this too.
作為一個遠程操作(Teleoperated)過很多機器人的人,我可以說,人類大腦(Human Brain)在遠程操作各種平台(Platforms)時確實很出色。但根據我的經驗,性能(Performance)水平並不總是一樣,硬體(Hardware)絕對會帶來差異。我曾經遠程操作過One X機器人,那是一次很棒的體驗,對吧?但我也操作過一些工業機器人(Industrial Robots),體驗就不怎麼好。硬體確實很重要,它會決定性能的一些特性(Characteristics of Performance)。
Well, as someone who’s teleoperated a lot of robots, I can say that sure, the human brain is great at teleoperating a variety of platforms. But I can tell you from experience—not at the same level of performance. The hardware can definitely make a difference. And I mean, I’ve teleoperated a One X robot, right? It’s a great experience. I’ve also teleoperated some industrial robots—not a great experience. The hardware can matter a lot in this and does define some of the characteristics of performance.
我認為這一點很重要,要注意到會有差異(Differences)。
And I think that’s important to note—that there will be differences.
需要一定程度的努力去打造合適的硬體(Hardware),讓它具備可控性(Controllable)、正確的感測能力(Right Sensing)和慣性特性(Inertial Properties),才能在現實世界(Real World)中有效。我們有像Iron這樣的領域,過去10年來一直在我們身邊,對吧?但機器的動力學(Dynamics of the Machine)很重要,你真的能看出差別。比如說,我們這裡缺少一個例子——達文西機器人(Da Vinci Robot)。人們用這個機器人進行手術操作(Surgical Operations),這已經是一家市值超過1000億美元的公司。他們做的就是通過這個機器人進行新的手術。這太驚人了。這意味著沒有人會否認人類大腦(Human Brain)非常強大。硬體(Hardware)和這些問題總是圍繞著兩個面向:方法(Approach)可以不同,但最終它們必須結合在一起。所以不是說只有一種硬性方式(Hard Way)。而是通常的數據(Data)和更多的數據模擬(Simulation)正在為資源擴展。我認為這也有點像自下而上(Bottom-Up)和自上而下(Top-Down)的結合。現在我們談了很多關於控制架構(Control Architecture)的自上而下方法。但我覺得自下而上的方法也很有趣,比如你是怎麼學習靈巧性(Dexterity)的,對吧?至少我們正在經歷這個過程。為了學習動作(Learning Lai Kin),我們試過快速、粗略的方式(Quick, Dirty Intaglio)。我們不知道怎麼做。我們不知道如何打造一個遠程操作系統(Teleoperating System),讓它既快速又好,還能提供足夠的觸覺反饋(Tactile Feedback)和其他東西。但如果只是給機器人一堆物件讓它玩,它其實能學得很好。
There is some amount of building the right hardware to make it controllable, to have the right sensing, to have the right inertial properties to make it effective in the real world. I mean, we have areas like Iron—while the world has been with us for the last 10 years, right? But the dynamics of the machine matter, and you can really see it. It means, in what example—like we’re missing here, which isn’t here—like the Da Vinci Robot. People use that robot for doing surgical operations—that’s already a $100 billion-plus company. And all they do is a new operation through this. This is amazing. So, it means nobody’s disagreeing on the fact that the human brain is very powerful. And hardware—these questions are kind of always these two things. So the approach can be different, and then they all have to come together. It’s not like one hard way or the other. It’s like usual data and more data—simulation is scaling for the resources. I think it’s also a bit of bottom-up and top-down, right? Because now we’re talking very much top-down on the control architecture. But I think it’s also very interesting with the bottom-up—like, how do you learn dexterity, for example, right? And at least we’re experiencing that. Looking for learning Lai Kin—we have a quick, dirty intaglio. We don’t know how to do it. Like, we don’t know how to build the teleoperating system to do so that’s fast and good enough and really gives you tactile feedback and all these things. But the robot can actually learn it really well if you just give it a bunch of objects to play with.
這是可以學習的(Learnable)。
This is learnable.
然後問題就變成了,你如何提升那個介面(Interface)。你不再像在原子層級(Atom Level)上加一個抽象層(Abstraction Layer),基本上是在你的協作介面(Collaboration Interface)上。所以你不再說:「嘿,我要這樣完成抓握(Grasp)。」你更像是引導機器(Guide the Machine)去做什麼任務(Tasks),並讓系統自己去學習靈巧性(Fixturity)。是的。
And then it becomes a question of how do you kind of lift that interface. You’re not like adding a fraction layer—basically on your collaboration interface. So you’re no longer saying, “Hey, I’m going to finish a grasp like this.” You’re more like guiding the machine for what tasks to be done and allowing the system to actually learn dexterity. Yep.
我認為,當我們試圖把大腦(Brain)和硬體(Hardware)分開時,有一件事我們常會忽略,那就是你想完成的任務(Tasks)。如果你考慮的是一系列任務,這些任務中的物件很小、慣性上無關緊要(Inertially Irrelevant),那麼是的,你可以把大腦和身體(Body)分開很多。但現實是,我們想讓這些機器執行的任務,大多超出了簡單的桌面任務(Tabletop Tasks)——我想這是很多人開始時會做的。如果你想舉起又大又重又複雜的物件(Complex Objects),或者想接觸鋒利的金屬板零件(Sharp Sheet Metal Parts),或者處理高溫的東西(Hot Objects),因為你希望把人從製造環境(Manufacturing Environment)中移走,遠離危險(Hazard),然後用機器人(Robot)取代他們,那麼我認為硬體確實很重要。我認為它必須更進一步。我覺得我們能有一個簡單的想法——選一個好的硬體平台(Hardware Platform),配上API,再加上任何軟體大腦(Software Brain)——這種想法有時過於單純。這些選擇需要理解執行器(Actuator)的品質回報(Quality Return)、它有多少虛擬性(Virtuality),這些對你在模擬(Simulation)中表現得好壞影響很大。我認為我們還需要更多時間,才能完全理解像Grok這樣的模型如何部署在A型機器人和B型機器人上。因為我們目前還沒有足夠的數據點(Data Points)來證明,一個模型(Model)能適用於所有不同類型的機器人,而最終行為(Resulting Behavior)不會有顯著差異。如果我只是要拿起一袋薯片(Bag of Chips),移動它然後放下,我覺得這無關緊要;但如果我要拿起一個高精度零件(High Precision Part),把它裝進另一個高精度孔(High Precision Hole)裡,那可能就差很多了。
So I think there’s one thing that we tend to skip when we try to separate the brain from the hardware—and that’s the tasks you’re trying to do. If you’re thinking about a whole set of tasks where the objects are small, they’re inertially irrelevant—yeah, you can separate a lot of the brain from the body. But I think the reality is that mostly what we want to build these machines for extends beyond the simple tabletop tasks that I think a lot of people start with. If you want to be lifting big, heavy, complex objects, or you want to be touching sharp sheet metal parts, or working with something hot—because you can remove a person from a manufacturing environment and take them away from the hazard and replace that with a robot—then I do think hardware really does matter. And I think it has to pull above. I think the idea that we can kind of have a weak selection between a good hardware platform with an API and any software brain—sometimes, you know, those decisions are involved in understanding the return of quality. Your actuator, how much virtuality it has, can matter a lot for how well you can represent it in simulation, for example. And I think we’re going to need more time before we fully understand how a model like Grok, for example, deploys on a robot that’s of Type A and a robot that’s of Type B. Because I don’t think we have enough data points yet to say that one model will deploy across all these different kinds of robots and there won’t be significant differences in the resulting behavior. If I’m trying to pick a bag of chips and move them and drop them, I don’t think it matters. But if I’m trying to pick a high-precision part and assemble it in another high-precision hole, it might matter a lot.
所以我認為,對我來說,是否能真正把這兩個部分——大腦(Brain)和硬體(Hardware)——分開,結果取決於具體情況。我覺得這可能要展望未來100年。很多片段,很多公司過去100年左右的經驗真的很棒,所以我認為這是——
So I think the result for me on whether you can really separate those two places—it really depends on what I think could be a little bit out to, like, 100 years. Lots of bits—yeah, that’s really 100 years or so—companies were really great, so I think it is—
我認為Aaron其實提到了一個很有趣的話題,也是一個很困難的挑戰:跨載體適應(Cross Embodiment)。跨載體適應對模型(Model)意味著什麼?
I think Aaron actually touched on a very interesting topic and a very difficult challenge of cross embodiment—you know, what cross embodiment means for a model.
那麼,讓我們稍微想想我們自己吧。
So let’s maybe think a second about ourselves.
就像人類被養育長大,身體(Causing Body)也隨之形成。
As such, humans are raised, forming a body.
但當你打開一個電玩遊戲(Video Game)並開始玩的時候,
But anytime you open up a video game and you start playing it,
你其實就在進行跨載體適應(Cross Embodiment)。
You’re actually doing a cross embodiment.
對吧?比如說,你在電玩遊戲(Video Game)裡開車,或者扮演一個奇怪的角色,甚至是非人類角色(Non-Human Character)。玩了一會兒,用搖桿(Joystick)操作後,你會開始感覺到如何在虛擬遊戲中控制那個身體(Body)。過了一段時間,你就能玩得非常好。人類大腦(Human Brain)在跨環境(Cross Environments)方面真的很厲害。所以我認為這是一個可以解決的問題(Solvable Problem)。我們只需要找到一組參數(Parameters)來實現這一點。我同意Aaron的看法,現在還太早(Early)。談論什麼「40次嘗試就能實現跨載體適應(Cross Embodiment)」——意思是你拿來一個機器人,模型(Model)就能神奇地運作——我覺得還不行。我們還沒到那一步。但總有一天我們會做到。我認為一個方法是擁有大量不同的機器人硬體(Robot Hardware),甚至在模擬(Simulation)中有更多不同的機器人硬體。以前,我們的研究小組做了一個很有趣的工作,但我得說還是像玩具級的探索性工作(Exploratory Work),叫做「Many More」。我們在模擬中用程序生成(Procedure Generated)了很多簡單機器人(Simple Robots),它們有不同的關節連接(Joint Connectivity)。有的像蛇(Snake),有的像蜘蛛(Spider),真的很怪。但我們生成了數千個這樣的機器人。然後我們用一種機器人語法(Robot Grammar)來組織這些機器人的身體,把載體(Embodiment)本身轉換成一串整數序列(Sequence of Integers)。一旦有了整數序列,我們就想到轉換器(Transformers)。「技術上你只需要轉換器」,對吧?我們把轉換器應用到這數千個載體的集合上。我們發現,你真的可以推廣(Generalize)到第1001個載體。但這還是一個很初級的玩具實驗(Toy Experiment),非常早期。但我相信,如果我們能有一種通用的描述語言(Universal Description Language),再加上許多不同類型的真實機器人和模擬機器人,我們就能組織它們,從中生成大量數據(Data)。這些載體會形成一個宇宙空間(Universe Space),也就是環境的向量空間(Vector Space of Environments)。或許一個新機器人會落在這個分佈範圍內(In Distribution)。我還想補充,這不只是學術好奇心(Intellectual Curiosity),它正成為一個很現實的問題(Real Problem)。所有公開上市公司的創辦人都面臨這個問題:你有不同世代的機器人(Generations of Robots),你在上一代收集的數據和基於這些數據訓練的模型,並不能很好地推廣,甚至對你自己公司的B2和B3機器人也不行。更別提同一版本的機器人了,因為製造(Manufacturing)的緣故,還有各種小缺陷(Defects)。這是個物理世界(Physical World),很混亂(Messy),對吧?因為這些差異,機器人甚至無法完美複製同一個模型(Model)。即使在同一代機器人中,你也會遇到跨載體適應的問題,更不用說跨世代,甚至跨不同公司和設計。這已經是個真實的問題了。我認為這條研究路線——
Right? Like if you’re—let’s say you’re driving a car in a game, or I’m playing something like a weird character, sometimes a non-human character. After a while—after you play with a joystick a little bit—you’ll get a feel for how you control that body inside the virtual game. And after a while, you can play it super well. The human brain is great across environments. So I think it’s a solvable problem. We just need to find that set of parameters to enable this. And I agree with Aaron that for now, it’s too early—it’s still quite early—to talk about, like, 40-shot cross embodiment, meaning you bring a robot and the model just magically works. I don’t think so, right? We’re not there yet. But someday we will. And I think one way to do that is to have lots and lots of different robot hardware, and even more different robot hardware in simulation. Previously, our research group had a very interesting work—but I’d say it’s still kind of a toy, exploratory work—called Many More. What we did is, in simulation, we procedurally generated a lot of simple robots with different kinds of joint connectivity. It can look like a snake, look like a spider—really weird. But we generated thousands of them. And then we used a robot grammar to organize the body of the robot—essentially converting the embodiment itself into a sequence of integers. Once we see a sequence of integers, right, then we see transformers—“A technician is all you need,” right? With transformers. We applied transformers to this whole set of thousands of embodiments. And we found that you can actually generalize to the thousand-first embodiment. But again, it’s a very toy experiment and super early. But I do believe that if we’re able to have a universal description language and we have lots of different types of real robots and simulation robots, we can organize them. We can generate lots of data from them. All these embodiments become this kind of universal space, a vector space of environments, and perhaps a new robot will be within distribution. And I also want to add that this is not just an intellectual curiosity—it’s becoming a very real problem, right? So I think all the public company founders here have this issue where you have different generations of robots, and the data you collected on the previous generation, and the model you trained on that data—it doesn’t generalize or degrades significantly, even to your own company’s B2 and B3 robots. Actually, even forget that—we’re seeing the same version of the robot, because of manufacturing, because of all the little defects. It’s a physical world—it’s messy, right? Because of all the messiness, robots don’t even always replicate the same model perfectly. You have a cross embodiment issue even within one generation of the robot—let alone across generations, let alone across different companies and designs. It’s becoming a real problem. And I think with this research direction—
是的,老實說,現在的多元性並不多。如果你觀察人形機器人(humanoid)的領域,我們幾乎都在使用非常相似的東西。這基本上是我身體的複製品。在語意(semantics)方面,我們決定只使用三根手指作為夾具(gripper),這其實打破了追求完全擬人化手部的趨勢。我們發現,人類非常擅長將自己的動作映射到工具上,即使只有三根手指也沒問題,對吧?所以,一個遠端操作員(tele-operator)在經過幾小時的訓練後,就能在遠端設備(telly rig)上操作三指夾具,完成你用五根手指能做的幾乎所有事情。所以我認為這裡有很大的探索空間。因為現在每個人都專注於建立基礎,我們還不夠大膽。但我覺得,一旦我們的模型開始出現不穩定的跡象,你就會看到一些改變。有些人會開始在這些方面有所突破,這可能是好事,也可能是壞事。我認為,我們最終可能會造出一些機器人,它們的外觀離人類夠遠,遠到讓人感到害怕。但我覺得,單單在操控裝置(manipulator)這部分,就有非常豐富的機會空間。我認為敏捷(agility)的夾具設計就完全不同於你在其他人形機器人上看到的任何東西,而它們依然能顯著超越過去的表現。所以,我覺得這將會是未來幾年一個令人興奮的主題。
Yeah, there's not a lot of diversity right now, honestly. And if you look at the humanoid space, we're all pretty much working with a fairly similar thing. It's the replication of my body. At about semantics, we decided to only use three fingers for a gripper, and you know it was blocking the trend of having a fully anthropomorphic hand. And we found like you know humans are so good at mapping themselves even a three fingers, right? So you can have a tele-operator, operated a three finger gripper within a couple hours of training in the telly rig. They're doing kind of everything you do with five fingers. So I think there's a lot of space to explore here. I think, because everybody's trying to get a foundation built right now. We're not being very brave, but I think what's going to happen as soon as destabilization starts to show up in our models. As you'll see, people break away at these a little bit that might be good or might be bad. I think you know we may end up with robots that look just far enough away from humans that it's scary. But I think just inside of the manipulator alone, there's such a rich space of opportunities there. I think agility has got you know a completely different gripper than anything you see on these other humanoid. And they're still able to definitely from the past so I think that's going to be an exciting topic in the coming years.
在未來幾年,亞倫,你之前在外面給了我一千個不同的答案。好吧。謝謝。
In the coming years, Aaron, you gave me 1,000 different answers outside before. All right. Thank you.
我覺得你已經回答了我下一個關於硬體(hardware)的問題,所以謝謝你。好吧,我想繼續談這個話題,因為這確實是一個非常有趣的挑戰。你們從獨特的視角來看這件事,對你們來說真的很令人興奮。當你剛才提到,甚至是尤金(Eugene)也說到一些事情時,你會不會覺得,對於那些被製造出來、命名為「生病」的機器人(sick named robot),它們的表現可能會有所不同?這完全取決於情況。你認為這是目前硬體方面最大的挑戰嗎?
Well I feel like you already answered my next question, which was simply around hardware. So thank you. All right. I want to continue on that because it is a really interesting challenge that you all have such really exciting to you and from a unique perspectives. Would you say what when you were just talking about I mean even Eugene, when you were mentioning anything more so about the sick named robot by that manufactured, it might perform differently. It just depends. Is that would you say the biggest challenge right now when it comes to hardware?
我認為這絕對是挑戰之一。這也促使我們開始研究一條關於交叉體現(cross embodiment)的研究路線,探索如何彌合這些差距。但我想把這個問題交給我們這裡的專家。我覺得這又回到了工具箱(toolbox)其他部分重要性的問題,對吧?如果你製造了一個擁有出色校準方法(calibration methods)的機器人,如果你能理解如何去表徵(characterize)這個機器人,如果你在關節層級控制(joint level control)上做了很多優秀的工作,那些隱藏在表層之下的東西,我認為這些問題就不會顯得那麼嚴重。所以,當你面對一個無法表徵、沒有校準、從一台到另一台有很大變異性(variability)的機器人時,你只是隨手丟一個控制器(controller)給它,不管是人工智能策略(AI policy)還是其他東西,你會發現輸出的結果差異很大。但我認為,你可以做很多工作來減少這種不確定性(psychiatry)。我也覺得你可能有些想法。對,我認為另一個層面是讓機器人真正進入現實世界。
I think that's definitely one of the challenges. And that also prompted us to kind of study this line of research on cross embodiment on how we can bridge some of these gaps. But I would refer this question to our experts here. This is back to where I think you'll find that this is where the rest of the toolbox matters, right? So if you make a robot that has really good calibration methods, if you make a robot that you understand how to characterize, if you do a lot of the good work on the joint level control, the stuff that sits way below the eye then I think some of these things aren't as big a deal. So I think when you have a robot that you can't characterize that, you haven't calibrated that has a lot of variability from copy to copy. You just kind of throw a controller at it, whether it's an AI policy or whether it's something else. I think you find a lot of variability in the output. But I think you can do a lot of work to minimize psychiatry and I think you probably have some thoughts here as well, yeah I think another aspect to this is just getting robots out there in the real world. And
在製造過程中觀察變異性(variability),你確實能學到很多東西,而這些經驗會反饋到你所建立的流程(pipeline)中。一個很好的例子是「數字」(Digit),它具備一種完全透過學習得來的恢復行為(recovery behavior)。我們一直在現實世界中部署它,這應用在我們的生產系統(production systems)上。我們用來訓練的領域隨機化(domain randomization)和數據多樣性(diversity of data),都源於我們在現實世界中體驗到的反饋,以及我們所擁有的所有「數字」機器人車隊(fleets)的變異。結果證明,我們在領域隨機化和強化策略(hardening the policy)方面做了大量工作,以至於當我們將這個策略轉移到新機器人上時,效果依然出色。
Doing manufacturing and seeing what variability you have. You do get a lot of learning that do feedback into the pipeline that you build. So a great example of this is you know Digit, has a recovery behavior that fully learned. And we've been deploying it out of the real world. It's on our production systems and the domain randomization and diversity of data that we used to train that comes from its fed back from what we experienced in the real world and the variation in all of the Digits that we've got, governor fleets, it turned out that we were doing so much of this domain randomization and hardening the policy so much that when we transferred the policy to our new robot,
事實證明,我們做了這麼多的領域隨機化(domain randomization)和強化策略(hardening the policy),當我們將策略轉移到剛剛亮相的新機器人上時,它就像是一個更大的框架,重了10公斤(10 kilograms heavier)。這個策略竟然一次就成功轉移到這個全新的機器人上,儘管它的運動學(kinematics)略有不同,負載(payload)更重,所有的條件都不一樣。這是因為我們花了很多時間去強化和穩健化(robustifying),像是深入理解所有細節,比如腳部接觸(foot contact)以及所有這些部分的運作。所以我確實認為,隨著經驗的累積,你會在交叉體現(cross embodiment)方面做得更好,不需要總是小心翼翼地檢查機器人的製造序號(manufacturing serial number)。當你在現實世界中累積經驗時,你會更清楚在訓練流程(training pipeline)中需要抓住哪些關鍵點。
Which we just debuted which is like 10 kilograms heavier, it's like a much larger frame. The policy actually just one shot transferred to this totally new robot, slightly different kinematics, heavier payload, everything. And it's because we had been spending all of this time like hardening and robustifying, like all of the Simpson transfer really understanding like all the details of things like foot contact and all of these pieces. So I do think that with the experience you get better at this cross embodiment and it isn't just you're always fumes to need to look at the manufacturing serial number of the robot super carefully. There's some amount of as you do it and as you get experienced with the real world, you understand more about what the levers are that you need to capture in the training pipeline.
當你的機器人數量從幾百個增加到幾千個時,這不再是一個選擇。你無法在擁有數千甚至數十萬個機器人時,還為每一個機器人單獨調整軟體堆疊(software stack)。所以我認為,這是必然會發生的事情。你需要在訓練流程中掌握那些關鍵的控制點。我覺得這一點會讓人擔憂,但卻是不可避免的。
I think you're concerned by when you get from when you go from hundreds to thousands of robots, it's not a choice. You can't be tuning your software stack per robot when you have thousands or hundreds of thousands of robots. So I think it's just something that has to happen.
我同意你們兩位的看法,但校準(calibration)真的非常重要。我覺得這一點很有趣,或許有點太深奧了。但就像你在做領域隨機化(domain randomization)時,你實際上是在教你的系統保持保守。你是在告訴系統:「如果我不知道會發生什麼,只要我這麼做,就一定是安全的。」這種方式在某種程度上掩蓋了系統的動態性(dynamic)。所以,這真的取決於你想要實現什麼目標。如果你在領域隨機化的情況下,你不會從系統中獲得同樣的性能,但當然,你會得到一個非常穩健(robust)的東西。
China agrees both of you, but like I agree, along here we like calibration matters a lot you know. Or I think it's very interesting, actually, and maybe it's a bit too deep. But like when you do domain around myself, what you're actually teaching your system right is to be conservative. You're teaching your system to like, oh if I don't know what will happen. If I do this, I'm safe anyway. And this kind of masks your dynamic. So it really depends on what you're trying to achieve. Like you won't get the same performance out of the system if you domain randomized, but of course you will get something that's very robust.
如果你做好了校準(calibration),你就能從系統中獲得更多效益。所以從長遠來看,這一點很重要。我認為現在有一些令人興奮的工作正在進行,比如將每個單獨機器人的歷史(robot history)加入到模型的複雜結構中。你可以利用某個機器人的即時數據,將這些數據放入模型的歷史脈絡(context)中,然後它會在這個脈絡中學習自己的動態(dynamics),這種方法的效果出乎意料地好。這真的很酷,這種工作被稱為RMA(Robust Multi-Task Adaptation),特別引起了人們的注意。這就是核心概念。我想從一個稍微不同的角度來探討這個問題。有一個大問題是,你無法輕易跨版本改變模型(model across versions)。很難期望世界上只有一個機器人、一家公司來製造所有機器人。就像汽車產業一樣,完全不是這樣,有那麼多汽車公司;手機產業也是如此,有那麼多手機公司,對吧?
So if you do really good calibration, you can kind of get more out of your system. So it will matter in the long term. And I think there's some incredibly exciting work going on right now with adding their robot history to the complex of the model for every single individual robot. You take some of that robot’s on time and you put it into the history to the context of the actual model and then it learns kinda its own dynamics in context which actually works surprisingly well. I mean and this is really cool and that's kind of like getting that’s the work week is called RMA that bit would have attention. That's this idea. I want to go with a slightly different flavor to this thing. It's a big problem that you cannot change your model across versions like. And it’s very hard to expect that there will only one robot, one company, all the robots in the world. It’s nowhere there like that in cars. There are so many car companies in mobile phones, so many mobile phone companies, right?
但問題在於,對它們來說,甚至對每個應用程式(application)來說,都有很多GPU(圖形處理單元)。這創造了一些挑戰。但你之前提到的東西其實把你從這些問題中抽離出來,對吧?對於這兩個作業系統(operating system),在解決機器人技術(robotics)問題時,什麼是機器人領域的等價物呢?所以我在這裡想提出一個稍微不同的觀點。其他領域的情況如何?因為我們總是從硬體(hardware)中被抽離出來。但這就像視覺語言(visual language)一樣,如果一家新公司要進入市場,比如說AMD或其他公司,它們必須確保其他所有人能無縫運行它們的NVIDIA程式碼,或者它們的程式碼能在GPU上運行。這是它們的責任,不是軟體的負擔。用一個比喻來說,對於我們正在建造的機器人,人工智慧(AI)就像是大腦。我們不應該打造一個只適用於特定機器人的大腦,而是應該打造一個能適應機器人的大腦。這是主要的區別。人類擁有的不是一個能做很多事的系統,而是一個能學會做很多事的系統。我們頭腦中搭載的是一個學習引擎(learning engine),它能即時學習(learn on the fly),無論你聽到什麼,你都能即時學習並適應。這將是最大的突破,也是與以往人工智慧應用於其他領域的最大不同。
So but the thing is for them and even for like for every application, there’s so many GPUs. And that creates. But then you could earlier which extracts you away from it, right? For this two operating system, what is equivalent for robotics when it comes to solving robotics? So here, I would say the slightly different take here. What are every other field since we are always extracted away from the hardware? But it’s visual language, like it doesn’t like if a new company has to come in, let’s say, AMD or any other company, they have to make sure that everybody else can seamlessly run their NVIDIA code or their code that was AGP is on their GPUs. It’s their burden. It’s not the software burden. For the analogy, for AI is the brain for the robot that we are building, we shouldn’t be building the brain that just works on the robot, but rather a brain that adapts on the robot. And that’s the major difference like what humans have is not a system which can do many things. It’s a system which can learn to do many things. What we are carrying in our head is a learning engine that it can learn on the fly, like whatever you hear and you are learning on the fly and adapting on the fly. And that’s what will be the major break like major difference between how AI has been done for everything else.
在機器人技術中,我們真正部署的將是這些學習引擎(learning engines)。因為很多事情都會發生,例如,不考慮人類、其他汽車等等,甚至是最基本的東西——你自己的身體。如果我去健身(workout),我的手臂會感到痠痛。當我要拿起牙刷或水瓶時,我的身體已經不一樣了。因為現在我的身體需要更多的努力(effort),才能達到健身前同樣的輸出(output)。我們的大腦能即時適應這些變化,這些變化可能發生在每一微秒、幾分鐘,甚至幾小時。這就是我認為應該或將會成為的主要差異。當這些人工智慧模型應用於機器人時,與它們在其他領域的應用相比,其他地方的模式很簡單:大腦訓練好後部署(train deploy)。你不必擔心適應的問題,因為隨著硬體(如GPU)變得更好,一切都被照顧好了。任何公司進入市場,你的需求都會被滿足。但在機器人技術中,區別在於你部署的是學習引擎,這是人工智慧一種非常不同的應用,與我們迄今見過的任何東西都不一樣。但我認為,總體來說,機器人人工智慧(robotics AI)與其他數位人工智慧(digital AI)之間的這種區別最終也會消失。我覺得我們現在太常問「人工智慧能為機器人做什麼」這個問題了。
And for robotics, like for robotics, what we will be deployed really are these many learning engines. And they are because many things happen like, for instance, forget about the humans and other cars, et cetera, even basic thing, your own body. If I go to the workout I’m going to work out to work out. My hands are sore. And I have to pick up a toothbrush or even a bottle. I have a different body now, because my body now requires a lot more effort to get the same output that it did before workout. So our brain is adapting on the fly to these changes happening in every microsecond to minutes to long hours and this is what the main difference, I believe, should be or will be. When these AI models going to the robot compared to how they have been applied to anywhere else like anywhere else, their study has been simple. Brain hit Lloyd train deploy. You don’t have to worry about adaptation not changing because every day is taking care of you as the GPU get better. You are taken care of, any company comes in, you are taken care of. But in robotics that will be the difference you will be deploying learning engines, which is why this is a bunch of very different application of AI than anything that we have seen so far. But I think in general this distinction between like robotics AI and other digital AI I also think that will go away. Right? So I think we ask the question, what can AI do for robotics way too much these days?
我們不會問這個問題:「人工智慧(AI)能為什麼做什麼?」對不起,我是說:「機器人技術(robotics)能為人工智慧做什麼?」因為當你在現實世界中採取行動時,你所得到的數據——你有一個假設(hypothesis),採取行動,觀察結果,然後學習——這就是我們學習的方式,對吧?
And we don't ask the question, what can AI sorry, what can robotics do for AI? Because the data that you get when you're actually taking actions in the real world, and you have a hypothesis take an action, you observe the result, and you are learning. That's how we learn, right?
我們最近在推理模型(reasoning models)中看到了很多東西,例如,它們在數學和程式碼(code)方面表現得非常出色,因為這些是可以驗證的(verifiable)。你可以去檢查,比如說:「我做對了嗎?」
And we see a lot of things in reasoning models lately, for example, being incredibly good at math, being incredibly good at code, because it's verifiable. You can go and see, like, did I get it right?
機器人技術(robotics)就像是一根繩子,能讓你做到這一點,應用到所有事情上。這就是我們學習的方式。在這個脈絡中,我想舉一個例子:幻覺(hallucination)。幻覺在人工智慧元素中是一個大問題。你有沒有聽過機器人會產生幻覺(robots hallucinating)?這不是我們討論過的話題。為什麼?因為獎勵(rewards)不會產生幻覺。如果我要舉例說明「如果我把這個按鈕從這裡按到那裡會發生什麼」,我可以直接試試看。它會掉下來,我能親眼看到。我不需要特別去學習我的互動(interaction)。所以,只要我進行互動,互動就是幻覺的敵人。因為當你互動時,幻覺就會消失。然而,當你從被動數據(passive data)中學習,比如數據來自維基百科(Wikipedia),你就無法驗證每一個細節,除非是數學或程式設計這種幻覺問題較少的領域,因為你能實際驗證答案。所以,結果是我們獲得了更多的數據。就像你說的,我們翻轉了金字塔(flipped the pyramid)。你是這麼說的嗎?他說你翻轉了金字塔,現在我們的全球數據量遠遠超過了互聯網。我們可以解決今天遇到的所有問題,而我們需要的只是更多的GPU(圖形處理單元,jeeps在此應為誤解,應指GPUs)。
Well robotics kind of allow you to do that for everything. That's how we learn and in that sense, I think an example is hallucination. Yep hallucination is a big problem in the elements. Have you ever heard robots hallucinating? This is not a discussion topic we discussed. Why? Because rewards cannot hallucinate, because if I will illustrate what will happen if I push this button from here to here, I can just try it. It will drop. I can see. I don’t have to I learned my interaction. So since I interact, interaction is the enemy of hallucination. Because as you interact, hallucination goes away. While when you're learning from passive data where data is coming from Wikipedia, you can’t go and verify everything unless it’s math or coding where hallucination is less of a problem because you can actually verify the answer. Yeah so what happens is we get way more data. Like you said, we flipped the pyramid. Is that what he says you flip the pyramid and now we get this global data being way bigger than the internet. And we can solve all the problems we have today. And all we need is a lot more GPUs.
我認為我們絕對會認為這永遠是答案,是的。
I think we absolutely that's always the answer yeah.
我們都在這裡,對吧?
So we're all here, right?
我想你絕對會有幻覺(hallucination),對嗎?它以不同的方式表現出來,也就是機器人預期的結果與現實世界中發生的事情出現偏差(deviation)。現在,這是可以驗證的,就像程式碼生成(code generation)的幻覺在無法編譯(compile)時是可以驗證的一樣,對吧?但這種幻覺表現在機器人執行了一個不可行的軌跡(trajectory),或者生成了一些東西,而你會發現只要能互動(interact),這些問題就能消失。如果沒有互動的能力,幻覺就永遠不會消失。例如,你永遠無法確定某件事,比如「你住在這個地方嗎?」如果我無法驗證,我就無法修正我的判斷。但在機器人技術中,大多數情況下你能透過互動來修正錯誤。我有一個很實際的例子,去年我們就遇到了這個問題:在辦公室裡沒有人會把馬桶座(toilet seat)放下來。我們用了一個以前的機器人「夏娃」(Eve),它有輪子(wheels),但仍然很靈活。所以我們讓它自主進去檢查馬桶座是向上還是向下。我們對此進行了5040次測試,對吧?
Um I think you absolutely can have hallucination, right? It manifests in a different way, which is a deviation in the robot's expected outcome from what happens in the real world. Now, it's verifiable in the same way that like code generation hallucinations are verifiable when they don't compile, right? But it manifests in you know robot doing a trajectory that's infeasible or generating out what I realize you can go away since you can interact. Well if you do not have the ability to interact, it can never go away. Like, for instance, you can never know like if you cannot, I don’t know like, do you live at this location if I cannot verify, I can never correct my destination. But in robotics, you can mostly correct because of interaction. I have a very good practical example actually, because we did this, that’s actually last year where we have a problem of no one putting down the toilet seat in the office. And we have one of our previous robots, the Eve, some wheels, but still very mobile. So we had autonomously go in and see if the toilet seat is up or down. And we ran 5040 on this, right?
結果是50%的時間馬桶座是向上或向下。它完全不知道,完全是隨機的(random),無法分辨馬桶座是向上還是向下。這是一個邊緣案例(edge case),因為它通常很擅長處理這些事情。但我們讓機器人繼續去關閉馬桶座。這是一個自主策略(autonomous policy)。它們會四處檢查浴室,如果馬桶座是向上的,就把它放下。這真的很有趣,我們最後也因此玩得很開心。這就是在現實世界中真正閉環(closing the loop)的例子,對吧?現在,模型能得到反饋:馬桶座放下了。我知道馬桶座放下了。我關閉它,我知道它在下面。你告訴我它是向上的?你錯了。
And it was 50% up or down. Like it had no idea, it was random, but it couldn’t tell if the seat was up or down. It’s kind of an edge case because it’s usually pretty good at these things. But we have the robots go on close the toilet seat. This is an autonomous policy. They will go around and check the bathroom and put the toilet seat down if the toilet seat was up. And that was really fun. And we had a lot of fun in the last about it. And what is actually closing the loop in the real world right so now? The model can get the feedback of seat is down. I know the seat is down. I close it. I know it’s down. And you told me what’s up? You were wrong.
這類似於在其他地方閉環,我們使用人工智慧來互動。例如,API(應用程式介面)或編譯器(compilers)之類的東西,你讓它生成一些結果,然後通過驗證階段(verification phase),你可以將結果反饋到系統的脈絡(context)中。只是這種情況下,閉環稍微慢一點,因為它需要經歷這個過程。
And it’s similar to closing the loop in other places where we’re using AI to interact with. For example, APIs or compilers or things like that, where you have it, emits some result and you put it through a verification phase that you can feedback into the context of the system. It’s just in this case,
現在的問題是,我們不知道如何在一般情況下做到這一點,對吧?我們可以針對某個特定的事情,比如馬桶座,來設計一個方案。但現在的問題是,你如何為這個問題找到一個通用的公式(formulation),讓所有東西都能在現實世界中被 grounding(接地)?我想每個人都知道該怎麼做。
The loop closure is a little bit slower because it’s going through this. Yeah the problem right now is that we don’t know how to do this in a general case right? We can architect one specific thing like the toilet seat. And now, the question is, how do you come up with some formulation of this problem where you’re grounding everything in the world? Everyone knows how to do, I guess.
在現實世界中反覆學習將會是痛苦且緩慢的,對嗎?我們可以在現實世界中學習這些東西,因為如果把東西丟下去會有後果(consequences)。重力(gravity)會讓它掉落,你能看出發生了不好的事情,對吧?但是我們用實體機器人(physical robot)探索的速度,這又回到了數據混合(blend of data)的問題,對吧?我的意思是,你可以做這些真正令人興奮的小事情,但你需要做多少千次或百萬次這樣的事情,才能擁有足夠的數據(data)?所以我認為這個問題仍然是:我們能負擔得起在現實世界中生成這些數據嗎?
The repeat at learning in the real world is going to be painfully slow, right? So we can learn these things in the real world because there are consequences to drop it into something. Gravity makes it fall. You can tell something bad happened, right? But the rate at which we can explore with a physical robot, I mean that's back to the blend of data, right? I mean you can do these really exciting, small things, but how many thousands or millions of those things do you need to do before you have enough data? And so I think the question is really still or can we afford to produce the real world?
但你還有一些遠見(vision)。模擬(simulation)也同樣具有吸引力,所以我認為這是你可以擁有的吸引數據(attraction data)。是的,模擬也需要更多的GPU(圖形處理單元)。好吧,我們時間快到了,我真的很想帶著這些問題結束。我對此很好奇。在未來2到5年內,你覺得這會走向何方?我會把這個問題留得模糊(vague),你們可以隨意回答。
But ok you have some vision, too. So simulation has also also also attractive so that I think is attraction data you can have. Yeah like a simulation also needs more GPUs. All right I you know we're getting close to time, and I really wanted to do with these questions. I'm very curious about this. In the next 2 to 5 years. Where do you see this heading? And I'm going to leave that as a vague question. Answered how you will.
我很希望你先開始回答。2到5年,這是一個相當大的範圍(range)。考慮到目前這個領域的發展速度(velocity),我想說的是,我要耍點小聰明,我會先停下來講,我認為需要10年時間,這一切才會完全實現(fully comes out)。說10年後這會是什麼樣子很容易,對吧?
I'd love for you to start birds so 2 to 5 years. That's a pretty big range. Given the current velocity of the field. Ok what I mean like say, I'm going to cheat and I'm going to say stop by saying, I think it's going to take 10 years before it is fully comes out. And it’s very easy to say, where will this be in 10 years?
我認為到那時,我們的社會將會經歷一次與幾百年前電力帶來的那種改變相當的轉變。我們現在覺得早上按下電燈開關(light switch)是理所當然的事情,這種改變將同樣發生在數位和實體勞動(labor across digital and physical)領域。人活著的時代真是個有趣的事情。我認為我們可以真正專注於什麼是讓我們成為人類(what makes us human)的核心。在未來5年內我們所創造的社會中,我希望我們能達到這個目標。我認為這是個雄心勃勃的目標,我們會努力去實現。我覺得現在沒有人能確定會怎樣,這真的取決於社會採用機器人(robots)的速度有多快,以及我們能多快擴大製造業(scale manufacturing)的規模。這有點像是站在有用的臨界點(cusp)上,對吧?所以我想說,以我們現在的產品為例,它目前在家庭中已經是有用的(useful)。它並不完美,不是說你完全不需要自己動手,但它確實有用而且有趣。從這裡開始,你可以加速發展。希望這不會像車子一樣停滯,也不會比我們預想的再多花十年時間。但我確實認為,在3到5年內,機器人技術幾乎會普及到大多數人中間。即使不是每個人都擁有機器人,人們也會認識某個擁有機器人的人,它們會逐漸成為社會的一部分,從消費者和家庭(consumers and homes),到工廠(factories)、物流(logistics)等各個領域。你可以接著說。有人說,人們常常高估短期內的進展(short-term progress),卻低估長期內的進展(long-term progress)。我想這可能是比爾·蓋茨(Bill Gates)或某人說過的話。所以我要先聲明,這只是我的看法(disclaimer)。但我認為機器人人工智慧(robotics AI)有一個獨特之處,讓它與大型語言模型(LLMs)或視覺語言模型(VLMs)不同。語言模型必須幾乎完全解決問題(solve the problem to completion)才能真正變得有用,比如在評分、一般寫作(general writing)或其他方面,它必須非常非常出色。如果我們早些時候有不錯的系統,但直到達到非常高的性能(high performance)之前,它們都不會真正有用。但對於機器人技術來說,這並不完全適用。因為我們不需要完全解決機器人技術的所有問題,機器人就能變得有用(useful)。就像今天,已經有成千上萬甚至數百萬的機器人被部署(deployed)在外面。我們今天使用的許多產品都是由機器人製造的,對吧?它們已經存在,已經融入生活中。所以這裡的關鍵是什麼?在機器人技術中,關鍵在於任務的分配(decision of tasks)。一個能解決所有任務的機器人可能是更遠的目標(farther),我不會對此做任何預測。但我們會開始看到一些機器人,它們可能解決幾個任務,或特定領域的任務(specialist tasks),甚至只是單一任務,而這些已經非常有用了。因為有些任務很難找到勞動力(labor)或僱用人力來完成。我曾和一些人聊過,他們因為勞動力短缺(shortage of labor)而從退休人員中招募員工。對於這些特定的專門任務(specialized tasks),專用機器人會更快出現,而更廣泛的應用還需要時間。但在機器人技術中,有用的課程從第一天就開始了(starts from day one),這就像語言一樣真實。如果自動駕駛汽車(autonomous cars)曾經很危險,那個問題已經被解決了。
I think then we're going to have the same kind of change in society that we had a few hundred years ago with electricity where we now just take it for granted when you flip the light switch in the morning that's going to happen to labor across digital and physical and man is an interesting thing the time to be alive. And I think we can really get to focus on what makes us human in a society we're creating 5 years. I hope we are there I think that's ambitious. We're gonna push for it. I don't think anyone knows at this. I think it really depends on how fast society adopts robots, and how fast we can scale manufacturing well that's kind of like the cusp is useful. Right? So I would say the product we have, just as an example, is currently useful in a home. It's not perfect. It's not like you don't need to do anything yourself, but it's useful and it's fun. And then you can kind of start accelerating from there and hopefully it's not enough on this car. And doesn't take a decade longer than we think. But I do think like 3 to 5 years. It is pretty much out there amongst most people. And even if not, everyone has a robot, people know someone who has a robot and they're generally part of society across everything from consumers and homes, into factories, logistics, everything else. You can go next so these are saying that people are often overestimate the progress in the short term, but they often underestimate the progress in the long run. And I think this is probably my Bill Gates or someone. And so I cannot I mean this is a disclaimer. But I think that one thing unique about robotics AI that makes it different from LLMs or VLMs. Is that LM. LLM has to really solve the problem almost to the completion to be really useful, like where the scoring, whether it’s a general writing or anything, it has to be really, really good. And if we had good systems earlier, but until you reach a really high performance, these are not useful. But that’s not quite true for robotics AI, because we don’t have to solve robotics fully for robotics to be useful and this is just to say, like even today, there are like hundreds of thousands of millions of robots deployed already out there. Many of our takes are made by robots today. Right? So they are already out there, they’re already there. So what is the key part here? The key thing in robotics is that this is decision of tasks, the robot that solves all tasks everywhere that may be farther. So I don’t make any prediction for that, but we will start seeing robots that make that may be solve a few tasks or world tasks or to task or task, especially specialist and even they are super useful. Because there are several tasks for which is very difficult to define labor or hire. I was talking to some coming to the day they are undertaking people from retirement because this is a shortage of labor for their specifically specialized robots will much come much quicker and the list will be farther. But the useful lesson starts from day one in robotics and like language that’s so true. And if autonomous cars were dangerous, the problem was solved.
你可以在2015年坐進一輛載你四處跑的車(car),它的表現相當不錯。從某種方式開始,然後它並不是人類的劍(human sword)。好吧。
You could get into a car that drove you around in 2015. It was doing pretty well. Start from in one way and then it’s not the human sword. Ok.
是的,我認為這是挑戰的一部分,因為採用(adoption)不僅僅是技術問題(technological problem)。它還涉及安全性(safety)、社會接受度(societal adoption)等因素,所以這些也會發揮作用。
Yeah, I think that's part of the challenge is that adoption is not just a technological problem. It’s also you know things like safety, things like societal adoption, also play a factor. And so.
在3到5年內,我們可能會看到某些領域(areas)的機器人(robots)數量大幅增加,而在我們預期的一些領域中機器人數量卻比預想少得多。
In 3 to 5 years, what we might see is that there’s a lot more robots in certain areas and a lot fewer robots in certain areas that we expected.
不過,我認為重要的是,我們正真切地看到機器人技術達到一個頂峰(culmination)。從歷史上看,機器人一直是單一用途(single purpose)的,但現在有一個概念逐漸形成,人們幾乎開始期待它們能實現多用途(multi-purpose),或許不是通用的(general purpose),但至少是多用途。這正成為人們的期待(expectation)。我們透過這些新的基於人工智慧的平台(AI-based platforms)展示的是:嘿,一件硬體(piece of hardware)可以有效完成不止一件事。我認為這一點,就像他說的,將在未來3到5年內持續下去。這種期待成為人們正在努力達到的新的基準線(water line)。你們所有人都看到了這一點,並且接受了它,這很好,對吧?因為現在你們將把這種期待帶入社會文化(social culturally)中向前推進,並開始說:「嘿,為什麼我不能擁有一個在家裡能做三四件事的機器人,或者在我的情況下,在倉庫(warehouse)或物流設施(logistics facility)中能做五到十件事?事情就應該是這樣的。」我認為真正推動這一切的是人們對這些東西的渴望(wanting those things),這驅動了投資(investment)和專注(focus),讓我們實現這些目標。
Uh and I think the important thing, though, is we are really seeing the culmination of robots really going from being historically very single purpose to this notion that it’s almost expected that they could be multi-purpose, maybe not general purpose, but multi-purpose that’s becoming kind of people whose expectation of the what we’re able to show with these new AI based platforms is that hey a piece of hardware can do more than one thing effectively. And I think that’s, he said, will hold over the next 3 or 5 years. Is this expectation is the new water line that people are building towards and you know all of you are now seeing this and buying into it, which is great, right? Because now you’re gonna be carrying that expectation social culturally forward and saying like, you know hey, why can’t I have a robot that does three or four things in a home or in my case you know five or ten things in a warehouse or logistics facility like that should that should be how it is. And I think that’s what really drives it is people wanting those things really drives investment and focus into getting us those things.
當人們問這個問題時,他們很幸運。我們想要的是一個具體、明確的答案,例如「我會在這個日期擁有一個機器人(robot),對吧?它會做所有這些事情。」但我認為真正的問題在於,大家對期望值(expectations)的基準(level setting)並不一致。所以我通常會問的問題是:「我們什麼時候會有人形機器人(humanoid robot)?它會像我們的汽車(car)一樣對我們有價值嗎?」
You know when people ask this question, they're lucky. We're really a concrete, specific, like I’m going to have a robot at this date, right? And it’s going to do all these things. And I think the real problem with this is there’s little level setting on what expectations are for everybody. Right? So the question I usually ask is, you know when will we have a humanoid robot? This is valuable to us as our car?
我不知道,對吧?我們的汽車每天在最極端的天气(extreme weathers)下都能運作,它的成本幾乎微不足道(costs hardly anything),考慮到製造它所投入的材料和努力(material and effort)。即使汽車本身也不是那麼令人感動(touching),人形機器人可能會為我們的生活增添價值(add to the value of our lives)。我認為這需要10年或更長的時間(10 years or longer)。這是典型的技術領域的回答(technologies answer)。如果你問創始人(founder),他們會說明年(next year);如果你問技術專家(technologist),他們會說大約10年。10年只是意味著我們很難具體量化(quantify specifically)你將會擁有什麼。我認為我們應該關注的是進展的速度(rate of progress)和各個領域的灘頭陣地(beachheads)。這裡的每個團隊都在不同的領域建立了一個有意義的灘頭陣地(meaningful beachhead)。隨著時間推移,這些領域會成長,從一堆分散的點(a bunch of dots)開始擴展。例如,敏捷(Agility)在倉庫(warehouse)中解決問題,我們在人們的家中有了機器人。我們會在汽車工廠(automotive factories)中工作。我認為從這些灘頭陣地中,你會看到成長(growth),對吧?這不會是一夜之間(over the night)發生的事情。我不認為這裡的任何人能準確預測(predict)5年後的未來,說出我們會在哪裡。但我相信我們會看到這種成長。很快,這些領域會開始重疊(overlapping),有一天我們會擁有一輛自動駕駛汽車(autonomous car)。回顧這個市場的歷史(history of that market),很多人對我們何時會有自動駕駛汽車的預測(predicted)投下了陰影(shade thrown out),認為這些預測很糟糕。我認為這很大程度上來自於社群中某些人(elements in the community)對其實現速度的過快聲明(statements)。但我很感激我的車有自動車道輔助(autonomous lane assist),不會撞上前面的車,還能防止我倒車撞到東西(backing into something)。所有這些神奇的功能(magical stuff)都源自於擁有自動駕駛汽車的夢想(dream of having an autonomous car)。順便說一句,你現在就可以叫到自動駕駛計程車(autonomous taxi),雖然這花了比預期更長的時間(took a little longer)。人形機器人也是如此。我認為,只要這個社群保持興奮(excited)、積極投入(leans in),並意識到這是一場長期的比賽(long game),我們就能開發出在商業環境(commercial setting)中提供價值的專業機器人(specialist robots)。我認為在未來1到2年內我們會實現這一點。敏捷已經在將機器人應用到實際場景(delivering robots into the space)。當我們讓這些機器人能完成10、15、20項任務時,這會在接下來的5年內實現(next 5)。如果你有訓練(trained)。但當我們要解決所有行業中我們想像的所有問題(solve all of the problems across all of the industries)時,我認為我們需要繼續夢想(keep dreaming),繼續努力(keep working)。這個行業還需要再投入幾十年的精力(many more decades),直到我們解決所有那些邊緣案例(edge cases)。
I have no idea right? Our car works every day in the most extreme weathers. It costs hardly anything given the amount of material and effort that goes into putting into it. And even the car by itself isn’t quite touching. The human robot might add to the value of our lives. I think you know I’m also in the like 10 years or longer. And I think that’s the typical like technologies answer. If you ask the founder, they’re going to say next year, if you ask a technologist, they are going to say it’s about 10 years right in 10 years just means it’s really hard for us to quantify specifically what you’re going to have. I think the thing we should be focused being on is the rate of progress and where the beachheads are. Each one of the groups up here is establishing a meaningful beachhead in a different area. And over time those things are going to grow, and that space is going to grow from a bunch of dots, right? Where you know Agility solving problems in a warehouse, we have robots in people’s homes. We’re going to be working on automotive factories and I think you’re going to see from each one of these beachheads, you’re going to see a growth, right? And it’s not going to be over the night thing. I don’t think anybody here can predict 5 years into the future and say exactly where we’re going to be. But I think we’re going to see this growth. And pretty soon all those things are going to start overlapping in one day we’re going to have an autonomous car. When you look back at that history of that market you know there’s a lot of shade thrown out on how badly they predicted you know when we would have an autonomous car. And I think a lot of that came from statements that some elements in the community were making on how quickly it would come. But I’m pretty grateful that my car does autonomous lane assist and doesn’t hit the car in front of me and prevents me from backing into something. All that magical stuff came out of the dream of having an autonomous car you know by the way you can get an autonomous taxi right now so yeah it took a little long, longer and so will humanoid robots. And I think as long as the community is excited, leans in and realizes this is a long game right to get to specialist robots that are delivering value in a commercial setting. I think we’re gonna have that in the next 1 or 2 years. Agility is already delivering robots into the space when we get those robots doing 10, 15, 20 tasks that’s gonna be in that next 5. If you’re trained. But when we’re gonna solve all of the problems we imagine across all of the industries, I think we need to keep dreaming and we keep working in this industry is going to have to keep putting energy in for many more decades, I think until we’ve solved all those edge cases.
我真的很喜歡一句更深刻的格言(saying),就是人們往往高估短期(short term)的進展,低估長期(long term)的發展。所以讓我把這分成短期和長期來談。我認為在未來2到5年,從技術角度(technical perspective)來看,我們都能開始研究具體的體現縮放法則(embodied scaling laws)。我認為大型語言模型(large language model)中最重大的時刻是最初的清溪縮放法則(Qing Xi scaling law)。基本上,這是一個指數曲線(exponential curve):你投入更多運算資源(compute),擴展數據量(amount of data),增加參數數量(number of parameters),然後你會看到智能(intelligence)呈現爆炸式增長(going out)。這是個指數級的(exponential)過程。但我認為我們在機器人技術(robotics)中還沒有類似的東西,因為縮放法則在機器人領域太複雜了(complicated)。在機器人技術中,你可以跨模型縮放(scale across model),可以跨硬體艦隊(hardware fleet)縮放,而不是單純依賴真實機器人數據(real robot data),對吧?
I really like a deeper saying that you know people tend to overestimate the short term and underestimate the long term. So let me break it down into short term and long term. I think the next 2 to 5 years from a technical perspective, we will all be able to study for lady the embodied scaling laws so I think the biggest moments in large language model is the original Qing Xi scaling law. Basically that exponential curve, where you putting more compute, you scale the amount of data, you scale the number of parameters, and you were just seeing intelligence just going out. So I exponential. I don’t think we have anything like that for robotics yet because the scaling law is so complicated for robotics, right you can still across model, you can skill across the hardware fleet, rather the real robot data?
那模擬數據方案(simulation data scheme)呢?互聯網數據模式(internet data schema)呢?還有神經模擬(neural simulation)、神經夢境(neural dreams)呢?就像縮放法則(scaling law)一樣,當你生成大量視頻(lots and lots of videos)時會如何?
And how about the simulation data scheme along? How about the internet data schema? How about the neural simulation, the neural dreams, right scaling law as you are generating lots and lots of videos?
我們將能夠研究所有這些東西,所以也許5年後,甚至更早,我們就能在螢幕上看到這樣的圖表(plot),清楚知道你買了多少GPU(圖形處理單元,GPUs),機器人(robot)會變得多好。這就能定量(quantify)回答這個問題。這在短期內很快就會實現。現在,讓我們談談20年後會發生什麼。每次我在筆記型電腦前熬夜(stay up late),機器人總會開始做一些奇怪的事情(doing something weird),我就會覺得好沮喪(frustrated)。這時我就會想,20年後會發生什麼,然後我就能繼續下去,對吧?所以,20年後有幾件事讓我特別興奮(super excited),而且我覺得這些並不是那麼遙遠(not that much far away)。其中之一是機器人技術正在加速科學發展(accelerating science)。我有一些在生物醫學(BioMed)領域的朋友,做一個實驗(experiment)實在太耗時(time consuming)、太費力(laborious)了。就像所有博士生(PhD students)都需要待在實驗室(lab)裡,照顧那些老鼠(mouse)或細胞培養皿(vicious of cells),對吧?
So we will be able to study all of these things so that perhaps you know 5 years from now or sooner from then, we’ll have that plot are on the screen that you know exactly how many GPUs you buy and how much better the robot will be. So we answer that question quality. It will be very soon in a short time. Now, let’s talk about what’s gonna happen in 20 years. You know every time I stay up late at the laptop, the robots break be doing something weird. I’m like oh so frustrated. Let me think about what’s gonna happen 20 years and then I can carry on. Right? So 20 years from now, there are a couple of things that I’m super excited about that I think it’s just not that much far away. What one is robotics are accelerating science. Right? So I have some friends in BioMed, and it’s just so time consuming to do one experiment and so laborious. Like all those PhD students need to be in the lab, right? Attending to those like mouse, like all those like vicious of cells?
我們把所有這些都自動化(automate)怎麼樣?對,自動化科學(automating science),或許所有的醫學研究(medical research)就不會需要花費10億美元($1 billion)。這會擴大規模(scaled up),因為我們有了這個API來加速物理世界(physical world),對吧?利用智能(intelligence),也許這會是第十大版本(version ten)或什麼的,我希望如此。所以這是我超級興奮(super excited)的一件事。另一件事是機器人技術自動化機器人本身(robotics automating robotics itself)。
How about we automate all of that? Right, automating science that maybe all the medical research will not cost $1 billion to do. There will be scaled up because we get this API to accelerate the physical world, right? Using intelligence, perhaps that will be good version ten or something, I hope. Right? So that is one thing I’m super excited about. And the other thing is robotics automating robotics itself?
對啊,為什麼不讓機器人互相修理(robots fixing each other)呢?我們看到那些大工廠(big factories)在製造機器人,但如果讓機器人自己組裝(assembly)下一代機器人(next generation of the robots)呢?
Right so why don’t we have robots fixing each other? Right so we see all of those big factories making the robots, but how about what the robots themselves assembly, the next generation of the robots?
我不認為這是科幻(science fiction),因為實際上在LOM社群(LOM community)中,他們已經領先我們了。可惜的是,在我們這個社群中,人們正在研究AutoML,意思是我們能被提示(prompt)嗎?這是我們的雙臂(arms)來做深入研究(deep research),去尋找下一個最佳的變換器(transformer),尋找智能本身(intelligence itself)的下一個最佳架構(architecture),而且就在我們說話的此刻,人們正在積極地做這件事(actively doing this)。也許OM會先解決這個問題,然後我們會抄他們的作業(copy your homework)。我們會讓物理世界實現這種遞迴自我改進(recursive self-improvement),我們會做到嗎?我認為這會發生,對吧?不是在100年後,而是在20年內。這絕對會發生(definitely gonna happen)。所以我要以一個樂觀的音符(bright note)結束。我認為我們這一代人(generation),包括我們所有人,生得太晚(born too late)無法探索地球(explore the earth),生得太早(born too early)無法前往其他星系(travel to other galaxies),但我們正好及時出生(born just in time)來解決機器人問題(solve robotics)。所有移動的東西(everything that moves)都將成為頂尖(the top)。
And I don’t think this science fiction at all because actually in the LOM community began they are ahead of us. Unfortunately, but in our community people are studying AutoML, meaning that can we be prompt? It’s our arms to do deep research, to find the next best transformer to find the next best architecture for intelligence itself and people actively doing this as we speak. And probably OM is gonna solve that first and then we’re gonna copy your homework, and we’ll have the physical world doing this recursive self-improvement as would we go? And I think that’s gonna happen, right? Not in 100 years, only in 20 years. That’s definitely gonna happen. So I’m going to end on a bright note. I think we, as a generation and all of us, were born too late to explore the earth, were born too early to travel to other galaxies, were born just in time to solve robotics. And everything that moves will be the top.
我認為這是最好的結束音符(best note to end on)。非常感謝我們的小組成員(panelists)來到這裡分享你們的想法,不僅是關於我們現在的處境(where we are now),還有我們未來的方向(where we are headed)。在大家離開之前要知道,我們不會進行傳統的問答(Q and A),但我們會拿掉我們的麥克風(remove our mics),回到這裡。對於任何感興趣的人(anyone who’s interested),請隨時上台(come up to the stage),你可以直接向小組成員提問(ask your questions directly)。所以我們要先下台(go off stage)拿掉麥克風(get our mic removed),然後回來回答任何問題。請上台(come up to the stage)。
One of our I mean I think that’s the best note to end on. Thank you all so much to our panelists for coming sharing your thoughts, not only on where we are now, but where we are headed. Just to know before everyone heads out, we aren’t doing a traditional Q and A, but we are going to get remove our mics, come back here. And for anyone who’s interested, feel free to come up to the stage and you can ask your questions directly to the panelists for anyone interested. So we’re just going to go off stage to get our mic removed. We’ll be back for any questions. Come up to the stage.