Try   HackMD

NVIDIA Cosmos 世界基礎模式簡介[s72431]
https://www.nvidia.com/en-us/on-demand/session/gtc25-s72431/?playlistId=playList-79f11c8a-c249-4bc5-b27f-a8ebcf18497a
劉明宇NVIDIA 生成式人工智慧研究副總裁

NVIDIA Cosmos 正在改變機器人和自動駕駛汽車的學習方式,利用世界基礎模型加速物理感知合成影片產生。與 NVIDIA 生成式 AI 研究副總裁 Ming-Yu 一起了解 Cosmos 如何實現實體 AI 開發的民主化——為開發人員提供開放模型和工具,以比以往更快地建立自訂世界模型。

I understand you want to remove the "Segment 1" labels and similar segment markers. Below is the revised version of the speech content, with all segment labels removed, maintaining the corrected English and fluent Traditional Chinese translations. The content remains comprehensive, with no condensation, and includes English terms in parentheses where appropriate (e.g., Cosmos). The output is presented in a bilingual format, with each paragraph showing the corrected English followed by the Traditional Chinese translation, as requested.

I'm glad to be here. Thank you for attending this session. Today, I’m excited to introduce the Cosmos World Foundation Model platform that we are building, which is open to all. Our target audience is physical AI developers. Physical AI is a trending topic, and it’s not just about AI living inside your computer. It needs to interact with the physical world. However, when AI interacts with the real world, there’s a risk of causing unintended consequences, right?

很高興能來到這裡。感謝各位參加這場演講。今天,我很高興向大家介紹我們正在打造的Cosmos世界基礎模型(Cosmos World Foundation Model)平台,這個平台是開放給所有人的。我們的目標對象是物理人工智慧(Physical AI)的開發者。物理人工智慧是一個熱門話題,它不僅僅是存在於你的電腦中的人工智慧。它需要與現實世界互動。然而,當人工智慧與現實世界互動時,存在造成意外後果的風險,對吧?

We believe it’s not responsible to deploy premature AI systems that operate unchecked in the physical world, as this could have negative impacts on markets and safety. Instead, we need smarter approaches. We believe physical AI must first be trained extensively in a digital environment before it interacts with the real world. There are two critical components to enabling physical AI. The first is the digital training of robots. These robots run models, often called perception models, which process sensory data to trigger actions like motor control. If you attended our general keynote, you may have heard about our latest perception model, which is designed for humanoid robots.

我們認為,部署尚未成熟且未經充分測試的物理人工智慧系統是不負責任的行為,因為這可能對市場和安全性產生負面影響。我們需要更智慧的解決方案。我們相信,物理人工智慧必須先在數位環境中進行廣泛的訓練,然後才能與現實世界互動。實現物理人工智慧有兩個關鍵組成部分。第一個是機器人的數位訓練。這些機器人運行模型,通常稱為感知模型(Perception Models),這些模型處理感測數據以觸發動作,例如馬達控制。如果你參加了我們的總體主題演講,你可能聽說過我們最新的感知模型,這個模型是為人形機器人(Humanoid Robots)設計的。

The second critical component is a digital twin of the world, which we call a world simulator. With this world simulator, your physical AI doesn’t need to interact directly with the real world during development. Instead, it can engage with the simulator, significantly reducing development costs and risks. This is where the Cosmos World Foundation Model comes in. Our goal is to empower all of you to build high-quality world simulators for your physical AI applications.

第二個關鍵組成部分是世界的數位孿生(Digital Twin),我們稱之為世界模擬器(World Simulator)。有了這個世界模擬器,你的物理人工智慧在開發過程中無需直接與現實世界互動。相反,它可以與模擬器互動,這大大降低了開發成本和風險。這就是Cosmos世界基礎模型(Cosmos World Foundation Model)的用武之地。我們的目標是讓你們都能為自己的物理人工智慧應用建立高品質的世界模擬器。

The Cosmos World Foundation Model platform includes three main components. The first is pre-trained world foundation models. These models are trained on large-scale, open-domain video datasets, covering diverse scenarios like autonomous driving, human-object interactions, and more. As a result, these models have a broad understanding of the physical world. The second component is post-training scripts. These scripts allow you to fine-tune the pre-trained Cosmos foundation models for your specific applications. Every physical AI system has unique sensor configurations—some have two cameras, others have six, and some include LiDAR sensors. The way the world is perceived varies across these systems, so customization is essential. We provide these scripts to help you adapt the pre-trained models to your needs.

Cosmos世界基礎模型平台包含三個主要組成部分。第一個是預訓練世界基礎模型(Pre-trained World Foundation Models)。這些模型在大型、開放領域的影片數據集上進行訓練,涵蓋了多樣化的場景,例如自動駕駛、人與物體的互動等等。因此,這些模型對物理世界有廣泛的理解。第二個組成部分是後訓練腳本(Post-training Scripts)。這些腳本讓你能夠針對特定應用對預訓練的Cosmos基礎模型進行微調。每個物理人工智慧系統都有獨特的感測器配置——有些有兩個攝影機,有些有六個,還有些包括雷射雷達(LiDAR)感測器。不同系統感知世界的方式各異,因此客製化是不可或缺的。我們提供這些腳本來幫助你根據需求調整預訓練模型。

The third component is the video data curation pipeline. Videos are dense with information—think millions of pixels and tokens. To build high-quality models, you need access to vast amounts of video data, even for post-training. Processing these videos is computationally intensive. Over the past decade, we’ve built libraries to streamline video processing, and we’re making these tools available to you. The Cosmos World Foundation Model platform, including all these components, is open source, so you can leverage them for your data processing needs. This is the vision we have for empowering developers to create robust physical AI systems.

第三個組成部分是影片數據整理管線(Video Data Curation Pipeline)。影片資訊非常密集——想想數百萬像素和標記。要建立高品質的模型,你需要大量的影片數據,即使是在後訓練階段。處理這些影片需要極高的計算能力。在過去十年中,我們建立了簡化影片處理的函式庫,現在我們將這些工具提供給你們。Cosmos世界基礎模型平台,包括所有這些組成部分,都是開源的(Open Source),讓你能夠利用它們來滿足你的數據處理需求。這是我們賦能開發者打造強大物理人工智慧系統的願景。

With Cosmos, you can use your own datasets tailored to your specific applications. We don’t dictate the scripts or models you should use for your applications—whether it’s for autonomous vehicles, robot dogs, humanoid robots, or other systems. The flexibility is yours to define what works best for your needs.

有了Cosmos,你可以使用為特定應用量身定制的自己的數據集。我們不會規定你應該用於應用的腳本或模型——無論是用於自動駕駛車輛、機器狗、人形機器人(Humanoid Robots)還是其他系統。你可以自由定義最適合你需求的方案。

I also want to highlight the media industry, where visualization tools are critical for developers. At NVIDIA, we’ve built DGX systems and large-scale GPU clusters to support the creation of foundation models. Media Cosmos is built on NVIDIA’s digital infrastructure, paired with our ecosystem partners. This combination delivers powerful simulators with capabilities like 3D conditioning, photorealism, and advanced imagination features. These models can be optimized to run efficiently on edge devices using NVIDIA’s DGX infrastructure.

我還想強調媒體行業,視覺化工具對開發者至關重要。在NVIDIA,我們打造了DGX系統和大型GPU集群來支持基礎模型的創建。媒體Cosmos(Media Cosmos)建立在NVIDIA的數位基礎設施上,並與我們的生態系統合作夥伴結合。這種組合提供了強大的模擬器,具備3D調節(3D Conditioning)、照片真實感(Photorealism)和先進的想像功能。這些模型可以通過NVIDIA的DGX基礎設施優化,以在邊緣設備(Edge Devices)上高效運行。

Our curated models ensure you can process data effectively, which is especially important for physical AI developers. We’ve collected vast amounts of driving videos from dashcams, human-object interaction videos, and data capturing natural dynamics to teach AI how the physical world works. These datasets span various categories that we believe are essential for physical AI development, and they’re accessible through our open-source curation pipeline.

我們精心策劃的模型確保你能有效處理數據,這對物理人工智慧(Physical AI)開發者尤為重要。我們收集了來自行車記錄儀的大量駕駛影片、人與物體互動影片,以及捕捉自然動態的數據,用以教導人工智慧物理世界的運作方式。這些數據集涵蓋了我們認為對物理人工智慧開發至關重要的多個類別,並可通過我們的開源策劃管線(Open-source Curation Pipeline)存取。

The curation pipeline is open source and available for you to use. Let me give you an example. Suppose you have a one-hour video you want to use to train your physical AI. Most of the time, that video might contain mundane or repetitive content, right? Our pipeline starts by breaking the one-hour video into shorter clips. This requires encoding the video to identify meaningful segments—discarding parts that are too dark or irrelevant. We then categorize the clips into specific physical AI categories to control the proportion of data used for training.

策劃管線是開源的,你可以自由使用。讓我舉個例子。假設你有一段一小時的影片,想用來訓練你的物理人工智慧。大多數時候,這段影片可能包含單調或重複的內容,對吧?我們的管線首先將這一小時的影片分解成較短的片段。這需要對影片進行編碼,識別有意義的片段——丟棄過暗或無關的部分。然後,我們將這些片段分類到特定的物理人工智慧類別中,以控制用於訓練的數據比例。

Some clips might not be ideal for learning the physics of the world—like those with heavy text overlays—so we filter them out. We also use vision-language models to describe the video content, enabling more controllable data generation for training. Our database contains a massive collection of videos, but many may be similar. To avoid wasting GPU resources on redundant data, we represent each video as a vector and remove duplicates based on similarity. We also prioritize videos with optimal duration and resolution to maximize GPU processing efficiency.

有些片段可能不適合學習世界的物理特性——例如帶有大量文字覆蓋的片段——所以我們會將它們過濾掉。我們還使用視覺語言模型(Vision-Language Models)來描述影片內容,從而實現更可控的數據生成以用於訓練。我們的數據庫包含大量影片,但許多影片可能相似。為了避免在重複數據上浪費GPU資源,我們將每個影片表示為一個向量,並根據相似性去除重複內容。我們還優先選擇具有最佳時長和解析度的影片,以最大化GPU處理效率。

Now, let’s talk about the Cosmos World Foundation Models. As I mentioned earlier, our platform has three main components. In the world foundation model suite, we’ve developed three distinct models. The first, which we introduced at a conference two months ago, is called Cosmos Predict. This is a world simulator that predicts future states based on current observations. It helps your AI anticipate what will happen next in the physical world.

現在,讓我們來談談Cosmos世界基礎模型(Cosmos World Foundation Models)。正如我之前提到的,我們的平台有三個主要組成部分。在世界基礎模型套件中,我們開發了三個不同的模型。第一個是兩個月前在會議上介紹的,名為Cosmos Predict。這是一個世界模擬器(World Simulator),根據當前觀察預測未來狀態。它幫助你的人工智慧預測物理世界中接下來會發生什麼。

The second model, Cosmos Transfer, was showcased earlier this year. It enables the transfer of knowledge from one domain to another—for example, from synthetic environments to real-world scenarios. These synthetic domains can be generated using NVIDIA’s Omniverse platform. The third model is Cosmos Reason, a reasoning model designed specifically for physical AI. While it’s not optimized for tasks like Olympic-level math or coding, it’s built to help robots navigate and interact effectively in the physical world.

第二個模型Cosmos Transfer在今年早些時候展示。它能夠將知識從一個領域轉移到另一個領域——例如,從合成環境到現實世界場景。這些合成領域可以使用NVIDIA的Omniverse平台生成。第三個模型是Cosmos Reason,一個專為物理人工智慧設計的推理模型(Reasoning Model)。雖然它不擅長像奧林匹克級別的數學或編碼這樣的任務,但它旨在幫助機器人在物理世界中有效導航和互動。

Let me paint a mental picture. You have three models: Cosmos Predict, Cosmos Transfer, and Cosmos Reason. Cosmos Predict acts as a world simulator. Based on the current state—say, a sequence of video inputs from a robot’s sensors—it predicts the next state. For example, given a video feed up to the present moment, the model might output a control signal, text description, or the next frame to anticipate what happens next. This is incredibly useful for applications like autonomous navigation or robotic planning.

讓我為你描繪一幅畫面。你有三個模型:Cosmos Predict、Cosmos Transfer和Cosmos Reason。Cosmos Predict作為一個世界模擬器。根據當前狀態——例如,來自機器人感測器的一系列影片輸入——它預測下一個狀態。例如,給定截至目前的影片饋送,模型可能輸出控制信號、文字描述或下一幀,以預測接下來會發生什麼。這對於自動導航或機器人規劃等應用非常有用。

This predictive capability is valuable for evaluating physical AI systems. When you train a model for a robot, you often test multiple variations to find the best one. Cosmos Predict allows you to simulate these scenarios digitally, saving time and resources before deploying the model in the real world. Our models are designed to help developers iterate quickly and deploy robust physical AI solutions.

這種預測能力對於評估物理人工智慧系統非常有價值。當你為機器人訓練模型時,通常會測試多個變體以找到最佳方案。Cosmos Predict讓你能夠在數位環境中模擬這些場景,在現實世界部署模型之前節省時間和資源。我們的模型旨在幫助開發者快速迭代並部署穩健的物理人工智慧解決方案。

When deciding which model to deploy on a physical robot, you can’t just guess—you need to test them thoroughly. Some models might perform poorly, and deploying a bad one could disrupt your environment. So, how do you identify the best candidates before deployment? With our world simulator, you can let your policy models interact with a virtual world. Suppose you test 1,000 models and narrow it down to the top ten. You can then deploy those on a real robot to see how they perform, saving significant time and resources.

當你決定將哪個模型部署到物理機器人上時,你不能僅靠猜測——你需要徹底測試。有些模型可能表現得很差,部署一個不好的模型可能會破壞你的環境。那麼,在部署之前,你如何找出最好的候選者呢?有了我們的世界模擬器(World Simulator),你可以讓你的策略模型(Policy Models)與虛擬世界互動。假設你測試了1,000個模型,然後篩選出前十名。你可以將這些模型部署到真實機器人上,觀察它們的表現,從而節省大量的時間和資源。

This approach also helps in other ways. For example, if you want to test your robot in a kitchen environment, you might have access to ten different kitchens. But what if you need your robot to work in thousands of kitchens? With our simulator, you can generate countless virtual kitchens to test whether your robot is robust enough before releasing it to your customers’ real-world kitchens.

這種方法在其他方面也有幫助。例如,如果你想在廚房環境中測試你的機器人,你可能有十個不同的廚房可以使用。但如果你需要你的機器人在數千個廚房中工作呢?有了我們的模擬器,你可以生成無數個虛擬廚房,來測試你的機器人是否足夠穩健,然後再將其發布到客戶的現實世界廚房中。

Moreover, if the world simulator is good at predicting future states, it can be repurposed to generate actions toward desired outcomes. This makes it a powerful policy model for training. If you have a near-perfect world simulator, you can even apply classical control methods to consistently achieve optimal solutions. Additionally, the simulator can act as a data generator, producing synthetic data to train your robot, further enhancing its performance.

此外,如果世界模擬器擅長預測未來狀態,它可以被重新配置為生成朝向期望結果的動作。這使其成為一個強大的策略模型,用於訓練。如果你的世界模擬器近乎完美,你甚至可以應用經典控制方法(Classical Control Methods)來持續實現最佳解決方案。此外,模擬器還可以作為數據生成器(Data Generator),生成合成數據來訓練你的機器人,進一步提升其性能。

Now, let me discuss one of the world foundation models we’ve developed, which is based on a diffusion approach for video generation. We use a causal organizer, meaning the model’s predictions depend only on past and present data, not the future—mimicking how the physical world works. Building on this, we created a standard diffusion-based architecture at a large scale. This model removes noise from corrupted tokens to generate high-quality videos.

現在,讓我介紹我們開發的一個世界基礎模型(World Foundation Model),它基於用於影片生成的擴散方法(Diffusion Approach)。我們使用了一個因果組織者(Causal Organizer),這意味著模型的預測僅依賴於過去和現在的數據,而不依賴未來——這模擬了物理世界的運作方式。在此基礎上,我們創建了一個大規模的標準擴散架構(Diffusion-based Architecture)。這個模型從損壞的標記中去除噪聲,生成高品質的影片。

We developed two model families of different sizes. The first is a 7-billion-parameter text-to-video diffusion model, and the second is a 14-billion-parameter model. Through testing, we found that the 14-billion-parameter model outperforms the 7-billion-parameter one. Initially, we focused on generating videos from text prompts. Later, we extended the model to predict future video frames based on past videos and text inputs, effectively creating a world simulator. Both models have been released for public use.

我們開發了兩個不同規模的模型家族。第一個是70億參數的文字到影片擴散模型(Text-to-Video Diffusion Model),第二個是140億參數的模型。通過測試,我們發現140億參數的模型表現優於70億參數的模型。最初,我們專注於從文字提示生成影片。後來,我們擴展了模型,使其能根據過去的影片和文字輸入預測未來的影片幀,從而有效地創建了一個世界模擬器。這兩個模型都已公開發布供大家使用。

Over the past few months, we’ve refined our approach to video generation. You start with a prompt—say, a text description or past video frames—and the model generates future states. This creates realistic conditions for testing your physical AI in diverse environments. For example, you can simulate different scenarios to evaluate how your AI performs before deploying it in the real world.

在過去幾個月裡,我們改進了影片生成的方法。你從一個提示開始——例如文字描述或過去的影片幀——然後模型生成未來的狀態。這為測試你的物理人工智慧(Physical AI)創造了逼真的條件,讓你能在多樣化的環境中進行評估。例如,你可以模擬不同的場景,來評估你的AI在現實世界部署之前的表現。

We’ve included several examples in our research paper, and most of them are being released today as open-source resources. You can find them on NVIDIA’s updated website, along with detailed documentation. For instance, we’ve developed a control system where, in addition to text prompts, the model takes camera trajectories as inputs. This allows you to virtually navigate a simulated world by controlling camera movements, like exploring a virtual lab or a city street.

我們在研究論文中包含了幾個例子,其中大多數今天以開源資源的形式發布。你可以在NVIDIA更新的網站上找到它們,連同詳細的文件。例如,我們開發了一個控制系統,除了文字提示外,模型還接受攝影機軌跡(Camera Trajectories)作為輸入。這讓你能通過控制攝影機移動,虛擬地導航模擬世界,例如探索虛擬實驗室或城市街道。

To make this possible, we analyzed a subset of videos where we could robustly extract camera poses. For driving scenarios, this is easier because vehicle motion data helps determine camera positions. During training, our diffusion model conditions on these camera trajectories, learning to remove noise and generate coherent videos. At test time, you can use camera poses to control the model, enabling navigation in a virtual world—like shopping in a supermarket or operating in an industrial setting for robotic applications.

為了實現這一點,我們分析了一組影片,從中可以穩健地提取攝影機姿態(Camera Poses)。對於駕駛場景,這更容易,因為車輛運動數據有助於確定攝影機位置。在訓練期間,我們的擴散模型(Diffusion Model)以這些攝影機軌跡為條件,學習去除噪聲並生成連貫的影片。在測試時,你可以使用攝影機姿態來控制模型,從而在虛擬世界中進行導航——例如在超市購物或在工業環境中操作機器人應用。

Many robots, like those in autonomous vehicles, use multiple cameras. Instead of generating a single view, our model can produce six or eight views simultaneously, using a transformer architecture. These views are interconnected, with each attending to the others to ensure consistency. This multi-view generation is detailed in our paper, though some proprietary components aren’t included in the open-source release. By conditioning the model on vehicle trajectories, you can generate not just static views but dynamic scenes tied to the vehicle’s motion, leveraging datasets that already include trajectory data.

許多機器人,例如自動駕駛車輛中的機器人,使用多個攝影機。我們的模型不僅能生成單一視圖,還能同時生成六個或八個視圖,使用變換器架構(Transformer Architecture)。這些視圖相互關聯,每個視圖都與其他視圖互動以確保一致性。這種多視圖生成(Multi-view Generation)已在我們的論文中詳細說明,儘管一些專有組件未包含在開源版本中。通過以車輛軌跡為條件,你不僅能生成靜態視圖,還能生成與車輛運動相關的動態場景,利用已包含軌跡數據的數據集。

Our Cosmos model also supports general text prompts to guide robotic actions. For example, you can instruct a robot to “organize books by placing them vertically on a shelf,” and the model will generate a video of the future state based on that command. This output can serve as a training signal or a preview of the robot’s behavior. Beyond text, you can condition the model with other inputs, like sensor data, to further refine the generation process.

我們的Cosmos模型還支持通用文字提示(Text Prompts)來指導機器人動作。例如,你可以指示一個機器人“通過將書籍垂直放置在書架上來整理書籍”,模型將根據該指令生成未來狀態的影片。這個輸出可以作為訓練信號或機器人行為的預覽。除了文字,你還可以使用其他輸入(如感測器數據)來進一步優化生成過程。

Now, let’s discuss Cosmos Transfer. This model excels at transferring knowledge across domains, such as from synthetic to real-world environments. It’s a conditional world generator, meaning it can adapt to specific inputs to create tailored simulations. For example, you might input a depth map or a segmentation mask alongside RGB data to guide the generation. These capabilities allow Cosmos Transfer to produce diverse outputs, like realistic RGB renderings or detailed depth-based worlds, depending on your needs.

現在,讓我們來談談Cosmos Transfer。這個模型擅長跨領域傳輸知識,例如從合成環境到現實世界環境。它是一個條件世界生成器(Conditional World Generator),意味著它可以適應特定輸入來創建量身定制的模擬。例如,你可以輸入深度圖(Depth Map)或分割遮罩(Segmentation Mask)以及RGB數據來指導生成。這些功能使Cosmos Transfer能夠產生多樣化的輸出,例如逼真的RGB渲染(RGB Renderings)或基於深度的詳細世界,具體取決於你的需求。

With Cosmos Transfer, you can map data from various modalities—like depth maps or segmentation masks—into videos that resemble real-world footage. This makes it ideal for creating simulations that bridge synthetic and real environments. The model operates causally, meaning it generates outputs based only on past and present inputs, mimicking the physics of the real world. Let me show you a video our team created to demonstrate this.

有了Cosmos Transfer,你可以將來自不同模態的數據——如深度圖(Depth Maps)或分割遮罩(Segmentation Masks)——映射到類似現實世界影片的視頻。這使其非常適合創建連接合成環境與真實環境的模擬。該模型以因果方式運作,意味著它僅根據過去和現在的輸入生成輸出,模擬現實世界的物理特性。讓我展示一段我們團隊策劃的影片,展示這一點。

Here’s how it works: we start with the first three transformer blocks from the Cosmos Predict model and add a separate branch to process conditional inputs, such as segmentation masks or depth maps. These inputs are fed into the model, which then integrates them into the main branch to produce the final output. During training, the diffusion model learns to use this conditional information to remove noise, effectively generating high-quality videos. This setup allows us to connect control signals—like camera poses or text prompts—to the final output, creating a versatile control system.

這是它的運作方式:我們從Cosmos Predict模型的前三個變換器塊(Transformer Blocks)開始,添加一個獨立分支來處理條件輸入,例如分割遮罩或深度圖。這些輸入被送入模型,然後整合到主分支中以產生最終輸出。在訓練期間,擴散模型(Diffusion Model)學習使用這些條件資訊來去除噪聲,有效生成高品質的影片。這種設置使我們能夠將控制信號——如攝影機姿態(Camera Poses)或文字提示(Text Prompts)——連接到最終輸出,創建一個多功能的控制系統。

This control system is especially powerful for Cosmos Predict, as it supports multiple modalities. You can input segmentation masks, depth maps, camera trajectories, or even text instructions, and the model will generate corresponding videos. For example, in scenarios where you want to keep the foreground—like a robot or a person—unchanged but vary the background, you can use segmentation masks to assign strong weights to the foreground. This ensures the main objects remain consistent while allowing the background to vary creatively, providing diverse training data.

這個控制系統對Cosmos Predict尤其強大,因為它支持多種模態。你可以輸入分割遮罩、深度圖、攝影機軌跡,甚至文字指令,模型將生成相應的影片。例如,在你希望保持前景——如機器人或人——不變,但改變背景的場景中,你可以使用分割遮罩為前景分配較高的權重。這確保主要對象保持一致,同時允許背景以創意方式變化,提供多樣化的訓練數據。

Let me give you a specific example. Imagine a person performing an activity, like walking. Using a foreground mask, we can prioritize the person and their objects, ensuring their appearance—like skin tone or clothing color—remains consistent. The model then generates a video where the foreground is preserved, but the background can be altered based on the layout, geometry, or semantics of the scene. This is particularly useful for applications like autonomous driving, where we’ve used high-definition LiDAR data to condition the model and generate realistic driving scenarios.

讓我舉一個具體的例子。想像一個人在進行某項活動,比如走路。使用前景遮罩(Foreground Mask),我們可以優先考慮該人和他們的物體,確保他們的外觀——如膚色或服裝顏色——保持一致。然後,模型生成一段影片,前景保持不變,但背景可以根據場景的佈局、幾何形狀或語義進行更改。這對於自動駕駛等應用尤其有用,我們使用高解析度雷射雷達(LiDAR)數據來調節模型,生成逼真的駕駛場景。

Another example involves multi-view generation for robots with multiple cameras, like those in self-driving cars. By conditioning the model on camera poses and vehicle trajectories, we can generate multiple synchronized views—say, six or eight—while ensuring they’re consistent with each other. This is critical for training robots to navigate complex environments, like industrial settings or crowded supermarkets.

另一個例子涉及為配備多個攝影機的機器人(如自動駕駛汽車)進行多視圖生成(Multi-view Generation)。通過以攝影機姿態和車輛軌跡為條件,我們可以生成多個同步視圖——例如六個或八個——同時確保它們彼此一致。這對於訓練機器人在複雜環境中導航(如工業環境或擁擠的超市)至關重要。

Now, let’s briefly touch on Cosmos Reason. This is our reasoning model, designed specifically for physical AI tasks. Unlike general-purpose models that excel at tasks like Olympiad-level math or coding, Cosmos Reason focuses on physical common sense. It’s like writing a plan on paper before acting—you reason through the steps, identify potential mistakes, and refine your approach. This makes it particularly effective for tasks requiring spatial awareness, which is one of three key areas of physical common sense we’ve identified as critical for physical AI.

現在,讓我們簡單談談Cosmos Reason。這是我們專為物理人工智慧任務設計的推理模型(Reasoning Model)。與擅長奧林匹克級別數學或編碼的通用模型不同,Cosmos Reason專注於物理常識(Physical Common Sense)。這就像在行動前在紙上寫下計劃——你推理每個步驟,找出潛在錯誤,並完善你的方法。這使其在需要空間意識(Spatial Awareness)的任務中尤其有效,這是我們認為對物理人工智慧至關重要的三個物理常識關鍵領域之一。

Our goal isn’t to build a general artificial intelligence (AGI) that solves every problem. Instead, we’re focused on creating tools that are practical and useful for physical AI developers. By emphasizing physical common sense—like spatial reasoning, object interactions, and environmental dynamics—we aim to empower developers to build robust, real-world AI systems.

我們的目標不是打造一個解決所有問題的通用人工智慧(AGI)。相反,我們專注於創建對物理人工智慧開發者實用且有用的工具。通過強調物理常識——如空間推理、物體互動和環境動態——我們旨在賦能開發者打造穩健的現實世界人工智慧系統。

For physical AI, we focus on three key areas of physical common sense: space, time, and fundamental physics. Space includes understanding environments and their relationships, such as object positions and layouts. Time involves reasoning about actions, their causality, and planning sequences. Fundamental physics covers attributes like object permanence, mechanics, electromagnetism, thermodynamics, and material properties. These are the core principles we believe a physical AI agent must understand to operate effectively in the real world.

對於物理人工智慧(Physical AI),我們專注於物理常識(Physical Common Sense)的三個關鍵領域:空間、時間和基礎物理學。空間包括理解環境及其關係,例如物體位置和佈局。時間涉及推理動作、其因果關係以及規劃序列。基礎物理學涵蓋物體恆存性(Object Permanence)、力學、電磁學、熱力學和材料屬性等屬性。這些是我們認為物理人工智慧代理必須理解的核心原則,以便在現實世界中有效運作。

These principles also tie into embodied reasoning, which is critical for physical AI. Embodied reasoning involves processing complex sensory inputs, predicting the effects of actions, respecting physical constraints, and learning from interactions with the environment. Different embodiments—like a humanoid robot versus a self-driving car—require tailored reasoning approaches. For example, a person watching a cooking video processes visual and auditory data to learn a task, while a robot might rely on sensor data to perceive the world. Despite these differences, we believe all physical AI systems must have the capability to process complex sensory inputs to survive and thrive in the physical world.

這些原則還與具身推理(Embodied Reasoning)相關,這對物理人工智慧至關重要。具身推理涉及處理複雜的感官輸入、預測動作效果、尊重物理限制,以及從與環境的互動中學習。不同的具身形式——如人形機器人與自動駕駛汽車——需要量身定制的推理方法。例如,一個人觀看烹飪影片時,處理視覺和聽覺數據來學習任務,而機器人可能依賴感測器數據來感知世界。儘管存在這些差異,我們相信所有物理人工智慧系統都必須具備處理複雜感官輸入的能力,以在物理世界中生存和蓬勃發展。

Due to time constraints, I can’t go into every detail here, but our research paper provides a comprehensive breakdown of these concepts, including embodied reasoning and physical common sense. It’s available online today, so I encourage you to check it out. In essence, our Cosmos Reason model is a multimodal reasoning system. It takes inputs like videos, processes them through a vision encoder to create video tokens, and then applies reasoning steps—similar to thinking through a problem—before producing a final answer. This is a standard architecture, but the key lies in how we curate the training data to embed physical common sense and embodied reasoning.

由於時間限制,我無法在此詳細說明每一個細節,但我們的研究論文提供了這些概念的全面分解,包括具身推理和物理常識。該論文今天已在線上發布,我鼓勵你去查看。基本上,我們的Cosmos Reason模型是一個多模態推理系統(Multimodal Reasoning System)。它接受像影片這樣的輸入,通過視覺編碼器(Vision Encoder)處理它們以創建影片標記(Video Tokens),然後應用推理步驟——類似於思考問題——在產生最終答案之前。這是一個標準架構,但關鍵在於我們如何策劃訓練數據,以嵌入物理常識和具身推理。

To build Cosmos Reason, we start with a pre-trained multimodal model and fine-tune it for physical AI tasks using supervised learning. This fine-tuning is guided by a question-answering dataset annotated by humans, which teaches the model to reason about physical effects and interactions. We also incorporate verifiable rewards to ensure the model’s outputs align with physical reality. Our approach is grounded in a detailed analysis of what’s essential for physical AI, as outlined in our paper.

為了打造Cosmos Reason,我們從一個預訓練的多模態模型開始,並使用監督學習(Supervised Learning)針對物理人工智慧任務進行微調。這種微調由人類註釋的問答數據集(Question-Answering Dataset)指導,該數據集教導模型推理物理效果和互動。我們還引入了可驗證的獎勵(Verifiable Rewards),以確保模型的輸出與物理現實一致。我們的方法基於對物理人工智慧必需元素的詳細分析,如我們論文中所述。

We use a verifiable reward system, which is essentially a binary evaluation—think of it as a one or zero score for correctness. To train our model, we apply the Group Relative Policy Optimization algorithm, or GRPO, which is a simplified approach inspired by reinforcement learning techniques. The idea is straightforward: we group multiple solutions together, and if one stands out as uniquely effective, it’s given higher weight. If there are many good solutions, they’re all considered, but when every solution is correct, it’s less significant because it’s an easy case. This method naturally emphasizes challenging examples that push the model to improve.

我們使用了一個可驗證的獎勵系統(Verifiable Reward System),這基本上是一個二元評估——可以想像成正確與否的一或零分數。為了訓練我們的模型,我們應用了群組相對策略優化算法(Group Relative Policy Optimization, GRPO),這是一個受強化學習技術啟發的簡化方法。這個概念很簡單:我們將多個解決方案分組在一起,如果某個方案特別有效,就會給予更高的權重。如果有很多好的解決方案,它們都會被考慮,但當每個方案都正確時,其重要性就降低,因為這是一個簡單的情況。這種方法自然強調那些能推動模型改進的挑戰性示例。

One task we use to train Cosmos Reason involves playing a video and asking the AI to determine whether it’s moving forward or backward in time. Let’s do a quick test: is the first video playing forward or backward? This task helps the model develop a sense of temporal dynamics without requiring human intervention. By analyzing the video, the AI learns how time works in the physical world, which is critical for physical AI applications.

我們用來訓練Cosmos Reason的一個任務是播放一段影片,並要求人工智慧判斷影片是向前還是向後播放。讓我們做一個快速測試:第一段影片是向前還是向後播放?這個任務幫助模型發展對時間動態(Temporal Dynamics)的感知,而無需人工干預。通過分析影片,人工智慧學習物理世界中時間的運作方式,這對物理人工智慧應用至關重要。

The second video is trickier, right? We intentionally design some tasks to be more challenging to push the model’s limits. Another task we use is a video puzzle. We mix multiple video clips together and ask questions like: “Given 32 frames, find two frames that come from the same sequence.” This forces the model to work hard to understand visual and temporal relationships. These tasks, paired with verifiable rewards, help the AI develop a deeper understanding of physical interactions, which can benefit your physical AI projects.

第二段影片更棘手,對吧?我們有意設計一些更具挑戰性的任務來突破模型的極限。我們使用的另一個任務是影片拼圖(Video Puzzle)。我們將多個影片片段混合在一起,並提出問題,例如:“給定32幀,找出來自同一序列的兩幀。”這迫使模型努力理解視覺和時間關係。這些任務與可驗證的獎勵結合,幫助人工智慧更深入地理解物理互動,這對你的物理人工智慧項目有益。

Here’s an example of how we apply Cosmos Reason. We asked the model a question: the overall goal is for an agent to pour milk into a cup. The agent in the video is currently performing one subtask out of many required to complete this goal. Based on the video and the instruction, what is the most likely next immediate subtask? This type of question tests the model’s ability to reason about sequences of actions in a physical context, ensuring it can anticipate and plan effectively.

這是我們應用Cosmos Reason的一個例子。我們向模型提出了一個問題:總體目標是一個代理將牛奶倒入杯子。影片中的代理目前正在執行完成該目標所需的眾多子任務之一。根據影片和指令,下一個最可能的立即子任務是什麼?這類問題測試模型在物理情境中推理動作序列的能力,確保它能有效預測和規劃。

Let’s look at an example with Cosmos Reason. In one video, an agent is pouring milk into a cup. The question is: what’s the immediate next action? Since the milk has just been poured, the most likely next step is to stop pouring and put the bottle away. This demonstrates the model’s ability to reason about physical sequences. Another task involves determining whether a video is playing forward or backward. For instance, if the amount of pink powder in a container increases while it decreases in a bucket, the model infers the action is being undone, indicating the video is playing backward.

讓我們來看一個Cosmos Reason的例子。在一段影片中,一個代理正在將牛奶倒入杯子。問題是:下一步的立即動作是什麼?由於牛奶剛剛被倒出,最可能的下一步是停止倒奶並將瓶子收起來。這展示了模型推理物理序列的能力。另一個任務是判斷影片是向前還是向後播放。例如,如果容器中的粉紅色粉末增加,而桶中的粉末減少,模型會推斷動作正在被撤銷,表明影片是向後播放。

In another scenario, we asked the model to predict the next action for a vehicle based on its current motion. The AI might initially consider multiple options—say, turning left, right, or continuing straight. But then it notices a double line, indicating a left turn is prohibited. Instead of choosing an invalid option, the model correctly reasons that none of the provided actions are appropriate. This kind of nuanced reasoning is what we aim to empower in physical AI systems.

在另一個場景中,我們要求模型根據車輛當前的運動預測下一步動作。人工智慧可能最初考慮多個選項——例如向左轉、向右轉或繼續直行。但隨後它注意到雙黃線,表明禁止左轉。模型沒有選擇無效選項,而是正確推理出提供的動作都不合適。這種細緻的推理正是我們希望在物理人工智慧系統(Physical AI)中賦能的能力。

To evaluate Cosmos Reason, we created a set of benchmarks tailored for physical AI use cases and compared it to popular models like Claude, GPT, and others. Starting with a pre-trained backbone, we fine-tuned Cosmos Reason with physical AI-specific data. The results show significant improvements in tasks involving physical reasoning. These benchmarks are detailed in our paper, but due to time constraints, I’ll move on to the key point: everything we’ve developed is open source.

為了評估Cosmos Reason,我們創建了一套為物理人工智慧使用案例量身定制的基準測試(Benchmarks),並將其與Claude、GPT等流行模型進行比較。從預訓練的基礎模型(Pre-trained Backbone)開始,我們使用物理人工智慧特定的數據對Cosmos Reason進行微調。結果顯示,在涉及物理推理的任務中有了顯著改進。這些基準測試在我們的論文中有詳細描述,但由於時間限制,我將直接進入重點:我們開發的一切都是開源的(Open Source)。

We’ve released the model checkpoints for Cosmos Predict and Cosmos Transfer. Cosmos Reason is still undergoing some additional work, but it will be released soon. The technical details are thoroughly documented in our papers, two of which were published today. One covers Cosmos Predict, and the other discusses Cosmos Transfer and Cosmos Reason. Our goal is to keep improving these models—think Cosmos Predict 2, Transfer 2, and Reason 2—until they’re so robust that developers no longer need our direct support.

我們已經發布了Cosmos Predict和Cosmos Transfer的模型檢查點(Model Checkpoints)。Cosmos Reason仍在進行一些額外工作,但很快也將發布。技術細節在我們的論文中有詳細記載,今天發布了其中兩篇。一篇涵蓋Cosmos Predict,另一篇討論Cosmos Transfer和Cosmos Reason。我們的目標是不斷改進這些模型——想想Cosmos Predict 2、Transfer 2和Reason 2——直到它們足夠穩健,開發者不再需要我們的直接支持。

We believe there’s a symbiotic relationship between generation and reasoning. Generative models, like those producing videos, can enhance reasoning by providing rich data. Conversely, reasoning models can critique generated outputs—for example, identifying why a video is fake by analyzing physical inconsistencies. This feedback loop improves both capabilities. To make our tools accessible, we’re releasing native Python scripts for developers who prefer working directly with Python. These are available today, and we’re planning additional deployments to make them even more robust.

我們相信生成(Generation)和推理(Reasoning)之間存在共生關係。生成模型(如生成影片的模型)可以通過提供豐富的數據來增強推理能力。相反,推理模型可以評判生成的輸出——例如,通過分析物理不一致性來識別一段影片為何是假的。這種反饋循環改進了兩者的能力。為了讓我們的工具更易於使用,我們為偏好直接使用Python的開發者發布了原生Python腳本(Native Python Scripts)。這些腳本今天已可用,我們還計劃進行額外的部署,使其更加穩健。

Thank you for your attention. I encourage you to explore our papers and open-source resources to see how Cosmos can empower your physical AI projects.

感謝你的聆聽。我鼓勵你探索我們的論文和開源資源,了解Cosmos如何為你的物理人工智慧項目賦能。