深入探討液冷 AI 資料中心的未來(由 Supermicro 提供)[S74305]
今天各位朋友,歡迎來到GTC。我的名字是惠普爾·博多尼(Whipple Bodoni),我在NVIDIA的汽車團隊(Automotive Team)工作。我的專長是感測器(Sensors)與感官生態系統(Sensory Ecosystem)。我負責研究光達(LiDAR)和雷達(Radar)。今天我將擔任這場會議的主持人。在我們開始之前,有幾項例行公告,我相信大家可能已經聽過了,但我還是得再次提醒。請記得下載NVIDIA GTC手機應用程式(Mobile App),以獲取最新的更新資訊、會議目錄(Session Catalog),以及方便快捷的會議問卷調查(Session Survey)和其他內容。會議結束後,請到展覽廳(Exhibit Hall)探索一番。下午5點到7點,展覽廳還會舉辦一場招待會(Reception)。請務必去GDC公園(GDC Park)看看,那裡有許多藝術裝置(Art Installations)、休息區(Lounge Areas)和美食餐車(Food Trucks)。此外,還有一些贊助商的活動(Sponsor Activations)。晚上8點53分起,還會有一個熱鬧的夜市(Night Market)開放,大家如果有時差問題(Jet Lag),不妨出去走走看看。
Folks, welcome to GTC. My name is Whipple Bodoni, and I work in the Automotive Team over here at NVIDIA. My specialty is in sensors and the sensory ecosystem. I work on LiDAR and radar. Today, I will be the host for this session. Before we begin, a few housekeeping announcements, which I’m sure you must have heard already, but I have to repeat them anyway. Don’t forget to download the NVIDIA GTC Mobile App for various updates, the session catalog, a quick and easy-to-complete session survey, as well as other things. Explore the exhibit hall after we’re done over here. There’s also a reception from 5 to 7 in the exhibit hall. Make sure you check out the GDC Park—you’ll find a lot of art installations, lounge areas, and food trucks. There are also some sponsor activations. And there’s going to be a very vibrant night market open from 8:53 PM, so yeah, make sure to check it out, and all of you who are jet-lagged can probably get out there.
這場會議的錄影(Session Recording)將在48小時內透過會議目錄提供給所有現場參與者(Attendees),並於下週開放給訂閱NVIDIA隨選服務(NVIDIA On Demand)的公眾觀看。現在,讓我們進入今天的會議主題。這場演講的標題是「深入探討液冷AI資料中心的未來」(A Deep Dive into the Future of Liquid-Cooled AI Data Centers),由來自Supermicro的CW Chen主講。CW Chen是Supermicro先進散熱解決方案(Advanced Thermal Solutions)的總經理(General Manager)。Supermicro將深入探討當前與未來的液冷AI基礎設施(Liquid-Cooled AI Infrastructure),分享全球最大規模液冷AI部署的真實成功案例,並預覽未來的液冷整體解決方案(Total Solutions)。他將強調液冷的最佳實踐(Best Practices)和參考架構(Reference Architectures),以最大化整個資料中心的運算密度(Computing Density)、效率(Efficiency)和生產力(Productivity)。重要提醒:場地接近滿座,強烈建議提早到場,入場採先到先得(First-Come, First-Served)原則。
The session recording will be made available to all attendees over here from the session catalog within 48 hours and to the public who subscribe to NVIDIA On Demand next week. Now, let’s get started with the session. This title, "A Deep Dive into the Future of Liquid-Cooled AI Data Centers," is presented by CW Chen from Supermicro. CW Chen is the General Manager for Advanced Thermal Solutions at Supermicro. Supermicro dives into the present and future of liquid-cooled AI infrastructure, covering real-world success stories from the world’s largest liquid-cooled AI deployments and previewing future liquid-cooling total solutions. Supermicro highlights liquid-cooling best practices and reference architectures to maximize computing density, efficiency, and productivity for the whole data center. Important: Near capacity, highly suggest arriving early. Attendees are let in on a first-come, first-served basis.
交付下一代AI基礎設施(Next Generation of AI Infrastructure)不僅僅是新的GPU,還需要重新思考系統節點(System Nodes)、機架(Racks)、網路(Networking)和整個資料中心(Data Center)。液冷技術能將運算密度提升一倍,並大幅改善效率與生產力。成功案例包括部署全球最大規模的液冷AI資料中心(Liquid-Cooled AI Data Center)。液冷技術為客戶和環境帶來雙重效益(Double Benefit)。Supermicro提供即插即用的機架(Plug-and-Play Racks),採用業界領先的直接晶片液冷技術(Direct-to-Chip Liquid Cooling)。會議結束時,大家有機會提問幾個問題。我會拿著麥克風(Microphone)走到會議前方,也就是房間的前面,而後方的工作人員也會在後面協助。女士們、先生們,讓我們歡迎CW Chen。謝謝大家!這場演講將於3月20日星期四,早上7點至7點40分(CST)舉行。
Delivering the next generation of AI infrastructure is more than new GPUs: it requires a rethinking of system nodes, racks, networking, and the whole data center. Liquid cooling can double computing density with massive efficiency and productivity improvements. Success story: deploying the world’s largest liquid-cooled AI data center. Liquid cooling provides a double benefit to the customer and to the environment. Deploy plug-and-play racks with industry-leading direct-to-chip liquid cooling. You’ll have some time at the end of the session for a few questions. I’ll run the microphone to the front of the session, in front of the room, and our folks in the back will assist in the back. Ladies and gentlemen, CW Chen. Thank you! This session will take place on Thursday, March 20, from 7:00 AM to 7:40 AM CST.
那麼,如何才能有效地冷卻這些晶片呢?你無法再單靠空調(Air Conditioner)或任何傳統方式來解決。因此,我們需要轉向液冷技術(Liquid Cooling)。這就是為什麼液冷現在如此重要、如此熱門的原因。它是當前最新NVIDIA平台(NVIDIA Platform)的唯一解決方案。從Supermicro的角度來看,我們預期目前—not下一代,而是現階段—GPU和CPU的TDP已經超過600瓦。對於機架密度(Rack Density)來說,很輕易就能超過130千瓦。這可能像是一條指數曲線(Exponential Curve),我們也在努力將機架密度提升到盡可能高的水準。所以或許明年,當我們再次舉辦這樣的演講時,這個數字可能已經翻倍甚至增長三倍。最終,你會需要液冷技術。如果能提早採用,你就能更早開始賺錢或回收成本(Get the Money Back)。稍後我會展示一些我們的總擁有成本(TCO, Total Cost of Ownership)計算,讓你清楚明白為什麼我們的執行長Charles Liang說液冷幾乎是免費的,還附帶額外好處,因為這項技術能幫你賺很多錢。
So how can you cool it down efficiently? You cannot use an air conditioner or any traditional means anymore. So we need to go to liquid cooling. That’s the reason liquid cooling right now is so critical and so popular. It’s the only solution for the latest NVIDIA platform right now. From Supermicro’s side, we expect—not really next-gen, but currently—GPU and CPU TDP to be over 600 watts. Regarding rack density, it will easily exceed 130 kilowatts. And maybe because it’s an exponential curve, we are also trying to push the rack density as high as possible. So maybe next year, when we present again, it might have already doubled or tripled. Eventually, you’ll need liquid cooling. If you start earlier, then you can earn money or get the money back earlier. Later, I’ll show you some of our TCO calculations. You can very clearly understand why our CEO Charles Liang said liquid cooling is almost free with a bonus—because it can help you make a lot of money with this technology.
那麼,Supermicro目前能提供什麼呢?我們知道液冷服務(Liquid Cooling Services)和系統(Systems)非常複雜,因此我們希望部署一套優質的解決方案,並實現綠色運算(Green Computing),讓這一切盡早發生。我們希望提供並確信自己能交付完整的液冷解決方案(Total Liquid Cooling Solution)。我們可以進行解決方案整合(Solution Integration),也能執行測試與驗證(Testing and Validation),當然還有現場部署(On-Site Deployment)。我們不僅提供硬體系統(Hardware System),還開發了一套優秀的服務管理工具(Service Management Tool)。這套工具能監控每顆晶片(Chip)、每台伺服器(Server)、每個節點(Node),甚至包括我稍後會介紹的冷卻塔(Cooling Tower)。Supermicro擁有最廣泛的產品線(Product Line),能支持液冷解決方案。我們有GPU伺服器(GPU Server)、高效能運算伺服器(HPC Server),還有企業伺服器(Enterprise Server),這些都能支援液冷技術。最重要的是,Supermicro的所有液冷伺服器(Liquid Cooling Servers)和機架(Racks)都是由Supermicro自行設計,內含更多解決方案,例如冷卻分配單元(CDU, Cooling Distribution Unit)、管路系統(Manifold),甚至冷卻塔。我們真正為客戶和親密合作夥伴提供一站式服務(One-Stop Shop),幫助他們讓資料中心(Data Center)更環保、更高效。
Okay, so what can Supermicro offer right now? Because we know liquid cooling services and systems are very complicated, we want to deploy a very good solution and make green computing happen as early as possible. We want to deliver, and we know we can offer a total liquid cooling solution. We can do solution integration, testing, and validation, and of course, on-site deployment. We’re not only providing the hardware system but also a very good service management tool. We can monitor each chip, each server, each node, even the cooling tower that I’ll introduce more about later. Supermicro has the widest product line to support liquid cooling solutions. We have GPU servers, HPC servers, and enterprise servers—they all can support liquid cooling. The most important thing is that all Supermicro liquid cooling servers and racks are designed by Supermicro, with additional solutions inside, including the CDU, manifold, and even the cooling tower. We truly offer a one-stop shop to our customers and close partners, helping them make their data centers greener and faster.
從Supermicro的角度,我們目前主要專注於直接晶片液冷(DLC, Direct-to-Chip Liquid Cooling),因為我們與NVIDIA團隊(NVIDIA Team)以及我們的合作夥伴和客戶密切合作。我們發現,相較於當前的傳統空氣冷卻資料中心(Traditional Air-Cooled Data Center),這種解決方案對客戶來說是最合理的選擇(Sensible Solution)。我們還提供了一套優秀的模組化解決方案(Modular Box Solution)來支持我們的DLC技術,這是一套高效的晶片冷卻方案(Chip Cooling Solution)。
From Supermicro’s side, we still mainly focus on direct-to-chip liquid cooling (DLC), because we work very closely with the NVIDIA team and our partners and customers. We found that, compared to traditional air-cooled data centers right now, this kind of solution is the most sensible solution for our customers. We also provide a very good modular box solution for our DLC solution—there’s an efficient chip cooling solution.
我們可以提供內建於機架(Rack)中的整合式冷卻分配單元(Integrated CDU, Cooling Distribution Unit)。這個CDU能直接與外部的冷卻塔(Cooling Tower)連接。在這個最新的平台上,你不需要額外的冷卻設備(Chiller),或者可以大幅減少冷卻設備的規模。這能為你節省大量的功耗(Power Consumption)。如果你有多個機架,需要更強大或更多的CDU,我們也能提供獨立式CDU(Standalone CDU),支援1兆瓦(1 Megawatt)或1.5兆瓦的冷卻能力,我們都能滿足。如果你的資料中心(Data Center)或客戶的資料中心目前仍使用空氣冷卻版本(Air-Cooled Version Data Center),我們也能提供後裝式冷卻解決方案(Retrofit Cooling Solution),由Supermicro設計與整合。這樣可以從空氣冷卻機架中移除熱量(Heat),然後透過外部冷卻塔冷卻後循環回來(Circulation),繼續運算。不論你的應用場景如何,我們都能提供這樣的整合解決方案(Integration Solution),讓你的資料中心節省高達40%的能源(Energy)。
We can provide an integrated CDU, which is built inside the rack already. The CDU can directly connect with our cooling tower outside. In this latest platform, you don’t need a chiller, or you can reduce the size of your chiller significantly. You can save a lot of power consumption with this solution. If you have a lot of racks and need more or even more powerful CDUs, we can also offer standalone CDUs for a single unit. We can support 1 megawatt, 1.5 megawatts—whatever we can support. If your data center or your customer’s data center still uses an air-cooled version, we can also provide a retrofit cooling solution, built and integrated by Supermicro. This allows you to remove heat from an air-cooled rack, then route it back through our cooling tower outside, cool it down, and send it back to complete the circulation. No matter your application, we can offer this kind of integration solution to make your data center save up to 40% in energy.
我們很高興能與大家分享去年的經驗。我們交付了—I believe是當時最大的—液冷AI資料中心(Liquid-Cooled AI Data Center)。我們與NVIDIA團隊(NVIDIA Team)以及客戶密切合作,交付了超過6000個液冷系統(Liquid-Cooled Systems)給客戶現場。所有液冷機架(Liquid-Cooled Racks)內部都搭載了Supermicro的液冷解決方案。你可以看到伺服器外部有大量的管路(Piping),每個機架內都內建了直接CDU(Direct CDU)。我們是如何做到的?交付時間非常短,因為一切都在Supermicro的生產線上完成。當你收到機架時,只需連接三樣東西:電源(Power)、網路(Internet)和水管(Water Pipe),機架就能立即運作。我們能幫助你實現這一點。最重要的是,我們還使用了NVIDIA的光譜網路平台(NVIDIA Spectrum Networking Platform)。在這個最大的液冷AI資料中心裡,有許多創新技術(Innovations),幫助客戶縮短交付時間(Lead Time)並節省大量能源。如果你有興趣,歡迎造訪我們的網站,上面有YouTube影片詳細介紹資料中心內部的樣貌。我試著拍了一些照片展示,因為時間有限。你可以看到資料大廳(Data Hall),所有的機架、Supermicro冷卻設備(Cooling Equipment)和路旁的管路系統(Manifold)。這些都是真實場景,照片中還有NVIDIA的網路平台與管路完美整合。這是目前最新且最大的液冷資料中心,甚至資料中心的管徑(Piping Diameter)達到近30英寸,非常巨大。
Okay, we’re very happy to share with you that last year we delivered what I think was the largest liquid-cooled AI data center. We worked with the NVIDIA team and our customer. We delivered over 6,000 liquid-cooled systems to our customer’s site, and all the liquid-cooled racks have Supermicro’s liquid cooling solutions built inside. You can see a lot of piping outside the servers, and every rack has a direct CDU built inside. How did we make it happen? The deployment time is very short because everything is done in Supermicro’s production line. When you receive a rack, the only three things you need are power, internet, and a water pipe, and the rack is ready to go. We can help you do this. The most important thing is that we also used NVIDIA’s Spectrum networking platform. In this largest liquid-cooled AI data center, there’s a lot of innovation inside that helps customers shorten lead time and save a lot of energy. If you’re interested, you’re welcome to visit our website. There’s a YouTube video describing in detail what it looks like inside the data center. I just tried to capture some pictures here because we have limited time. You can see the data hall—all the racks, Supermicro cooling equipment, and the manifold on the roadside. This is a real design, and in these pictures, you can see the NVIDIA networking platform and the piping combined together, integrating very well. This is the latest and largest liquid-cooled data center, and you can see even the piping in the data center is almost 30 inches—very big.
這裡有很多線纜(Cabling)和真實的機架交換設備(Rack Switching Equipment)。我們非常有經驗,已經為一位客戶在這個資料中心交付了超過5萬個GPU。如果你有客戶,或想了解更多我們如何合作或是否有合作機會,歡迎隨時聯繫我們。我想分享一張五年前的OCP資料(OCP Data)曲線圖,雖然有些老舊,但我認為我們仍在這條路上。你可以看到,當時圖表提到,如果你想要低PUE(Power Usage Effectiveness)和高密度(High Density),就必須採用液冷技術(Liquid Cooling)。市場趨勢顯示,對於空氣冷卻(Air Cooling)的資料中心,最佳PUE大約是1.5左右,機架密度約在20到30千瓦之間。但如果你想轉向更低的PUE或更高密度的應用,就必須採用液冷技術。我們非常有信心推薦液冷,因為它能為你的資料中心節省大量空間(Space)、降低功耗成本(Power Consumption Cost),最重要的是,與Supermicro合作,我們能幫你在短時間內完成部署,就像你現在使用空氣冷卻機架一樣。目前我們提供冷卻設備和Supermicro機架整體解決方案(Total Solution)。這是最新的平台,例如B200。
There’s a lot of cabling and real rack-switching equipment here. We are truly experienced—we’ve delivered over 50,000 GPUs to one of our customers in this data center. If you have any customers or want to know more about how we can work together or if there’s any cooperation opportunity, just feel free to contact us. Here, I want to share a curve—it’s very old data from five years ago from OCP data, but I think we’re still on this path. You can see, I remember the first time I saw this figure—it mentioned that if you want a low PUE and high density, you need to go to liquid cooling. The market trend for air-cooled data centers shows the best PUE we can achieve is around 1.5 or so, with a rack density of about 20 to 30 kilowatts per rack. But if you want to switch to a lower PUE or high-density application, you need to go to liquid cooling. We’re here with high confidence to recommend liquid cooling—it can do a lot for your data center because you can save a lot of space, save a lot of money on power consumption, and most importantly, if you work with Supermicro, we can help you deploy in minimal time, just like what you’re doing with air-cooled racks. Right now, we have cooling equipment and can also provide Supermicro rack total solutions. You can see this is the latest platform, like the B200.
我們可以根據你的需求打造整個機架(Rack),並依據你的配置(Configuration),提供過去一些熱門的解決方案(Popular Solutions)。從Supermicro的角度,我們不僅提供液冷機架(Liquid-Cooled Racks),我們還有AI機架(AI Racks)、GPU機架(GPU Racks),還能提供企業級直接液冷機架(Enterprise Direct Liquid-Cooled Racks)。最重要的是,我們還有高密度高效能運算機架(High-Density HPC Racks)。你可以看到每個機架內部的設計都不一樣,因為我們希望為客戶提供最佳解決方案(Best Solution)和終極解決方案(Ultimate Solution)。所有機架都是Supermicro自行設計的,包含Supermicro設計的冷卻分配單元(CDU, Cooling Distribution Unit)和分配管路系統(Distribution Manifold)。這些冷卻設備(Cooling Equipment)和CDU能針對每個獨立單元(Unit)進行最佳化。我們會根據你的不同配置、不同階段(Phase),甚至不同壓力條件(Pressure Conditions),為客戶設計最佳解決方案。這就是我們目前正在做的事情。
We can build the whole rack for you, and according to your configuration, we can offer some popular solutions from the past. From Supermicro, we are not only providing liquid-cooled racks—we have AI racks, GPU racks, and we can also provide enterprise direct liquid-cooled racks. Most importantly, we also have high-density HPC racks. You can see each rack has differences built inside, because we want to deliver the best solution and the ultimate solution to our customers. All the racks are designed by Supermicro, including Supermicro-designed CDUs and distribution manifolds. These cooling components and CDUs connect to each unit. Based on your different configurations, different phases, or even different pressure conditions, we can design the best solution for our customers. That’s what we are doing right now.
我想舉一個例子來說明。以冷卻設備(Cooling Equipment)為例,我們能保證並承諾為客戶提供市場上最佳的冷卻性能(Cooling Performance)。拿一個例子來說,縱軸代表熱阻(Thermal Resistance),通常越低越好。藍線代表市場上的產品(Market Product),綠線代表Supermicro的產品(Supermicro Product)。我們的產品性能比市場上的現有產品高出30%。我們有一支非常專業且專注的團隊(Dedicated Expert Team),進行非常細緻的模擬(Simulation)。我們希望模擬每一滴液體(Droplet),確保從晶片(Chip)中汲取最大的熱量(Heat)。無論是溫度(Temperature)、流量的分佈(Flow Distribution),還是壓力條件下的速度(Velocity),我們都能做到最佳設計。我們還有很棒的製造合作夥伴(Manufacturing Partner),擅長製作散熱器(Heat Sink)中的微小部件。我想簡單提一下,這些散熱器的設計非常接近晶片,就像你的頭髮一樣貼近頭皮。我們設計出非常緊湊的尺寸(Compact Size),雖然小而複雜,但冷卻能力(Cooling Capacity)非常強大。因為我們提供整體解決方案(Total Solution),就像人體一樣,我們不僅關注核心的CDU(就像心臟),還照顧管路系統(就像血管),確保冷卻液能流遍整個系統,讓你的伺服器(Server)保持健康運作。
Here, I want to take one example. For cooling equipment, we can guarantee and promise our customers the best cooling performance compared to the market. Take one example: the vertical axis is thermal resistance—usually, lower is better. The blue line is a market product, and the green line is the Supermicro product. We have 30% better performance than current products in the market. We also have a very dedicated and expert team. We do a lot of tiny detail simulations to capture the maximum heat from the chip with each droplet. Whether it’s temperature, flow distribution, or velocity under pressure conditions, we can design it optimally. We also have a good partner who excels at manufacturing the tiny components inside the heat sink. I just want to bring it up—the heat sink design is very close to your chip, like your hair to your scalp. We come up with a very compact size—small and complex, but with very powerful cooling capacity. Because we provide a total solution, just like your human body, we’re not only taking care of your heart (the CDU) but also your blood vessels. We make sure the coolant flows through your system, keeping your servers healthy.
關於CDU,我要感謝我們的合作夥伴,他們也在現場。我們目前擁有市場上最強大的機架CDU(Rack CDU)。它的容量高達250千瓦(Over 250 Kilowatts),不僅能支援當前的130千瓦機架(130 Kilowatt Rack),還能應對下一代的需求。我們的設計非常聰明,具備N+2備援(N+2 Redundancy),並支援熱插拔泵浦(Hot-Swap Pump)。對於下一代AI資料中心來說,冷卻是最關鍵的部分(Most Critical)。接下來,你需要思考如何降低維護成本(Maintenance Cost)。這是我們的關鍵功能(Key Feature)。如果你要做預防性維護(Preventive Maintenance),特別是針對移動部件,比如泵浦(Pump),我們的第一層保障是備援設計(Redundancy),第二層是熱插拔功能(Hot-Swap)。這樣你無需關閉整個機架(Shut Down the Rack)來進行維護,只需使用Supermicro的備用零件(Spare Part),在兩分鐘內更換泵浦,機架就能繼續運作,為你的公司持續創造收益(Keep Making Money)。此外,我們還有觸控螢幕面板(Touch Screen Panel),所有細節都能從面板上看到。我們也支援資料通訊埠(Data Communication Port),包括ASAP、SNMP或RESTful API等協議(Protocols)。在下一代AI資料中心中,不僅伺服器需要更聰明,冷卻設備(Cooling Equipment)也需要更智能。我們在CDU內建了另一層控制邏輯(Control Logic),讓它更具智慧。
For the CDU, I want to thank our partners—they’re also here. We have the most powerful rack CDU in the market right now. It has a capacity of over 250 kilowatts, so we can support not only the current 130-kilowatt racks but also next-generation needs. It’s designed very smartly with N+2 redundancy and supports hot-swap pumps. For your next AI data center, cooling is the most critical part. Next, you need to think about how to reduce maintenance costs. That’s a key feature. If you want to do preventive maintenance, especially for moving parts like pumps, the first layer is redundancy, and the second is hot-swap capability. You don’t need to shut down the whole rack for maintenance—just take a Supermicro spare part, replace the pump in two minutes, and the rack keeps running, making money for your company. We also have a touch screen panel—every detail is visible on it. Another thing is we support data communication ports with protocols like ASAP, SNMP, or RESTful API. In the next-generation AI data center, not only do your servers need to be smarter, but your cooling equipment needs to be smarter too. We’ve built another layer of control logic inside this CDU.
我們還能防止冷凝問題(Condensation Issue)。這個CDU提供許多功能,你可以調整泵浦速度(Pump Speed),監控機架(Rack)的濕度(Humidity)以及其他相關數據,一切盡在掌握之中。這個CDU雖然尺寸非常緊湊(Compact Size),但功能強大(Powerful)。在電源方面(Power Supply),我們也內建了高效能電源供應器(High-Efficiency Power Supply)。我們是唯一通過伺服器級鈦金認證(Server-Level Titanium Label)的設計,這能延長使用壽命(Longer Lifespan)。所有部件都由Supermicro設計並整合成機架解決方案(Rack Solutions)。這裡我想給大家一些數據,這是Supermicro的基本承諾(Bottom Line)。因為我們提供液冷伺服器(Liquid-Cooled Servers),我們的解決方案能保證至少90%的液冷覆蓋率(Liquid Cooling Coverage Ratio)給客戶。如果你想要達到100%,我們也能做到,這完全取決於你的配置(Configuration)。以我們的標準產品為例,比如搭載CPU、GPU和V3或新一代CPU(Next-Gen CPU)的產品,搭配記憶體(Memory),採用液冷技術(Liquid Cooling),冷卻覆蓋率可達93%。如果你需要更高的液冷覆蓋率,歡迎聯繫我們。目前我們有些產品已經能達到100%的液冷覆蓋。這項數據對下一代AI資料中心(Next-Gen AI Data Center)非常重要,因為它能決定資料中心冷卻設備(Chiller)的規模,甚至在100%覆蓋的情況下,你可能完全不需要冷卻設備。Supermicro的液冷服務(Liquid Cooling Services)可以幫你實現這一點。
We can also prevent the condensation issue. You can control a lot of functions—tell the pump speed, the humidity of the rack, everything. This CDU is a very compact size but very powerful. On the power side, we also have high-efficiency power supplies. We’re the only ones who design server-level titanium-label certified power supplies to make your equipment last longer. Everything is designed and integrated by Supermicro as rack solutions. Here, I want to give you some numbers—this is the bottom line from Supermicro’s side because we provide liquid-cooled servers. The baseline from our solution is that we can guarantee at least a 90% liquid cooling coverage ratio to your customers. If you want 100%, we can do that too—it all depends on your configuration. Take our standard product as an example: with a CPU, GPU, V3, or next-gen CPU, paired with memory and liquid cooling, the coverage is up to 93%. If you need more or higher liquid cooling coverage, just contact us. Right now, we have some products that can reach 100% liquid cooling. This number is very important for next-gen AI data centers because it can dominate or decide the size of the chiller in your data center—or even eliminate the need for one entirely. If you get 100%, you can rely on Supermicro’s liquid cooling services.
我要做一個簡單的結論,關於液冷技術(Liquid Cooling Technology)。它提供了一個機會,讓我們相較於空氣冷卻(Air Cooling)大幅降低功耗成本(Power Cost)。考慮到液冷基礎設施(Liquid Cooling Infrastructure)的電力成本,我們可以節省高達89%。對於整個資料中心的電力成本(Electricity Cost),我們能節省高達40%。因為所有重要元件(Key Components),像是CPU、GPU和記憶體(Memory),都由液冷技術冷卻,我們不需要太多額外的設備,也不需要強大的風扇(Powerful Fans)。因此,噪音水平(Noise Level)降低了55%。由於密度(Density)越來越高,根據我們的數據,資料中心的空間(Data Center Space)可以節省高達80%。這項技術能為你帶來許多好處。在下一代AI資料中心中,我們預期它將非常省電(Power-Saving)、符合綠色運算(Green Computing),而且非常安靜,因為內部不再需要那麼多強大的風扇。
Let me make a small conclusion here about liquid cooling technology. It presents an opportunity to reduce power costs compared to air cooling. Considering the electricity cost of liquid cooling infrastructure, we can save up to 89%. For the overall data center’s electricity cost, we can save up to 40%. Because all the important or key components—like the CPU, GPU, and memory—are cooled by liquid, we don’t need so many additional things inside or such powerful fans. So, we reduce the noise level by up to 55%. And because the density is getting higher, according to our data, you can save up to 80% of the data center’s space. This technology can bring you a lot of benefits. In the next AI data center, we expect it to be very power-saving, green computing, and also very quiet because there aren’t so many powerful fans inside.
Supermicro從液冷解決方案(Liquid Cooling Solutions)中能帶來的效益包括:我們現在就能提供下一代600瓦的解決方案(Next-Generation 600-Watt Solution)。我們的機架密度(Rack Density)可以支援高達130千瓦(130 Kilowatts)。在省電方面(Power Saving),功耗最多可降低50%至15%。最重要的是,Supermicro的所有解決方案不僅支援液冷(Liquid Cooling),還支援溫水液冷(Warm Water Liquid Cooling)。這意味著我們能支援最高45°C的冷卻水溫。我給大家一個概念:我來自台灣,住在台灣時常去泡溫泉(Hot Spring)。溫泉溫度大約是40°C,你無法待太久,因為太熱了,對吧?
The benefits from our liquid cooling solutions at Supermicro are: we can right now offer a next-generation 600-watt solution to you. The rack density we can support is up to 130 kilowatts. In terms of power saving, we can reduce power consumption by up to 50% to 15%. Most importantly, all of Supermicro’s solutions can support not only liquid cooling but also warm water liquid cooling. That means we can support up to 45°C. Just to give you a concept: I’m from Taiwan, and when I’m in Taiwan, I often go to hot springs. The hot spring is around 40°C—you can’t stay too long because it’s too hot, right?
但Supermicro的伺服器(Servers)能在45°C下正常運作。這個數字非常重要,因為如果我們能支援如此高的溫度(High Temperature),那麼你的資料中心(Data Center)就不需要額外的冷卻設備(Chiller)。你只需大幅縮減冷卻設備的尺寸(Size),就能為機架(Racks)騰出更多空間。因為我們採用了溫水液冷技術(Warm Water Liquid Cooling Technology),我們正在與一些客戶合作,收集他們的回饋並實現熱能再利用(Heat Reuse)。試想一下,當你在使用ChatGPT或Open AI時,房間裡的暖氣系統(Warm Air System)或游泳池(Swimming Pool)可以利用這些運算產生的熱量(Heat)來加熱。理論上,Supermicro的解決方案能讓PUE(Power Usage Effectiveness)低於1。沒錯,因為我們重複利用了這些熱量,對吧?
But Supermicro’s servers can survive at 45°C. Okay, that number is very important because if we can support such a high temperature, then you don’t need a chiller in your data center anymore—you just reduce the size of your cooling equipment significantly, and then you can give more space to the racks. Because we have warm water liquid cooling technology, some of our customers—we’re working with them to take their feedback and reuse the heat. Just imagine that when you’re using ChatGPT or Open AI, the warm air system or swimming pool in your room is heated by the heat from those computations. Theoretically, Supermicro’s solution can achieve a PUE below 1. Yeah, because we reuse that heat, right?
最重要的是,我們還能提供整合式軟體(Integrated Software),在接下來的兩頁投影片中我會詳細介紹。這裡有一個我們的客戶案例,因為空間有限(Space is Limited),他們希望轉向液冷技術(Liquid Cooling)。舉個例子,假設一個空氣冷卻機架(Air-Cooled Rack)的運算密度(Computing Density)是100千瓦(100 Kilowatts)。採用液冷整合技術(Liquid-Cooled Integrated Technology)後,只需一個機架就能達成同樣效果。這樣你就能在資料中心(Data Center)獲得更多空間(Space),或者在相同空間內放入更多機架。這就是液冷技術的美妙之處。目前我們正在與客戶合作進行一個加熱項目(Heating Project),這是一個熱能再利用(Heat Reuse)的實際案例。我們使用Supermicro設計的冷卻設備(Cooling Equipment)和伺服器(Servers),將所有元件整合到一台伺服器中。之後,我們將多台伺服器整合成一個機架(Rack)。現在,我們將多個機架整合成一個叢集(Cluster)。所有熱量(Heat)都被內部的冷卻器(Cooler)吸收,透過CDU(Cooling Distribution Unit)進行熱交換(Heat Exchange),然後CDU將設施用水(Facility Water)直接送到外部的冷卻塔(Cooling Tower)。這就是我們正在運作的模式。你可以看到,在傳統空氣冷卻資料中心(Traditional Air-Cooled Data Center)或節能資料中心(Eco Data Center)中,通常會有冷卻設備(Chiller)或HVAC系統(HVAC System),因為它們需要將環境溫度冷卻到例如25°C,就像我們現在所在的香港(Hong Kong)。但有了我們的解決方案,我們保證你可以直接與冷卻塔連接,不需要額外的冷卻水(Chilled Water)。
The most important thing is that we can also provide you with integrated software—later, in the next two slides, I’ll introduce more. Here’s one of our customers’ cases: because space is limited, they want to switch to liquid cooling. Let’s take an example: a decade ago, an air-cooled rack had a computing density of 100 kilowatts. With liquid-cooled integrated technology, just one rack is needed. So, you can get more space or have more racks in the same space in your data center. That’s the beauty of liquid cooling technology. This is a heating project we’re doing with our customer right now—just to give you a real example of heat reuse. We have Supermicro-designed cooling equipment and servers, integrating everything into a server. Later on, we integrate all the servers into a rack. Right now, we integrate a lot of racks into a cluster. All the heat is absorbed by the cooler inside and exchanged by the CDU. Then, the CDU sends the facility water directly outside to the cooling tower. That’s the working model we’re using. Usually, in traditional air-cooled data centers or eco data centers, there’s a chiller or HVAC system because they need to cool the environment to, let’s say, 25°C—like where we are right now in Hong Kong. They need chilled water, but with our solution, we can guarantee you can connect directly to our cooling tower without it.
關於這個冷卻塔(Cooling Tower),如果你有時間,歡迎來我們的園區(Campus)參觀。我們有自己設計的冷卻塔,特別為多種場景(Occasions)設計。它內建非常聰明的邏輯(Smart Logic),能實現節能(Energy Saving)。我們能保持高效能(High Performance)和強大的機架運算能力(Rack Computing Capability),同時讓你的資料中心功耗(Power Consumption)降到最低,節省高達40%的能源。對於下一代AI資料中心(Next-Generation AI Data Center),能源分配(Energy Allocation)將非常關鍵。這裡以一個小型資料中心(Small-Size Data Center)為例。在空氣冷卻資料中心(Air-Cooled Data Center)中,幾年前的平均PUE(Power Usage Effectiveness)是1.6,這是最新節能資料中心的平均值。HVAC系統(HVAC System)通常佔據32%的功耗。但如果你轉向液冷技術(Liquid Cooling),PUE可以降到1.1。你可以從HVAC系統中節省大量能源,還能從伺服器端(Server Side)節省15%的功耗,因為我們減少了風扇(Fans)的數量和功率(Power)。如果你從空氣冷卻轉換到液冷,PUE會從1.6降到1.1。最重要的是,我們能幫你快速實現這個目標(Achieve This Goal)。下一頁投影片我會給你一個例子,來說明這個PUE 1.1的案例。
For this cooling tower, if you have time, you’re welcome to visit our campus. We have our own designed cooling tower, specially designed for several occasions. It also has very smart logic built inside, so we can achieve energy saving. We can keep you at very high performance with powerful rack computing capability, but also the lowest power consumption in your data center—saving up to 40%. For the next-generation AI data center, energy allocation will be critical. Here, take an example of a small-size data center. For air-cooled data centers a few years ago, the PUE was 1.6—I think that’s the average in the latest eco data centers. The HVAC system usually occupies 32% of power consumption. But if you switch to liquid cooling, the PUE is 1.1. You can save a lot of energy from the HVAC system and also save 15% on the server side because we reduce the number or power of fans. If you switch from air cooling to liquid cooling, the PUE will drop from 1.6 to 1.1. The most important thing is we can help you achieve this goal very fast. In the next slide, I’ll take one example for you about this PUE of 1.1.
我們提到,相較於傳統空氣冷卻資料中心(Traditional Air-Cooled Data Center),我們的解決方案能節省約40%的能源(Energy Saving)。在同樣的空間(Same Space)內,你可以在資料中心放入更多機架(Racks)。因為我們減少了HVAC系統(HVAC System)的能源消耗(Power Consumption),你就可以將這些節省下來的電力用於運算節點(Computing Nodes)。這樣,你的運算能力(Computing Capability)會更強大,同時大幅降低這類解決方案的功耗。Supermicro的綠色運算(Green Computing)能提供液冷服務(Liquid Cooling Services)和更高操作溫度的伺服器(Higher Operating Temperature Servers)。我們的伺服器能支援最高45°C的設施用水(Facility Water)。即使在你的環境中(Environment),我們的伺服器也能在35°C下穩定運作(Work Very Well)。因此,你的資料大廳(Data Hall)不需要那麼強大的空調(Air Conditioning)。所有最新的Supermicro伺服器都配備鈦金認證電源供應器(Titanium-Label PSU),能節省大量電力並提供極高的效率(High Efficiency)。更重要的是,我們可以根據你的機架需求(Rack Requirement),例如42U、48U、52U或36U,幫助你在資料中心實現最高的機架密度(Rack Density)。
We almost said 40% energy saving compared to the traditional air-cooled data center. In the same space, maybe you can put more racks in your data center because you take the energy or power saved from the HVAC system and use it in your computing nodes. So, you can have more powerful computing capability, and you can reduce a lot of power with this kind of solution. Supermicro’s green computing can provide liquid cooling services and higher operating temperature servers. For this, we can support up to 45°C facility water. In your environment, we can allow our servers to still work very well at 35°C. So, you don’t need such powerful air conditioning inside your data halls. All the latest Supermicro servers have titanium-label PSUs that save a lot of power and offer very high efficiency. The important thing is that, according to your rack requirements—42U, 48U, 52U, or 36U—we can help you achieve the highest rack density in your data center.
這裡我想舉一個真實案例,數據來自第三方(Third Party)。這是在滿載壓力模式(Full Load Stress Mode)下的測試結果,使用Supermicro解決方案,我們達到了PUE 1.08(Power Usage Effectiveness 1.08),這是真實數據(Real Data)。這不是模擬(Simulation)的結果,而是由我們的客戶和第三方測量的。平均PUE是1.17。最重要的是,這是在亞洲地區(Asia Side)測試的,不是在美國。這裡氣候非常炎熱且高濕度(Hot and High Humidity),但我們依然能實現這樣的目標。從我們的觀點來看,下一代資料中心(Next-Generation Data Center)必須轉向液冷技術(Liquid Cooling),並將PUE降到1.2以下,甚至1.1以下。在總擁有成本(TCO, Total Cost of Ownership)方面,這項技術能幫你節省多少資金呢?這裡用同一個例子來說明,假設總功率(Total Power)是10兆瓦(10 Megawatts)。對於空氣冷卻版本(Air-Cooled Version),部分PUE(Partial PUE)大約是1.5;而採用液冷技術後,PUE是1.08。如果你使用8000個NVIDIA H100 GPU(NVIDIA H100 GPUs),空氣冷卻需要256個機架(256 Racks)。但有了液冷技術,我們能提高密度(Increase Density),只需要128個機架(128 Racks)。
Here, I want to take an example—it’s real data with a third party. This result is with full load in stress mode. We reached a PUE of 1.08 with the Supermicro solution—this is real data. It’s not from a lot of simulations; it’s measured by our customer and a third party. The average PUE is 1.17. The most important thing is that this is on the Asia side, not in the US—it’s a very hot and high-humidity region, yet we can still reach this goal. From our point of view, the next-generation data center needs to switch to liquid cooling and reach a PUE below 1.2, or even below 1.1, something like that. In terms of TCO, how much money can you save with this technology? Here’s the same example: we take a total power of 10 megawatts. For the air-cooled version, the partial PUE is around 1.5 something, and for liquid cooling, the PUE is 1.08. If we assume over 8,000 NVIDIA H100 GPUs, the rack count is 256 racks for air cooling. But with liquid cooling technology, we can increase the density, so the total racks drop to only 128.
因為我們大幅減少了風扇系統(Fan System)和空調(Air Conditioning)的功耗,整個資料中心的總功耗(Total Power)從15,000千瓦(15,000 Kilowatts)降到接近10,000千瓦(10,000 Kilowatts)。以年度電力消耗(Annual Power Consumption)來說,空氣冷卻大約是140,000兆瓦時(140,000 Megawatt-Hours),而液冷是85,000兆瓦時(85,000 Megawatt-Hours),減少了將近38%。總投資(Total Investment)方面,大家常問我液冷(Liquid Cooling)是否比空氣冷卻(Air Cooling)更貴。老實說,如果你是新建一個資料中心(New Data Center),液冷的成本其實比空氣冷卻更便宜。
The total power reduces because we cut a lot from the fan system and air conditioning power consumption. The total power for your data center drops from 15,000 kilowatts to almost 10,000 kilowatts. Okay? The annual power consumption is about 140,000 megawatt-hours for air cooling, and for liquid cooling, it’s 85,000 megawatt-hours—it’s reduced by almost 38%. Regarding total investment, everybody keeps asking me if liquid cooling is more expensive than air cooling. To be honest, if you’re building a new data center, it’s cheaper than air cooling.
因為你需要考慮你的HVAC系統(HVAC System),採用我們的解決方案後,你可以減少這部分系統以及資料中心的管路(Piping)。你可以在之前的投影片中看到相關內容。在最新的AI資料中心(AI Data Center)中,傳統的冷卻方式已經不復存在(No More How Are You)。這樣我們就能大幅節省基礎設施(Infrastructure)的成本。總投資(Total Investment)方面,對於空氣冷卻版本(Air-Cooled Version),資本支出(CAPEX)大約是3.34億美元(334 Million USD)。而根據我們的數據,液冷版本(Liquid-Cooled Version)的CAPEX低於空氣冷卻,只有3.21億美元(321 Million USD)。在總運行成本(Running Cost)方面,第一年的運營支出(OPEX)對於空氣冷卻是3800萬美元(38 Million USD),而液冷版本是2300萬美元(23 Million USD),減少了近40%。因此,五年後,你能透過這項技術節省超過8000萬美元(80 Million USD)。我們正在誠摯地與客戶合作並進行交流。最終,你必須轉向液冷技術(Liquid Cooling),因為TDP(Thermal Design Power)越來越高。越早投資(Invest Earlier),你就越早能回收資金(Get the Money Back)並開始賺錢(Make Money)。這也是液冷技術(Liquid Cooling)最近如此熱門的另一個主要原因。它不僅解決了晶片TDP的問題,還能幫你在資料中心省錢甚至賺錢。
Because you need to consider your HVAC system, you can reduce that system and the piping in the data center, as you can see in the previous slide. Later, in the latest AI data centers, there’s no “how are you” anymore—so we can save a lot from your infrastructure. The total investment, the CAPEX for the air-cooled version, is about 334 million USD. For liquid cooling, according to our data, it’s less than air cooling at 321 million USD. The total running cost—the first-year OPEX—is 38 million USD for the air-cooled version and 23 million USD for the liquid-cooled version. It reduced almost 40%. So after five years, you can save over 80 million USD with this technology. Right now, we sincerely work with and talk to our customers. Eventually, you need to go to liquid cooling because you see the TDP getting higher and higher. Investing as early as possible lets you get the money back and make money. This technology is another main reason why liquid cooling is so hot recently—it’s not just about solving chip TDP issues; you also save or even make money in your data centers.
Supermicro能提供所有解決方案(All Solutions),一站式服務(One-Stop Shop),包括完整的液冷解決方案(Total Liquid Cooling Solution),涵蓋CDU(Cooling Distribution Unit)、伺服器(Servers)、機殼(Chassis),甚至冷卻塔(Cooling Tower)。我們還有SuperCloud Composer,這是一款非常出色的測量工具(Measurement Tool)。你可以用它監控機架(Rack)中的一切,包括運算節點(Computing Nodes)、CDU,甚至冷卻塔的狀態。我們能幫助你在非常早的階段(Early Stage)實現綠色運算(Green Computing)。Supermicro一直大力推廣綠色運算,我們提供用戶友好的設施連接(User-Friendly Facility Connection),全部採用業界標準(Industry Standards),技術非常成熟(Mature)且穩定(Stable)。你無需擔心漏水風險(Risk of Leakage),因為這種技術已經應用了六七十年。我們的團隊還能從供水線(Supply Line)和回流線(Return Line)提供支援。我們可以提供設施房(Facility House)、閥門(Valves),以及所有的管路設計(Piping Design)。此外,我們只使用卡洛克接頭(Camlock Fittings),這是一種簡單且非常可靠的連接器(Reliable Connector),適用於所有管路。我們希望讓客戶更容易採用這種綠色運算解決方案,讓你的工作更輕鬆。
Supermicro can provide all the solutions—a one-stop shop—the total liquid cooling solution, including CDUs, servers, chassis, and even the cooling tower. We also have SuperCloud Composer, a very good measurement tool. You can monitor everything in the rack, including the computing nodes, CDU, and even the cooling tower. We can help you realize green computing at a very early stage. Here, Supermicro always strongly promotes green computing as a company. We provide and design user-friendly facility connections—all industry standards, so they’re very mature and stable. You don’t need to worry about the risk of leakage because this has been around for maybe 60 or 70 years already. Our team can also support you from the supply line or return line—we can provide a facility house, valves, and all the piping designs. We only use Camlock fittings, an easy and very reliable connector for all the piping. We want to make it easier, and our customers’ jobs easier, to have this kind of green computing solution.
我們目前能提供什麼呢?在之前的投影片中,我們提到Supermicro交付伺服器(Deliver Servers)、開發機架(Develop Racks)、交付叢集(Deliver Clusters)。但現在我們正在做的是提供完整的資料中心模組化解決方案(Data Center Building Block Solution),我們稱之為ADCPPS(Advanced Data Center Power and Performance Solution)。根據你的伺服器設計(Server Design)或伺服器配置(Server Configuration),我們可以推薦並提供你所需的最佳伺服器,並將其整合到機架中(Integrate into a Rack)。我們還能將多個機架整合成一個叢集(Cluster),並設計管路(Piping)。我們也能從我們的園區(Campus)交付冷卻塔(Cooling Tower)。我們專注於液冷AI(Liquid-Cooled AI),歡迎來我們的園區參觀。目前我們園區裡有一個展示廳(Demo Room),你可以親自體驗。
What we can offer right now—in the previous slide, we mentioned Supermicro delivers servers, develops racks, and delivers clusters. But most of what we’re doing right now is delivering the complete data center building block solution—we call it ADCPPS. According to your server design or configuration, we can provide or recommend the best servers you need and integrate them into a rack. We can also integrate several racks into a cluster and design the piping. We can deliver your cooling tower from our campus too. We focus on liquid-cooled AI—welcome to our campus. We have a demo room in our campus right now. Okay?
這是一個軟體管理系統(Software Management System)。我認為Supermicro是唯一一家能同時監控伺服器(Servers)、CDU(Cooling Distribution Unit),甚至冷卻塔(Cooling Tower)的公司。這套系統適用於多種應用(Applications),我們在系統內建了大量的邏輯(Logic)。我們利用AI技術(AI Technology)來冷卻AI資料中心(AI Data Center)。透過這套系統,你甚至可以自動調整風扇速度(Fan Speed)或泵浦速度(Pump Speed),以節省資料中心的電力(Power)。Supermicro的冷卻塔標準容量是15兆瓦(15 Megawatts),機械PUE(Mechanical PUE)為1.09(1.09),使用壽命(Life Cycle)非常長,超過15年。最重要的是,如果你能提供預測(Forecast),我們能在4週內(4 Weeks)交付冷卻塔。因為時間有限,我快速分享一些照片,給我1分鐘。這是我們想與你分享的內容。即使在非常惡劣的環境(Harsh Environment),比如大雪(Heavy Snow),客戶依然需要冷卻塔來降低設備溫度(Cool Down the System)。我們的Supermicro冷卻塔在這種惡劣環境下依然運作良好(Work Very Well)。即使一夜大雪,兩天內一切就能完成,這台機架(Rack)在一週內就能上線(Online)。
This is a software management system. I think Supermicro is the only one that monitors servers, CDUs, and even the cooling tower. This one is for various applications, and we’ve built a lot of logic inside the system. We use AI technology to cool down AI data centers. With this, you can even automatically adjust the fan speed or pump speed to save power in your data centers. Supermicro’s cooling tower is a standard 15-megawatt unit, with a mechanical PUE of 1.09. The life cycle is quite long—over 15 years. Most importantly, if you have a forecast, we can help deliver the cooling tower within 4 weeks. Some pictures—because we’re almost out of time—okay, just give me 1 minute. This is something we can share with you. In a very harsh environment with heavy snow, they still need a cooling tower to cool down the system. Our Supermicro cooling tower still worked very well in this harsh environment. Even with heavy snow one night, you can see everything is finished in 2 days, and this rack is online within 1 week.
Supermicro的整體產能(Capacity)每月達到5500台(5,500 Units Per Month),大多數在美國(US),因此我們能確保品質(Measure the Quality)。我們可以提供12種不同的機架解決方案(12 Rack Solutions)給客戶。最重要的是,我們提供APOC服務(APOC Service),具備安全的遠端存取功能(Secure Remote Access)。這裡有一些照片與大家分享。我們的設計經過充分整合(Fully Integrated),特別針對液冷機架(Liquid-Cooled Racks)進行優化。我們擁有強大的產能(Capacity),能以最高品質(Highest Quality)交付機架給客戶。
Supermicro’s overall capacity is pronounced at 5,500 units per month, and most are in the US, so we can measure the quality. We can provide attentive 12 rack solutions to our customers. Most importantly, we have APOC service with secure remote access. Here are some pictures to share with you. We are fully integrated and very well-designed, especially for liquid-cooled racks, and we have a lot of capacity to help deliver the highest quality racks to our customers.
這裡有兩個數字。第一個是下一代AI資料中心(Next-Generation AI Data Center)的能源使用效率(Power Usage Effectiveness, PUE),透過液冷技術(Liquid Cooling Technology)可達到90%的效率提升。第二個是五年內資料中心成本節省(Data Center Cost Saving)36%。只要與Supermicro合作,就能實現這些成果。這是最後一頁投影片,關於我們的平台(Platform),非常感謝大家。這一切都由NVIDIA提供技術支援(Powered by NVIDIA),我們與NVIDIA密切合作(Close with NVIDIA),與NVIDIA團隊(NVIDIA Team)共同打造了智能液冷(Smart Liquid Cooling)和綠色運算解決方案(Green Computing Solution)。這就是我們目前能提供的解決方案。如果你有興趣,歡迎造訪我們的網站(Website)了解更多。
Okay, so two numbers. One is the power usage effectiveness for the next AI data center—90% from liquid cooling technology. And 36% data center cost saving over 5 years. Just work with Supermicro. The last slide—this is a little bit about the platform, and thanks a lot. It’s all powered by NVIDIA, and we work closely with NVIDIA. We built the smart liquid cooling and green computing solution with the NVIDIA team. We can deliver that—it’s the solution right now. If you’re interested, welcome to our website to check it out.
謝謝大家,很抱歉(時間有限)。我們可以讓內容更豐富。再次感謝大家。
Okay, thank you. Sorry. I’d make it more. Yeah, thank you, thank you, thank you.
如果有幾個問題,我們可以在這場演講和下一場演講之間回答。如果你有問題想問CW Chen博士(Dr. CW Chen),可以隨時上前來問他。
We can take maybe a few questions between this talk and the next talk. If you have questions for Dr. CW Chen, you can always come up here and ask him.
是的。
Yes.
感謝你的演講。我有一個簡單的問題。每當新一代GPU(Graphics Processing Unit)推出時,它的性能更強大,但也帶來更多熱量(Heat),消耗更多電力(Power)。如果我們想保持GPU的密度(Density)不變,你認為液冷技術(Liquid Cooling)能跟得上每兩年或四年新一代GPU帶來的功耗和熱量增加嗎?
Thank you for your talk. I have a quick question. Every time there is a new generation of GPU, it’s much more powerful, but it’s also bringing more heat and consuming much more power. If we want to keep the same density of GPUs, do you believe that liquid cooling will be able to keep up with the increase in power and heat brought by a new generation of GPUs every 2 or 4 years?
是的,是的。
Yeah, yeah.
我認為這是一個非常好的問題,因為我們知道現在每一代GPU的TDP(Thermal Design Power)都在增加。從Supermicro的設計角度(Supermicro Design),我們會提前規劃(Plan in Advance)。我們會設計更高的冷卻容量(Cooling Capacity),即使是單個CDU(Cooling Distribution Unit),也能支援下一代(Next Generation)。而且,因為每個GPU平台(GPU Platform)都有不同的需求,我們會針對每個晶片(Chip)進行特別設計(Special Design)。如果機架結構(Mechanism)保持不變,我們仍然可以使用相同的冷卻設備(Cooling Equipment)來應對下一代。所以,是的,可以跟得上。
I think it’s a very good question because we know right now every generation’s TDP is getting higher. From our Supermicro design perspective, we will plan it in advance. We can design more capacity—even a single CDU can support the next generation. Also, because each GPU platform is different, we have a special design for that chip. If the mechanism stays the same, we can still use the same cooling equipment for the next generation. So, okay, yes.
你看到的故障率(Failure Rates)是多少?有哪些可靠性(Reliability)問題?我這麼問是因為我想知道需要內建什麼樣的備援(Redundancy)。是機架級備援(Rack-Level Redundancy)還是列級備援(Row-Level Redundancy)?比如,如果機架裡的一個泵浦(Pump)故障了,我得關閉整個系統(Shut the Whole Thing Down),直到更換完成我才恢復運作。這是影響單個機架(Rack)還是整列(Row)?
What kind of failure rates do you see? What kind of reliability issues are there? I ask because I’m wondering about what kind of redundancy you need to build in—is it rack-level redundancy or row-level redundancy? Like, if I take a failure on a pump in a rack and I have to shut the whole thing down, I’m down until it’s replaced. Is that a rack or a row?
是的,對吧?
Yeah, right?
我不知道其他公司的情况,但從Supermicro的角度(Supermicro Side)來說,我們的結果顯示,相較於空氣冷卻(Air Cooling)的GPU系統,我們的故障率(Failure Rate)只有一半(Half)。有時候GPU可能會因為某些原因損壞(Sitting a Loss),但液冷(Liquid Cooling)非常穩定,因為溫度低(Low Temperature)。至於泵浦(Pump)的問題,目前為止,與我們供應商(Vendor)的良好合作關係(Good Partnership)讓我們還沒遇到任何泵浦故障(No Pump Issue)。我們已經交付了總計400兆瓦(400 Megawatts)的冷卻容量給客戶。如果你在考慮備援(Redundancy),我認為N+1(N+1 Redundancy)是必要的。但如果是獨立式CDU(Standalone CDU),裡面有兩個泵浦(Two Pumps),採用1+1備援(1+1 Redundancy),我認為這已經足夠了(Quite Enough)。
Okay, I don’t know about other companies, but from Supermicro’s side, as our results show, the failure rate compared with air-cooled GPU systems is only half. Sometimes GPUs have issues like sitting a loss or something, but liquid cooling is very stable because of the low temperature. Regarding the pump or something, right now, I think because we have a good partnership with our vendor, so far there’s no pump issue. We’ve already delivered 400 megawatts of capacity outside. If you’re considering redundancy, I think N+1 is needed. But if you’re talking about a standalone CDU with two pumps inside, for 1+1 redundancy, I think it’s quite enough.
有很多有趣的問題。
A lot of interesting questions.
我想問一下,使用45°C(設施用水溫度)有什麼條件限制(Condition Limitation)嗎?
So may I ask, is there any condition limitation to use the 45°C sources?
是的,當然有,這取決於機架內有多少GPU(Graphics Processing Units)。如果以我們的標準產品(Standard Product)為例,比如內含64個GPU(64 GPUs),我們可以支援最高45°C(Up to 45°C)。但如果你想要放入更多GPU,我們會建議你降低設施用水溫度(Facility Water Temperature)。不過,如果你的資料中心環境條件(Environmental Conditions)非常複雜(Complicated),我們可以與你合作,嘗試在40°C或45°C下運作。
Yes, yes, for sure—it depends on how many GPUs are inside. If it’s our standard product, let’s say 64 GPUs inside, we can support up to 45°C. But if you want more than that, we will recommend you need to reduce the facility water temperature. But if you have a very complicated computing environment in your data center, we can work with you to try even 40°C or 45°C in your data center, yes.
我有一個問題,我想這可能是最後一個,至少在你上前與CW Chen博士(Dr. CW Chen)交談之前。液冷解決方案(Liquid Cooling Solution)和空氣冷卻解決方案(Air-Cooled Solution)之間的成本差異(Cost Delta)是多少?假設標準化到資料中心的規模(Size of the Data Center)。
Yeah, so I have a question. I think it might be the last one, at least over here before you come and talk to Dr. CW Chen. It’s about the cost delta between the air-cooled solution and the liquid-cooled solution, normalized to whatever size of the data center.
是的,因為現在對於空氣冷卻的伺服器(Air-Cooled Servers),散熱器(Heat Sink)越來越大(Getting Bigger),成本也因此變得更高(More Expensive)。考慮到整體成本(Total Cost),液冷機架(Liquid-Cooled Rack)的成本實際上低於空氣冷卻(Air Cooling)。老實說,你能節省大量的冷卻設備材料(Cooling Material),也能避免許多低效的設計(Inefficient Design)。
Yeah, because right now for air-cooled servers, you see the heat sink is getting bigger and bigger, and that one is also expensive. Considering the total cost, the overall cost of a liquid-cooled rack is lower than air cooling. To be honest, yeah, you save a lot of cooling material, you save a lot of inefficiency.
所以我們付給你的費用比空氣冷卻解決方案要少,因為你提供了液冷解決方案(Cooling Solution)?
So we pay you less than for giving us a cooling solution?
是的。感謝你的問題。讓我們為他鼓掌,非常感謝。
Yeah. Thank you, thank you for your question. Yes. A round of applause for you. Thank you very much.
感謝大家抽出時間。是的。
Thank you for your time. Yes.
這場演講非常精彩,我們即將進入下一個環節。但在這之前,還有機會提問。
So it was well—we are going to be moving to the next session as it is. But there can be more questions.