# 2025 AI模型API比較
比較指標將涵蓋:
- **API 速度**:請求延遲、吞吐量(Requests Per Second, RPS)
- **功能性**:多模態能力(圖像、音頻、視頻)、程式碼生成能力、長文本處理
- **反應時間**:平均響應時間、最大/最小延遲
- **價格**:計費方式(按請求、按 Token)、各模型的定價細節
- **智能程度**:推理能力、創造力、知識覆蓋範圍
**模型** | **API速度**(請求延遲、吞吐量 RPS) | **功能性**(多模態能力、代碼生成、長文本處理) | **反應時間**(平均響應時間、最大/最小延遲) | **價格**(計費方式、模型定價) | **智能程度**(推理能力、創造力、知識覆蓋範圍)
---|---|---|---|---|---
**OpenAI O3-mini (High)** | 延遲較前代降低,支持高並發。可根據“推理強度”模式調整思考深度:低模式下首字輸出約0.5秒,吞吐~175 tokens/s ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=increases%20inference%20costs%20and%20latency,latency%20to%20the%20first%20token));高模式會增加額外鏈式思考步驟,延遲可達 ~15 秒 ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=increases%20inference%20costs%20and%20latency,latency%20to%20the%20first%20token))。 | 純文本模型,不支持圖像輸入 ([OpenAI o3-mini | OpenAI](https://openai.com/index/openai-o3-mini/#:~:text=o3%E2%80%91mini%20does%20not%20support%20vision,opens%20in%20a%20new%20window))。擅長代碼生成和數學推理,提供函數調用、結構化輸出等開發者功能 ([OpenAI o3-mini | OpenAI](https://openai.com/index/openai-o3-mini/#:~:text=OpenAI%20o3%E2%80%91mini%20is%20our%20first,opens%20in%20a%20new))。上下文窗口支持約20萬-token輸入+10萬輸出 ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=reasoning%20models%20DeepSeek,knowledge%20cutoff%20is%20October%202023))。 | 默認中等模式響應較快,相比O1模型提速明顯;在高精度模式下需額外思考,響應變慢(延遲可達十余秒) ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=increases%20inference%20costs%20and%20latency,latency%20to%20the%20first%20token))。整體能夠在速度和準確率間靈活權衡。 | 按使用計費:輸入約 $1.10/百萬 tokens,輸出約 $4.40/百萬 tokens ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=,API%20calls%20users%20can%20make))。相比OpenAI O1成本降低約93% ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=,API%20calls%20users%20can%20make))。ChatGPT Plus等付費用戶可使用高模式,無額外加價。 | STEM領域推理極為出色,高模式在數學、編程基準上稍優於OpenAI O1 ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=%2A%20In%20OpenAI%E2%80%99s%20tests%2C%20o3,7%20percent%20and%2039%20percent))。但是在常識問答上,即使高模式下表現(13.8%)仍低於O1 (47%)和GPT-4o (39%) ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=medium%20effort%2C%20and%20it%20outperformed,7%20percent%20and%2039%20percent))。總體邏輯推理強,創造性和知識面比肩頂級小模型。
**OpenAI O1** | OpenAI首個鏈式思考模型,請求需額外推理步驟,導致初始延遲較高:首字往往需等待約10秒 ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=runs%20faster%20%28168,latency%20to%20the%20first%20token))。生成過程中吞吐量可達約200 tokens/s ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=runs%20faster%20%28168,latency%20to%20the%20first%20token))。API並發速率有限制(高端付費用戶配額較高)。 | 支持多模態輸入,具備視覺推理能力,可接受圖像等作為上下文 ([OpenAI o3-mini | OpenAI](https://openai.com/index/openai-o3-mini/#:~:text=o3%E2%80%91mini%20does%20not%20support%20vision,opens%20in%20a%20new%20window))。同樣支持函數調用和系統消息等接口特性。上下文長度與GPT-4o相當(約128K)。擅長代碼、數學等覆雜任務。 | 由於引入“思考”Token,平均響應較慢,需數秒到十余秒才能得到完整答覆。無快速模式可選,面向覆雜問題會主動內部推理多步,延遲明顯。 | API價格昂貴:輸入 ~$15/百萬 tokens,輸出 ~$60/百萬 tokens ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=surcharge%20for%20reasoning))。是OpenAI系列中最高等級定價,適用於高精度需求。 | 高級推理和廣泛知識覆蓋能力突出。在學術問答等基準上成績極佳(如MMLU達92% ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=medium%20effort%2C%20and%20it%20outperformed,7%20percent%20and%2039%20percent))),常識問答顯著優於新模型。數學推理、代碼理解均屬頂尖水準 ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=%2A%20In%20OpenAI%E2%80%99s%20tests%2C%20o3,7%20percent%20and%2039%20percent))。創造力強且更少幻覺,是當時綜合表現最強的模型之一。
**DeepSeek R1** | 采用Mixture-of-Experts架構的大模型。雖然參數總量6710億,但每次推理僅激活約370億參數 ([o3-mini vs DeepSeek-R1 - Detailed Performance & Feature Comparison](https://docsbot.ai/models/compare/o3-mini/deepseek-r1#:~:text=DeepSeek)),推理過程並行化,一定程度提升吞吐。未公開具體延遲數據,但引入鏈式推理,同樣有思考開銷。網站/移動端聊天中支持文件上傳,實際模型本身不直接處理非文本 ([Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search | VentureBeat](https://venturebeat.com/ai/google-launches-gemini-2-0-pro-flash-lite-and-connects-reasoning-model-flash-thinking-to-youtube-maps-and-search/#:~:text=Neither%20DeepSeek,and%20file%20uploads%20or%20attachments))。 | 純文本LLM,不支持原生圖像等多模態輸入 ([Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search | VentureBeat](https://venturebeat.com/ai/google-launches-gemini-2-0-pro-flash-lite-and-connects-reasoning-model-flash-thinking-to-youtube-maps-and-search/#:~:text=Neither%20DeepSeek,and%20file%20uploads%20or%20attachments))。強化了推理鏈能力,兩階段RL訓練挖掘推理模式、兩階段SFT保證一般任務性能 ([o3-mini vs DeepSeek-R1 - Detailed Performance & Feature Comparison](https://docsbot.ai/models/compare/o3-mini/deepseek-r1#:~:text=DeepSeek))。上下文窗口128K輸入+32K輸出 ([o3-mini vs DeepSeek-R1 - Detailed Performance & Feature Comparison](https://docsbot.ai/models/compare/o3-mini/deepseek-r1#:~:text=Input%20Context%20Window)) ([o3-mini vs DeepSeek-R1 - Detailed Performance & Feature Comparison](https://docsbot.ai/models/compare/o3-mini/deepseek-r1#:~:text=Maximum%20Output%20Tokens))。擅長代碼、生成人類可讀的思考過程。模型開源,可通過自有API或HF運行 ([o3-mini vs DeepSeek-R1 - Detailed Performance & Feature Comparison](https://docsbot.ai/models/compare/o3-mini/deepseek-r1#:~:text=API%20Providers))。 | 默認會進行鏈式思考,響應速度略慢於一般對話模型。首字延遲官方未明確,但在其UI上可邊思考邊輸出。對覆雜問題可能也需數秒思考,總體交互體驗與OpenAI O1相近。 | 定價極為低廉:輸入約 $0.55/百萬,輸出約 $2.19/百萬 tokens ([o3-mini vs DeepSeek-R1 - Detailed Performance & Feature Comparison](https://docsbot.ai/models/compare/o3-mini/deepseek-r1#:~:text=Output))(比O3-mini便宜近2倍 ([o3-mini vs DeepSeek-R1 - Detailed Performance & Feature Comparison](https://docsbot.ai/models/compare/o3-mini/deepseek-r1#:~:text=per%20million%20tokens)))。另有自部署開源權,可自行托管節省成本。 | 推理能力被認為可與OpenAI O1相媲美 ([o3-mini vs DeepSeek-R1 - Detailed Performance & Feature Comparison](https://docsbot.ai/models/compare/o3-mini/deepseek-r1#:~:text=discovering%20improved%20reasoning%20patterns%20and,math%2C%20code%2C%20and%20reasoning%20tasks))。在數學、代碼和覆雜推理任務上表現接近O1 ([o3-mini vs DeepSeek-R1 - Detailed Performance & Feature Comparison](https://docsbot.ai/models/compare/o3-mini/deepseek-r1#:~:text=discovering%20improved%20reasoning%20patterns%20and,math%2C%20code%2C%20and%20reasoning%20tasks)),用戶競技場偏好顯示其與O1、不少頂級閉源模型不相上下 ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=applications,least%20for%20common%2C%20everyday%20prompts)) ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=We%E2%80%99re%20thinking%3A%C2%A0Regardless%20of%20benchmark%20performance%2C,least%20for%20common%2C%20everyday%20prompts))。創意和知識覆蓋也屬於一流水準,但可能由於對真知嚴格,對常識問答稍遜於側重知識的大模型。
**Gemini 2.0 Pro (實驗版)** | 屬於Google最先進的大模型系列,因處於實驗階段,API調用受到嚴格速率限制 ([Gemini Pro 2.0 Experimental (free) - API, Providers, Stats](https://openrouter.ai/google/gemini-2.0-pro-exp-02-05:free#:~:text=Gemini%20Pro%202,limited%20by))。推理過程深度更高,延遲相對基礎版有所增加。適合批量長文本分析但不強調低延遲。 | 多模態AI,具備讀取文本、代碼,**可能**支持圖像等視覺信息(Gemini系列整體強調多模態)。與搜索和代碼執行環境集成,可聯網查詢和運行代碼 ([Google Gemini 2.0 Pro Experimental vs OpenAI o3-mini](https://www.analyticsvidhya.com/blog/2025/02/gemini-2-0-pro-experimental-vs-o3-mini/#:~:text=Google%E2%80%99s%20Gemini%202,date%20information))。上下文窗口最高達200萬tokens,能處理海量信息 ([Gemini app adding 2.0 Pro and 2.0 Flash Thinking Experimental](https://9to5google.com/2025/02/05/gemini-2-0-pro-flash-thinking-experimental-app/#:~:text=For%20the%20Gemini%20API%2C%20there%E2%80%99s,get%201%20million%20like%20before))。在覆雜編程任務和世界知識推理上效果最佳 ([Gemini app adding 2.0 Pro and 2.0 Flash Thinking Experimental](https://9to5google.com/2025/02/05/gemini-2-0-pro-flash-thinking-experimental-app/#:~:text=Google%20says%202,%E2%80%9D))。 | 由於面向覆雜任務,會進行更深入的鏈式推理,平均響應時間比即時對話模型更長。對於要求嚴謹推理的問題,可能需十幾秒甚至更久才能完整回答。簡單提問則可在可接受時間內返回結果。 | 尚處於有限預覽階段,暫未公開單獨API計費。Gemini Advanced訂閱用戶(月費$19.99)可搶先體驗 ([Gemini app adding 2.0 Pro and 2.0 Flash Thinking Experimental](https://9to5google.com/2025/02/05/gemini-2-0-pro-flash-thinking-experimental-app/#:~:text=Gemini%20Advanced%20subscribers%20%28%2419,AI%20Studio%20%2B%20Vertex%20AI));開發者通過AI Studio/Vertex AI試用,目前使用不額外收費但有配額。 | 公認為Google迄今智能水平最高的模型。編碼能力卓越 ([Gemini app adding 2.0 Pro and 2.0 Flash Thinking Experimental](https://9to5google.com/2025/02/05/gemini-2-0-pro-flash-thinking-experimental-app/#:~:text=Google%20says%202,%E2%80%9D))、對覆雜長文本和知識推理的理解力領先,以往模型無法解答的難題它往往有更完善的解決方案。推理嚴謹性和知識覆蓋面刷新了Google模型紀錄,同時在創造性生成上也達到了頂尖水準。
**Claude 3.7 Sonnet** | Anthropic新一代模型,采用“雙模式”架構:標準模式下極為迅速,官方稱典型對話延遲僅 ~200ms ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=Claude%203,16s))(可能指首字或短答覆)。實際測試平均響應約1.16秒 ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=cites%20,16s))。擴展思考模式下會犧牲速度來換取更強推理,可內部思考長達15秒甚至更多 ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=Image))。開發者可動態權衡延遲和準確性。 | 純文本大型模型,暫無圖像等多模態功能。支持最大100K級別以上長上下文(可配置“思考”Token上限至128K,用於內部推理) ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=For%20anyone%20looking%20to%20test,creates%20entirely%20new%20optimization%20opportunities))。擅長代碼、數學、多語言問答,在遵循指令和內容安全方面也有優化。無需分離快模型和聰明模型,一個Claude同時滿足快速與深度需求。 | 普通對話場景下響應極快,用戶幾乎即時得到答案 ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=Claude%203,16s))。啟用深度推理時,模型可能先“沈思”,最長十多秒給出更詳盡的解答 ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=Image))。開發者可通過限制思考Token數量控制最大延遲,保證在預期時間內返回結果。 | 價格與前代Claude 2一致:輸入 ~$3/百萬,輸出 ~$15/百萬 tokens ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=Claude%203,No%20extra%20surcharge%20for%20reasoning))。思考用掉的推理token也計入輸出字數,但Anthropic不對此另收費用 ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=Claude%203,No%20extra%20surcharge%20for%20reasoning))。相較OpenAI O1(15/60)便宜許多,但比O3-mini略高 ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=surcharge%20for%20reasoning))。 | 智能水平全面而均衡。覆雜推理能力卓越,在研究生難度的問答中表現拔尖(如GPQA鉆石78.2%) ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=Benchmarks));多領域知識測驗MMLU達86.1% ([Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1](https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1#:~:text=Image))。在代碼生成、翻譯等任務上也有強勁表現。創造力和上下文理解良好,同時通過可控“思考”顯著減少幻覺和錯誤,是目前綜合能力極強的模型之一。
**Gemini 2.0 Flash** | Google面向實際產品的高效模型,強調低延遲高吞吐。推理非常快速,實測首字延遲僅約0.46秒 ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=increases%20inference%20costs%20and%20latency,latency%20to%20the%20first%20token));生成速率約168.8 tokens/s ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=increases%20inference%20costs%20and%20latency,latency%20to%20the%20first%20token)),在同級模型中名列前茅。支持大規模並發調用,適合實時應用場景。 | 多模態支持強,可處理文本、圖像等多種輸入 ([Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search | VentureBeat](https://venturebeat.com/ai/google-launches-gemini-2-0-pro-flash-lite-and-connects-reasoning-model-flash-thinking-to-youtube-maps-and-search/#:~:text=Neither%20DeepSeek,and%20file%20uploads%20or%20attachments))(Gemini Flash支持用戶上傳圖片、文件作為上下文)。具備強大的廣域推理和工具使用能力,可連接谷歌地圖、YouTube、搜索等應用 ([Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search | VentureBeat](https://venturebeat.com/ai/google-launches-gemini-2-0-pro-flash-lite-and-connects-reasoning-model-flash-thinking-to-youtube-maps-and-search/#:~:text=But%20there%20was%20some%20news,services%20like%20DeepSeek%20and%20OpenAI))。上下文窗口達100萬tokens ([Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search | VentureBeat](https://venturebeat.com/ai/google-launches-gemini-2-0-pro-flash-lite-and-connects-reasoning-model-flash-thinking-to-youtube-maps-and-search/#:~:text=programming%20interface%20%28API%29)),可應對超長文檔摘要與覆雜推理。 | 平均響應時間極短,對多數查詢均能做到即時回答 ([Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search | VentureBeat](https://venturebeat.com/ai/google-launches-gemini-2-0-pro-flash-lite-and-connects-reasoning-model-flash-thinking-to-youtube-maps-and-search/#:~:text=in%20December%2C%20is%20now%20production))。由於針對低延遲優化,幾乎感覺不到等待。同時在UI中可以實時顯示其思考步驟(Flash Thinking模式),交互流暢。 | 按百萬tokens計費極其親民。Gemini 2.0 Flash-Lite版本僅 ~$0.075/百萬輸入、$0.30/百萬輸出 ([Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search | VentureBeat](https://venturebeat.com/ai/google-launches-gemini-2-0-pro-flash-lite-and-connects-reasoning-model-flash-thinking-to-youtube-maps-and-search/#:~:text=As%20shown%20in%20the%20table,maintaining%20the%20same%20cost%20structure));Flash完整版定價略高(推測約雙倍,即$0.15/$0.60)但仍遠低於同等性能競品。定價策略令其在成本效益上優勢明顯。 | 整體智能水平僅次於Pro模型,遠超上一代。第三方對話競技場上用戶偏好度很高,甚至常勝於OpenAI O1和DeepSeek R1 ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=applications,least%20for%20common%2C%20everyday%20prompts)) ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=We%E2%80%99re%20thinking%3A%C2%A0Regardless%20of%20benchmark%20performance%2C,least%20for%20common%2C%20everyday%20prompts))。在專業知識問答、代碼和常規對話上表現優秀,生成結果準確且連貫。盡管在極端覆雜任務上略遜Pro版,但已是當前性能最強、用途最廣的通用模型之一。
**OpenAI GPT-4o (2024年11月)** | OpenAI的多模態旗艦模型,“o”代表“omni”。支持更優化的推理並引入了成本優化,相比早期GPT-4略提升速度和並發。提供128K上下文窗口 ([Benchmarking Amazon Nova and GPT-4o models with FloTorch | AWS Machine Learning Blog](https://aws.amazon.com/blogs/machine-learning/benchmarking-amazon-nova-and-gpt-4o-models-with-flotorch/#:~:text=FloTorch%20used%20the%20GPT,The%20inference))(輸入),16K輸出上限 ([Benchmarking Amazon Nova and GPT-4o models with FloTorch | AWS Machine Learning Blog](https://aws.amazon.com/blogs/machine-learning/benchmarking-amazon-nova-and-gpt-4o-models-with-flotorch/#:~:text=FloTorch%20used%20the%20GPT,API%20calls%20using%20the%20same))。由於模型規模龐大,單請求延遲相對較高,但Azure等平台支持批處理和更高並發配額 ([Benchmarking Amazon Nova and GPT-4o models with FloTorch | AWS Machine Learning Blog](https://aws.amazon.com/blogs/machine-learning/benchmarking-amazon-nova-and-gpt-4o-models-with-flotorch/#:~:text=Both%20models%20were%20evaluated%20by,function%20code%20is%20as%20follows))。 | 真·多模態:可處理文本、圖像和音頻輸入,生成文本(及音頻輸出) ([GPT-4o - Wikipedia](https://en.wikipedia.org/wiki/GPT-4o#:~:text=GPT,3%20%5D%20Its%20application)) ([What Is GPT-4o? - IBM](https://www.ibm.com/think/topics/gpt-4o#:~:text=What%20Is%20GPT,audio%2C%20image%20and%20video%20input))。可進行實時對話、問答、代碼寫作等多種任務 ([GPT-4o explained: Everything you need to know - TechTarget](https://www.techtarget.com/whatis/feature/GPT-4o-explained-Everything-you-need-to-know#:~:text=OpenAI%27s%20GPT,Q%26A%2C%20text%20generation%20and%20more))。在理解覆雜圖像、語音內容方面業界領先,也是OpenAI首個真正融合視覺和聽覺能力的模型。 | 作為高精度大模型,響應時間明顯偏長。覆雜問題往往需數秒甚至數十秒才能得到完整解答。即使有優化,“GPT-4o”在速度上仍不及小型模型,但能通過流式API漸進輸出緩解等待。 | 定價昂貴:約 $18/百萬輸入、$72/百萬輸出(比Amazon Nova Pro貴22.5倍) ([Amazon Nova Pro vs GPT-4 - DocsBot AI](https://docsbot.ai/models/compare/amazon-nova-pro/gpt-4#:~:text=Amazon%20Nova%20Pro%20vs%20GPT,for%20input%20and%20output%20tokens))。這是針對API調用的商業價格,ChatGPT使用中免費用戶也可有限體驗(Plus用戶享更高額度 ([GPT-4o - Wikipedia](https://en.wikipedia.org/wiki/GPT-4o#:~:text=GPT,3%20%5D%20Its%20application)))。高成本反映了其超強能力和多模態特性。 | 多領域智能的巔峰之作。MMLU知識測試約88.7%正確率 ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=medium%20effort%2C%20and%20it%20outperformed,7%20percent%20and%2039%20percent))(與O1相當),常識問答(SimpleQA)約39%優於大多數模型僅次於O1 ([o3-mini Puts Reasoning in High Gear, How to Train for Computer Use, Gemini 2.0 Thinks Faster, and more...](https://www.deeplearning.ai/the-batch/issue-287/#:~:text=medium%20effort%2C%20and%20it%20outperformed,7%20percent%20and%2039%20percent))。在覆雜推理、創造性寫作、代碼等幾乎所有任務上都表現卓越,被視為業界基準。在圖像理解、跨模態推理方面更是一騎絕塵。
**Meta Llama 3.3 70B** | Meta最新開源模型,采用70億參數*10倍=70B參數的高效Transformer架構 ([What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More | DataCamp](https://www.datacamp.com/blog/llama-3-3-70b#:~:text=Architecture%3A%20Efficient%20and%20scalable))。引入**Grouped-Query Attention (GQA)**機制顯著提升推理效率 ([What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More | DataCamp](https://www.datacamp.com/blog/llama-3-3-70b#:~:text=What%E2%80%99s%20different%20about%20Llama%203,far%20less%20demanding%20on%20hardware)),使其雖然規模龐大但推理速度接近更大模型的水平。支持在本地設備上運行,針對普通GPU/工作站優化部署 ([What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More | DataCamp](https://www.datacamp.com/blog/llama-3-3-70b#:~:text=Designed%20for%20accessible%20hardware))。相較4050億參數的Llama 3.1,3.3版在大幅降低算力需求下實現了可比性能 ([What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More | DataCamp](https://www.datacamp.com/blog/llama-3-3-70b#:~:text=What%E2%80%99s%20different%20about%20Llama%203,far%20less%20demanding%20on%20hardware))。 | 純文本指令微調模型,僅提供Instruction版本 ([Model Cards and Prompt formats - Llama 3.3](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/#:~:text=Model%20Cards%20and%20Prompt%20formats,comprehensive%20technical%20information%20about))(無純預訓練模型)。不具備圖像或音頻輸入能力 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=Something%20missing%20from%20Mistral%20Large,increasingly%20looking%20to%20build%20with))。訓練於15萬億Token海量語料 ([What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More | DataCamp](https://www.datacamp.com/blog/llama-3-3-70b#:~:text=Training%20and%20fine))並經SFT和RLHF調教 ([What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More | DataCamp](https://www.datacamp.com/blog/llama-3-3-70b#:~:text=Training%20a%20model%20like%20Llama,understanding%20of%20language%20and%20knowledge)) ([What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More | DataCamp](https://www.datacamp.com/blog/llama-3-3-70b#:~:text=%2A%20Supervised%20fine,feedback%20to%20refine%20its%20behavior)),在安全性和有益性上較前代提升。上下文長度未明確公布(據推測支持長上下文,社區評測LOFT任務得分83.1% ([Grok 3 Review: A Critical Look at xAI's 'Smartest AI' Claim. - Medium](https://medium.com/@bernardloki/grok-3-review-a-critical-look-at-xais-smartest-ai-claim-aea15ca38b66#:~:text=Medium%20medium,This%20suggests%20xAI)))。擅長內容創作、聊天、代碼和科研問答等多種應用 ([What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More](https://www.datacamp.com/blog/llama-3-3-70b#:~:text=What%20Is%20Meta%27s%20Llama%203,1))。 | 雖為大模型但設計上注重推理效率,實際響應速度在同規模模型中表現優秀。由於支持本地量化部署,開發者可在單機上較流暢地運行(有報告量化後CPU也能生成輸出)。相較需要集群的超大模型,在中短文本交互中延遲更低、輸出更快 ([What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More | DataCamp](https://www.datacamp.com/blog/llama-3-3-70b#:~:text=What%E2%80%99s%20different%20about%20Llama%203,far%20less%20demanding%20on%20hardware)) ([What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More | DataCamp](https://www.datacamp.com/blog/llama-3-3-70b#:~:text=Designed%20for%20accessible%20hardware))。 | 模型以開源研究許可證發布,可自由非商用使用 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=However%2C%20it%E2%80%99s%20important%20to%20note,405%20billion%20parameters%2C%20of%20course))。自行托管無需API費用。在AWS Bedrock等雲服務上亦可獲得(推理費用比封閉模型低很多)。企業商用需向Meta付費許可。總體而言,使用成本遠低於同等性能的閉源模型,是高性能開源替代方案。 | 性能媲美體量數倍於己的模型 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=cost%20for%20open%20models%2C%20backing,with%20a%20handful%20of%20benchmarks))。在代碼生成、邏輯推理、數學和常識問答上,Llama 3.3 70B甚至超越Meta自家的405B模型3.1 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=cost%20for%20open%20models%2C%20backing,with%20a%20handful%20of%20benchmarks)) ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=Large%202%20appears%20to%20outpace,123%20billion%2C%20to%20be%20precise))。同時針對幻覺問題做了特別訓練,更善於在未知時回答“不確定”而非亂猜 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=be%20precise))。它在各項基準上逼近封閉頂級模型的水平,成為“縮放定律未死”的有力證據 ([llama3.3:70b - Ollama](https://ollama.com/library/llama3.3:70b#:~:text=llama3.3%3A70b%20,43GB%20%3B%20params%20%C2%B7%2096B))。創造力和多語種能力優秀,是當前最強的開源Instruction LLM之一。
**Mistral Large 2 (2024年11月)** | Mistral AI的下一代旗艦LLM,參數規模約1230億 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=cost%20for%20open%20models%2C%20backing,with%20a%20handful%20of%20benchmarks)) ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=Large%202%20appears%20to%20outpace,123%20billion%2C%20to%20be%20precise))。盡管體積龐大,但針對推理效率做了優化,官方將其暴露在自家平台“La Plateforme”上供高速調用 ([Large Enough | Mistral AI](https://mistral.ai/news/mistral-large-2407#:~:text=Mistral%20Large%202%20is%20exposed,Mistral%20Large%202))。支持超長上下文:輸入窗口達128k tokens ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=and%20text%20simultaneously%2C%20a%20feature,increasingly%20looking%20to%20build%20with))(約300頁文本)。不支持圖像等多模態輸入,與Meta同期發布的Llama 3.1一樣主要面向文本任務 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=Something%20missing%20from%20Mistral%20Large,increasingly%20looking%20to%20build%20with))。 | 純文本大模型,無多模態功能 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=Something%20missing%20from%20Mistral%20Large,increasingly%20looking%20to%20build%20with))。以對標OpenAI/Meta最新模型為目標訓練,在代碼生成、數學推理等方面表現與最頂尖模型看齊 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=Mistral%20released%20a%20fresh%20new,code%20generation%2C%20mathematics%2C%20and%20reasoning)) ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=terms%20of%20code%20generation%2C%20mathematics%2C,and%20reasoning))。通過特殊訓練減少幻覺傾向,當不了解問題時更可能坦誠回應不確定 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=be%20precise))。多語言能力較前版增強,適用於企業級長文分析和覆雜問答。 | 推理需要強大算力支撐,但Mistral官方提供了雲端推理服務(如Snowflake Cortex等 ([New Mistral Large 2 model available in Snowflake Cortex AI](https://docs.snowflake.com/en/release-notes/2024/other/2024-08-29-mistral-large2#:~:text=New%20Mistral%20Large%202%20model,inference%20in%20Snowflake%20Cortex%20AI))),以優化後的實現降低延遲。大多數查詢可在數秒內得到回答,長文摘要等場景因上下文龐大可能稍慢。相對於規模更大的封閉模型,該模型在性能/延遲上的性價比非常突出。 | 商業API采用分層計費模式,最近宣布降價:輸入約 $2.00/百萬,輸出 ~$6.00/百萬 tokens ([Mistral Large 2 (Nov '24) - Intelligence, Performance & Price Analysis](https://artificialanalysis.ai/models/mistral-large-2#:~:text=Analysis%20artificialanalysis,00%2C))(按典型3:1輸入輸出比折合約$3/百萬綜合價 ([Mistral Large 2 (Nov '24) - Intelligence, Performance & Price Analysis](https://artificialanalysis.ai/models/mistral-large-2#:~:text=Mistral%20Large%202%20,00%2C)))。研究和非商用途可免費獲取模型權重 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=However%2C%20it%E2%80%99s%20important%20to%20note,405%20billion%20parameters%2C%20of%20course))。需要商業部署時須獲得付費許可,但相比同級別封閉模型仍價格低廉。 | Mistral聲稱Large 2的綜合實力已達到OpenAI和Meta最新模型水準 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=Mistral%20released%20a%20fresh%20new,code%20generation%2C%20mathematics%2C%20and%20reasoning)) ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=terms%20of%20code%20generation%2C%20mathematics%2C,and%20reasoning))。在官方提供的基準中,它在代碼和數學任務上甚至跑贏了405B參數的Llama 3.1 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=cost%20for%20open%20models%2C%20backing,with%20a%20handful%20of%20benchmarks)) ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=Large%202%20appears%20to%20outpace,123%20billion%2C%20to%20be%20precise))。多項評測顯示其推理、知識和創造力均處於第一梯隊,同時由於大幅緩解了不實回答問題 ([Millennium New Horizons](https://www.mnh.vc/blog/mistrals-large-2-is-its-answer-to-meta-and-openais-latest-models#:~:text=be%20precise))而特別適合企業應用。雖不支持多模態,但作為開源旗艦,它被視為業務落地的“Game-Changer” ([Mistral Large 2: A Game-Changer for Business AI Adoption - Medium](https://medium.com/@cognidownunder/mistral-large-2-a-game-changer-for-business-ai-adoption-1502bbdf1948#:~:text=Mistral%20Large%202%3A%20A%20Game,can%20approach%20AI%20adoption))。
**xAI Grok (Beta)** | 馬斯克推出的Grok模型,目前處於Beta測試。以即時響應見長——在測試中平均僅0.3秒即可開始回答 ([Grok Beta - Intelligence, Performance & Price Analysis](https://artificialanalysis.ai/models/grok-beta#:~:text=Analysis%20artificialanalysis,30s))(首字延遲極低)。不過由於模型和基礎設施原因,輸出速度相對一般,僅約66.8 tokens/秒 ([Grok Beta - Intelligence, Performance & Price Analysis](https://artificialanalysis.ai/models/grok-beta#:~:text=Analysis%20artificialanalysis,30s))。總體而言適合短問短答的高速交互,但生成長文檔時不如同級模型迅捷。 | 主要針對文本對話優化。據稱集成了實時網絡訪問能力,可獲取最新的互聯網(尤其是X平台)信息作為輔助手段。未有圖像輸入支持的報道,側重於對話和問答場景。訓練特點是風格活潑幽默,回答往往帶有俏皮語氣。具備編程助手能力,也能夠處理一般常識和知識問答。 | 對簡單提問反應極快,用戶幾乎秒得回覆。對於覆雜或超長問題,有時因模型推理深度或上下文限制,會稍有遲滯,但Beta版總體以速度優先。其架構在短交流中非常敏捷,但長篇生成時66 token/s的速率略顯吃力,需要更長時間才能完成輸出。 | 目前僅提供給X平台部分用戶試用(X Premium++訂閱者)。尚未公布正式API價格和開放使用方案。Beta期間對用戶免費,但有調用上限。未來商業化定價預計會低於OpenAI同級模型,以爭取用戶。 | 官方宣稱Grok 3在學術基準和用戶偏好上達到領先水平,Chatbot Arena Elo評分達1402 ([Grok 3 Beta — The Age of Reasoning Agents](https://x.ai/blog/grok-3#:~:text=Grok%203%20has%20leading%20performance,1402%20in%20the%20Chatbot))。xAI發布演示中,它在某些測試上超越了Gemini 2.0 Pro、DeepSeek V3、GPT-4o等知名模型 ([Grok 3 Technical Review: Everything You Need to Know - Helicone](https://www.helicone.ai/blog/grok-3-benchmark-comparison#:~:text=Grok%203%20Technical%20Review%3A%20Everything,5%20Sonnet))。不過這些結果存爭議,社區質疑其評測公正性 ([Did xAI Cheat? The Truth About Grok-3's Benchmarks! - YouTube](https://www.youtube.com/watch?v=tuUS5a8qnms#:~:text=Did%20xAI%20Cheat%3F%20The%20Truth,video%2C%20we%20break%20down))。獨立測評顯示Grok在長文信息檢索(LOFT)得分83.1%、MMLU-Pro 78.9%等,屬於一流水準但未明顯突破現有最強模型 ([Grok 3 Review: A Critical Look at xAI's 'Smartest AI' Claim. - Medium](https://medium.com/@bernardloki/grok-3-review-a-critical-look-at-xais-smartest-ai-claim-aea15ca38b66#:~:text=Grok%203%20Review%3A%20A%20Critical,This%20suggests%20xAI))。總的來說,Grok表現出色,具有很高的推理能力和良好的知識覆蓋,只是在創造性等方面還有待更多公開驗證。
**Amazon Nova Pro** | AWS的高性能多模態模型,在Bedrock平台上提供業界領先的速度 ([Amazon Nova: Meet our new foundation models in Amazon Bedrock](https://www.aboutamazon.com/news/aws/amazon-nova-artificial-intelligence-bedrock-aws#:~:text=a%20wide%20range%20of%20tasks,intelligence%20classes%20in%20Amazon%20Bedrock))。Nova Micro版本推理速度超過200 token/s ([Amazon Nova - Generative Foundation Model - AWS](https://aws.amazon.com/ai/generative-ai/nova/#:~:text=problem,applications%20that%20require%20fast%20responses));作為更大型的Nova Pro,雖吞吐略低但仍保持百余token每秒的水準。延遲優化出色,在同檔次模型中響應最快 ([Amazon Nova: Meet our new foundation models in Amazon Bedrock](https://www.aboutamazon.com/news/aws/amazon-nova-artificial-intelligence-bedrock-aws#:~:text=a%20wide%20range%20of%20tasks,intelligence%20classes%20in%20Amazon%20Bedrock))。可彈性擴展,支持高RPS企業應用。 | 功能強大的多模態基礎模型,支持文本、圖像和視頻輸入,輸出文本 ([Amazon Nova - Generative Foundation Model - AWS](https://aws.amazon.com/ai/generative-ai/nova/#:~:text=Amazon%20Nova%20Micro%2C%20Amazon%20Nova,speed%2C%20and%20cost%20operation%20points))。能夠理解視頻內容、解析圖表文檔,進行覆雜問答和代碼生成 ([Amazon Nova - Generative Foundation Model - AWS](https://aws.amazon.com/ai/generative-ai/nova/#:~:text=Amazon%20Nova%20Pro%20is%20a,The%20capabilities%20of))。擅長代理式任務(Agentic AI),可執行多步驟工作流 ([Amazon Nova - Generative Foundation Model - AWS](https://aws.amazon.com/ai/generative-ai/nova/#:~:text=Amazon%20Nova%20Pro%2C%20coupled%20with,art%20accuracy%20on%20text))。上下文窗口長達300K tokens ([Benchmarking Amazon Nova and GPT-4o models with FloTorch | AWS Machine Learning Blog](https://aws.amazon.com/blogs/machine-learning/benchmarking-amazon-nova-and-gpt-4o-models-with-flotorch/#:~:text=FloTorch%20used%20the%20GPT,The%20inference)),最大輸出約5000 tokens ([Benchmarking Amazon Nova and GPT-4o models with FloTorch | AWS Machine Learning Blog](https://aws.amazon.com/blogs/machine-learning/benchmarking-amazon-nova-and-gpt-4o-models-with-flotorch/#:~:text=FloTorch%20used%20the%20GPT,API%20calls%20using%20the%20same))。在準確性、速度和成本上實現均衡,適用範圍極廣 ([Amazon Nova - Generative Foundation Model - AWS](https://aws.amazon.com/ai/generative-ai/nova/#:~:text=Amazon%20Nova%20Pro%20is%20a,The%20capabilities%20of))。 | 針對低延遲進行了深度優化。在實際使用中,無論是問答還是覆雜分析,Nova Pro都能以接近實時的速度給出結果。其響應延遲和吞吐均在同級別模型中領先,足以滿足嚴苛的實時交互需求。Nova系列模型在AWS自研Inferentia硬件上運行,進一步確保了穩定快速的響應。 | 計費采用Bedrock按量計費:即時調用時輸入 ~$0.80/百萬,輸出 ~$3.20/百萬 tokens ([[new multi-model] Amazon Nova released just now. - Reddit](https://www.reddit.com/r/singularity/comments/1h5ugjs/new_multimodel_amazon_nova_released_just_now/#:~:text=Amazon%20Nova%20Pro.%20,%241.60));批量異步調用價格減半($0.40/$1.60) ([[new multi-model] Amazon Nova released just now. - Reddit](https://www.reddit.com/r/singularity/comments/1h5ugjs/new_multimodel_amazon_nova_released_just_now/#:~:text=Amazon%20Nova%20Pro.%20,%241.60))。相比同等智能水平的封閉模型(Nova Pro成本僅為它們的25%左右) ([Amazon Nova: Meet our new foundation models in Amazon Bedrock](https://www.aboutamazon.com/news/aws/amazon-nova-artificial-intelligence-bedrock-aws#:~:text=a%20wide%20range%20of%20tasks,intelligence%20classes%20in%20Amazon%20Bedrock))。對於需要處理海量多模態數據的應用,這種價位極具吸引力。 | 具備接近頂尖閉源模型的能力。亞馬遜表示Nova Pro在許多任務上可與Anthropic Claude 3.5等媲美 ([Amazon unveils Nova Pro, its LLM that is on par with Claude 3.5 Sonnet : r/singularity](https://www.reddit.com/r/singularity/comments/1h5ug30/amazon_unveils_nova_pro_its_llm_that_is_on_par/#:~:text=edit%3A%20they%20are%20training%20Nova,performance))。它在視頻摘要、覆雜問答、數學推理、代碼生成等方面表現出色 ([Amazon Nova - Generative Foundation Model - AWS](https://aws.amazon.com/ai/generative-ai/nova/#:~:text=Amazon%20Nova%20Pro%20is%20a,The%20capabilities%20of))。同時在多語種理解和遵循指令上也達到一流水平 ([Amazon Nova - Generative Foundation Model - AWS](https://aws.amazon.com/ai/generative-ai/nova/#:~:text=Amazon%20Nova%20Pro%2C%20coupled%20with,art%20accuracy%20on%20text))。雖然還有更高端的Nova Premier在訓練中,但Nova Pro已經以優秀的準確度和推理能力覆蓋了絕大多數常見場景,成為高性價比的通用AI模型選擇 ([Amazon Nova: Meet our new foundation models in Amazon Bedrock](https://www.aboutamazon.com/news/aws/amazon-nova-artificial-intelligence-bedrock-aws#:~:text=a%20wide%20range%20of%20tasks,intelligence%20classes%20in%20Amazon%20Bedrock))。