Computer Use Agent 自動化流程說明

# Computer Use Agent 自動化流程說明本文件說明 **Azure OpenAI Computer Use Agent (CUA)** 。 --- ![image](https://hackmd.io/_uploads/SkeI93Fy-e.png) --- ## 參考資料 [CUA Repo](https://github.com/Azure-Samples/computer-use-model) --- ## 一、前置需求 ### Require Access for computer use Model: + 一個Subscription Id 對應一個申請單 [Request Form](https://aka.ms/oai/cuaaccess) ### 環境準備: + 可自動登入 & 禁用螢幕保護與自動鎖定 & 禁用睡眠/休眠的Virtual Machine. --- ## 二、範例程式本程式僅作為示範如何自動化執行 CUA 的範例，若需應用於正式生產環境，請依需求進行調整與最佳化。 ### 1. Download Code & Create Python venv + 請下載[Sample Code]() + Create a Virtual Environment 1. 切換Sample Code資料夾 ```bash cd "您的專案資料夾路徑" ``` 建立python venv ```bash python -m venv .venv ``` 2. Activate venv ```bash # For Windows .\.venv\Scripts\activate ``` 3. Install Dependency** ```bash # For Windows pip install -r requirements.txt ``` ### 🔹 2. 環境變數設定 (Env) 請在專案中設定以下環境變數： | 變數名稱 | 說明 | |----------|------| | **AZURE_OPENAI_ENDPOINT** | Azure OpenAI 服務的 Endpoint URL | | **AZURE_OPENAI_DEPLOYMENT_NAME** | 建議使用 `gpt-4.1` 作為部署名稱 | | **AZURE_OPENAI_API_KEY** | Azure OpenAI 的 API 金鑰 | | **NOTIFICATION_URL** | *(選填)* 可自訂通知機制，例如使用 Logic Apps。<br>參考：[Logic Apps - Teams Channel Notification] | > ⚠️ **提醒：** 範例程式使用的 CUA 模型為 `computer-use-preview`。若模型名稱有調整，請至 main.py 進行修改。 ## 🧠 LLM 在 CUA 架構中的「三層輸出」 | 層級 | 名稱 | 說明 | 典型欄位 | |------|------|------|-----------| | (A) | **指令層 (Action Layer)** | LLM 告訴 runtime 具體要執行的行動，例如「click」「type」「open」等。 | `agent.actions` | | (B) | **授權層 (Consent Layer)** | 當行動有風險或需要明確授權（例如開啟應用程式、發送訊息）時，CUA runtime 會要求明確 consent。 | `agent.requires_consent`、`agent.pending_safety_checks` | | (C) | **語意層 (Message Layer)** | LLM 回覆給使用者的自然語言（包含詢問、說明、回報），用於「對話互動」。 | `agent.messages`、`agent.reasoning_summary` | --- ## ⚙️ 狀態屬性對照表 | 屬性 | 說明 | 常見出現時機 | 是否可自動化 | |------|------|---------------|----------------| | `requires_consent` | 模型要執行高風險動作（滑鼠、鍵盤） | 開啟應用程式、點擊、輸入 | ✅ 可用 `args.autoplay` 自動略過 | | `pending_safety_checks` | 模型觸發安全檢查 | 訪問敏感路徑、修改系統設定 | ✅ 可自動略過 | | `reasoning_summary` | 模型推理出的下一步摘要 | LLM 輸出「我將執行...」等描述 | ✅ 可記錄但不自動化 | | `messages` | 模型完整自然語言輸出 | 問題、回覆、確認、報告、錯誤 | ✅ 可由 LLM 判斷是否為確認語句 | --- ## 💬 `agent.messages` 的用途與典型內容 `agent.messages` 是模型的自然語言回覆，包含以下類型： | 類型 | 範例 | 說明 | |------|------|------| | **認知類** | “Got it, I’ll open Edge.” | 模型理解任務 | | **確認類** | “Should I proceed to send the message?” | 詢問是否要繼續 | | **狀態類** | “I’ve opened the vacation portal.” | 回報目前進度 | | **結果類** | “Message has been sent.” | 任務完成通知 | | **錯誤類** | “I couldn’t find the button.” | 報告錯誤或中斷 | --- ## 🔍 自動判斷下一步「Status」的邏輯使用一個 LLM-based 檢測器來判斷 `agent.messages` 的下一步狀態。 + autoplay: 自動執行 + human_in_loop: 人為介入 + complete: 工作完成 ### System Prompt： ```python You are a message classifier for an autonomous agent system. You must classify the agent's message into one of three categories: 1. "autoplay" — The Agent response describes an intention, ongoing action, or explicitly requests approval/confirmation to proceed. 2. "human_in_loop" — The Agent response asks for help, guidance, a decision, reports an error. 3. "complete" — The agent reports the task is finished or successfully completed. ```