Linux系統自動化運維

Linux系統自動化運維課名 (中文): `Linux系統自動化運維 ` Class name (English): `Automatic Operation and Maintenance for Linux System` Teacher: `Profesor 柯志亨` About Class (Room,Credits,Hour): , `3Credits,3 Hour` Year you take this class: `114 Year` Score you give to this class `9.5/10` Telegram Group：Open for student ## Class Summary What This Course Is About: A Summary This course was designed to build a Modern AI Operations Engineer, blending cutting-edge AI application development with robust, production-grade systems management. You didn't just learn how to use AI; you learned how to build, deploy, manage, monitor, and automate it. The curriculum was built on two parallel tracks that ultimately merged: 1. The AI Application & Agent Development Track This track took you on a journey from being a consumer of AI to a builder of autonomous AI agents. Foundation (Weeks 1-4): You started by learning how to interact with Large Language Models (LLMs) like Gemini using the LangChain framework. The key project was building a Retrieval-Augmented Generation (RAG) system, which taught the AI to answer questions based on external documents rather than just its internal knowledge. You then gave this AI access to live information by integrating the Google Search API. Self-Hosting & Specialization (Weeks 2-9): You moved beyond cloud APIs to running your own models locally with Ollama. This gave you more control, privacy, and the ability to use specialized models like Llava (for image recognition) and breeze-7b (for Traditional Chinese). You also explored the Model Context Protocol (MCP), a secure, modern standard for letting AI applications like Claude interact with your local files without uploading them. Building Autonomous Agents (Weeks 5, 8, 10): The culmination of this track was creating true AI agents. You learned how to build "tools" (Python functions) and give them to an agent framework (smol-agents). The AI would then act as a "brain," intelligently deciding which tool to use—whether to check a server's status, install software, fetch stock data, or send a Telegram message—to solve a user's request. 2. The DevOps & Systems Administration Track This track provided the essential skills to ensure that the AI applications you build are reliable, scalable, and secure. Linux & Security Fundamentals (Weeks 2, 13): You mastered core Linux concepts, from advanced command-line text processing (grep, regex) to fine-grained file permissions (chattr, ACLs) for securing the system. Containerization with Docker (Weeks 9-12): You learned how to package applications and their dependencies into portable containers. You progressed from running single containers to building your own custom images with a Dockerfile, and finally, orchestrating entire multi-container applications (like a PHP+MySQL stack) with Docker Compose and understanding Docker networking. Monitoring & Alerting (Weeks 3, 5): You built a complete, end-to-end monitoring pipeline. You used Prometheus to collect system metrics, wrote custom scripts to monitor specific services (like SSH), and configured Alertmanager to send automatic notifications to Telegram when a system failed. Automation with Ansible (Weeks 14-15): Finally, you learned how to automate the management of all your servers. You moved from running single ad-hoc commands to writing declarative Ansible Playbooks, using variables, templates, and handlers to configure multiple servers consistently and efficiently. Conclusion In essence, this was a comprehensive "AI Ops" / "LLM Ops" course. You have acquired a holistic skill set that enables you not only to develop sophisticated AI applications but also to confidently deploy, manage, and maintain them in a real-world, production environment. ## Week 1 Analysis & Summary --- This week focused on **setting up the development environment** and a **gradual learning of LangChain concepts**, starting from the very basics and building up to a relatively complex application like RAG (Retrieval-Augmented Generation). ### 1. Environment & Operating System Setup The initial step was to prepare the development environment on **Ubuntu Linux**. This included installing the necessary Python libraries. You verified the installation using a terminal command to ensure all `langchain` packages were correctly installed. **Terminal Command:** ```bash pip list | grep langchain ``` * **Explanation:** This command lists all installed Python packages (`pip list`) and then filters that list (`grep`) to show only the lines containing the word "langchain". This is a quick way to check the versions of `langchain`, `langchain-community`, `langchain-core`, etc., that are being used. --- ### 2. Basic Interaction with the LLM via LangChain Once the environment was ready, you began interacting with the LLM (Gemini Pro) using LangChain. #### a. Direct Model Invocation (Basic Invoke) This is the simplest way to get a response from the LLM. **Code:** ```python from langchain_google_genai import GoogleGenerativeAI llm = GoogleGenerativeAI(model="gemini-pro", google_api_key="...") result = llm.invoke("what is the captial of Taiwan") print(result) ``` * **Explanation:** * `GoogleGenerativeAI(...)`: Initializes the Gemini Pro model as an `llm` object. * `llm.invoke(...)`: Sends the question directly to the model and waits for a result. This demonstrates the most basic interaction: one input, one output. #### b. Using a Prompt Template to Control Output For more structured interactions, you used `ChatPromptTemplate`. **Code:** ```python from langchain_core.prompts import ChatPromptTemplate prompt = ChatPromptTemplate.from_messages([ ("system", "You are a world class technical documentation writer. You will answer in Traditional Chinese"), ("user", "{input}") ]) chain = prompt | llm result = chain.invoke({"input": "where is Kinmen?"}) print(result) ``` * **Explanation:** * `ChatPromptTemplate.from_messages(...)`: Creates a conversation template. * **`system`**: Provides instructions or a "persona" to the AI. * **`user`**: A placeholder for the user's input/question. * `chain = prompt | llm`: Creates a simple "chain" using LangChain Expression Language (LCEL). Data flows from the `prompt` to the `llm`. * `chain.invoke(...)`: Executes the chain by providing a value for the `{input}` placeholder. #### c. Adding an Output Parser to Clean the Result To get a clean string output, you added an `StrOutputParser`. **Code:** ```python from langchain_core.output_parsers import StrOutputParser output_parser = StrOutputParser() chain = prompt | llm | output_parser result = chain.invoke({"input": "any food is recommended in Kinmen?"}) print(result) ``` * **Explanation:** * `StrOutputParser()`: A component that automatically formats the LLM's result into a simple string. * `chain = prompt | llm | output_parser`: The chain is now extended. The flow is: `prompt` -> `llm` -> `output_parser`. --- ### 3. Advanced Concept: Retrieval-Augmented Generation (RAG) This was the culmination of the week's learning: making the AI answer questions based on an external webpage. **Full RAG Code:** ```python from langchain_community.document_loaders import WebBaseLoader from langchain_google_genai import GoogleGenerativeAIEmbeddings from langchain_community.vectorstores import FAISS from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain.chains.combine_documents import create_stuff_documents_chain from langchain.chains import create_retrieval_chain from langchain_core.prompts import ChatPromptTemplate from langchain_google_genai import GoogleGenerativeAI # Initialize the LLM and Embeddings models llm = GoogleGenerativeAI(model="gemini-pro", google_api_key="...") embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key="...") # 1. Load Data: Fetch content from a URL loader = WebBaseLoader("[https://csie.nqu.edu.tw/p/412-1038-469.php?Lang=zh-tw](https://csie.nqu.edu.tw/p/412-1038-469.php?Lang=zh-tw)") docs = loader.load() # 2. Split Text: Break the document into smaller chunks text_splitter = RecursiveCharacterTextSplitter() documents = text_splitter.split_documents(docs) # 3. Vector Store: Convert text chunks to vectors and store them in FAISS vector = FAISS.from_documents(documents, embeddings) # 4. Create a Prompt for Answering Based on Context prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context: <context> {context} </context> Question: {input}""") # 5. Create the Retrieval Chain document_chain = create_stuff_documents_chain(llm, prompt) retriever = vector.as_retriever() retrieval_chain = create_retrieval_chain(retriever, document_chain) # 6. Invoke the Chain with a Question response = retrieval_chain.invoke({"input": "柯志亨的email?"}) print(response["answer"]) ``` * **RAG Code Explanation:** 1. **`WebBaseLoader`**: Loads the entire HTML content from the given URL. 2. **`RecursiveCharacterTextSplitter`**: Splits the large document into several smaller chunks. 3. **`FAISS.from_documents`**: A key step where each text chunk is converted into a numerical vector (embedding) and stored in `FAISS`, a fast vector database for similarity search. 4. **`Prompt Template`**: A custom prompt that instructs the LLM to answer the question (`{input}`) **only** based on the provided `{context}`. 5. **`create_retrieval_chain`**: Combines the `retriever` (which fetches relevant documents from FAISS) and the `document_chain` (which sends the documents and question to the LLM). 6. **`retrieval_chain.invoke`**: Executes the entire process: search, context stuffing, and generation. ## Week 2 Analysis & Summary --- This week's focus shifted towards more practical, hands-on skills, including **advanced Linux command-line usage**, **system integration via APIs**, and a significant new topic: **running and interacting with local Large Language Models (LLMs)**. ### 1. Advanced Linux Command-Line Proficiency Building on the Ubuntu environment from Week 1, this week delved into powerful command-line tools for file manipulation and text processing. #### a. File Globbing & Pattern Matching with `ls` You explored how to select files based on patterns instead of listing them one by one. **Terminal Commands:** ```bash # List files starting with 'a' followed by 'b' or 'c' ls a[bc]* # List files that do NOT start with a lowercase letter ls [!a-z]* # List files that start with an uppercase letter using POSIX classes ls [[:upper:]]* ``` * **Explanation:** These commands demonstrate advanced file selection techniques. `[]` is used to define a set of characters, `!` negates the set, and POSIX character classes (`[:upper:]`, `[:lower:]`, etc.) provide a more standardized way to match character types. #### b. Text Searching with `grep` and Regular Expressions (Regex) You learned to search for specific text patterns within files or command outputs using `grep`. **Terminal Commands:** ```bash # Find lines that START with the letter 'a' echo "abcd.txt" | grep "^a" # Find lines that END with the letter 't' echo "Abcd.txt" | grep "t$" # Find lines containing 'a' followed by ZERO or MORE 'a's. echo "aaab" | grep "a*" ``` * **Explanation:** This section covers the fundamentals of regular expressions. `^` anchors a pattern to the start of a line, `$` anchors it to the end, and `*` is a quantifier meaning "zero or more of the preceding character." --- ### 2. System Integration & Automation This topic focused on making different systems talk to each other and setting up a more versatile development environment. #### a. Creating a Telegram Notifier You learned how to send messages to a Telegram chat using Python, which is useful for notifications and simple bots. **Code:** ```python import requests def send_msg(msg: str, token: str, chatID: str): """Sends a message to a specific Telegram chat using the Bot API.""" assert type(msg) == str, "Message must be a string" # The URL is constructed with the bot token, chat ID, and message text url = f'[https://api.telegram.org/bot](https://api.telegram.org/bot){token}/sendMessage?chat_id={chatID}&text={msg}' requests.get(url) # --- Example Usage --- my_message = "Test from Python script" bot_token = "Your_Bot_Token_Here" chat_id = "-Your_Chat_ID_Here" send_msg(my_message, bot_token, chat_id) ``` * **Explanation:** This code defines a function that makes an HTTP GET request to the Telegram Bot API endpoint. It passes the bot's `token`, the destination `chatID`, and the `msg` as URL parameters to send the message. #### b. Setting up Windows Subsystem for Linux (WSL) The provided link shows the process for installing a Linux environment (like Ubuntu) directly within Windows. This is a popular alternative to dual-booting or using a virtual machine for development. --- ### 3. Exploring Local LLMs with Ollama This is a major conceptual leap from Week 1. Instead of using a cloud-based LLM service, you explored how to run an LLM directly on your own machine using **Ollama**. **Code:** ```python from langchain_community.llms import Ollama # Define the connection to the local Ollama server host = "127.0.0.1" port = "11434" # Default Ollama port # Initialize the LangChain LLM object to use the local model llm = Ollama( base_url=f"http://{host}:{port}", model="llama3.2:1b", # Specify the local model to use temperature=0 ) # Invoke the local model response = llm.invoke("Who are you?") print(response) ``` * **Explanation:** * **Ollama:** A tool that simplifies downloading and running LLMs (like Llama 3.2) on your personal computer. * **`langchain_community.llms.Ollama`**: The LangChain integration that allows your Python code to communicate with the local Ollama server. * **`base_url`**: Tells LangChain where to find the running Ollama instance (typically on `localhost`). * This setup provides more privacy, no API costs, and greater control over the model, representing a powerful alternative to cloud APIs. ## Week 3 Analysis & Summary --- This week's learning was divided into three core areas: further **configuring the development environment**, introducing professional **system monitoring with Prometheus**, and significantly **upgrading the local AI deployment** with new models and user interfaces. ### 1. Environment Configuration & Essential Tools You continued to set up the **Windows Subsystem for Linux (WSL)** environment to make it more functional for daily tasks. * **Installing Google Chrome on Ubuntu:** Since a standard Linux server doesn't come with a web browser, you manually installed one using command-line tools. * **Commands:** The process involves downloading the `.deb` package file and then using `dpkg` (Debian Package Manager) to install it. * **參考:** [書 - Ubuntu Q&A - 安裝 Google Chrome](https://samwhelp.github.io/book-ubuntu-qna/read/case/app/google-chrome/install) * **Time Synchronization:** You configured the system's time zone to `Asia/Taipei` and synchronized it with a standard time server. This is a crucial step for ensuring accurate logs and scheduled tasks. * **Commands:** `timedatectl set-timezone`, `ntpdate`, `hwclock`. * **參考:** Provided by the instructor (`Ke-柯志亨`) in the chat log. --- ### 2. Introduction to System Monitoring with Prometheus A major new topic this week was **Prometheus**, an industry-standard open-source tool for monitoring system metrics and triggering alerts. * **What is Prometheus?** It's a system that periodically collects (scrapes) metrics like CPU usage, memory, and disk space from various machines and services. It stores this data and allows you to visualize it and set up alerts for specific conditions. * **參考:** [Prometheus Official Website](https://prometheus.io/) * **Installation:** You installed the core Prometheus components using `apt`. * **Commands:** `sudo apt install prometheus prometheus-node-exporter ...` * **參考:** Provided by the instructor in the chat log. * **Configuration:** You learned how to tell Prometheus which machines (targets) to monitor by editing its configuration file. * **Code Snippet:** ```yaml - job_name: node static_configs: - targets: ['192.168.190.200:9100', '192.168.190.201:9100'] ``` * **Explanation:** This configuration tells Prometheus to scrape metrics from two different machines (at port `9100`, the default for `node-exporter`) and group them under the job name `node`. * **參考:** General Prometheus configuration example, similar to what's found in tutorials like the provided CSDN blog post. --- ### 3. Advanced Local AI Deployment & New Tools This week, you significantly enhanced the local AI capabilities built in Week 2. * **Downloading New, Specialized LLMs:** You were instructed to download several powerful models using `ollama pull`. * `ollama pull llava`: A **multimodal model** capable of **image recognition**. This allows your local AI to understand and describe images, not just text. * `ollama pull ycchen/breeze-7b-instruct-v1_0:latest`: A large language model specifically developed by MediaTek for **Traditional Chinese**. * `ollama pull nomic-embed-text`: A specialized model used for creating **embeddings**. This is a critical component for the RAG (Retrieval-Augmented Generation) process, which you learned about in Week 1, enabling the AI to answer questions based on your own documents. * **參考:** Instructions from the instructor and the screenshot showing the download of `nomic-embed-text`. * **Installing New User Interfaces for Ollama:** To make interacting with these local models easier, you explored two new web-based UIs. * **Open WebUI:** A comprehensive, self-hosted web interface that provides a rich user experience for managing and chatting with all your local Ollama models. * **Command:** `pip install open-webui` * **Page Assist (Chrome Extension):** A lightweight UI that runs directly inside your Chrome browser. It can connect to your local Ollama instance, allowing you to use your local AI models to summarize web pages or ask questions without leaving the browser. * **參考:** [Page Assist on Chrome Web Store](https://chromewebstore.google.com/detail/page-assist-a-web-ui-for/jfgfiigpkhlkbnfnbobbkinehhfdhndo) and the provided screenshot showing it in action. * **Exploring Other AI Tools:** * **NotebookLM:** You were introduced to Google's NotebookLM, an AI-powered research and writing tool that uses your own documents as its knowledge base. * **參考:** [Google NotebookLM](https://notebooklm.google.com/) ## Week 4 Analysis & Summary --- This week marked a significant leap in complexity, moving from static data models to building an **advanced AI agent with live web search capabilities**. You also delved deeper into **custom system monitoring scripts** for Prometheus. ### 1. Building an AI Agent with Real-time Web Search The primary project this week was to give the Language Model the ability to browse the internet to find up-to-date information. #### a. Setting up Google Search API To allow the AI to perform Google searches, you needed to set up a **Google Programmable Search Engine (PSE)** and obtain a **Custom Search API Key**. * **References:** * [Google Cloud Console for API Keys](https://console.cloud.google.com/apis/credentials) * [Programmable Search Engine Control Panel](https://programmablesearchengine.google.com/controlpanel/create) #### b. Basic Google Search Tool in LangChain First, you created a simple tool to test the search functionality. **Code:** ```python import os from langchain_core.tools import Tool from langchain_google_community import GoogleSearchAPIWrapper os.environ["GOOGLE_CSE_ID"] = "YOUR_CSE_ID" os.environ["GOOGLE_API_KEY"] = "YOUR_API_KEY" search = GoogleSearchAPIWrapper() tool = Tool( name="google_search", description="Search Google for recent results.", func=search.run, ) # The script asks about the weather in Taipei print(tool.run("how is the weather in Taipei now?")) ``` **Project Result Example:** When the code above is executed (e.g., as a file named `test_sel.py`), it performs a Google search about the weather in Taipei. Its output is a long string of text summarizing the search results, exactly as seen in the image `Screenshot 2025-06-08 at 9.52.41 PM.jpg` you provided. --- ### 2. Custom Monitoring with Prometheus Pushgateway You continued learning about Prometheus, focusing on how to monitor scripts using the **Pushgateway**. #### a. Custom Bash Script to Push Metrics You were shown a script that checks if a machine is online (`ping`) and then pushes the result to the Pushgateway. **Code (Bash Script):** ```bash #!/usr/bin/bash instance_name=$(hostname -f) label="ens33_status" ping -c1 -W1 192.168.164.135 > /dev/null 2>&1 result=$(echo $?) if [ $result = "0" ]; then status=1 # Success else status=0 # Failure fi # Push the status to the Pushgateway echo "$label $status" | curl --data-binary @- [http://192.168.164.13:9091/metrics/job/pushgateway_test/instance/$instance_name](http://192.168.164.13:9091/metrics/job/pushgateway_test/instance/$instance_name) ``` * **References:** Instructor's examples and blog posts like those from CSDN and Cnblogs provided in the chat. #### b. Querying Metrics in Prometheus Once metrics are sent to Prometheus, you can use the **PromQL** query language to calculate and display them, for instance, calculating CPU usage. **Example PromQL Query:** ``` 100 * (1 - sum by (instance) (increase(node_cpu_seconds_total{mode="idle"}[5m])) / sum by (instance) (increase(node_cpu_seconds_total[5m]))) ``` **Project Result Example:** A PromQL query like the one above would be run in the Prometheus web interface. The view would be a graph showing time-series data, exactly as seen in the image ![Screenshot 2025-06-08 at 9.59.56 PM](https://hackmd.io/_uploads/HkawwfXQeg.jpg) --- ### **Preparation for Next Week** The week concluded with a clear assignment: ensure your **WSL environment is fully set up with Python and Ollama installed**, and download two specific models (`llama3.2:latest` and `mistral:instruct`) in preparation for the next session. ## Week 5 Analysis & Summary --- This week, you integrated concepts from previous sessions to build two sophisticated systems. The first project focused on creating a **tool-using AI agent** with a local LLM. The second project involved setting up a complete **end-to-end monitoring and alerting pipeline** using Prometheus and Alertmanager. ### 1. Advanced AI Agent Concepts with Local LLMs You moved beyond simple Q&A to explore how a Large Language Model can act as a "brain" that decides which tool to use for a given task. #### a. Manual RAG with Ollama and Google Search First, you created a script that shows the fundamental logic of Retrieval-Augmented Generation (RAG). **Code:** ```python import os import ollama from langchain_core.tools import Tool from langchain_google_community import GoogleSearchAPIWrapper os.environ["GOOGLE_CSE_ID"] = "YOUR_CSE_ID" os.environ["GOOGLE_API_KEY"] = "YOUR_API_KEY" # 1. Define the search tool search = GoogleSearchAPIWrapper() tool = Tool(name="Google Search", description="Search Google...", func=search.run) # 2. Explicitly get context by running the tool question = "where is the capital city of France?" context = tool.run(question) # 3. Manually create a prompt, injecting the context prompt = f"""Use the following pieces of context to answer the question at the end. Context: {context}.\n Question: {question} Helpful Answer:""" # 4. Call the local LLM with the context-rich prompt response = ollama.chat(model='llama3.2:1b', messages=[ {'role': 'system', 'content': 'You are a useful AI assistant...'}, {'role': 'user', 'content': f'{prompt}'}, ]) output = response['message']['content'] print(output) ``` * **Explanation:** This script demonstrates the core RAG process manually. It explicitly fetches information using the search tool and then "stuffs" that context into a prompt before sending it to a local Ollama model. This gives you a clear understanding of how an LLM can be guided by external data. #### b. Building a Tool-Using Agent (LLM as a Router) This was the most advanced AI concept of the week. You created a system where the LLM doesn't answer directly but instead decides which function (tool) to call based on the user's prompt. **Code:** ```python from langchain_ollama import OllamaLLM from langchain_core.prompts import ChatPromptTemplate from langchain_core.tools import tool from langchain.tools.render import render_text_description from langchain_core.output_parsers import JsonOutputParser # 1. Initialize the local model model = OllamaLLM(model='mistral:instruct') # 2. Define functions as tools using the @tool decorator @tool def add(first: int, second: int) -> int: "Add two integers." return first + second @tool def multiply(first: int, second: int) -> int: """Multiply two integers together.""" return first * second # 3. Create a text description of the tools for the LLM tools = [add, multiply] rendered_tools = render_text_description(tools) # 4. Create a system prompt instructing the LLM to act as a tool router system_prompt = f"""You are an assistant that has access to the following set of tools. Here are the names and descriptions for each tool: {rendered_tools} Given the user input, return the name and input of the tool to use. Return your response as a JSON blob with 'name' and 'arguments' keys.""" prompt = ChatPromptTemplate.from_messages([("system", system_prompt), ("user", "{input}")]) # 5. Build the chain and force JSON output chain = prompt | model | JsonOutputParser() # --- Example Usage --- # The LLM decides to use the 'multiply' tool print(chain.invoke({'input': 'What is 3 times 23'})) # Expected output: {'name': 'multiply', 'arguments': {'first': 3, 'second': 23}} ``` * **Explanation:** This is the foundation of an AI agent. The LLM is given a list of available tools and their descriptions. Its primary job is not to calculate "3 times 23" but to recognize that this task requires the `multiply` tool and to output the correct tool name and arguments in a structured JSON format. Another part of your application would then execute the chosen function. --- ### 2. End-to-End System Monitoring & Alerting This project built a complete pipeline that monitors a service, detects failures, and sends a notification. #### a. Custom Metric Collection You first set up a service to monitor (Apache web server) and created a script to check its status. * **Setup:** 1. Install Apache: `sudo apt install apache2` 2. Create a monitoring script (`test_httpd_status.sh`) that uses `curl` to check if the web server is responding. If it is, a status of `1` is pushed to the Prometheus Pushgateway; otherwise, `0` is pushed. 3. Schedule the script to run every minute using a cron job: `*/1 * * * * /home/user/test_httpd_status.sh` * **Project Result Example:** After the script runs, you can see the metric `apache2_status` with a value of `1` in your Prometheus dashboard, as shown in the screenshot you provided (`Screenshot 2025-06-08 at 10.01.49 PM.jpg`). This confirms that your custom metric collection is working. #### b. Alerting with Prometheus and Alertmanager Next, you configured the system to automatically send an alert when the service fails. * **Prometheus Alerting Rule:** You defined a rule that tells Prometheus to fire an alert named `InstanceApache2Down` if the `apache2_status` metric is `0` for more than 1 minute. * **Code (Prometheus `rules.yml`):** ```yaml groups: - name: example rules: - alert: InstanceApache2Down expr: apache2_status == 0 for: 1m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} Apache2 down" ``` * **Alertmanager Configuration:** You set up Alertmanager to receive these alerts from Prometheus and route them to a specific destination. * **Code (Alertmanager `alertmanager.yml`):** ```yaml route: receiver: 'telegram' receivers: - name: 'telegram' telegram_configs: - api_url: '[https://api.telegram.org](https://api.telegram.org)' bot_token: 'YOUR_BOT_TOKEN' chat_id: YOUR_CHAT_ID ``` * **End-to-End Flow:** 1. The cron job runs the script. 2. The script pushes `apache2_status=0` to Pushgateway because the server is down. 3. Prometheus scrapes this metric. 4. The `InstanceApache2Down` alert condition is met and fires after 1 minute. 5. Prometheus sends the alert to Alertmanager. 6. Alertmanager receives the alert and, based on its routing rules, sends a notification message to your specified Telegram chat. ## Week 6 Analysis & Summary --- This week, you integrated concepts from previous sessions to build two sophisticated systems. The first project focused on creating a **tool-using AI agent** with a local LLM. The second project involved setting up a complete **end-to-end monitoring and alerting pipeline** using Prometheus and Alertmanager. ### 1. Advanced AI Agent Concepts with Local LLMs You moved beyond simple Q&A to explore how a Large Language Model can act as a "brain" that decides which tool to use for a given task. #### a. Manual RAG with Ollama and Google Search First, you created a script that shows the fundamental logic of Retrieval-Augmented Generation (RAG). **Code:** ```python import os import ollama from langchain_core.tools import Tool from langchain_google_community import GoogleSearchAPIWrapper os.environ["GOOGLE_CSE_ID"] = "YOUR_CSE_ID" os.environ["GOOGLE_API_KEY"] = "YOUR_API_KEY" # 1. Define the search tool search = GoogleSearchAPIWrapper() tool = Tool(name="google_search", description="Search Google...", func=search.run) # 2. Explicitly get context by running the tool question = "where is the capital city of France?" context = tool.run(question) # 3. Manually create a prompt, injecting the context prompt = f"""Use the following pieces of context to answer the question at the end. Context: {context}.\n Question: {question} Helpful Answer:""" # 4. Call the local LLM with the context-rich prompt response = ollama.chat(model='llama3.2:1b', messages=[ {'role': 'system', 'content': 'You are a useful AI assistant...'}, {'role': 'user', 'content': f'{prompt}'}, ]) output = response['message']['content'] print(output) ``` * **Explanation:** This script demonstrates the core RAG process manually. It explicitly fetches information using the search tool and then "stuffs" that context into a prompt before sending it to a local Ollama model. This gives you a clear understanding of how an LLM can be guided by external data. #### b. Building a Tool-Using Agent (LLM as a Router) This was the most advanced AI concept of the week. You created a system where the LLM doesn't answer directly but instead decides which function (tool) to call based on the user's prompt. **Code:** ```python from langchain_ollama import OllamaLLM from langchain_core.prompts import ChatPromptTemplate from langchain_core.tools import tool from langchain.tools.render import render_text_description from langchain_core.output_parsers import JsonOutputParser # 1. Initialize the local model model = OllamaLLM(model='mistral:instruct') # 2. Define functions as tools using the @tool decorator @tool def add(first: int, second: int) -> int: "Add two integers." return first + second @tool def multiply(first: int, second: int) -> int: """Multiply two integers together.""" return first * second # 3. Create a text description of the tools for the LLM tools = [add, multiply] rendered_tools = render_text_description(tools) # 4. Create a system prompt instructing the LLM to act as a tool router system_prompt = f"""You are an assistant that has access to the following set of tools. Here are the names and descriptions for each tool: {rendered_tools} Given the user input, return the name and input of the tool to use. Return your response as a JSON blob with 'name' and 'arguments' keys.""" prompt = ChatPromptTemplate.from_messages([("system", system_prompt), ("user", "{input}")]) # 5. Build the chain and force JSON output chain = prompt | model | JsonOutputParser() # --- Example Usage --- # The LLM decides to use the 'multiply' tool print(chain.invoke({'input': 'What is 3 times 23'})) # Expected output: {'name': 'multiply', 'arguments': {'first': 3, 'second': 23}} ``` * **Explanation:** This is the foundation of an AI agent. The LLM is given a list of available tools and their descriptions. Its primary job is not to calculate "3 times 23" but to recognize that this task requires the `multiply` tool and to output the correct tool name and arguments in a structured JSON format. Another part of your application would then execute the chosen function. --- ### 2. End-to-End System Monitoring & Alerting This project built a complete pipeline that monitors a service, detects failures, and sends a notification. #### a. Custom Metric Collection You first set up a service to monitor (Apache web server) and created a script to check its status. * **Setup:** 1. Install Apache: `sudo apt install apache2` 2. Create a monitoring script (`test_httpd_status.sh`) that uses `curl` to check if the web server is responding. If it is, a status of `1` is pushed to the Prometheus Pushgateway; otherwise, `0` is pushed. 3. Schedule the script to run every minute using a cron job: `*/1 * * * * /home/user/test_httpd_status.sh` #### b. Alerting with Prometheus and Alertmanager Next, you configured the system to automatically send an alert when the service fails. * **Prometheus Alerting Rule:** You defined a rule that tells Prometheus to fire an alert named `InstanceApache2Down` if the `apache2_status` metric is `0` for more than 1 minute. * **Code (Prometheus `rules.yml`):** ```yaml groups: - name: example rules: - alert: InstanceApache2Down expr: apache2_status == 0 for: 1m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} Apache2 down" ``` * **Alertmanager Configuration:** You set up Alertmanager to receive these alerts from Prometheus and route them to a specific destination. * **Code (Alertmanager `alertmanager.yml`):** ```yaml route: receiver: 'telegram' receivers: - name: 'telegram' telegram_configs: - api_url: '[https://api.telegram.org](https://api.telegram.org)' bot_token: 'YOUR_BOT_TOKEN' chat_id: YOUR_CHAT_ID ``` * **End-to-End Flow:** 1. The cron job runs the script. 2. The script pushes `apache2_status=0` to Pushgateway if the server is down. 3. Prometheus scrapes this metric. 4. The `InstanceApache2Down` alert condition is met and fires after 1 minute. 5. Prometheus sends the alert to Alertmanager. 6. Alertmanager receives the alert and sends a notification to your specified Telegram chat. ## 期中考作業: If ssh server shutdown, send a message to telegram. --- This project combines everything you have learned from Weeks 1-5 into a single, practical, automated monitoring system. The goal is to create a reliable alert that notifies you immediately if a critical service on your server goes down. ### Brief Project Explanation The primary objective is: **"If the SSH service on a server shuts down, the system must automatically send an alert message to a Telegram chat."** This simulates a common and critical real-world scenario. The workflow is identical to the Apache monitoring you did in Week 5, but the target is changed from the *Apache web service* to the *SSH service*. --- ### Complete Code & Configuration Here are all the components you need to build the system. #### 1. The Monitoring Script (`check_ssh.sh`) This is a Bash script that will check if the SSH port (port 22) is open and listening. We will use the `netcat` (`nc`) utility for this, as it's a very efficient tool for checking a specific port. ```bash #!/usr/bin/bash # A script to check the status of the SSH service and push it to the Pushgateway. # --- Configuration --- # The IP address of your Prometheus Pushgateway PUSHGATEWAY_IP="192.168.164.134" # Change to your Pushgateway's IP INSTANCE_NAME=$(hostname -f | cut -d. -f1) JOB_NAME="ssh_monitoring" METRIC_NAME="ssh_service_status" # --- Port Checking Logic --- # Use 'nc' (netcat) to check port 22 (SSH) on localhost. # -z: Zero-I/O mode (scanning). # -w1: Wait no more than 1 second for a connection. nc -z -w1 127.0.0.1 22 # The exit code '$?' will be 0 if the port is open (success), and non-zero if closed (failure). if [ $? -eq 0 ]; then status=1 # 1 means 'UP' or 'Running' else status=0 # 0 means 'DOWN' or 'Stopped' fi # --- Push Metric --- # Format the data: metric_name value # Push the data to the Pushgateway using curl. echo "$METRIC_NAME $status" | curl --data-binary @- http://${PUSHGATEWAY_IP}:9091/metrics/job/${JOB_NAME}/instance/${INSTANCE_NAME} ``` **Automation:** You need to run this script periodically. The standard way is using a `cron` job. ```bash # Open the crontab editor crontab -e # Add this line to run the script every minute */1 * * * * /path/to/your/check_ssh.sh ``` #### 2. Prometheus Alerting Rule (`ssh_rules.yml`) Prometheus needs to know when to trigger an alarm. You will define a rule that becomes active when the `ssh_service_status` metric is `0`. ```yaml groups: - name: SSH_Service_Alerts rules: - alert: SSHServerDown # 'expr' is the trigger condition: if the ssh service status is 0. expr: ssh_service_status == 0 # 'for' means the condition must be true for 1 minute before the alert becomes active. This prevents false alarms. for: 1m labels: # A label to categorize the alert, e.g., 'critical'. severity: critical annotations: # 'summary' is the title of the alert. summary: "SSH service down on instance {{ $labels.instance }}" # 'description' is the body of the alert message. description: "The SSH service on job '{{ $labels.job }}' at instance {{ $labels.instance }} has been down for more than 1 minute." ``` #### 3. Alertmanager Receiver Configuration (`alertmanager.yml`) This configuration is exactly the same as the one from Week 5. Alertmanager just needs to know how to route alerts to Telegram. ```yaml route: # The default receiver for all alerts. receiver: 'telegram-notifications' # Grouping and timing settings. group_wait: 30s group_interval: 5m repeat_interval: 1h receivers: - name: 'telegram-notifications' telegram_configs: - api_url: '[https://api.telegram.org](https://api.telegram.org)' # REPLACE with your Bot Token bot_token: 'YOUR_TELEGRAM_BOT_TOKEN' # REPLACE with your Chat ID chat_id: YOUR_TELEGRAM_CHAT_ID # The message to be sent, using Go templating message: | [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} *Summary:* {{ .CommonAnnotations.summary }} *Description:* {{ .CommonAnnotations.description }} ``` --- ### Expected Final Result When everything is configured correctly, here is what will happen: 1. **Normal Condition:** The `check_ssh.sh` script runs every minute, finds port 22 open, and pushes `ssh_service_status 1` to Prometheus. No alerts are active. 2. **SSH Server Goes Down:** You manually stop the SSH service on your server (e.g., `sudo systemctl stop ssh`). 3. **Failure Detection:** The next time `check_ssh.sh` runs, it fails to connect to port 22 and pushes `ssh_service_status 0`. 4. **Alert Becomes Active:** Prometheus receives the `0` status. After this condition persists for 1 minute (as defined by `for: 1m`), the `SSHServerDown` alert enters a "firing" state. 5. **Notification is Sent:** Prometheus sends the alert details to Alertmanager. 6. **Message Received in Telegram:** Alertmanager receives the alert, formats the message according to its template, and sends it to your Telegram chat. You will receive a message that looks like this: > **[FIRING:1] SSHServerDown** > *Summary:* SSH service down on instance your-server-hostname > *Description:* The SSH service on job 'ssh_monitoring' at instance your-server-hostname has been down for more than 1 minute. ## Week 7 (Holiday) ## Week 8 Analysis & Summary This week, you explored a new agent framework and were introduced to containerization with Docker. The focus was on building more autonomous agents that can execute complex, multi-step tasks. #### 1. A New Framework: `smol-agents` You transitioned from LangChain to `smol-agents`, a lightweight library for creating tool-calling agents. It offers a simple and direct way to define tools and run an agent loop. #### 2. Building an Advanced, Multi-Tool Agent The main project for the week involved creating an agent with several tools to perform system diagnostics. * **Tools Defined:** * `is_ip`: A helper function to validate IP addresses. * `pingtestwithip`: A tool that runs the `ping` command to check if an IP is reachable. * `DN2IP`: A tool that uses the `dig` command to resolve a domain name to an IP address. * `sendmsg2TG`: A tool that sends a message to Telegram using the `requests` library. * **Agentic Workflow (Chaining Tools):** The key learning was seeing how the agent can chain tools together to solve a complex problem. The provided screenshot shows this perfectly: 1. **User asks:** "can i ping www.nqu.edu.tw?" 2. **Agent's Plan (Step 1):** The agent realizes it cannot ping a domain name directly. It first calls the `DN2IP` tool to get the IP address for `www.nqu.edu.tw`. 3. **Agent's Plan (Step 2):** Using the IP address (`120.125.96.159`) returned from the first tool, the agent now calls the `pingtestwithip` tool. 4. **Final Answer (Step 3):** After receiving the "ping test ok" result, the agent formulates a final, comprehensive answer for the user. #### 3. Introduction to Docker You were introduced to **Docker**, a foundational technology for modern software development. * **Core Concepts:** * **Isolation:** Docker allows you to package an application and all its dependencies (libraries, system tools, etc.) into a standardized unit called a "container." This container runs in isolation from the rest of the system, ensuring that it works consistently everywhere. * **cgroups (Control Groups):** This is a Linux kernel feature that Docker uses to limit and monitor the resources (CPU, memory, etc.) that a container can use. * **Setup:** You were instructed on how to add your user to the `docker` group (`sudo usermod -aG docker $USER`), which allows you to run `docker` commands without needing `sudo` every time. * **Reference:** [Cnblogs article on Docker](https://www.cnblogs.com/wtzbk/p/15077185.html) ## Assignment: Check if a Service is Running **Objective:** Develop a tool that can determine whether a web server (like Apache2) is running on the local machine. **Solution:** We will create a Python function `is_service_running` that attempts to connect to the local web server on port 80. The `@tool` decorator makes it available to the agent. ```python import requests from smolagents import tool @tool def is_service_running(service_name: str) -> str: """ Checks if a known local service is running by testing its default port. Currently supports 'apache2'. Args: service_name: The name of the service to check (e.g., 'apache2'). Returns: A string indicating if the service is 'running' or 'not running'. """ if service_name.lower() == 'apache2': # Apache typically runs on port 80 url = "[http://127.0.0.1:80](http://127.0.0.1:80)" try: # Use a short timeout to avoid long waits response = requests.get(url, timeout=2) # A successful request (even with an error code) means something is listening. return f"The {service_name} service is running." except requests.ConnectionError: return f"The {service_name} service is not running." else: return f"I don't know how to check the status of the '{service_name}' service." # To use this, you would add `is_service_running` to your agent's list of tools. # agent = ToolCallingAgent(tools=[..., is_service_running], model=model) # agent.run("is the apache2 running?") ``` **How it Works:** When the user asks, "is the apache2 running?", the AI agent sees the `is_service_running` tool. The tool's description tells the AI that it can check service statuses. The agent then calls the tool with the argument `service_name='apache2'`, and the function returns a human-readable string with the answer. --- ## Assignment: Install a Software Package **Objective:** Develop a tool that allows the AI to install software packages on the system. **Solution:** This tool is more powerful and requires executing system commands with `sudo`. This should be used with extreme caution. The tool will construct and run an `apt-get install` command. ```python import os from smolagents import tool @tool def install_package(package_name: str) -> str: """ Installs a software package on this Debian/Ubuntu system using apt-get. This requires passwordless sudo privileges for the user running the script. Args: package_name: The name of the package to install (e.g., 'htop'). Returns: A message indicating whether the installation was successful or failed. """ # WARNING: This command runs with sudo. It is a security risk. command = f"sudo apt-get install -y {package_name}" print(f"Executing command: {command}") # os.system returns 0 on success result_code = os.system(command) if result_code == 0: return f"Successfully installed the package: {package_name}." else: return f"Failed to install the package: {package_name}. Check for errors in the terminal." # To use this, you would add `install_package` to your agent's list of tools. # agent = ToolCallingAgent(tools=[..., install_package], model=model) # agent.run("please install the htop package") ``` **How it Works:** When the user asks to "install htop", the AI recognizes the `install_package` tool from its description. It calls the tool with `package_name='htop'`. The Python function then executes `sudo apt-get install -y htop`, and reports back to the user whether the command succeeded or failed. --- ## Week 9 Analysis & Summary --- This week introduced a cutting-edge concept: the **Model Context Protocol (MCP)**, a standardized way for AI models to securely interact with local data. You also gained more hands-on experience with **Docker** by managing and running containerized web servers. ### 1. Practical Docker Usage: Running Web Servers You continued to build on your Docker knowledge from the previous weeks. The focus was on the lifecycle of containers: pulling images, running them, checking their status, and cleaning up. * **Docker Commands Review:** * `docker pull centos:centos7.9.2009`: Downloads a specific version of the CentOS Linux image from Docker Hub. * `docker images`: Lists all the images you have downloaded locally. * `docker run -it centos:centos7.9.2009 echo "hi"`: Starts a new container from the CentOS image, runs the command `echo "hi"`, and then exits. * `docker ps`: Shows currently *running* containers. * `docker ps -a`: Shows *all* containers, including those that have stopped (exited). * `docker rm -f $(docker ps -a -q)`: A powerful command to forcefully remove (`rm -f`) all containers (`ps -a -q` lists the IDs of all containers). This is useful for cleaning up your system. * **Project Result Example:** ![image](https://hackmd.io/_uploads/Hky7RfQQeg.png) , successfully ran multiple web server containers. Docker assigned different internal IP addresses to them, and by mapping different host ports (e.g., 8080) to the container's port 80, you could access both web servers (`myweb1` and `myweb2`) from your browser simultaneously. --- ### 2. The Model Context Protocol (MCP) This was the core new topic of the week. MCP is a proposed open standard designed to solve a major problem: how can AI models (especially those running in desktop apps like Claude) securely access your local files and data without you having to upload them? * **What is it?** MCP acts as a universal bridge. An AI application that supports MCP can request context from a local "MCP server." This server is a small, local program that has permission to access specific folders on your computer. * **Why is it important?** It enhances privacy and security. Your files on your desktop never leave your machine. The AI application only receives the *content* of the files it needs, when it needs it, through this secure, local protocol. * **References:** * Official Website: [mcp.so](https://mcp.so/) * News Article (Chinese): [Business Next: What is MCP?](https://www.bnext.com.tw/article/82706/what-is-mcp) ### 3. Setting Up a Local MCP Environment The main task was to set up an environment where the Claude Desktop App could talk to a local MCP server that has access to your desktop files. #### a. Installing Development Tools (`uv`) To manage the Python environment for developing or running MCP servers, you used `uv`, a very fast, modern Python package installer from Astral (the makers of Ruff). * **Installation:** You used a PowerShell command to download and install `uv`. * **Usage:** You then used `uv python install 3.12` to quickly install a specific version of Python. * **Reference:** The process is shown clearly ![image](https://hackmd.io/_uploads/Sk_Y0fQXel.png) #### b. Configuring the Claude App for MCP You configured the **Claude Desktop App** to enable its experimental MCP feature. This was done by creating a configuration file that tells the app how to start your local MCP server. * **Configuration File (`claude_desktop_config.json`):** ```json { "mcpServers": { "filesystem": { "command": "npx", "args": [ "-y", "@modelcontextprotocol/server-filesystem", "/Users/user/Desktop" ] } } } ``` * **Explanation:** * `"filesystem"`: You are defining a context source named "filesystem". * `"command": "npx"`: This tells the Claude app to use `npx` (Node.js Package Runner) to start the server. * `"args"`: These are the arguments passed to `npx`. * `@modelcontextprotocol/server-filesystem`: This is the official MCP server package that knows how to read files. * `/Users/user/Desktop`: This is the crucial part. You are giving this specific server permission to access **only** the files on your Desktop folder. * **End Result:** When you are in the Claude app and type `@filesystem`, the app will automatically start this local server. You can then ask questions about files on your desktop (e.g., "@filesystem summarize my report.docx"), and Claude can answer without the file ever being uploaded to Anthropic's servers. ## Week 10 Analysis & Summary #### 1. Building Custom Docker Images with a `Dockerfile` This week, you learned the professional way to create Docker images. Instead of manually changing a running container and using `docker commit`, you now define the entire image build process in a text file called a `Dockerfile`. * **Dockerfile Instructions:** * `FROM ubuntu:22.04`: Specifies the base image to start from. * `RUN apt update; apt install -y apache2`: Executes commands inside the image, like installing software. * `ADD index.html /var/www/html`: Copies files from your local machine into the image. * `EXPOSE 80`: Informs Docker that the container will listen on this port. * `ENTRYPOINT ["/usr/sbin/apache2ctl", "-D", "FOREGROUND"]`: Sets the default command to run when the container starts. * **Project Result Example:** * You practiced this by creating a custom `ubuntu:web` ![d](https://hackmd.io/_uploads/SJIZ-QQmxx.jpg) shows this custom image running alongside the standard `ubuntu:apache2` image, each serving content on a different port (`8081` and `8080`). * Other screenshots show the results of running different containers: ![b](https://hackmd.io/_uploads/SyZ4bQ77xl.jpg) shows the default Apache "It works!" page, ![c](https://hackmd.io/_uploads/SJ-SWQXXlx.jpg) shows a custom "hello world" page served from a container, and ![a](https://hackmd.io/_uploads/rJjX-XQ7gg.jpg) shows a containerized web app vulnerable to SQL injection (`' OR 1=1#`). #### 2. Expanding the Model Context Protocol (MCP) Ecosystem You made your Claude AI even smarter by adding two new standard MCP servers to your configuration file, allowing it to access new types of real-time information. * **`mcp-server-time`:** * **Purpose:** Provides the current date and time as a context source. * **Use Case:** You can now ask time-sensitive questions like, "@time what day is it today?" and the AI will get the live, accurate answer from this local server. * **`mcp-server-fetch`:** * **Purpose:** Allows the AI to fetch the raw content from any URL. * **Use Case:** This is extremely powerful. You can ask, "@fetch summarize the contents of https://www.ithome.com.tw/". The `fetch` server will download the HTML of the page and provide it to the AI as context for summarization. * **Updated Configuration (`claude_desktop_config.json`):** ```json { "mcpServers": { "filesystem": { "command": "..." }, "time": { "command": "python", "args": ["-m", "mcp_server_time"] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } } ``` #### 3. Exploring Automated Hacking and Security Tools You were introduced to `auto-hacker`, an open-source project that uses browser automation (`playwright`) to perform security-related tasks. The setup commands (`conda create`, `pip install playwright`, `playwright install`) are for preparing the environment to run this kind of advanced, automated tool. This connects to the security theme seen in the SQL injection example. ## Assignment: Fetch Stock Prices and Save as CSV **Objective:** Develop a tool that allows an AI agent to get the last 3 days of stock prices for a given company (e.g., META), format the data as a CSV file, and save it to the user's Desktop. **Solution:** This task requires creating a new, specialized tool. We'll use the `yfinance` library to get stock data and `pandas` to easily handle and save the data. This tool can be added to the `fastmcp` application you started building. First, ensure you have the necessary libraries installed: `pip install yfinance pandas` Next, define the tool in your Python script: ```python import os import yfinance as yf import pandas as pd from fastmcp import FastMCP from datetime import date, timedelta # Assume 'mcp' is your FastMCP app instance mcp = FastMCP("stock-fetcher-app") @mcp.tool() def get_stock_price(ticker: str, days: int = 3) -> str: """ Fetches recent historical stock data for a given ticker, saves it as a CSV file on the Desktop, and returns the status. Args: ticker: The stock symbol to look up (e.g., 'META', 'GOOGL'). days: The number of recent days of data to fetch. Defaults to 3. Returns: A confirmation message with the file path or an error message. """ try: # 1. Fetch data using yfinance stock = yf.Ticker(ticker) end_date = date.today() start_date = end_date - timedelta(days=days + 4) # Fetch a bit more to ensure we get 3 trading days hist_data = stock.history(start=start_date.strftime('%Y-%m-%d'), end=end_date.strftime('%Y-%m-%d')) if hist_data.empty: return f"Could not find any data for ticker '{ticker}'." # 2. Format the data # Select the last 'days' rows and desired columns recent_data = hist_data.tail(days)[['Open', 'High', 'Low', 'Close', 'Volume']] recent_data.index = recent_data.index.strftime('%Y-%m-%d') # Format date recent_data.reset_index(inplace=True) recent_data.rename(columns={'index': 'Date'}, inplace=True) # 3. Save to Desktop desktop_path = os.path.join(os.path.expanduser("~"), "Desktop") file_path = os.path.join(desktop_path, f"{ticker}_last_{days}_days.csv") recent_data.to_csv(file_path, index=False) return f"Successfully saved {ticker} stock data to {file_path}" except Exception as e: return f"An error occurred: {str(e)}" # To run this, you would add the tool to your MCP application and ask the AI: # "can you get recent 3 days for META stock price? Save it in csv format and save it on the Desktop" ``` **Project Result:** When the AI uses this tool, it will create a CSV file on your desktop. The screenshot you provided ![e](https://hackmd.io/_uploads/H17vWX7Xxx.jpg) shows exactly this result, though it was for TSMC (`台積電`) instead of META. The content and format (Date, Open, High, Low, Close, etc.) are what this tool is designed to produce. --- ## Week 11 Analysis & Summary --- This week, you moved from single, isolated containers to building a fully functional, **multi-container web application**. The core topics were understanding **Docker networking models** and **managing container images** for sharing and deployment. ### 1. Managing & Sharing Docker Images Before deploying applications, you need to manage your images. This involves versioning them and storing them in a registry. * **Tagging Images (`docker tag`):** Tagging is how you give an image a specific name and version. The standard format is `username/repository:tag`. * **Command:** `docker tag ubuntu:web smallko/ubuntu:web` * **Explanation:** This command takes your existing local image `ubuntu:web` and gives it a new name, `smallko/ubuntu:web`. The `smallko/` prefix is the username, which is necessary for pushing the image to a registry. * **Image Registries (Docker Hub & Harbor):** * **Docker Hub:** A public, cloud-based registry where anyone can store and download images. To push your tagged image, you would first log in (`docker login`) and then use `docker push smallko/ubuntu:web`. Your screenshot ![b](https://hackmd.io/_uploads/ryiMSX7Qxg.jpg) shows the successful login process. * **Harbor:** An open-source, private registry that you can host yourself. This is used by companies to store their internal, proprietary container images. * **Reference:** [用 Harbor 架設私有 Docker 倉庫 (Medium)](https://medium.com/starbugs/%E7%94%A8-harbor-%E6%9E%B6%E8%A8%AD%E7%A7%81%E6%9C%89-docker-%E5%80%89%E5%BA%AB-9e7eb2bbf769) --- ### 2. Understanding Docker Networking Models For containers to work together, they need to communicate. Docker provides several network models to achieve this. * **Default Bridge Network (`docker0`):** * By default, all containers are attached to a network bridge called `docker0`. Each container gets its own IP address on this network (e.g., 172.17.0.2, 172.17.0.3). * **Limitation:** Containers on the default bridge can only communicate with each other using their IP addresses. They **cannot** use their names for communication. Your screenshot ![c](https://hackmd.io/_uploads/Hk1MSQmQxg.jpg) clearly shows this setup. * **Custom Bridge Network:** * **This is the recommended approach for multi-container applications.** You create your own isolated network using `docker network create mynet`. * **Key Advantage:** Containers attached to the same custom network can resolve each other by their container names. Docker provides an internal DNS service for this. This means `container1` can simply connect to `container2` using the hostname `container2`. Your screenshot ![d](https://hackmd.io/_uploads/ryKgrQQQgx.jpg) shows two containers on a custom network, each with its own IP. * **Other Modes:** * `--network=none`: The container is completely isolated from any network. * `--network=container:<name>`: One container shares the network stack of another existing container. They will have the same IP address. --- ### 3. Project: Deploying a Multi-Container LAMP Stack Application This week's main project was to deploy a classic web application stack using three separate, communicating containers. #### a. The Components: 1. **Backend (Database):** A `mysql` (or `mariadb`) container to store the application's data. 2. **Frontend (Web Server):** A `php:7.1-apache` container that runs the PHP code and serves the website. 3. **Management (DB Tool):** A `phpmyadmin` container that provides a web interface to manage the MySQL database. #### b. The Deployment Process & Commands: The key to making this work is running all three containers on the **same custom network** (`mynet`). 1. **Create the Network:** ```bash docker network create mynet ``` 2. **Run the MySQL Database Container:** ```bash # -d: run in detached mode # --name mysql: give the container a specific name # -e MYSQL_ROOT_PASSWORD=root: set the root password for the database # --network=mynet: attach to our custom network docker run -itd --name mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=root --network=mynet mysql:5.7.24 ``` 3. **Run the phpMyAdmin Container:** ```bash # --name phpmyadmin: give it a name # -e PMA_HOST="mysql": **THIS IS THE CRITICAL STEP**. We tell phpMyAdmin that the database host is named "mysql". Docker's DNS will resolve this name to the IP of the MySQL container. # -p 8080:80: map host port 8080 to the container's port 80 docker run --name phpmyadmin -d --network=mynet -e PMA_HOST="mysql" -p 8080:80 phpmyadmin/phpmyadmin ``` 4. **Run the PHP/Apache Container:** ```bash docker run --name my-php-apache -d -p 8000:80 --network=mynet php:7.1-apache ``` #### c. Verifying the Result: * Your screenshot ![f](https://hackmd.io/_uploads/Hy5TVXmXgx.jpg) shows you successfully using `docker exec` to enter the `mysql` container and create a database and table with some data. * Your screenshot ![e](https://hackmd.io/_uploads/H15nNQXXxe.jpg) shows you accessing the PHP server on port 8000 and viewing the `phpinfo()` page, confirming the web server is running. * Your final screenshot ![g](https://hackmd.io/_uploads/HJnoEQQ7lx.jpg) ties everything together. It shows the phpMyAdmin web interface (running on port 8080) successfully connected to the `mysql` container and displaying the data you inserted. This proves that the container-to-container communication over the custom network is working perfectly. ## Week 12 Analysis & Summary --- This week, you transitioned from managing single containers to orchestrating entire application environments. The main themes were using **Docker Compose** for declarative, multi-container setups, understanding **persistent data storage** with volumes, and packaging a complete **Machine Learning application** into a portable Docker image. You also got a brief introduction to container orchestration with **Docker Swarm**. ### 1. Docker Compose: Managing Multi-Container Applications Instead of running multiple long `docker run` commands, you learned to define your entire application stack in a single `docker-compose.yml` file. This is the standard way to manage complex local development environments. * **The `docker-compose.yml` File:** This YAML file describes all the services, networks, and volumes your application needs. * **`services`**: Each container is a "service" (e.g., `web`, `mysql`). * **`depends_on`**: Ensures services start in the correct order (e.g., start the database before the web server that connects to it). * **`volumes`**: This is a crucial concept for **persistent data**. The line `- /home/user/mysql:/var/lib/mysql` maps the `mysql` data directory *inside* the container to a folder on your host machine. This means your database data is saved on your computer and will still be there even if you delete the container. * **`networks`**: Connects all services to a custom network, allowing them to communicate using their service names (e.g., the PHP code can connect to `mysql` by name). * **Project: PHP + MySQL Web App with Docker Compose** You deployed a classic web application stack defined entirely in a `docker-compose.yml` file. * **PHP Code (`index.php`):** ```php <?php // The hostname is the service name from docker-compose.yml $servername="mysql"; $username="root"; $password="root"; $dbname="testdb"; $conn = new mysqli($servername, $username, $password, $dbname); if($conn->connect_error){ die("connection failed: " . $conn->connect_error); } else { echo "connect OK!" . "<br>"; } $sql="select name,phone from addrbook"; $result=$conn->query($sql); // ... (code to display results) ... ?> ``` * **Result:** Your screenshot ![e](https://hackmd.io/_uploads/rJh-uQ7mex.jpg) shows the web page successfully displaying "connect OK!" followed by data (`tom`, `mary`). This proves that the `web` container successfully connected to the `mysql` container using its service name, fetched the data, and rendered it. --- ### 2. Containerizing a Machine Learning Application This project involved taking a complete ML workflow—from training a model to serving it via an API—and packaging it into a single, portable Docker image. * **The Components:** 1. **`train_model.py`:** A script to train a `scikit-learn` decision tree model on the Iris dataset and save it as `model.pkl`. 2. **`server.py`:** A `Flask` web server that loads `model.pkl` and exposes a `/api` endpoint for predictions. 3. **`Dockerfile`:** The blueprint that assembles everything. * **The `Dockerfile`:** ```dockerfile # Start from a base image with Python FROM nitincypher/docker-ubuntu-python-pip # Set the working directory inside the container WORKDIR /app # Copy and install dependencies COPY ./requirements.txt /app/requirements.txt RUN pip install -r requirements.txt # Copy the application scripts COPY server.py /app COPY train_model.py /app # Command to run when the container starts: # First, train the model. Then, start the API server. CMD python /app/train_model.py && python /app/server.py ``` * **Result:** Your screenshot ![c](https://hackmd.io/_uploads/SyxZum7mxg.jpg) shows the `docker build` process successfully executing the steps in the `Dockerfile` to create the `iris:2.0` image. The final step is to run this image and test it with the `client.py` script to get a live prediction. --- ### 3. Introduction to Container Orchestration: Docker Swarm You briefly touched upon Docker Swarm, which is Docker's native solution for managing applications across multiple machines (a "cluster"). * **Key Concepts:** * **Manager & Workers:** A swarm consists of manager nodes (which control the cluster) and worker nodes (which run the containers). * **Services:** Instead of running single containers, you define "services" (e.g., `docker service create...`). A service can have multiple replicas (copies) of a container running across the cluster for high availability and load balancing. * **Result:** * Your screenshot![f](https://hackmd.io/_uploads/HkWx_mQXeg.jpg) shows a Docker Swarm visualizer displaying a cluster with one manager and two worker nodes. * Your screenshot g](https://hackmd.io/_uploads/Sko1OXm7el.jpg) shows three browser tabs, each hitting a different IP address but showing the same "It works!" page. This is a classic demonstration of a load-balanced service running with three replicas in a Docker Swarm. ## Week 13 Analysis & Summary --- This week, you explored two key areas essential for managing production environments: **advanced Linux file permissions** for securing systems, and **advanced Docker Swarm management** for controlling application deployment and availability. ### 1. Advanced Linux File & Directory Permissions You moved beyond basic `chmod` to understand more granular and powerful ways to control file access. #### a. Standard Permissions with Users & Groups The main goal was to configure a file (`hi.txt`) to be readable by users `user` and `mary`, but completely inaccessible to `peter`. * **Setup:** 1. Create the necessary users (`adduser mary`, `adduser peter`), as shown in your screenshot (`b.jpg`). 2. Create a new group (e.g., `sharedgroup`). 3. Add `user` and `mary` to this new group. 4. Change the group ownership of `hi.txt` to `sharedgroup` (`chown :sharedgroup hi.txt`). 5. Set the permissions to `660` (`chmod 660 hi.txt`). * **Result:** * The `660` permission translates to `rw-` for the owner, `rw-` for the group, and `---` for others. * Since `user` and `mary` are in the `sharedgroup`, they can read the file. * Since `peter` is not the owner and not in the group, he falls under "others" and has no permissions. * Your screenshot ![c](https://hackmd.io/_uploads/B1Q4q7mXge.jpg) perfectly demonstrates this outcome, where `peter` gets a "Permission denied" error while the owner can still read the file. #### b. Directory Permissions You learned a critical rule: **file operations like deleting or renaming are controlled by the permissions of the *parent directory*, not the file itself.** * **Demonstration:** 1. A directory `testdir` is created. 2. The owner's write (`w`) permission is removed from the directory (`chmod u-w testdir`). 3. Even though the user still owns the files *inside* `testdir`, they can no longer delete or rename them because they don't have write permission on the directory. * **Project Result:** Your screenshot ![d](https://hackmd.io/_uploads/ByyB57QQxe.jpg) shows exactly this: attempting to `rm` or `mv` a file inside the modified directory results in a "Permission denied" error. #### c. Special File Attributes with `chattr` This is an advanced technique that provides security beyond standard permissions. These attributes can even restrict the `root` user. * `chattr +i <file>`: Sets the **immutable** attribute. The file cannot be deleted, renamed, modified, or written to by *any* user, including root. * `chattr +a <file>`: Sets the **append-only** attribute. Data can only be added to the end of the file. Existing data cannot be modified or deleted. This is very useful for log files. * **Project Result:** Your screenshot (`e.jpg`) clearly shows that after setting the `+i` attribute on a file, even the `root` user gets an "Operation not permitted" error when trying to delete or modify it. --- ### 2. Advanced Docker Swarm & Load Balancing You learned how to manage a Swarm cluster professionally by controlling where services are placed and how traffic is distributed. #### a. Node & Service Management * **Draining a Node:** The command `docker node update --availability drain <node_name>` gracefully removes all tasks from a node without removing the node from the swarm. This is the correct way to prepare a node for maintenance or a reboot. * **Service Placement with Labels and Constraints:** This is a powerful feature for controlling deployment. 1. **Labeling Nodes:** You can add custom metadata (labels) to your nodes, such as `docker node update --label-add env=prod ubuntu3`. 2. **Creating Constrained Services:** You can then create a service that will *only* run on nodes with a specific label: `docker service create --constraint node.labels.env==test ...`. 3. **Updating Constraints:** You can dynamically move a service from one environment to another by updating its constraints: `docker service update --constraint-rm ...` and `--constraint-add ...`. * **Project Result:** ![i](https://hackmd.io/_uploads/rkv_9XmQxx.jpg) shows the `web` service running on nodes with the `env=test` label. After the update, ![j](https://hackmd.io/_uploads/BkgY5X7mlg.jpg) shows the `web1` service has moved to the `ubuntu3` node, which has the `env=prod` label. #### b. Using HAProxy as a Load Balancer While Swarm has its own internal load balancer, **HAProxy** is a powerful, dedicated external load balancer. You configured it to distribute traffic across your Swarm services. * **HAProxy Configuration (`haproxy.cfg`):** ``` frontend myfrontend bind 0.0.0.0:9999 default_backend myservers backend myservers balance roundrobin server server1 192.168.164.134:8881 server server2 192.168.164.135:8881 server server3 192.168.164.136:8881 ``` * **Explanation:** * The `frontend` listens for all incoming traffic on port `9999`. * It forwards this traffic to the `backend` pool named `myservers`. * The `backend` uses a `roundrobin` algorithm to distribute requests evenly among the three defined server IPs, which are your Swarm nodes running the web service. * **Project Result:** Your screenshot ![h](https://hackmd.io/_uploads/ByjF97Xmxe.jpg) shows `docker ps` with the `haproxy-master` container running and listening on port `9999`. The browser is pointed to this port, and HAProxy successfully forwards the request to one of the backend web servers, which returns the "It works!" page. ## Week 14 Analysis & Summary --- This week marks a major shift from manual server management to automated **Configuration Management**. You were introduced to industry-standard tools and focused on learning **Ansible** to automate tasks across multiple machines efficiently and reliably. ### 1. Introduction to Configuration Management Managing one server is easy, but configuring and maintaining tens or hundreds of servers manually is impossible. Configuration Management tools solve this problem by allowing you to define the desired state of your systems in code. * **Key Tools:** * **Ansible:** The focus of this week. It is agentless, meaning it communicates with managed nodes over standard SSH without needing any special software installed on them. * **SaltStack:** Another powerful, Python-based automation tool. * **Puppet:** One of the oldest and most established tools in this space. * **Ansible's Advantage:** Its agentless architecture makes it simple to get started with. All you need is Python on the managed nodes and SSH access. --- ### 2. Setting Up the Ansible Environment The foundation of Ansible is its ability to securely connect to and execute commands on remote servers. #### a. Passwordless SSH Login This is the most critical prerequisite. The Ansible control node must be able to SSH into the managed nodes without a password prompt. 1. **Generate SSH Keys:** On the control node (ubuntu1), you generate a key pair. * **Command:** `ssh-keygen` * As shown in your screenshot ![a](https://hackmd.io/_uploads/SyS43QQ7xe.jpg) , this creates `id_rsa` (private key) and `id_rsa.pub` (public key) in the `~/.ssh/` directory. 2. **Modify Hostname Resolution:** You edited the `/etc/hosts` file to map hostnames (ubuntu2, ubuntu3) to their IP addresses. This allows you to use easy-to-remember names instead of IPs. 3. **Allow Root Login on Managed Nodes:** For `ssh-copy-id` to work with the root user, you must first permit root login over SSH on the managed nodes (ubuntu2, ubuntu3). * **Action:** Edit `/etc/ssh/sshd_config` and change `PermitRootLogin` to `yes`. * This step is shown clearly in your screenshot ![b](https://hackmd.io/_uploads/BJfS3QX7gx.jpg) . 4. **Copy Public Key to Managed Nodes:** The `ssh-copy-id` command appends your public key to the `~/.ssh/authorized_keys` file on the remote machine. * **Command:** `ssh-copy-id root@ubuntu2` * Your screenshot ![c](https://hackmd.io/_uploads/BkKB3m7Xex.jpg) shows the process, which failed initially due to a password issue but then succeeded, allowing passwordless login. #### b. The Inventory File The inventory file is the heart of Ansible. It tells Ansible which servers it can manage. The default location is `/etc/ansible/hosts`. * **Defining Hosts and Groups:** You can group servers logically (e.g., `[web]`, `[db]`) for targeting commands. * **Specifying Connection Details:** You can specify non-standard details like the SSH port. * **Example Inventory:** ```ini [web] ubuntu2 [db] # This host will be connected to on port 2222 ubuntu3:2222 [allservers] ubuntu2 ubuntu3:2222 ``` --- ### 3. Running Ad-Hoc Commands Ansible has two primary modes: **Ad-Hoc** (for single, one-off tasks from the command line) and **Playbook** (for complex, multi-step tasks defined in a file). This week focused on Ad-Hoc commands. The basic structure is: `ansible <host-pattern> -m <module_name> -a "<arguments>"` #### a. Testing Connectivity (`ping` module) This is the "hello world" of Ansible. It verifies that Ansible can connect to the nodes, find a Python interpreter, and execute a module. * **Command:** `ansible allservers -m ping` * **Result:** You received a `SUCCESS` message with a `"ping": "pong"` response, confirming your setup is correct. #### b. Executing Remote Commands (`command` and `shell` modules) * **`command` module:** A simple module for executing basic commands. It does **not** support shell features like pipes (`|`), redirection (`>`, `<`), or variables (`$HOME`). * **`shell` module:** Executes commands through the node's default shell (e.g., `/bin/sh`). It supports all standard shell features. * **Project Results:** * Your screenshot ![d](https://hackmd.io/_uploads/rkiInmQ7xg.jpg) shows a successful `command` module execution: `ansible web -m command -a "chdir=/tmp cat hi.txt"`. * Your screenshot ![e](https://hackmd.io/_uploads/SkzD2QX7le.jpg) demonstrates the difference. An `ifconfig` command with a pipe (`|`) fails with the `command` module but works perfectly with the `shell` module. #### c. Executing Local Scripts (`script` module) This module takes a script from your local control node, copies it to the remote nodes, executes it, and then deletes it. * **Command:** `ansible web -m script -a test.sh` * **Project Result:** Your screenshot ![f](https://hackmd.io/_uploads/rJOPhQXQle.jpg) shows the `test.sh` script being run on `ubuntu2` and its output (`hello world` and the hostname) being returned. #### d. Managing Packages and Services Ansible uses modules to ensure a system is in a specific state. This is called **idempotency** – running the same command multiple times produces the same result without error. * **`apt` module:** Manages system packages. * `state=present`: Ensures the package is installed. * `state=absent`: Ensures the package is removed. * **`service` module:** Manages system services. * `state=started`: Ensures the service is running. * `state=stopped`: Ensures the service is stopped. * **Example Commands:** bash # Ensure Apache is installed on all web servers ansible web -m apt -a "name=apache2 state=present" # Ensure Apache is running on all web servers ansible web -m service -a "name=apache2 state=started" ### Week 15 Analysis & Summary --- This week, you moved from simple, one-off ad-hoc commands to the core of Ansible's power: **Playbooks**. A playbook is a YAML file where you define a complete, ordered set of tasks to be executed on a group of hosts. This allows for repeatable, version-controlled, and complex infrastructure automation. ### 1. The Anatomy of an Ansible Playbook A playbook is a list of "plays," and each play consists of a set of tasks. * **Key Components:** * `hosts`: Specifies which group of servers from your inventory file this play will run on. * `tasks`: A list of actions to be performed. Each task calls an Ansible module. * `name`: A human-readable description for each task, which appears in the command-line output. * **Execution:** You run a playbook using the command `ansible-playbook <filename.yml>`. * **Project Result:** Your screenshot ![a](https://hackmd.io/_uploads/S1kS0Qm7ee.jpg) shows a very basic playbook, `test1.yml`, that uses the `command` module to run `/usr/bin/wall hello world` on all hosts in the `web` group. The terminal output shows the successful execution of the play. --- ### 2. Using Variables to Make Playbooks Dynamic Hardcoding values is inefficient. Ansible provides several powerful ways to use variables to make your playbooks flexible and reusable. #### a. Ansible Facts Ansible automatically gathers hundreds of "facts" (system details) from each managed node. You can use these variables directly in your playbooks. * **Example:** Using the built-in fact `{{ ansible_fqdn }}` (Fully Qualified Domain Name) to create a unique log file for each server. ```yaml - hosts: allservers tasks: - name: Create a log file named after the host file: name: "/data/{{ ansible_fqdn }}.log" state: touch ``` * **Project Result:** Your screenshot ![d](https://hackmd.io/_uploads/H1Fr0X7Qxg.jpg) shows you discovering these facts by running `ansible allservers -m setup | grep fqdn`, which displays the FQDN for each managed node. #### b. Inventory Variables You can define variables directly in your inventory file (`/etc/ansible/hosts`) either per-host or per-group. * **Example Inventory:** ```ini [allservers] ubuntu2 http_port=8080 ubuntu3:2222 http_port=8081 [allservers:vars] domainname=example.com ``` * **Explanation:** `ubuntu2` gets a specific `http_port` of `8080`. All servers in the `allservers` group get the `domainname` variable. #### c. Command-Line Variables You can pass variables directly when you run the playbook using the `-e` (extra-vars) flag. This is useful for one-time overrides. * **Command:** `ansible-playbook -e "pkname=apache2" test4.yml` * **Project Result:** Your screenshot ![c](https://hackmd.io/_uploads/ByHLAmXXxe.jpg) shows this exact command being used to pass the package name `apache2` to a playbook. --- ### 3. Templates for Dynamic Configuration Files This is one of Ansible's most powerful features. The `template` module uses the **Jinja2** templating engine to create configuration files based on variables. * **Template File (`ports.conf.j2`):** ```j2 # This port number will be replaced by the variable's value Listen {{ http_port }} ``` * **Playbook Task:** ```yaml - name: Copy dynamic config file template: src: ports.conf.j2 dest: /etc/apache2/ports.conf ``` * **How it Works:** When the playbook runs, it processes the `.j2` file. For the host `ubuntu2`, `{{ http_port }}` becomes `8080`. For `ubuntu3`, it becomes `8081`. The resulting personalized `ports.conf` file is then copied to each respective server. * **Project Result:** Your screenshot ![e](https://hackmd.io/_uploads/B1JwRmmmgg.jpg) is the direct outcome of this process, showing one Apache server running on port `8080` and the other on `8081`. --- ### 4. Controlling Execution Flow #### a. Handlers: Triggering Actions on Change A **handler** is a special task that only runs when it is "notified" by another task. A task sends a notification only if it makes a change to the system. * **Use Case:** You only want to restart the Apache service if its configuration file has actually changed. * **Playbook with Handler:** ```yaml - hosts: web tasks: - name: Copy ports.conf copy: src: ports.conf dest: /etc/apache2/ notify: restart apache2 service # Notify the handler handlers: - name: restart apache2 service service: name: apache2 state: restarted ``` * **Explanation:** The `restart apache2 service` handler will *only* run if the `copy` task actually copies a new version of the file. If the file is already correct, no change is made, and the service is not needlessly restarted. #### b. Conditionals: Running Tasks Selectively The `when` clause allows you to run a task only if a certain condition is true. * **Example:** ```yaml - name: Shutdown a specific computer command: /sbin/shutdown -h now when: ansible_fqdn == "test-8081.test.com" ``` * **Explanation:** This task will be skipped on all hosts *except* for the one whose FQDN exactly matches "test-8081.test.com". ## Presentation ### Infrastructure as Code https://www.canva.com/design/DAGpTpbdkWM/WDr2EetMU-G_HANlv6npiA/edit?utm_content=DAGpTpbdkWM&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton