CS410 Homework 5: Knowledge Representation and Reasoning (KRR)

# CS410 Homework 5: Knowledge Representation and Reasoning (KRR) ==**Due Date: 02/24/2026 at 11:59pm**==  **Need help?** Feel free post on [Edstem](https://edstem.org/us/courses/93617) for TA assistance. :::danger **⚠️ Battle Alert 🏆 ⚠️** You must activate the CS410 virtual environment every time you work on a CS410 assignment! You can find the activation instructions [here](https://hackmd.io/@cs410/BJvhqHXuR). If you forget to activate this virtual environment, you will almost certainly encounter import errors, and your code will not run. ::: ![arena5](https://hackmd.io/_uploads/HJ-wspIrZe.jpg) ## Assignment Overview Knowledge Representation and Reasoning (KRR) is a fundamental area of artificial intelligence that focuses on how to formally represent information about the world and use that information to make inferences. KRR systems aim to capture knowledge in a structured format that enables automated reasoning and decision-making. In this assignment, you will explore a key area of KRR: **Probabilistic Reasoning**. You will use Bayesian networks to represent probabilistic knowledge and reason under uncertainty. ## Learning Objectives What you will know: * Two inference methods for estimating joint probabilities in Bayesian networks, which can be used to compute the probability of events given some evidence. What you will be able to do: * Use Pandas and dataframes to manage large datasets in Python. * Construct Bayesian networks by learning CPTs from data and query them programmatically to answer probabilistic questions using exact and approximate inference techniques. # Bayesian Networks Bayesian networks (also called belief networks or Bayes nets) are probabilistic graphical models that represent variables and the conditional dependencies among them. The relationships between variables are represented using a directed acyclic graph (DAG), where each variable in the model is represented by a node in the graph, and each edge represents a conditional dependency between variables. Appended to each node is a conditional probability table (CPT), which gives the conditional probability of each variable given its parents. Together these CPTs afford a compact representation of the full joint distribution over all variables in the model. ### Conditional Probabilities For a node $X$ with parent nodes $\{V_1, V_2, \dots, V_k\}$, the CPT defines the **conditional probabilities** of $X$ given all combinations of its parents’ values. Formally: $$ \text{CPT}(X) = \left\{ P(X = x \mid V_1 = v_1, \dots, V_k = v_k) \ \Big| \ x \in \text{Values}(X),\ v_i \in \text{Values}(V_i) \right\} $$ Where: - $\text{Values}(X)$ is the set of possible values of $X$. - $P(X = x \mid V_1 = v_1, \dots, V_k = v_k)$ is the probability that $X$ takes value $x$, given that its parents have values $v_1, \dots, v_k$. Not all nodes in a Bayesian network have parents. For nodes without parents, the CPT simply contains the prior probabilities over all values of that variable. In the first part of this assignment, you will be constructing CPTs from a dataset of samples.  ## Data Structures ### Pandas In this assignment, you will be working with the *Pandas* library in Python. Pandas (frequently abbreviated *pd*) is a library for working with large scale tabular datasets. As such, it provides many common data analytics tools. The base structure for storing data in Pandas is a *dataframe*. Consider loading the following Comma Separated Values (CSV) data into pandas (which may be done with `pd.read_csv('filename.csv')`). | Smoker, | Excersizes, | Heart Disease Risk | |----------|----------|----------| | Yes, | No, | High, | | No, | Yes, | Low, | This will construct a DataFrame with 3 columns and 2 rows, containing the CSV data and store it in the variable `df`. Pandas defaults to using the first row of a CSV as the column labels. ## Accessing Data in DataFrames ### Column Access To access a column of data, you can use `df["Exercises"]`, which will return a Series containing the data in that column: `["No", "Yes"]`. Additionally, multiple columns can be accessed at once if a list of column names is provided. ```python # Access a single column exercise_data = df["Exercises"] print(exercise_data) # Output: ["No", "Yes"] # Access multiple columns health_data = df[["Smoker", "Heart Disease Risk"]] print(health_data) # Output: [["Yes", "No"], ["No, Yes"]] ``` ### Row Access To access an individual row in the DataFrame, you can use `df.iloc[i]` (for index-based locate): ```python # Access the first row (index 0) first_row = df.iloc[0] print(first_row) # Output: Smoker: Yes, Exercises: No, Heart Disease Risk: High # Access the second row (index 1) second_row = df.iloc[1] ``` ### Iterating Through Rows If you wish to loop through the rows of a DataFrame and perform some operation, you can use the `iterrows()` method: ```python for index, row in df.iterrows(): print(f"Row number {index}:") print(f" Smoker: {row['Smoker']}") print(f" Exercises: {row['Exercises']}") print(f" Heart Disease Risk: {row['Heart Disease Risk']}") ``` ## Useful DataFrame Methods There are a number of other useful functions built into DataFrames you might find useful for this assignment: ### `value_counts()` Counts the frequency of each unique value in a column: ```python # Count how many people smoke vs don't smoke in the dataframe provided # value_counts returns the values in that column as well as their frequency smoker_counts = df['Smoker'].value_counts() print(smoker_counts) # Sample Output: # Yes 200 # No 800 # Count heart disease risk levels risk_counts = df['Heart Disease Risk'].value_counts() ``` `value_counts` can also be used with multiple columns at a time, in which case it will count the number of times each row is repeated in the dataframe. A larger introduction to Pandas can be found [here](https://pandas.pydata.org/docs/user_guide/10min.html) and many other guides exist from other sources. **You are welcome to use any built in method within Pandas for this assignment, not just the ones listed above.** ## Heart Disease Bayesian Network In the first part of this assignment, you will be working with the following Bayesian Network for analyzing the risk of heart disease. The network has 6 nodes with the structure shown below: ```mermaid graph TD; Smoker --> BloodPressure Exercise --> BloodPressure Exercise --> Cholesterol BloodPressure --> HeartDiseaseRisk Smoker --> HeartDiseaseRisk Cholesterol --> HeartDiseaseRisk HeartDiseaseRisk --> ECGResults ``` ## Conditional Probability Tables ### 1. Smoking Status (Root Node) | Smoking Status | Probability | |----------------|-------------| | Smoker | 0.20 | | Non-smoker | 0.80 | ### 2. Exercise Level (Root Node) | Exercise Level | Probability | |----------------|-------------| | High | 0.30 | | Low | 0.70 | ### 3. BloodPressure | Smoking, Exercise | Smoking | Exercise | High | Normal | |---------|----------|------|--------| | No | High | 0.15 | 0.85 | | No | Low | 0.30 | 0.70 | | Yes | High | 0.25 | 0.75 | | Yes | Low | 0.50 | 0.50 | ### 4. Cholesterol | Exercise | Exercise | High Cholesterol | Normal Cholesterol | |----------|------------------|--------------------| | High | 0.25 | 0.75 | | Low | 0.45 | 0.55 | ### 5. HeartDiseaseRisk | Smoking, BloodPressure, Cholesterol | Smoking | BloodPressure | Cholesterol | High | Low | |---------|---------------|-------------|------|------| | No | Normal | Normal | 0.05 | 0.95 | | No | High | Normal | 0.10 | 0.90 | | No | Normal | High | 0.08 | 0.92 | | No | High | High | 0.15 | 0.85 | | Yes | Normal | Normal | 0.12 | 0.88 | | Yes | High | Normal | 0.25 | 0.75 | | Yes | Normal | High | 0.20 | 0.80 | | Yes | High | High | 0.40 | 0.60 | ### 6. ECGResults | HeartDiseaseRisk | HeartDiseaseRisk | Abnormal | Normal | |------------------|----------|--------| | High | 0.60 | 0.40 | | Low | 0.20 | 0.80 | ## Exact Inference :::info **Task 1:** Exact Inference Compute the following probabilities by hand, include your work and results in your README.md. If you prefer, you can upload a handwritten document or a PDF generated by latex as a separate file. Include the filename in your README.md. 1. A patient receives an abnormal ECG result, but is an otherwise healthy individual (exercise=high, bloodpressure=normal, smoker=no, cholesterol=normal). What is the probability they have high heart disease risk? What is the probability they have low heart disease risk? 2. What is the conditional probability that someone at high risk for heart disease is also a smoker? 3. What is the prior probability that someone is at high risk for heart disease (with no other evidence)? :::warning We are treating "has heart disease" as `HeartDiseaseRisk=High`, not not `ECGResults = Abnormal` (or anything else). ::: :::spoiler Tips There are multiple ways to get answers for each of these questions. You will likely need to use Marginalization, the Chain Rule, and Bayes' Rule in some form to answer these questions. ::: ## Car Diagnostics Network The mechanics in a car garage put together the following Bayesian Network structure to help them diagnose issues in cars. It models the relationship between different components in cars. The mechanics collected data on each car that came into their shop for a year, tracking the status of each component. Now they want you to help them to answer queries about the probability of certain issues. First, you must finish constructing the Bayesian Network by building CPTs for each node based on the data. Then, you will implement methods to estimate the probability of some issue with the car given other known evidence. ```mermaid graph TD; %% Root nodes (no parents) MainFuse["Main Fuse"] BatteryAge["Battery Age"] Alternator["Alternator"] Distributor["Distributor"] StarterMotor["Starter Motor"] SparkPlugs["Spark Plugs"] FuelSystem["Fuel System"] AirFilter["Air Filter"] %% Intermediate nodes ChargingSystem["Charging System"] BatteryVoltage["Battery Voltage"] VoltageAtPlug["Voltage at Plug"] StarterSystem["Starter System"] SparkTiming["Spark Timing"] SparkQuality["Spark Quality"] AirSystem["Air System"] CarCranks["Car Cranks"] %% Output nodes Headlights["Headlights"] CarStarts["Car Starts"] %% Connections Alternator --> ChargingSystem ChargingSystem --> BatteryVoltage BatteryAge --> BatteryVoltage BatteryVoltage --> Headlights BatteryVoltage --> StarterSystem BatteryVoltage --> VoltageAtPlug MainFuse --> StarterSystem MainFuse --> VoltageAtPlug StarterMotor --> StarterSystem StarterSystem --> CarCranks Distributor --> SparkTiming Distributor --> VoltageAtPlug SparkPlugs --> SparkQuality VoltageAtPlug --> SparkQuality AirFilter --> AirSystem %% All systems feed into Car Starts SparkTiming --> CarStarts SparkQuality --> CarStarts FuelSystem --> CarStarts AirSystem --> CarStarts CarCranks --> CarStarts ``` # Implementing Bayes Networks :::info **Task 2.1:** Constructing CPTs Implement the `build_cpt_from_data` method in `bayesian_network.py`. This method: - Takes a node name and pandas DataFrame as input - Counts occurrences of each outcome in the training data - Returns a CPT object with learned conditional probabilities **Important: CPT Dictionary Structure** Your CPT table must use the correct dictionary format: - **Root nodes (no parents):** `{"Yes": 0.2, "No": 0.8}` - Simple outcome to probability mapping - **Nodes with one parent:** `{("High",): {"Yes": 0.3, "No": 0.7}}` - Parent value tuple maps to outcome probabilities - **Nodes with multiple parents:** `{("Yes", "High"): {"High": 0.5, "Normal": 0.5}}` - Parent combination tuple maps to outcome probabilities **Critical:** When a node has multiple parents like `["Smoking", "Exercise"]`, the tuple must be in that exact order: `("smoking_value", "exercise_value")`. ::: :::success Note: The conditional probabilities in the CPT are computed using relative frequencies from the dataset. This corresponds to the maximum likelihood estimate (MLE) of the conditional distribution P(X|Pa(X)). In other words, given a fixed network structure, the CPT values are chosen to maximize the likelihood of the observed data with respect to the parameters of the network. ::: Run ``python bayesian_network.py --network heart --dataset heart_data.csv`` and ``python bayesian_network.py --network car --dataset car_data.csv`` to print out and inspect the bayesian networks and CPTs produced by your code. Your CPTs for the heart disease network, should closely match the probabilities provided above. :::info **Task 2.2:** Querying CPTs Implement the `get_probability` method in the `CPT` class of `bayesian_network.py`. This method retrieves conditional probabilities from your CPT dictionary. **Method Signature:** `get_probability(outcome, parent_values=None)` **Usage Examples:** - **Root node (Smoking):** `cpt.get_probability("Yes")` --> returns P(Smoking=Yes) - **Node with parents (BloodPressure):** `cpt.get_probability("High", {"Smoking": "Yes", "Exercise": "Low"})` → returns P(BloodPressure=High | Smoking=Yes, Exercise=Low) **Key Requirements:** - The `parent_values` dictionary must contain ALL parent nodes as keys - Parent values must be converted to a tuple in the correct order to match your CPT table keys - For the BloodPressure example above, convert `{"Smoking": "Yes", "Exercise": "Low"}` to tuple `("Yes", "Low")` based on the parents list `["Smoking", "Exercise"]` ::: :::spoiler Hint If you have implemented your CPT table correctly as a dictionary, `get_probability` amounts to a lookup in that dictionary. Just make sure the order you construct the key (a tuple of strings) is the same order as when the table was constructed. ::: :::info **Task 2.3:** Testing In `unit_tests.py`, we have provided tests for checking the network structure and CPTs the heart disease Bayesian Network. You do not need to write additional tests, but you can if you want. ::: ### Monte Carlo Sampling Bayesian Networks are a **generative** AI model. You may be familiar with this term (or the shortening GenAI) from other classes of models, like LLMs or image generation models, like Dalle or Midjourney. What does it mean to be a generative model? Generative models represent joint probabilities of data. In Bayesian Networks, we can query our network for the probability of any node and value given some set of evidence. Large Language Models learn to estimate the probability of the next token (e.g., next word) given context (i.e., preceding tokens). Why does learning this probability distribution relate at all to **generative**? If we have learned a probability distribution, we can sample new data from that distribution to generate new examples. A sample from a Bayesian Network is a value for each node, drawn at random from the specified (conditional) probabilities. Sampling can provide a computationally efficient means of estimating the probabilities of queries in Bayesian Networks. Rather than exact an exact algorithm to compute the conditional probability of a query, sampling generates a number of samples and estimates the probability of a query from the samples. In this assignment you will implement **rejection sampling**. Rejection sampling works by generating $n$ samples and ignoring (rejecting) any samples generated where the evidence does not match the query. How can we actually generate a sample? Look back at the heart disease network. What variables are easy to draw new values for? Exercise and smoking are easy to generate values for because they are root nodes (i.e., they have no parents). The values of all other nodes depend on their parents. We cannot sample new values for non-root nodes until we sample the values of their parents. We have to generate samples in a specific order, starting at the roots of the network and working our way down. This order is called the *topological order* of nodes in a graph. So long as we sort our nodes into topological order, we can simply sample new values for each node using the previously sampled values. Topological sorting can be done with DFS. :::info **Task 3.1: Implement Rejection Sampling** Implement the `estimate_probability` method in `bayesian_network.py`. The method takes in a query node (e.g., "Battery Age"), query value (e.g., "Old"), and (optionally) evidence (e.g., Car Starts=False), and should return the estimated probability using rejection sampling. **Algorithm Steps:** 1. **Generate samples:** Use prior sampling to create N complete samples 2. **Filter by evidence:** Keep only samples that match all evidence constraints (reject other samples) 3. **Count matches:** Among valid samples, count how many have query_node = query_value 4. **Calculate probability:** Return count / total_valid_samples :::warning Implementing `forward_sample` and `_sample_node_value` will be very helpful here ::: ::: :::success **Tip:** We have provided a heart disease dataset that matches the example from the previous questions. You can test your `estimate_probability` method with the answers you found previously! ::: :::info **Task 3.2: Querying Bayesian Networks** Use your `estimate_probability` method to answer the following questions: 1. What is the probability a car that comes into the shop has an engine that starts (i.e., P(CarStarts=True))? 2. What is the probability the battery voltage is strong when we know the battery is new and the charging system is working? 3. A new car has come into the shop that won't start. The headlights are dim. When the key is turned, the engine cranks. Should the mechanics investigate the fuel system, spark plugs, or alternator first? Which is most likely to be broken given the evidence provided? Implement your queries in the `query_network.py` script and run it to get your results. Include your results in your README.md file. ::: :::info **Task 3.3: A closer look at sampling** In the previous task, you used rejection sampling to estimate probabilities. During this process, you specified some number of samples to generate that match your provided evidence and rejected other samples. 1. Which of the queries from Task 3.2 would require the highest number of total samples to reach the same precision threshold? Why do you think that is? 2. Try the queries from 3.2 again with different numbers of samples (`estimate_probability` takes `num_samples` as an argument). Try with 10 samples, 100, 1000, and 10,000 samples. What trends do you notice as you increase the number of samples? Do any of the queries converge faster than the others? 3. Suppose you ran your `estimate_probability` for $n$ samples. How could you know if $n$ was a large enough number of samples? How could you know if you needed to run more samples? What information would you track and how would you make that decision. Add the answers to these questions to your README.md. ::: :::info **Task 3.4** If you knew the labor cost of checking the status of the outcome of each node (e.g., checking the headlights or battery voltage is easy, but testing spark plugs takes an hour of labor), how might you could use a bayesian network to determine which component a mechanic should test next? Include 2-3 sentences on how you might incorporate labor cost into your README.md. ::: ## Submission ### Download Please click [here](https://classroom.github.com/a/4X8a_WUS) to access the assignment. ### Handin Your handin should contain the following: - all files, including comments describing the logic of your implementations - a README containing: - your responses to any conceptual questions - known problems in your code - anyone you worked with - any outside resources used (eg. Stack Overflow, ChatGPT) ### Gradescope Submit your assignment via Gradescope. To submit through GitHub, follow this sequence of commands: 1. `git add -A` 2. `git commit -m "commit message"` 3. `git push` Now, you are ready to upload your repo to Gradescope. :::danger **⚠️WARNING⚠️** Make sure you have the correct assignment repo and branch selected before submitting. ::: :::info If you are having difficulties submitting through GitHub, you may submit by zipping up your hw folder. ::: ### Grading Your code for Bayesian Networks will be graded on correctness.  ### Rubric | Component | Points | Notes | | -------- | -------- | -------- | | Exact Inference | 15 | Points awarded for answers and work shown in README for task 1.| | Building & QueryingCPTs | 40 | Points awarded for correct implementation of build_cpt_from_data and get_probability.| | Approximate Inference | 15 | Points awarded for correct implementation of rejection sampling.| |Querying Bayesian Networks |10 |Points awarded for answering 3.2 in the README.| |Varying sample size and application of Bayesian Networks |20|Points awarded for answering 3.3 and 3.4 in the README.| :::success <p style="text-align: center;"> <img src="https://hackmd.io/_uploads/rkhjnTUSbx.png" alt="Fifth Homework" width="800"/> ![hog-rider-relaxing](https://hackmd.io/_uploads/B1PAn68SWx.jpg) Now go relax! </p>