# Homework 6: CSCI 347: Data Mining Name: **Sam Behrens** Show your work. Include any code snippets you used to generate an answer, using comments in the code to clearly indicate which problem corresponds to which code. 1. [13 points] Suppose you are developing a clustering algorithm, and you have a data set with a ground-truth clustering available. Your ground-truth clustering contains 4 clusters $T_1,...,T_4$, but your clustering algorithm found only 3 clusters $C_1,...C_3$. The intersection $n_{ij}$ between each of the clusters $C_i$ found by your clustering algorithm and the groundtruth clusters $T_j$ is represented by the table below, where the entry in the $i$th row and $j$th column is $n_{ij}$: | | T1 | T2 | T3 | T4 | | --- | --- | --- | --- | --- | | C1 | 5 | 45 | 35 | 5 | | C2 | 85 | 4 | 5 | 10 | | C3 | 10 | 1 | 10 | 55 | A. [3 points ] Compute the precision of cluster C2. ``` 1/104(85) = 85/104 = 0.817 = 0.817 ``` B. [4 points ] Compute the precision of the entire clustering. ``` 1/270(45 + 85 + 55) = 185/270 = 0.685 = 0.685 ``` C. [3 points ] Compute the recall of cluster C1. ``` 1/50(45) = 45/50 = 0.9 = 0.9 ``` D. [3 points] Compute the F-score of Cluster 2. ``` recall(C2) = 0.85 precision(C2) = 0.817 2 * (0.85 * 0.817) / (0.85 + 0.817) = 1.389 / 1.667 = 0.833 = 0.833 ``` 2. [7 points] Consider the following data that shows transactions of items purchased in a supermarket: | Transaction ID | Items | |:--------------:|:-----------------------------------------------------------------------------:| | 1 | Toilet paper (9), beans (2), rice (8), milk (5), baby wipes (1) , diapers (4) | | 2 | Oat milk (6), beans (2), toilet paper (9), orange juice (7) | | 3 | Oat milk (6), milk (5), orange juice (7), toilet paper (9) | | 4 | Beans (2), toilet paper (9), baby wipes (1), diapers (4) | | 5 | Toilet paper (9), butter (3), baby wipes (1), diapers (4) | | 6 | Milk (5), toilet paper (9), orange juice (7) | | 7 | Milk (5), rice (8), toilet paper (9) | | 8 | Beans (2), milk (5), rice (8), toilet paper (9) | | 9 | Milk (5), butter (3) , diapers (4) | | 10 | Beans (2), rice (8), toilet paper (9), baby wipes (1) | A. [2 points ] What is the support of the itemset {milk, toilet paper}? support of {milk, toilet paper} = **5** B. [3 points ] If we use a minimum support threshold of 4, what are the itemsets of size 2 that are frequent? | Itemset | Support | | -------- | ------- | | 1 | 4 | | 2 | 5 | | 3 | 2 | | 4 | 4 | | 5 | 6 | | 6 | 2 | | 7 | 3 | | 8 | 4 | | 9 | 9 | | 1, 2 | 3 | | 1, 4 | 3 | | 1, 5 | 1 | | 1, 8 | 2 | | **1, 9** | **4** | | 2, 4 | 2 | | 2, 5 | 2 | | 2, 8 | 3 | | **2, 9** | **5** | | 4, 5 | 2 | | 4, 8 | 1 | | 4, 9 | 3 | | 5, 8 | 3 | | **5, 9** | **5** | | **8, 9** | **4** | ``` 1, 9 (baby wipes and toilet paper) 2, 9 (beans and toilet paper) 5, 9 (milk and toilet paper) 8, 9 (rice and toilet paper) ``` C. [2 points] a. [1 point] What is the confidence of the rule {rice} -> {beans}? ``` {8} -> {2} {2, 8} = 3 {8} = 4 3/4 = 0.75 = 0.75 ``` b. [1 point] What is the confidence of the rule {beans} -> {rice}? ``` {2} -> {8} {2, 8} = 3 {2} = 5 3/5 = 0.6 = 0.6 ``` 3. [EXTRA CREDIT: 6 points] Write a function in Python that computes the F-score of a clustering output, with 2 input parameters: an array containing the labels of each data point in the ground-truth clustering, and an array containing the labels of each data point in the ground-truth clustering. You may use the contingency_matrix function in scikitlearn.