# Homework 6: CSCI 347: Data Mining
Name: **Sam Behrens**
Show your work. Include any code snippets you used to generate an answer, using comments in the code to clearly indicate which problem corresponds to which code.
1. [13 points] Suppose you are developing a clustering algorithm, and you have a data set with a ground-truth clustering available. Your ground-truth clustering contains 4 clusters $T_1,...,T_4$, but your clustering algorithm found only 3 clusters $C_1,...C_3$. The intersection $n_{ij}$ between each of the clusters $C_i$ found by your clustering algorithm and the groundtruth clusters $T_j$ is represented by the table below, where the entry in the $i$th row and $j$th column is $n_{ij}$:
| | T1 | T2 | T3 | T4 |
| --- | --- | --- | --- | --- |
| C1 | 5 | 45 | 35 | 5 |
| C2 | 85 | 4 | 5 | 10 |
| C3 | 10 | 1 | 10 | 55 |
A. [3 points ] Compute the precision of cluster C2.
```
1/104(85) = 85/104 = 0.817
= 0.817
```
B. [4 points ] Compute the precision of the entire clustering.
```
1/270(45 + 85 + 55) = 185/270 = 0.685
= 0.685
```
C. [3 points ] Compute the recall of cluster C1.
```
1/50(45) = 45/50 = 0.9
= 0.9
```
D. [3 points] Compute the F-score of Cluster 2.
```
recall(C2) = 0.85
precision(C2) = 0.817
2 * (0.85 * 0.817) / (0.85 + 0.817) = 1.389 / 1.667 = 0.833
= 0.833
```
2. [7 points] Consider the following data that shows transactions of items purchased in a supermarket:
| Transaction ID | Items |
|:--------------:|:-----------------------------------------------------------------------------:|
| 1 | Toilet paper (9), beans (2), rice (8), milk (5), baby wipes (1) , diapers (4) |
| 2 | Oat milk (6), beans (2), toilet paper (9), orange juice (7) |
| 3 | Oat milk (6), milk (5), orange juice (7), toilet paper (9) |
| 4 | Beans (2), toilet paper (9), baby wipes (1), diapers (4) |
| 5 | Toilet paper (9), butter (3), baby wipes (1), diapers (4) |
| 6 | Milk (5), toilet paper (9), orange juice (7) |
| 7 | Milk (5), rice (8), toilet paper (9) |
| 8 | Beans (2), milk (5), rice (8), toilet paper (9) |
| 9 | Milk (5), butter (3) , diapers (4) |
| 10 | Beans (2), rice (8), toilet paper (9), baby wipes (1) |
A. [2 points ] What is the support of the itemset {milk, toilet paper}?
support of {milk, toilet paper} = **5**
B. [3 points ] If we use a minimum support threshold of 4, what are the itemsets of size 2 that are frequent?
| Itemset | Support |
| -------- | ------- |
| 1 | 4 |
| 2 | 5 |
| 3 | 2 |
| 4 | 4 |
| 5 | 6 |
| 6 | 2 |
| 7 | 3 |
| 8 | 4 |
| 9 | 9 |
| 1, 2 | 3 |
| 1, 4 | 3 |
| 1, 5 | 1 |
| 1, 8 | 2 |
| **1, 9** | **4** |
| 2, 4 | 2 |
| 2, 5 | 2 |
| 2, 8 | 3 |
| **2, 9** | **5** |
| 4, 5 | 2 |
| 4, 8 | 1 |
| 4, 9 | 3 |
| 5, 8 | 3 |
| **5, 9** | **5** |
| **8, 9** | **4** |
```
1, 9 (baby wipes and toilet paper)
2, 9 (beans and toilet paper)
5, 9 (milk and toilet paper)
8, 9 (rice and toilet paper)
```
C. [2 points]
a. [1 point] What is the confidence of the rule {rice} -> {beans}?
```
{8} -> {2}
{2, 8} = 3
{8} = 4
3/4 = 0.75
= 0.75
```
b. [1 point] What is the confidence of the rule {beans} -> {rice}?
```
{2} -> {8}
{2, 8} = 3
{2} = 5
3/5 = 0.6
= 0.6
```
3. [EXTRA CREDIT: 6 points] Write a function in Python that computes the F-score of a clustering output, with 2 input parameters: an array containing the labels of each data point in the ground-truth clustering, and an array containing the labels of each data point in the ground-truth clustering. You may use the contingency_matrix function in scikitlearn.