# CS 3120 Machine Learning HW3
Devon DeJohn, Spring 2020
## Imports
```python
import cv2
import pathlib
import random
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report as crep
from itertools import product
from tabulate import tabulate
```
## Train, Test, Validate
My default model's parameters are slightly different than `sklearn`'s model, as noted in the initial call to `super().__init__()`:
```python
DEFAULT = {
"n_neighbors": 3,
"metric": "manhattan",
"weights": "distance",
"n_jobs": -1
}
class Model(KNeighborsClassifier):
"""A container for a KNN classifier"""
def __init__(self, path: str):
super().__init__(**DEFAULT)
self.dims = (16,16)
self.labels = {}
self.path = path
self.parts = ["train", "test", "validate"]
self.load_data()
self.fit(self.train.X, self.train.Y)
# end
```
I also decided to implement my own version of `train_test_split` that supports multiple partitions, not just two. I did this for the sake of convenience for splitting the data in a single function call.
Partition 0: 'train'
size: 2100 / 3000
pcnt: 70.0 %
Partition 1: 'test'
size: 600 / 3000
pcnt: 20.0 %
Partition 2: 'validate'
size: 300 / 3000
pcnt: 10.0 %
## K=3, using the $\ell_1$-norm
For `n_neighbors=3`, using the `manhattan` distance metric, or the $\ell_1$-norm, we have:
```
precision recall f1-score support
cat 0.49 0.44 0.46 202
dog 0.47 0.57 0.52 223
panda 0.73 0.63 0.68 175
accuracy 0.54 600
macro avg 0.56 0.54 0.55 600
weighted avg 0.55 0.54 0.54 600
```
## Retraining
I added the ability to retrain the model directly through the `Model` type by making `sklearn.KNeighborsClassifier` a super class of `Model`, so that the classifier's attributes are accessible via the `Model` type.
## Performance
Taking the cartesian product of these parameters, we can retrain our model on every combination and measure the mean accuracy:
```python
dims = [8, 16, 32, 64]
neighbors = [3, 5, 7, 9]
metrics = ["manhattan", "euclidean"]
weights = ["uniform", "distance"]
```
Each time `Model.retrain()` is called, the seed for the `random` module is reset, so the testing data are the same across each run.
Each subplot below represents a distance metric (tabular), and a weighting algorithm (columnar). The horizontal axis of each subplot represents the pixel dimensions of the rescaled image data, and the vertical axis represents the number of neighbors counted in voting.

As seen from the tabular data below, we achieve the highest accuracy with `9` neighbors, using the inverse of the `distance` as a vote weight, measured using the $\ell_1$-norm, for a `16 x 16`-pixel image.
In general, $k$-nearest neighbors is a poor choice for naive image classification. Potential improvements could include a very rudimentary face recognition algorithm which would isolate the head/face of the animal in the image, crop the image so that the face/head is the only portion shown, then resize.
Even then, the variation in the image data is highly dependent on lighting, camera angle, and many other factors which can't be accounted for simply by analyzing the pixel intensities themselves.
| k | metric | weights | dims | accuracy |
|:-----------:|:---------:|:---------:|:--------:|:----------:|
| 5 | euclidean | uniform | (64, 64) | 0.4567 |
| 7 | euclidean | uniform | (64, 64) | 0.47 |
| 3 | euclidean | uniform | (64, 64) | 0.4717 |
| 3 | euclidean | distance | (64, 64) | 0.4717 |
| 5 | euclidean | distance | (64, 64) | 0.4717 |
| 9 | euclidean | uniform | (64, 64) | 0.4767 |
| 3 | euclidean | uniform | (32, 32) | 0.4817 |
| 7 | euclidean | distance | (64, 64) | 0.4817 |
| 3 | euclidean | distance | (32, 32) | 0.4833 |
| 5 | euclidean | uniform | (32, 32) | 0.485 |
| 9 | euclidean | distance | (64, 64) | 0.485 |
| 7 | euclidean | uniform | (32, 32) | 0.4933 |
| 5 | euclidean | distance | (32, 32) | 0.5033 |
| 9 | euclidean | uniform | (32, 32) | 0.5033 |
| 3 | manhattan | uniform | (64, 64) | 0.51 |
| 5 | manhattan | uniform | (64, 64) | 0.5117 |
| 9 | euclidean | distance | (32, 32) | 0.5133 |
| 5 | euclidean | uniform | (16, 16) | 0.515 |
| 9 | manhattan | uniform | (64, 64) | 0.515 |
| 7 | euclidean | distance | (32, 32) | 0.5183 |
| 3 | manhattan | distance | (64, 64) | 0.5183 |
| 3 | euclidean | uniform | (16, 16) | 0.52 |
| 5 | manhattan | uniform | (32, 32) | 0.5217 |
| 3 | euclidean | distance | (16, 16) | 0.525 |
| 5 | manhattan | uniform | (16, 16) | 0.525 |
| 3 | manhattan | uniform | (8, 8) | 0.53 |
| 5 | euclidean | distance | (16, 16) | 0.5333 |
| 7 | euclidean | uniform | (16, 16) | 0.5333 |
| 5 | manhattan | distance | (16, 16) | 0.535 |
| 9 | manhattan | uniform | (32, 32) | 0.535 |
| 7 | manhattan | uniform | (64, 64) | 0.535 |
| 9 | manhattan | distance | (64, 64) | 0.535 |
| 3 | manhattan | uniform | (32, 32) | 0.5367 |
| 5 | manhattan | distance | (64, 64) | 0.5367 |
| 3 | manhattan | uniform | (16, 16) | 0.5383 |
| 7 | manhattan | uniform | (32, 32) | 0.5383 |
| 3 | euclidean | uniform | (8, 8) | 0.54 |
| 3 | manhattan | distance | (32, 32) | 0.54 |
| 3 | manhattan | distance | (8, 8) | 0.5417 |
| 9 | euclidean | uniform | (16, 16) | 0.5433 |
| 5 | manhattan | distance | (32, 32) | 0.5433 |
| 7 | euclidean | distance | (16, 16) | 0.545 |
| 7 | euclidean | uniform | (8, 8) | 0.5483 |
| 9 | manhattan | distance | (32, 32) | 0.5483 |
| 7 | manhattan | distance | (64, 64) | 0.5483 |
| 9 | euclidean | distance | (16, 16) | 0.55 |
| 3 | manhattan | distance | (16, 16) | 0.5517 |
| 7 | manhattan | distance | (32, 32) | 0.5533 |
| 7 | euclidean | distance | (8, 8) | 0.555 |
| 3 | euclidean | distance | (8, 8) | 0.5567 |
| 7 | manhattan | uniform | (16, 16) | 0.5567 |
| 5 | euclidean | uniform | (8, 8) | 0.56 |
| 5 | euclidean | distance | (8, 8) | 0.5633 |
| 9 | euclidean | uniform | (8, 8) | 0.5633 |
| 9 | manhattan | uniform | (16, 16) | 0.565 |
| 7 | manhattan | uniform | (8, 8) | 0.5683 |
| 7 | manhattan | distance | (16, 16) | 0.5683 |
| 5 | manhattan | uniform | (8, 8) | 0.57 |
| 5 | manhattan | distance | (8, 8) | 0.57 |
| 9 | manhattan | uniform | (8, 8) | 0.5733 |
| 9 | euclidean | distance | (8, 8) | 0.5767 |
| 9 | manhattan | distance | (8, 8) | 0.5783 |
| 7 | manhattan | distance | (8, 8) | 0.5817 |
| 9 | manhattan | distance | (16, 16) | 0.585 |
## Validate
After selecting the best model, we arrive at
```
precision recall f1-score support
cat 0.54 0.57 0.56 108
dog 0.44 0.53 0.48 90
panda 0.83 0.62 0.71 102
accuracy 0.58 300
macro avg 0.60 0.58 0.58 300
weighted avg 0.61 0.58 0.59 300
```