CS 3120 Machine Learning HW3

# CS 3120 Machine Learning HW3 Devon DeJohn, Spring 2020 ## Imports ```python import cv2 import pathlib import random import numpy as np import plotly.graph_objects as go from plotly.subplots import make_subplots from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report as crep from itertools import product from tabulate import tabulate ``` ## Train, Test, Validate My default model's parameters are slightly different than `sklearn`'s model, as noted in the initial call to `super().__init__()`: ```python DEFAULT = { "n_neighbors": 3, "metric": "manhattan", "weights": "distance", "n_jobs": -1 } class Model(KNeighborsClassifier): """A container for a KNN classifier""" def __init__(self, path: str): super().__init__(**DEFAULT) self.dims = (16,16) self.labels = {} self.path = path self.parts = ["train", "test", "validate"] self.load_data() self.fit(self.train.X, self.train.Y) # end ``` I also decided to implement my own version of `train_test_split` that supports multiple partitions, not just two. I did this for the sake of convenience for splitting the data in a single function call. Partition 0: 'train' size: 2100 / 3000 pcnt: 70.0 % Partition 1: 'test' size: 600 / 3000 pcnt: 20.0 % Partition 2: 'validate' size: 300 / 3000 pcnt: 10.0 % ## K=3, using the $\ell_1$-norm For `n_neighbors=3`, using the `manhattan` distance metric, or the $\ell_1$-norm, we have: ``` precision recall f1-score support cat 0.49 0.44 0.46 202 dog 0.47 0.57 0.52 223 panda 0.73 0.63 0.68 175 accuracy 0.54 600 macro avg 0.56 0.54 0.55 600 weighted avg 0.55 0.54 0.54 600 ``` ## Retraining I added the ability to retrain the model directly through the `Model` type by making `sklearn.KNeighborsClassifier` a super class of `Model`, so that the classifier's attributes are accessible via the `Model` type. ## Performance Taking the cartesian product of these parameters, we can retrain our model on every combination and measure the mean accuracy: ```python dims = [8, 16, 32, 64] neighbors = [3, 5, 7, 9] metrics = ["manhattan", "euclidean"] weights = ["uniform", "distance"] ``` Each time `Model.retrain()` is called, the seed for the `random` module is reset, so the testing data are the same across each run. Each subplot below represents a distance metric (tabular), and a weighting algorithm (columnar). The horizontal axis of each subplot represents the pixel dimensions of the rescaled image data, and the vertical axis represents the number of neighbors counted in voting. ![](https://i.imgur.com/ZAGoFuJ.png) As seen from the tabular data below, we achieve the highest accuracy with `9` neighbors, using the inverse of the `distance` as a vote weight, measured using the $\ell_1$-norm, for a `16 x 16`-pixel image. In general, $k$-nearest neighbors is a poor choice for naive image classification. Potential improvements could include a very rudimentary face recognition algorithm which would isolate the head/face of the animal in the image, crop the image so that the face/head is the only portion shown, then resize. Even then, the variation in the image data is highly dependent on lighting, camera angle, and many other factors which can't be accounted for simply by analyzing the pixel intensities themselves. | k | metric | weights | dims | accuracy | |:-----------:|:---------:|:---------:|:--------:|:----------:| | 5 | euclidean | uniform | (64, 64) | 0.4567 | | 7 | euclidean | uniform | (64, 64) | 0.47 | | 3 | euclidean | uniform | (64, 64) | 0.4717 | | 3 | euclidean | distance | (64, 64) | 0.4717 | | 5 | euclidean | distance | (64, 64) | 0.4717 | | 9 | euclidean | uniform | (64, 64) | 0.4767 | | 3 | euclidean | uniform | (32, 32) | 0.4817 | | 7 | euclidean | distance | (64, 64) | 0.4817 | | 3 | euclidean | distance | (32, 32) | 0.4833 | | 5 | euclidean | uniform | (32, 32) | 0.485 | | 9 | euclidean | distance | (64, 64) | 0.485 | | 7 | euclidean | uniform | (32, 32) | 0.4933 | | 5 | euclidean | distance | (32, 32) | 0.5033 | | 9 | euclidean | uniform | (32, 32) | 0.5033 | | 3 | manhattan | uniform | (64, 64) | 0.51 | | 5 | manhattan | uniform | (64, 64) | 0.5117 | | 9 | euclidean | distance | (32, 32) | 0.5133 | | 5 | euclidean | uniform | (16, 16) | 0.515 | | 9 | manhattan | uniform | (64, 64) | 0.515 | | 7 | euclidean | distance | (32, 32) | 0.5183 | | 3 | manhattan | distance | (64, 64) | 0.5183 | | 3 | euclidean | uniform | (16, 16) | 0.52 | | 5 | manhattan | uniform | (32, 32) | 0.5217 | | 3 | euclidean | distance | (16, 16) | 0.525 | | 5 | manhattan | uniform | (16, 16) | 0.525 | | 3 | manhattan | uniform | (8, 8) | 0.53 | | 5 | euclidean | distance | (16, 16) | 0.5333 | | 7 | euclidean | uniform | (16, 16) | 0.5333 | | 5 | manhattan | distance | (16, 16) | 0.535 | | 9 | manhattan | uniform | (32, 32) | 0.535 | | 7 | manhattan | uniform | (64, 64) | 0.535 | | 9 | manhattan | distance | (64, 64) | 0.535 | | 3 | manhattan | uniform | (32, 32) | 0.5367 | | 5 | manhattan | distance | (64, 64) | 0.5367 | | 3 | manhattan | uniform | (16, 16) | 0.5383 | | 7 | manhattan | uniform | (32, 32) | 0.5383 | | 3 | euclidean | uniform | (8, 8) | 0.54 | | 3 | manhattan | distance | (32, 32) | 0.54 | | 3 | manhattan | distance | (8, 8) | 0.5417 | | 9 | euclidean | uniform | (16, 16) | 0.5433 | | 5 | manhattan | distance | (32, 32) | 0.5433 | | 7 | euclidean | distance | (16, 16) | 0.545 | | 7 | euclidean | uniform | (8, 8) | 0.5483 | | 9 | manhattan | distance | (32, 32) | 0.5483 | | 7 | manhattan | distance | (64, 64) | 0.5483 | | 9 | euclidean | distance | (16, 16) | 0.55 | | 3 | manhattan | distance | (16, 16) | 0.5517 | | 7 | manhattan | distance | (32, 32) | 0.5533 | | 7 | euclidean | distance | (8, 8) | 0.555 | | 3 | euclidean | distance | (8, 8) | 0.5567 | | 7 | manhattan | uniform | (16, 16) | 0.5567 | | 5 | euclidean | uniform | (8, 8) | 0.56 | | 5 | euclidean | distance | (8, 8) | 0.5633 | | 9 | euclidean | uniform | (8, 8) | 0.5633 | | 9 | manhattan | uniform | (16, 16) | 0.565 | | 7 | manhattan | uniform | (8, 8) | 0.5683 | | 7 | manhattan | distance | (16, 16) | 0.5683 | | 5 | manhattan | uniform | (8, 8) | 0.57 | | 5 | manhattan | distance | (8, 8) | 0.57 | | 9 | manhattan | uniform | (8, 8) | 0.5733 | | 9 | euclidean | distance | (8, 8) | 0.5767 | | 9 | manhattan | distance | (8, 8) | 0.5783 | | 7 | manhattan | distance | (8, 8) | 0.5817 | | 9 | manhattan | distance | (16, 16) | 0.585 | ## Validate After selecting the best model, we arrive at ``` precision recall f1-score support cat 0.54 0.57 0.56 108 dog 0.44 0.53 0.48 90 panda 0.83 0.62 0.71 102 accuracy 0.58 300 macro avg 0.60 0.58 0.58 300 weighted avg 0.61 0.58 0.59 300 ```