Project 1 and Machine Learning

# Project 1 and Machine Learning ## Table of Contents 1. [Machine Learning](#machine) 2. [Project 1](#project) 3. [Thinking More about Memory](#memory) ### There's a new drill out today! It's due Friday. Remember these are designed to be extremely short (this one has only 2 questions) and help me notice when students have questions that I might not think to prepare for otherwise. ### Tim's Hours on Monday Tim can start at 2:30 as normal, but has to stop at 3 for an unexpected appointment. To help make up the missing time as effectively as possible, there's a [post on Ed](https://edstem.org/us/courses/13110/discussion/633679) asking if anyone is unable to make the usual hours but would like to meet. ## Building Classifiers <a name="machine"></a> Suppose we have the job of taking in information about an animal's _size_ and _fuzziness_ and deciding whether it is a cat or a dog. We could certainly write a Python function that does that; we might even have a guess about how to write it. Something like: ```python= def classify(size: int, fuzziness: int): if size > 10: return 'dog' if size > 8 and fuzziness > 10: return 'dog' else return 'cat' ``` This program would "work" but hopefully you have some questions about it. <details> <summary>Think, then click!</summary> Among other questions (what are the units? why only size and fuzziness?) you might ask: Where did the 8 and the 10 come from---"expert" knowledge, or somewhere else? If our knowledge changes, how should we decide on new numbers? </details> ### Machine Learning Machine learning is a field concerned with (among other things) finding patterns in data and extrapolating from them. Its ideas are well suited to our classification task, but we'll have to do a bit of work to make them applicable. First, let's collect a large set of size and fuzziness data, that's already been classified. Since we know (in theory, anyway) whether each datapoint is a dog or cat, we can use this data to _train_ a classifier using machine learning. Let's visualize this by drawing a 2-dimensional graph, where the Y axis corresponds to size, and the X axis to fuzziness: ![Dogs and Cats: Size vs. Fuzziness](https://hackmd.io/_uploads/ry-GAiOmt.png) I've put two example animals on the chart: Boatswain the dog, and Charli the cat. Here's a picture of Charli helping me prepare to teach. <center> <img src="https://i.imgur.com/md4oTcJ.jpg" alt="Charli Cat" width="300"/> </center> We'll represent dogs as blue circles and cats as orange squares. As Boatswain is much larger than Charli, and a tiny bit more fuzzy, he appears above and slightly to the right of Charli on the graph. Let's add all the other dogs and cats from our training set. Then, a very basic machine learning algorithm might use the existing data to sketch a border line: on one side, it guesses "dogs", and on the other, "cats". ![A Possible Classifier](https://hackmd.io/_uploads/S19rk3u7Y.png) Some approaches build a more complicated border, which might be able to do better than the straight line I've drawn here. But this is a decent high-level sketch of how we might think about building a classifier. Note that error is always a possibility: here we've misclassified one dog as a cat, and one cat as a dog. There are also a couple animals who are quite close to the line, and so might easily see their classification change if the training data are updated. #### Something I wonder This is just a toy example, but already you might see some industrial applications and potential problems that could arise from errors. What do you see? <details> <summary>Think, then click!</summary> Any veterinarian would tell you that some foods or medicines that are good for a cat might hurt a dog (or the other way around). So even in a toy example like this, there could be real world consequences to errors in a classifier. We might build a much better classifier by adding more dimensions of information, but even then, a machine is capable of error just like a human is. Classification is tough. </details> ### Classifying Song Genre Let's move on to something a bit more complex, and also more related to your first project. We're going to try to figure out what genre of music (pop, hip hop, rock, blues, country, metal, ...) a song belongs to, just based on its lyrics. Here's a few example lines from a chorus: ``` And here I go again, I'm drinkin' one, I'm drinkin' two I got my heartache medication, a strong dedication To gettin' over you, turnin' me loose On that hardwood jukebox lost in neon time... ``` Do you have any guesses about what genre this song is from? If so, why? <details> <summary>Think, then click!</summary> It is, in fact, a country song: "Heartache Medication" by Jon Pardi (2019) </details> If we believe there's a correlation between the words in a song and its genre, it might make sense to use _word counts_ to train a classifier. And that's what you'll be doing in Project 1. How might this work? Let's consider a shorter song, which might be sung at our mascot's birthday: ``` Happy Birthday to You Happy Birthday to You Happy Birthday Dear Bruno Happy Birthday to You. ``` If we ran our word-counting program on this song, we'd get (after removing punctuation!): ```python {'Happy': 4, 'Birthday': 4, 'to': 3, 'Dear': 1, 'Bruno': 1, 'You': 3 } ``` We could get similar word count information for every song in our training set. We'll call this the _term frequency_ table for a given song. From there, we can go on to imagine (although I won't attempt to draw it!) a graph on hundreds or thousands of dimensions, where each dimension corresponds to a word. Then, a song could be put in that graph according to how many times each word appeared in it. So far, so good, but: * Aren't some words more important than others? * How would we measure distance in this mind-boggling graph? * How do we actually go about drawing that line between genre? We'll answer the latter 2 questions in the project handout. For now, let's focus on that first question; it's vital. The country song we looked at a few minutes ago contains quite a few instances of the word "I". For our classification purposes, do we really care as much about "I" as we do about "drinkin'" or "heartache"? Probably not, which is where the next idea comes in. We'd like to value difference in some word counts more than others, and to do this we'll use an idea called _inverse document frequency_, which essentially means finding out how common a word is in the entire training set. Words like "a" and "I" will be common; words like "drinkin'" will be less so. Putting these two ideas together gives us a metric called _TF-IDF_, short for the combination of term frequency with inverse document frequency. More details about this in the project; for now, just think about it as a sort of multiplier we'll apply to the term frequency value that makes very common words less important. Here's what Happy Birthday's frequency table looks like with an IDF factor applied: ```python {'Happy': 4, 'Birthday': 8, 'to': 0, 'Dear': 2, 'Bruno': 8, 'You': 1.5 } ``` Hardly any songs use "Birthday" or "Bruno", but many use "to". ### Discussion: Let's think about classifiers We saw one possible issue above: errors in classification can cause real consequences. But, it might be that an error about song genre is less dangerous than an error about animal species. That was about the _impact_ of errors, though. Since machine learning can be so dependent on training data, let's think a bit about ways that a machine-learning algorithm might be led astray, or might lead us astray. This is a really broad question, but it needs to be! <details> <summary>Think, then click!</summary> A short list of a few factors might include: - issues with the features chosen (to discover a relationship between tooth shape and animal species, you need to realize that tooth shape is a thing you should be considering); - sampling issues with the data itself (too small a training set, bias in the population sampled, external causes for patterns discovered...) - cognitive biases in setting up and analyzing the problem (confirmation, availability, and many others; feedback loops can even form where statistical or ML results reinforce a bias that led to the results) </details> Whenever you're learning a new technology, it's useful to think a bit about potential threats, biases, and other factors that might influence how you use it. You'll be doing some reading on this concurrently with the project, but I'm glad we got to _start_ the discussion here in class. I want to tell you a story that might make one of the points above more real to you. During World War 2, the U.S. military was very interested in ways they could reduce the chance of aircraft loss. You can't afford to armor an airplane as much as you might like to: the more it weighs, the harder it is to get off the ground, the less agile it is, and the more fuel it needs. The U.S. Army and Navy thought that they might be able to look at their planes after a battle and add armor to the places where they saw bullet holes. [Abraham Wald](https://en.wikipedia.org/wiki/Abraham_Wald) was a mathematician who was brought in to work on the project. Wald was originally from Austria, but was unable to get a University position because of anti-Semitic discrimination at the time. When Germany annexed Austria, he fled and came to the United States. Wald noticed a problem with the military's reasoning: something called _survivorship bias_, which is a kind of sampling bias. Any airplane they could examine for damage after a battle had, by definition, avoided the fate that they wanted to prevent. The others had already crashed, or exploded, or would anyway not be available for a careful examination of damage. Wald found that, rather than armoring the places where surviving planes showed damage, other places (like the fuel supply) needed to be given much more consideration. It's estimated that Wald's statistical work saved hundreds of lives. ## Project 1 <a name="project"></a> We talked a bit about Project 1; see the project handout for a more detailed presentation. The summary is that we're asking you to build a classifier for song genre based on a corpus of training data we'll provide. You'll use TF-IDF for the classifier, and just return the genre of the nearest neighbor in the training data. There are many ways that you might define "nearest"; we've given you code for a metric called cosine similarity and suggest you use that. ## Thinking more about memory <a name="memory"></a> (If time permits!) Let's talk a bit more about what happens in memory when you create container objects like lists, sets, and dictionaries. Here's an example snapshot of part of the state of a Python program. We've loaded a bunch of song lyrics into memory, and in this case we've chosen to store them all in a dictionary (`songs`) indexed by the name of each song. This snapshot was taken right after executing the line: ```python= happy_words = songs['Birthday'] ``` Notice what's happened. The program knows the name `happy_words` now, and the name `happy_words` refers to a list in memory that's storing the lyrics to "Happy Birthday". In fact, it's the same list that the dictionary was (and still is) storing as the value for the `'Birthday'` key. ![](https://hackmd.io/_uploads/HyyhwTO7t.png) Sometimes, this is exactly the situation we want. But sometimes it isn't. What could go wrong here? (Note that this issue, in another form, came up for some of you on the homework.) <details> <summary>Think, then click!</summary> Both `happy_words` and `songs['Birthday']` point to the same list in memory. When we ran `happy_words = songs['Birthday']`, we didn't copy the _data_, we copied a reference to the list. This sort of thing is often normal, but can lead to errors when you modify a list and don't expect that modification to be reflected elsewhere. If you want to be absolutely safe, you can create what's called a \defensive copy\: ```python= happy_words = list(songs['Birthday']) ``` which creates an entirely new list with the same data in it. If you're worried about this kind of thing happening, you can test it using the `id` function: ```python= print(id(songs['Birthday'])) print(id(happy_words)) ``` When given an object, the `id` function returns an identifier for that object. The identifier is guaranteed to be unique and immutable for the duration of the object's life. (Of course, this means that two objects might be given the same identifier, so long as they can never exist at the same time.) </details>