Working with Files

# Working with Files ## Table of Contents 0. [Logistics](#logistics) 1. [Homework 1](#hw1) 2. [Working With Files](#files) 3. [Word Counts](#counts) 4. [Wondering](#wondering) ## Logistical Comments <a name="logistics"></a> * We introduced TAs today! * Homework 0 is due today! This homework is only asking for a little information about you, and should be extremely quick. * Homework 1 (Python Practice) is due on Tuesday. This homework will award full credit for participation! Dive in, practice and review your Python, and don't worry about how you do. We made this change based on your answers to the "What do you need / What scares you?" exercise. * There will be a Python Review Session at 10am EDT on Sunday, September 12 over Zoom. See the Homework 1 page for details. * We delayed the start of drills due to Wednesday's lecture being optional; more info is coming on these soon! * TA hours are starting, and Tim's hours are on Monday afternoon. * Labs start next week. We've currently set up 4 sections in person and 1 section over Zoom. ## About Homework 1 <a name="hw1"></a> Today's lecture is structured (in part) to be a helpful example of the sort of code you might write for Homework 1. ## Working with files <a name="files"></a> In CSCI 0111, you learned how to work with data organized in several structures: tables, lists, trees, and hashtables. You also might have seen data loaded from Google sheets, or CSV (comma-separated value) files. In this class, we’ll see how to load data from a couple of other sources: text files and web pages. These sources are incredibly common, but not as convenient to work with right away as what you saw in 0111. Today we’ll cover text files; we’ll work with web pages later in the course, once we’ve learned a bit about how they are structured. Let’s say we want to write a program that works with the complete text of Frankenstein, by Mary Wollstonecraft Shelley. The text is available [here](http://www.gutenberg.org/files/84/84-0.txt) via [Project Gutenberg](http://www.gutenberg.org/), an online collection of public-domain books. First, we'll download the file and save it to disk somewhere. By default, it should be named `84-0.txt`. This isn't very descriptive, so I renamed mine to `frankenstein.txt`. In the future, whenever you see a piece of code preceded by a `>`, that means I am running it in the Python prompt. For example, after running `python3`, we can load the file into Python like this: ``` > frankenstein_file = open('frankenstein.txt', 'r') ``` The `r` means we want to open the file for Reading. Opening a file gives us a sort of token that we can use to work with it. Crucially, this isn't the same thing as the text itself; that text is still stored in the file! To get the text itself, we need to read the file's contents into a string: ``` > frankenstein = frankenstein_file.read() ``` Now we have the whole text as a really long string. You could see the text by asking Python to print `frankenstein`, but I won't do that here because it would take up a lot of space in the notes! Something you might try is comparing what Python prints out for `frankenstein` against what it prints out for `frankenstein_file`. Remember: * The text of the book is represented as a string (`frankenstein`); and * The file, which provides access to the book's data, is represented as a file object (`frankenstein_file`) Now that we have the text, what are some things we might do with it? ### Replacing words and writing files Maybe we want to rewrite Frankenstein to instead be about Bruno the bear (Brown's sports mascot). The easiest way to do this is probably just to take the text of Frankenstein and replace "Frankenstein" with "Bruno" everywhere: ``` > bruno = frankenstein.replace("Frankenstein", "Bruno") ``` Now that we’ve done that, we could save the results in a new file: ``` > bruno_file = open('bruno.txt', 'w') > bruno_file.write(bruno) > bruno_file.close() ``` If we look at that file, we can see that the text has been rewritten. Everthing will be identical to the original, except that any occurrences of "Frankenstein" will have been replaced with "Bruno". And since it's saved into a file, you could share it easily: put it on your website, send it as an email attachment, or just archive it for future generations. ## Word counts <a name="counts"></a> Let's say we wanted to create a count of the number of times every word appears in Frankenstein. How should we get started? <details> <summary>Think, then click!</summary> I use a strategy of: - writing down the shape of the data I've already got (here, it's a string containing the entire book); - writing down everything I need to do with the data (here, it's counting the number of times every word appears); and then - using that info to decide what intermediate data structures I should use. (Here, a dictionary would probably make sense, since that way we can store a count for every word.) You'll sometimes hear me say that "queries influence structure", which is the idea that your choice of data structures should be governed by what you need to do with the data. Hence this strategy. </details> Let's try writing a helper function to do that. We'll start by writing the skeleton of the function, and then add the dictionary (which starts out empty) and return it. ``` def count_words(s: str): counts = {} # Code goes here, I don't know what yet return counts ``` Now we just have to figure out how to convert the input string into the dictionary. To do that, we need to break the goal down into subtasks. What are they? <details> <summary>Think, then click!</summary> We need to: * break up the input string into words; and * count each word. To break up the input, we'll use the `split()` function, which converts a string into a list of strings, broken up by empty space (spaces, tabs, newlines, etc.). To count the words, we'll loop over that list! </details> ``` def count_words(s: str): counts = {} for word in s.split(): if word not in counts: counts[word] = 1 else: counts[word] += 1 ``` You might not have seen `+=` before. When we write `counts[word] += 1`, it's just shorthand for `counts[word] = counts[word] + 1`. What can we do with this dictionary? Well, lots of things! But for today, let's try to figure out what the most common word in the text is. There are a few ways to do this, but let's opt for another loop. We can loop over dictionaries just like over a list; if we write `for word in counts` it will loop for every key in the dictionary: ``` def most_common(counts: dict): most_common = '' most_common_count = 0 for word in counts: if counts[word] > most_common_count: most_common = word most_common_count = counts[word] return most_common ``` Finally: we may want to run our word-counting program as a script from the terminal. This would let us count words for different files. We can do that like this: ``` if __name__ == '__main__': import sys print(most_common(count_words(open(sys.argv[1], 'r').read()))) ``` That strange `__name__ == '__main__'` business is how we execute code when the program is run as a script, from the terminal. In particular, we’re going to get the first argument passed to our script (that’s `sys.argv[1]`), open that file, read it, and run our word count program. If we save all this to a file called `word_count.py`, we can do (in the terminal): ``` % python3 word_count.py frankenstein.txt the ``` Not very surprising, in retrospect. ## Some things I wonder <a name="wondering"></a> #### Something concrete I wonder what the top 10 words are in Frankenstein. How would we find out? #### Something code-related Last time we talked about different kinds of "efficiency" in good code. I wonder how efficient the code we wrote today is. How would we find out? What factors might influence efficiency? #### Something else We talked about some aspects of "good code" last time. I wonder what else makes code good. I wonder if it's possible for code to be morally or ethically good (or bad).