--- tags: assignments title: HW4 --- ## Homework 4 ### Setup Run `cs6-work hw4` to start working on this assignment. ### Help There are several commands that these problems expect you to use. If you get stuck, read their man pages and look up some examples of how to use them. * ```wc``` * ```cat``` * ```uniq``` * ```sort``` * ```tr``` * ```sed``` * ```awk``` * ```grep``` These webages may be helpful: * [sed](https://www.grymoire.com/Unix/Sed.html) * [awk](https://www.grymoire.com/Unix/Awk.html) * [grep](https://opensourceforu.com/2012/06/beginners-guide-gnu-grep-basics/) * [curl](https://curl.haxx.se/libcurl/) ### Problem 1 ***cs6-work:*** While penguins love puzzles, they occasionally take a break to read. After reading Alice in Wonderland, Patricia the chinstrap penguin is curious &mdash; how many times are each of the words in Alice in Wonderland used? To help her out, write a script to do the analysis. To get started, explore the directories that we set up for you. This problem's files are in `~/hw4/p1`. `alice.txt` is a text file containing a version of Alice in Wonderland that has been cleaned of capital letters, numbers, and punctuation. `count_uses.sh` is an empty bash file that you will be filling in. Fill in `count_uses.sh` with a bash script that reads`alice.txt` and lists each of the unique words used in the book with its number of occurences. `count_uses.sh` should be run by passing in the path to the text file to read. The command we will use to run the script is `./count_uses.sh /<pathtoalice>/alice.txt`. The output should be recorded in a file named `output.txt` in the same directory formatted as follows: ``` hello 156 i 45 like 34 bananagrams 22 ... leastusedword 1 ``` **Constraints:** * Do not use associative arrays- your solution should make use of some of the programs listed above. * The words should be listed in descending order of frequency with a single space between the word and the number of occurrences. Words with the same number of occurences should be sorted lexicographically. * Do not use absolute paths in your script. Your ```count_uses.sh``` will be put in a sandboxed folder with ```alice.txt``` as a sibling and run. We will grade the ```output.txt``` that is generated. * Do not change the title of `alice.txt` or `count_uses.sh`. ### Problem 2 ***cs6-work:*** While Patricia is reading Alice in Wonderland, her friend Charlie the crested penguin has a hankering for some Shakespeare. He knows the URL, but needs your help getting the text. Also, He's considering auditioning for one of the roles, but wants to avoid the role with the most lines. To get the text, we will be using the ```curl``` command. Run ```man curl``` to read about the specifics. Try running the following command OUTSIDE the cs6-work environment: ``` curl -o ./raw.html http://shakespeare.mit.edu/macbeth/full.html ``` This will download the entire play to a text file named ```raw.html``` in your directory. Handy, right? Unfortunatly, `curl` doesn't work in the cs6-work environment. So, we will be using `cat` to get the play's contents. Because this was noticed late, you may need to recover the plays and stencil from the `~/backups/hw4/p2` directory. Your script should end up being able to take a single argument specifying the location of the input play's file. We will be using the ```p2``` directory for this problem. Start editing the script named `play_parser.sh`. We have provided you with a stencil. Your job is to fill in the appropriate URL, `awk`, and `sed` commands. You may edit other lines of the file to see intermediate values and can restore the stencil from the `~/backups` directory. Next, continue working on ```play_parser.sh```. The end goal is to have it print a single line into stdout. That line should be formatted as follows: ```58 george``` Where George is the character with the most lines, and 58 is the number of lines George has. Make the actor's name be in all lower case. For our purposes, a "line" is counted as one item spoken by the character contained by \<A\> tags. So, the entry: ``` <A NAME=speech15><b>MACBETH</b></a> <blockquote> <A NAME=3.4.43>Sweet remembrancer!</A><br> <A NAME=3.4.44>Now, good digestion wait on appetite,</A><br> <A NAME=3.4.45>And health on both!</A><br> </blockquote> <A NAME=speech16><b>LENNOX</b></a> <blockquote> <A NAME=3.4.46>May't please your highness sit.</A><br> <p><i>The GHOST OF BANQUO enters, and sits in MACBETH's place</i></p> </blockquote> <A NAME=speech17><b>MACBETH</b></a> <blockquote> <A NAME=3.4.47>Here had we now our country's honour roof'd,</A><br> <A NAME=3.4.48>Were the graced person of our Banquo present;</A><br> <A NAME=3.4.49>Who may I rather challenge for unkindness</A><br> <A NAME=3.4.50>Than pity for mischance!</A><br> </blockquote> ``` Consists of 7 lines for Macbeth and 1 line for Lennox. To simplify the exercise, treat lines that are spoken by more than one person (e.g. lines spoken by "all," "lords," "both murderers," or "soldiers") as if they are just another character. Do not put in the work to determine who "all" is for the scene and add lines to each person in "all." Here are a couple correct examples to test your script against: ```bash $ ./play_parser.sh merry_wives 433 falstaff $ ./play_parser.sh hamlet 1495 hamlet ``` We will test your script against several plays, so it should work for all of them. ### Problem 3 ***cs6-work:*** Arnold the Adélie penguin is trying to understand how numbers work. He needs thing built from the ground up &mdash; starting with positive integers working all the way to scientific notation. To convey meaning to Arnold effectively, construct a regular expression for each of the following classes. We have provided commands in the `cs6-work` environment that allow you to test your regular expressions. The first tests your expressions against a short list of simple cases that are available to you. They can be viewed in The second does more extensive edge case testing. However, it tells you what fraction of the cases you got right but not which cases. These are the commands: ``` regex-test <path to your regex file> easy regex-test <path to your regex file> hard ``` The test cases for the first script are an excellent source of examples to get you started. They can be found in: ``` /usr/local/regex-test/testcases ``` For each of these regular expressions, we recommend you follow these steps: * List cases that should be identified and cases that should not * Consider edge cases. Is it alright if this character doesn't happen at all? Must it happen at least once? Etc. * Construct each of the pieces separately. For example, if you were working on email addresses, construct two separate regular expressions for the part before the @ symbol and the part after. Once each part is working, join them together. * Double and triple check your escape characters. * Use online tools to test and do research for your regular expressions. [This](https://regexr.com/) is a personal favorite. * Do not search for the answers to these problems online. * Wildcards (*) in regular expressions behave differently than wildcards for matching filenames! For example, ```ls file*txt``` would list all files that start with "file" and end with "txt" (so, hopefully text files &mdash; the "." that would normally be included before "txt" is ommitted because it is also a special character in regexes). However, the regular expression ```file*txt``` would only match the following string group: ``` filtxt filetxt fileetxt fileeetxt ... ``` For this problem, put your regexes into the `~/hw4/p3/regex.txt` file, one per line. ##### 1. Positive integers Begin with considering what falls into the category of "positive integer," or $\mathbb{N}$. ##### 2. Integers Integers includes all positive integers as well as zero and negative integers. This is represented by $\mathbb{Z}$. ##### 3. Decimals Decimals theoretically include all of $\mathbb{R}$, but some numbers would require an infinite amount of space to theoretically store on a computer. That is not your concern, however, as you just need to write a regex to recognize them :smile: Do not allow any decimals with zero significant figures other than the string "0". ##### 4. Decimals in scientific notation These numbers all fall in $\mathbb{R}$, but are denoted differently. We will be using [E notation](https://en.wikipedia.org/wiki/Scientific_notation#E-notation). The exponents must be integers and the base number may be any decimal. ### Problem 4 ***cs6-work:*** Pat the penguin is a teacher and wants to award the student who wrote the longest essay with a prize. You tell her that there are several ways to measure length: character count, word count, and line count. Pat thinks about this, and decides to give the prize to the student with the greatest sum of these three metrics. Not the most normal measuring method, you think, but no issue when it comes to scripting! Write Pat a script ```longest_essay.sh``` that examines each other file in its directory and prints a single line to stdout: the name of the file that has the greatest sum of characters, words, and lines. Write some fake essays in this folder to do testing. If you do no testing you will lose points.