CS200 Bridge 4: Structured data for formatting

# CS200 Bridge Assignment: Structured data for formatting *[Work in Pyret, putting your work in a file called `document-code.arr`]* You might have experience formatting a paper or report that includes headers, paragraphs, bulleted lists, etc, by pressing buttons in a word processor (such as Microsoft Word or Google Docs). You might also have seen that online articles and other websites (including this one!) have different sections and components that are displayed a bit differently. In this assignment, we'll explore two ways that computers can store *structured text*. :::info office hours recording See [**here**](https://brown.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=0428f945-e616-44f7-9e45-b308016439af) for the summer 2025 office hours recording on this assignment. *Note: as with all of the office hours recordings, you will get a lot more out of watching this recording after you've attempted the problems in this assignment.* ::: ## Part 1: Parsing HTML **[HTML, or Hyper-Text Markup Language](https://en.wikipedia.org/wiki/HTML)**, is the language that describes websites on the internet. It is made up of content surrounded by *tags* which follow specific rules and dictate how a website will be displayed when it loads on your screen. A very basic website might have the following html (the output is on the left): | Output | HTML | -------- | -------- | | ![cats](https://hackmd.io/_uploads/BkIR6kxeel.png) |<pre>\<html> \<head> \<title>Cat facts\</title> \</head> \<body> \<h1>Did you know:\</h1> \<ul> \<li>Cats sleep \12-16\ hours a \day\\</li> \<li>No two cat nose prints are alike\</li> \</ul> \Created by the \CS111\ staff!\ \</body> \</html></pre> | In the example, `html`, `head`, `title` etc are the tags. There is a nested structure to HTML, where content and additional structures are contained between *opening* and *closing* tags, (e.g. `Created by the CS111 staff!` is contained by the `p` tag because it falls between `` and `` -- note that the tag contains both text "Created by..." and additional structure in terms of the `b` tag). We call the tag and the content contained by the tag an *HTML element*. We're going to use a simplified version of HTML with the following rules: * In the description below, "contents" refers to a string with no linebreaks that gets displayed on the page * Every document begins with `<html>` and ends with `</html>` * The `html` tag contains two HTML elements (in order): `head` and `body` * The `head` tag contains one HTML element (`title`) and nothing else * The `title` tag contains only contents (the `title` tag is for describing the website. It doesn't get displayed on the page directly, but appears in the window/tab heading) * The `body` tag can contain any number/combination of five elements: `h1`, `h2`, `h3`, `p`, and `ul` * The three header tags (`h1`, `h2`, `h3`) contain only contents * The `p` tag contains contents, some of which may be surrounded by `b` tags * The unordered list tag (`ul`) contains any number of `li` elements * The list item tag (`li`) contains contents, some of which can be surrounded by `b` tags The important thing to notice is that the HTML defines the *structure* of the website, which is different from how the website *appears* on your browser. Said otherwise, website servers store HTML files, and your web browser reads in those files to figure out how to display the contents. In order to display an HTML file, a browser needs to have an internal representation of the HTML structure that it can compute on. In Pyret, we can use structured data for this internal representatation. Click [here](https://code.pyret.org/editor#share=18uMxY51XE33oHUsh3q7NInoC__SvDtXc&v=fee2ecd) for a file that contains definition of structured data that can be used to represent the rules written above. :::spoiler Our example website represented using our data ![cat_data](https://hackmd.io/_uploads/H1mMYgxeeg.png) ::: **Task 1:** Make a copy of the file. In a comment in the file, answer the following questions: 1. Why do the definitions use different types instead of nesting a bunch of variants under one type? That is, what could go wrong if we put the `head` and `body` variants under the `HTML` type? 2. Why do we have three almost-identical looking variants for `h1`, `h2`, `h3`? Since they all hold the same information, why not just use a single `header` variant? 3. Our HTML rules allow us to arbitrarily mix headers, paragraphs, and bulleted lists in the body of a document. This is also allowed by the structure of our data, because the `body` variant is just a list of `Elements`. What would have to change about our `data` definitions if we changed the rules so that every paragraph (`p` tag) needed to be preceded by a header (`h1`, `h2`, or `h3` tag)? Explicitly write out the changes to the `Element` (or any other) type as well as any new types that you might define, if needed. **Task 2:** Write a function `fun text-to-html(input-text :: String) -> HTML:` that takes in an input string formatted in HTML and produces the corresponding data of type `HTML`. You can assume the following about the input string: - It's all one line (contains no newlines) - It has no extra spaces surrounding the tags (e.g. `<head><title>...` instead of `<head> <title>`) - It is well-formed HTML according to the rules above. In particular, every opening tag is eventually closed with a closing tag, the string starts with `<html>`, and the nesting rules obey the structure we've defined. - All html tags are lowercase and have no superfluous spaces (e.g. `<ul>` rather than `< UL >`) - Contents do not contain the `<` or `>` characters :::spoiler How to define inputs for testing Instead of typing out a really long input String in one line, you can use multi-line strings (defined using three backticks) and then remove the newlines using string-replace (the special String "\n" represents a newline character). For example, our website above could be described as: ``` cat-website = ```<html> <head><title>Cat facts!</title></head> <body> <h1>Did you know:</h1> <ul> <li>Cats sleep 12-16 hours a day</li> <li>No two cat nose prints are alike</li> </ul> Created by the CS111 staff! </body> </html>``` ``` and then we could call `text-to-html` with the expression `text-to-html(string-replace(cat-website, "\n", ""))`. The downside is that the Pyret editor will want to automatically add spaces at the beginning of each line, so you'll have to be careful to remove them. You can also break a String up into multiple lines in the Pyret window using addition: ``` cat-website-2 = "<html>" + "<head><title>Cat facts!</title></head>" + "<body>" + "<h1>Did you know:</h1>" + "<ul>" + "<li>Cats sleep 12-16 hours a day</li>" + "<li>No two cat nose prints are alike</li>" + "</ul>" + "Created by the CS111 staff!" + "</body>" + "</html>" ``` In this case, the String doesn't contain any newlines, so you can call `text-to-html` with this input directly. ::: :::spoiler Hints This is a pretty complicated function, and will probably require a few helper functions. We suggest starting with the helper functions and thoroughly testing them before combining them for the full task. Consider the following tips: - The `string-split` function can help you get the text before and after an opening (or closing!) tag - There are probably some operations (such as removing opening/closing tags) that you'll need to do repeatedly -- helper functions will really come in handy here - Try to work from the inside-out: define a function that transforms the contents of a `p`/`li` tag into a `List<Text>`, thoroughly test it, and then see how you can use that function in a function that transforms the contents of a `ul` tag into a `List<List<Text>>`, etc. ::: ## Part 2: Outputting markdown Now that we have an internal representation of our website, we can reason about how a browser would display it. Rather than trying to output an image or website, we're going to create output in another text-based format, called *markdown.* [Markdown](https://en.wikipedia.org/wiki/Markdown), like HTML, is a language for describing the structure of a document. It's a bit simpler than HTML and used for basic documents (this very page that you're reading was written in Markdown!). Take a look at the Wikipedia examples to see how Markdown is formatted. For us, the relevant rules are: - Different levels of headers are represented using different numbers of `#`. That is, the content for an HTML header 1 (`h1`) would be preceded by `# ` in Markdown (and by `## ` and `### ` for `h2` and `h3`, respectively) - Items in bulleted lists are preceded by `- `. Every item in a bulleted list is separated by one newline - Bold text is surrounded by `**` on each side - Every new element in a body (header, paragraph, bulleted list) is separated from the previous one using two newlines (so that a blank line appears between elements) Using these rules, the body of our example cat webpage would look like: ``` # Did you know: - Cats sleep **12-16** hours a **day** - No two cat nose prints are alike Created by the **CS111** staff! ``` :::spoiler What does this look like when rendered on your page? # Did you know: - Cats sleep **12-16** hours a **day** - No two cat nose prints are alike Created by the **CS111** staff! ::: **Task 3:** Write a function `fun body-to-md(b :: Body) -> String:` that takes in a well-formed `Body` element and produces a Markdown String. :::spoiler How to create newlines in the output String Above, we learned that the String "\n" represents a newline, so the Markdown output of our example should actually be ``` "# Did you know:\n\n- Cats sleep **12-16** hours a **day**\n- No two cat nose prints are alike\n\nCreated by the **CS111** staff! ``` *(Note the lack of spaces after `\n` and note that there are no additional newlines at the bottom of the file -- your output is expected to behave in the same way).* You can append a newline to a String using the same String operations we know -- for example, the expression `"Hello World" + "\n"` would result in the String `"Hello World\n"`. If you want to see how the String displays using newlines, use the `print` function, which will print out the String by replacing `\n` with newlines (the function also returns the String, so you'll see it at the bottom of the output). For example, if `cat-html-body` was the `Body` of the cat website as internally represented using structured data, we could run the expression `print(body-to-md(cat-html-body))` to see the output displayed using newlines. ::: ## Reflection In a block comment, write a few sentences about what you’ve learned from these problems. Also mention any questions you have (if any) based on these exercises. ## Handin Turn in a file `document-code.arr` with your code. In particular, make sure you have tested your functions with multiple directories and files. :::warning ***Information about the Autograder*** Just like previous assignments, Gradescope will run our autograder tests on your code and display the results. **It is your responsibility to make sure that all autograder tests pass**. * Make sure you're using our datatype definitions for the problem, and that the functions are defined accordingly * The hints from the previous three assignments apply -- check the filename, check the function definitions, and write plenty of tests of your own. :::