Lab 10: Iterators and Regex

Learning Objectives

In this lab, you will learn to:

create iterable objects in Python
pattern match using Regex

Setup

Use this GitHub Classroom link to get the stencil code for this assignment.

Intro

Alan was playing with his Webkinz, and decided to lead an icebreaker to get them all to know each other. However, before he could start introducing them, he had to check each Webkinz's name tag and learn their names. Realizing he had to find a way to visit each of his toys in sequential order, a lightbulb lit up in his mind: Iterators!

What is an iterator?

In both Python and Java, we have iterated over lists using a notation that visits each element in the list exactly once, as in:

// Java
for (DataType item : linklist) {
    ...
}

# Python
for elt in lst:
   ...

The idea of "visit each item once" isn't limited to lists. We could imagine doing that over any data structure that has a collection of elements, such as all the nodes in a tree, all the nodes in a graph, or all the KVPairs in a hashtable/dictionary.

Because iterating over elements makes sense in a lot of situations, many languages give programmers a way to make for-loops work on their own data structures (e.g., graphs). In this section of lab, we will teach you how to write your own iterators in Python (a similar process works in Java).

Writing Iterators

There are two key steps in iterating over a collection of items:

identifying the first element
from any one element, producing the next element to visit

In Python, we capture these two steps by writing two (dunder) methods named __iter__ and __next__.

__iter__ defines the initial state of the iterator by indicating the element of the iterable object that the iterator will start iterating from (usually the first element), and returns the iterator itself.
__next__ can be called repeatedly to return what the next element of the iterable structure should be and advance your place in the data structure by one location. If there is not a next element, __next__ should raise StopIteration.

Writing a LinkedList iterator

Python's built-in lists are built on arrays. In LinkedList.py, we've given you the Python code for a basic LinkedList class, similar in style to what we wrote in Java.

Task 1: The bottom of LinkedList.py contains an example of making a list with this class and an attempt to write a for-loop over it. Try to run the code and see what error you get from Python.

Comment out the for-loop code (for now). Now we will write the iterator needed to get that code to run.

Task 2: Identify the first element that the for-loop should visit (don't overthink this).

Now, think about the __next__ method. Python's for-loop mechanism will call this method over and over, each time returning the next element of the list for the for-loop to visit (the first time it is called, it returns the item identified in the previous question).

Task 3: Discuss how to design the __next__ method: how will your code keep track of the current and next elements to return? Do you do this with a helper function? With an additional variable? Does that variable live in the __next__ method? In the class?

Task 4: Implement your design by filling in the __iter__ and __next__ methods in the stencil code. The comments in the method tell you what to do. When you reach the end of the iterable structure, __next__ should raise StopIteration.

Try uncommenting and running the loop at the bottom of LinkedList.py to see if your iterator is working as expected.

Checkpoint! Call over a TA to check over your work. Ask them any questions you have about how this works.

A subtlety (for those who are interested)

The iterator design we have done here highlight the basic concept of __iter__ and __next__ methods. There's a problem with our current design though. Assume that you had a large data structure that multiple objects had a reference to.

What happens if two for loops get set off on the same data structure object at the same time?

Our current design will fail, as both loops would be using the same variable to track the current item. A more robust iterator design actually returns a separate object for the iterator, so that each iterator can manage its own state independently.

Designing a HashMap iterator

Let's go back to HW3, where you implemented hashtables. How might you implement a Hashtable iterator? What would that even mean?

We'll do this as a design exercise; you won't turn this into code.

Recall the internal data structure of a hashtable: we had an array of lists of KVPair. Assume that we want a hashtable for-loop to iterate over the KVPairs. For example (assuming we had implemented the same design in Python):










# using a for-loop to print a dictionary
def print_ht(my_hashtable):
    for pair in my_hashtable:
       print(str(pair.key) + " -> " + str(pair.value))
       
sample_ht = Hashtable()
sample_ht.insert("a", 3)
sample_ht.insert("b", 6)
sample_ht.insert("c", 1)
print_ht(sample_ht)

Task 5: Your classmates are having a debate about what the order of pairs would be upon printing the hashtable.

one says the pairs would print in the order (by keys) "a", "b", "c" because that's the order in which they were added
another says the order is unpredictable to the programmer

Discuss these two perspectives. Why might the second student claim the order is unpredictable? Which one do you think should be right? Does it matter?

Task 6: Now think about how you might implement the hashtable. Answer the following two questions:

Where in the underlying data structure might we find the first KVPair? Think about the array of lists of KVPairs. Where could your code easily get a first element?
In the LinkedList iterator, you kept track of the current node in the list in order to compute __next__. What might you keep track of here in order to advance to the next KVPair for the loop?

Checkpoint! Call over a TA to check over your work. Ask them any questions you have about how this works.

Task 7 (if you are still within the first hour of lab, otherwise skip): Think about the efficiency of your design for __next__. Ideally, an iterator visits each item to be produced exactly once, with constant time needed to advance from one item to the next under the hood. Does your design achieve that? If not, how might your design achieve that?

Regexes

Imagine that you were given the text from an email message and asked to find all of the references to Brown course codes (such as "CSCI0200") in the message. How might you do this?

If you were searching for a specific course, you could get the text as a list of strings and loop over all of the strings, checking which ones were equal to "CSCI0200". But we're asking something different: we're asking you to identify all strings of a specific shape (here, four capital letters followed by four digits). How would you write that program?

Tasks like this that involve searching (and perhaps replacing) strings based on their shapes are common in programming. So much so that languages provide constructs to describe the shape they want, while the language handles checking each string against the expected shape. These shapes are called regexes or regexps (short for "regular expressions", a term that arises from the mathematical foundations of computation).

This part of lab will teach you how to write regular expressions. You will need this on the upcoming Search project. (And you'll find these a handy tool to know in general.)

We're going to start having you work in a website designed to teach about writing regular expressions. Afterwards, you'll write a small program to see how to use regexes within a Python program.

Open regex101.com. Here's what you will see (with our annotations in pink):

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Warmup Exercises

Task 8: Put the following string into the "Test String" box, and the pattern string cat into the "Regular Expression" box, as follows:

The cat in the hat was late, and that was a great surprise

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Notice that the cat string is highlighted in the text, showing that it matched your regex.

Now let's start building up patterns in which not all characters are fixed. Let’s look for every 3 letter word in a body of text that ends in “at” (e.g. “cat”, “bat”, “sat”, etc). For this, we can use [a-z]at. The brackets create a set of characters to search for, and the a-z gives a range of every character between a and z in the ASCII character list.

Task 9: Put this new regex into the site, and see what matches.

Now instead, suppose we want every single word ending in “at”, regardless of length (e.g. “at”, “bat”, “seat”, etc). For this, we can use a quantifier. The * quantifier says that there can be any number, including 0, of the character proceeding it in the string it finds. This means [a-z]*at will do what we want; find every String consisting of some number of letters followed by “at”.

Task 10: Put this new regex into the site, and see what matches.

If we want to exclude “at” and only find words with 3 or more letters, we could instead use [a-z]+at, as the + quantifier requires there be one or more a-z characters.

Task 11: Put this new regex into the site, and see what matches.

This regex still has some issues though. For example, if we wanted to used it to find all of the words ending in “at”, we see (by the highlights in the website) that it matches "lat" (as part of "late"). We need some way to specify that the “at” actually be at the end of the word. For this, we can use a “word boundary,” which requires that there be “word characters” (letters, numbers or underscores) on only one side of the boundary. This is represented as \b in the regex syntax. So our new regex, [a-z]*at\b, will match “at”, “cat”, “seat”, and “great”, but not “ate”, “late”, “bat9”, or “at_home”.

Task 12: Put this new regex into the site, and see what matches.

If we want to avoid matches like “seat” or “great”, we could add a word boundary to the beginning as well, for a final regex of \b[a-z]at\b.

More challenging examples

Now that you've seen the basics of character ranges, quantifiers, and boundaries, we can build up to more interesting patterns. The bottom of our course regex guide shows other characters you can use in regexes (to match things like spaces, sets of characters, and more).

Task 13: Craft regexes to match on the following. Try them out in the regex101 website (modify your text string as needed to check your work):

All space separated words.
All words that start with a capital letter.
Any capitalization of the word train (i.e., TRAIN, train, tRAIn, TrAiN, etc,)
All words of length 4 (i.e., 4 characters)
All words inside quotes (Hint: think escaped characters!)

Regexes in Python

Now we're going to use regexes in Python! Python comes with a built in re library that performs many of the functions yould would want using regexes. You can read all about this library here. The function that will get the most use out of in this class is findall, which takes in a regex and a string to look through and returns a list of all the instances from the string that match the given regex. Other methods you may find helpful are finditer, sub, and split.

In Python, the syntax you use to specify a regex is r"<insert regex here>". The r and the quotation marks are necessary, as they help Python better parse escape characters.

In Regex.py you will find a class called Registrar, which handles student emails and the course codes each student email (and thus, each student) is associated with. There are three methods in the class, view_emails, view_courses, and make_changes. Assume the registrar is for Brown University, and thus has Brown-styled emails and course codes. These methods should have the following functionality:

view_emails should print out all of the emails found in the document. For this problem, a Brown email address contains any number (at least 1) of names separated by single underscores, followed by a group of zero or more numbers and then "@brown.edu", where a name is a group of lowercase letters and dashes.
view_courses should print out all of the course codes found in the document. Courses consist of 3 to 4 capital letters, followed by 4 numbers, with an optional capital letter at the end!
make_changes takes in a regex of a pattern that should be changed and a replacement string of what that pattern should be changed to. This is a general method, and therefore should be able to make changes to either emails or course codes. (Hint: check out the sub method!)

Task 14: Fill out the methods in the Registrar class with the specifications above. Be prepared to a show a TA the results of your regex matching!

Checkoff! Congratulations, you've completed this week's lab! Check in with a TA, and then you can go home. Nice job!

Please let us know if you find any mistakes, inconsistencies, or confusing language in this or any other CSCI0200 document by filling out the anonymous feedback form.