Cleaning and Preparing Data in Python (Basics)

# Cleaning and Preparing Data in Python (Basics) The MoMA data is in a CSV file called artworks.csv. Here's what the first five lines of that file look like: ![](https://i.imgur.com/Up2NpUh.png) ```python= # import the reader function from the csv module from csv import reader # use the python built-in function open() # to open the children.csv file opened_file = open('artworks.csv') # use csv.reader() to parse the data from # the opened file read_file = reader(opened_file) # use list() to convert the read file # into a list of lists format moma = list(read_file) # remove the first row of the data, which # contains the column names moma = moma[1:] ``` --- ### str.replace() > In order to do this, we'll learn the str.replace() method. The str.replace() method is like a "find and replace" tool for strings. Let's look at the individual steps required to change our string: > > We need to find all instances of the old substring, "red". > We need to replace each of those instances with the new substring, "blue". > To achieve this using str.replace(), we need to provide two arguments: > > old: The substring we want to find and replace. > new: The substring we want to replace old with. > Both of these are positional arguments, so we can use them without specifying their names. Let's look at what this looks like in the diagram below: > ![](https://i.imgur.com/RUIwPeV.png) > > We may decide that we can just replace the substring "r" with "R". Let's look at what happens when we do that: > ![](https://i.imgur.com/QQRYp79.png) > > Because the substring "r" was found in the words favorite and color, we have replaced them giving us "favoRite" and "coloR". Be careful where you might have a substring hidden inside other words, and if this happens, just use a longer substring: > ![](https://i.imgur.com/YWqmJmE.png) > ```python= > age1 = "I am thirty-one years old" > age2 = age1.replace("one", "two") > ``` ### [str.title()](https://docs.python.org/3/library/stdtypes.html#str.title) > The str.title() method returns a copy of the string with the first letter of each word transformed to uppercase (also known as title case). > ```python= > my_string = "The cool thing about this string is that it has a CoMbInAtIoN of UPPERCASE and lowercase letters!" > my_string_title = my_string.title() > print(my_string_title) > > The Cool Thing About This String Is That It Has A Combination Of Uppercase And Lowercase Letters! > ``` > - Instructions > > Create a function called strip_characters(), which accepts a string argument and: Iterates over the bad_chars list, using str.replace() to remove each character. Returns the cleaned string. Create an empty list, stripped_test_data. Iterate over the strings in test_data, and on each iteration: Use the function you created earlier to clean the string. Append the cleaned string to the stripped_test_data list. ```python= test_data = ["1912", "1929", "1913-1923", "(1951)", "1994", "1934", "c. 1915", "1995", "c. 1912", "(1988)", "2002", "1957-1959", "c. 1955.", "c. 1970's", "C. 1990-1999"] bad_chars = ["(",")","c","C",".","s","'", " "] def strip_characters(string): for char in bad_chars: string = string.replace(char,"") return string stripped_test_data = [] for s in test_data: test_str = strip_characters(s) stripped_test_data.append(test_str) print (stripped_test_data) ``` ```python= test_data = ["1912", "1929", "1913-1923", "(1951)", "1994", "1934", "c. 1915", "1995", "c. 1912", "(1988)", "2002", "1957-1959", "c. 1955.", "c. 1970's", "C. 1990-1999"] bad_chars = ["(",")","c","C",".","s","'", " "] def strip_characters(string): for char in bad_chars: string = string.replace(char,"") return string stripped_test_data = ['1912', '1929', '1913-1923', '1951', '1994', '1934', '1915', '1995', '1912', '1988', '2002', '1957-1959', '1955', '1970', '1990-1999'] def process_date(date): if "-" in date: split_date = date.split("-") date = round((int(split_date[0])+int(split_date[1])) / 2) else: date = int(date) return date processed_test_data = [] or d in stripped_test_data: date = process_date(d) processed_test_data.append(date) for row in moma: date = row[6] date = strip_characters(date) date = process_date(date) row[6] = date ``` ###### tags: `python`