Bookbub README

# Bookbub README ## Run Instructions 1. Download my solution (`solution.zip`). 2. Ensure that `python3 --version` returns a version at least 3.5.2. If not, update Python. I have not tested my code with earlier versions than 3.5.2. 3. Additionally, ensure that `which python3` returns `/usr/bin/python3`. If this is not the case, either make `/usr/bin/python3` a symbolic link pointing to what `which python3` returned or use `python3 ./sol.py` in place of `./sol.py` in the following lines. 4. Execute `unzip solution.zip` in the same directory as `solution.zip`. 5. `cd ./bookbub-problem` 6. `./sol.py` will execute my code on the sample files (`sample_book_json.txt` and `sample_genre_keyword_value.csv`) 7. To see the flags that `sol.py` accepts, execute `./sol.py -h`. * `-h`: Display help message and exit. * `-i`: Use to ignore letter case in keywords and book descriptions * `-p`: Use to ignore punctuation in keywords and book descriptions. Apostrophes are removed and all other non alphanumeric/whitespace characters are removed and replaced with spaces. Spaces are then squeezed, resulting in a string of alphanumeric words separated by whitespace. * `--books`: Use to specify a different book data file. * `--keys`: Use to specify a different key data file. 8. One example command leveraging these flags is: `./sol.py -i -p --books ./data/test_books.json --keys ./data/test_keys.csv` I used these files to test the `-i` and `-p` flags and examine edge cases. ## Trade-Offs and Edge Cases The first edge case I spent some time thinking about was capitalization. The simplest implementation would take the keywords and look for exact matches in the description of the book. This would cause the keyword "space" to not be found in the sentence "Space: the final frontier." I felt that this was undesirable behavior, so I added the `-i` flag. The next case I considered was punctuation. I could think of arguments for leaving punctuation as an exact match (for example, if a keyword like "R&R" or "m&m" were to be used) but also saw the merit in ignoring punctuation if a comma got in the middle of a keyword or someone used an apostrophe incorrectly (such as a keyword being "it's magic" and the description reads "its magic"). So, I implemented the `-p` flag. Another edge case I considered is whether the key "aaaa" should be "found" once or twice in the string "aaaaa". In regular expression matching the string is found once, but there are two locations in the string where "aaaa" occurs. They just overlap. Because the default behavior for `findall` was to not consider overlapping strings separate matches and I could not think of a good use case for counting it twice, I did not implement this. For the sorting the titles alphabetically, I decided to ignore case. This causes the list of titles `a, C, b` to be sorted to `a, b, C` instead of `C, a, b`, which I feel is desirable. One trade-off made is that of memory for speed. My current implementation stores the entirety of the book list and keyword list in memory. For large book lists and/or keyword lists, this could become an issue. To solve this, I can think of two straightforward approaches. The first would be to store the datastructures on disk in a database. This would eliminate the memory issue but require SQL incorporation and increase runtime significantly. The second is to leave the files as is and iterate through them line by line, only keeping the current book in memory. This would loose the performance that we gain from iterating through the key file only once and require code that can parse parts of the file without loading it all into memory. ## Time Spent I spent about 2 hours on the core product and spent another 2 hours writing comments, composing the README, adding features, and testing. ## Notes I chose Python for the language because it is very easy to make small projects. If I were adding this to a project of mine that already existed, I would have used a statically typed language (my current favorite is Kotlin). Unfortunately, the time overhead of creating a new project in Kotlin is greater than I wanted to expend setting up this project. Another consideration I made was whether my code should support any key. I decided yes. Using regular expression matching, supporting arbitrary keys was easiest to implement. If I restricted the keys to being a maximum of two words, it might be possible to make a more performant algorithm that considers book descriptions as lists of words and iterates through them, but I felt that this would be a small theoretical gain for a very real large amount of work and would also limit the user.