Hands-On 2 - HackMD

# Hands-On 2 ###### tags: `homework/lab` This is the place to ask/share/answer any questions related to the hands-on lab. :::info Make sure to **CAREFULLY READ** both the [Assignment slide](https://docs.google.com/presentation/d/1uRevIyBBRYXpB5TLw4qeEth8tIFQy0C6vRwsrNBXt60/edit#slide=id.g8325a9dbc4_0_0) and the [SOP slide](https://docs.google.com/presentation/d/1uRevIyBBRYXpB5TLw4qeEth8tIFQy0C6vRwsrNBXt60/edit#slide=id.g8318e5b55d_0_0) Submission deadline: 4/15 Wednesday 22:00. Late submission deadline: 5/16 Thursday 22:00 **(10 point penalty)**. ::: A few useful links for this weeks lab: * [Write "Pythonic" code](https://docs.python-guide.org/writing/style/) * [Hyphen, en-dash, em-dash](https://en.wikipedia.org/wiki/Dash) * [TF-IDF Solution (Medium article)](https://medium.com/@shmueli/a-tf-idf-based-news-recommendation-system-from-scratch-75e73c2acc63?source=friends_link&sk=6c276a4c5e687aabc7870a4ba4fca1e5) * [NLTK](https://www.nltk.org/) * [SpaCy](https://spacy.io) --- > NLTK Pos labels "Trump’s" as ("Trump", "NNP"), ("’", "NNP"), ("s", "NN"). Is this punctuation --> ’ going to be a problem? >> Maybe you forgot to mention the tokenizer? When I tried I got a different result. The NLTK POS for “Trump’s” is 'NN': ``>> pos_tag(['Trump’s'])`` ``[('Trump’s', 'NN')]``[name=Boaz] [color=#e54927] >> Sorry, you're right. With `word_tokenize`, and then `pos_tag`. >> No problem! So can you post the exact sentence you are trying to ``pos_tag(work_tokenize(...)``? The more accurate and more detailed the information, the better :)[name=Boaz] [color=#e54927] >> I copied an example from the corpus: `sentence = 'After Donald Trump’s election in November, he observed that the world was watching U. S. political developments with some stupefaction.'` The sentence is processed: `pos_tag(word_tokenize(sentence))` And it returns: `[('After', 'IN'), ('Donald', 'NNP'), ('Trump', 'NNP'), ('’', 'NNP'), ('s', 'JJ'), ('election', 'NN'), ..... ]` So `Trump` and `’` are separated and both labeled `NNP`. Also, in this sentence, `U.S.` is separated into `U.` and `S.`. > I also ran into the same problem using nltk's word_tokenizer and pos_tag. One of the answer I got was `('Trump ’')`. > I think U.S. being seperated is fine since `U.` and `S.` are both classified as `NNP`? U.S. being one of the answers is reasonable. >> Agree with the `U.S.` problem. Thanks! Did you run the titles instead of the contents? I got the answer `(Trump,’)`, too for the titles, but not the contents. >>I got `('Trump ’')` as an answer using either one of them as corpus. >>After I load the corpus I had to convert the data into string `sent_tokenize(str(doc))`, or an error will occur when sent_tokenize is called. >> I re-ran it and got `('Trump ’')` for the content, too. >> Great discussion! Here's a small puzzle: try to run the following code: `pos_tag(word_tokenize("After Donald Trump's election in November, he observed that the world was watching U. S. political developments with some stupefaction."))` Is it different and/or better the the previous POS tagging? Can you see why?[name=Boaz] [color=#e54927] >> NLTK recognizes `'` but not `’` as the apostrophe. NLTK doesn't recognize `’`, so it thinks `’` might be a new word (NNP)? >> Exactly! pos_tag() works nicely with `'`, but not with `’`. I would guess it's because the model was trained on ASCII data. `'` is ASCII 39, and we can type it from any keyboard. `’` is the UNICODE character `\u2019`. Also, take a look [here](https://en.wikipedia.org/wiki/Quotation_mark#Unicode_code_point_table).[name=Boaz] [color=#e54927] >> According to above discussion, should we need to handle `('Trump ’')` or we can take this as an answer? >> `('Trump ’')` is not a valid answer.[name=Boaz][color=#e54927] >> So we need to convert `’` to `'` ? Since if we do that, `word_tokenize("Trump's")` will become `["Trump", "'s"]`, but `Trump's` is not equal to `Trump` exactly. >>> True! But for our purpose (finding two consecutive proper noun(s)) it is good enough :) [name=Boaz] [color=#e54927] >> I found a python package called `unidecode` may be useful in this issue. You may try `pip install unidecode` on terminal with python3.x version. Then: >> ``` >> > from unidecode import unidecode >> > tst_str = "After Donald Trump’s election in November, he observed that the world was watching U. S. political developments with some stupefaction." >> > unidecode(tst_str) >> ``` >> It may give you the ascii decoded string. [name=Lien_0856721] >>>👍👍👍 >>>(by the way, in Google Colab you can install Python packages with `!pip`)[name=Boaz] [color=#e54927] >>>> If a sentence tokens format is `NNP1 NNP2’s NNP3 NNP4`, does all of the combinations (`NNP1 NNP2`, `NNP2 NNP3`, `NNP3 NNP4`) meet the requirement `where both tokens are PROPER NOUNS` ? >>>>> Take a hard look at the output from ```pos_tag(word_tokenize(<your_original_sentence>))``` >>>>> (after replacing `’` with `'`)[name=Boaz] [color=#e54927] --- > Does it need to remove the punctuation in the part 2? >> If we keep punctuation like `.`, `,` etc., the symbols will end up as features, which we don't want (can you see why?). So it's advisable to remove the punctuation. With SpaCy there is an easy way to do that...[name=Boaz] [color=#e54927] --- > I think there is a small issue in the lab1-solution.ipynb (https://colab.research.google.com/gist/bshmueli/5dee9055c6cbc386bb84ab023e8bc964/lab1-solution.ipynb). Looks like the `nan` makes the sorted result weird in the code posted in the medium article. e.g. For top 5 result, instead of finding (1) 0.4397 (2) 0.4312 (3) 0.4208 (4) 0.4205 .... The code will output (1) 0.4397 (2) 0.4312 (3) 0.4030 (4) 0.3627 (5) 0.3031. If you looks through the whole output, you can still find 0.4208 (near `nan`). You can find full details here https://colab.research.google.com/drive/1Q_4D-wsRoj7tOceG4iWllQg2_Jw1urbI (a copy version of the code you shared). There exists some disscussion on stackoverflow (https://stackoverflow.com/a/18062760). The reason is that `nan` is neither greater nor less than the other elements. Therefore, there is no strict ordering defined. Thank you. >> It's not stated in the above post but you are referring to the vector similarity values in part (2) of this week's assignment. Since you are getting a similarity value of `nan`, check to see under what conditions a similarity of `nan` is returned. And then, what kind of text data produces this kind of condition? Indeed, part (b) requires a little bit of detective work :) If anyone managed to solve this puzzle, feel free to share the solution here and earn some karma points![name=Boaz][color=#e54927] >>> I see. Thank you for your reply. The `if a_2 == 0 or b_2 == 0: return float('nan')` is fine in reuters dataset but not in buzzfeed dataset since reuters dataset don't have empty content. Or, we can just use some preprocessing strageies to fliter those empty contents. In that way, I think this line change to `if a_2 == 0 or b_2 == 0: return 0` might be better? Otherwise, the sorting (`sorted(range(len(similarities)), key=lambda i: similarities[i])[-k:]`) will still suffer from `nan` if `a_2 == 0 or b_2 == 0`. >>>>Good analysis! Returning 0 from `cosine_similarity()` is indeed a quick fix (although not ideal since mathemtically it is incorrect). [name=Boaz][color=#e54927] >>>>> I got it. The `sorted(range(len(similarities)), key=lambda i: similarities[i])[-k:]` should be the part which needs to be changed. Thanks :satisfied: --- > According to above discussion, should we need to handle `('Trump ’')` or we can take this as an answer? >> I've copied your question (with answer) to the relevant discussion.[name=Boaz][color=#e54927] --- > Do we have to implement lowercase on both part of these hand-on? >> Is it a good idea to use lowercase for part (1)? What about part (2)?[name=Boaz][color=#e54927] >>> Using lowercase might get very different results in part(1)! Which one should we use or it doesn't matter? >>> E.g. I guess like "New York" is NNP, but if we apply lowercase() on it, it becomes "new" (NN) and "york"(maybe still NNP) >>> ![](https://i.imgur.com/6YL6ljp.png) >>>>Choose the one which makes more sense and gives more accurate results.[name=Boaz][color=#e54927] --- > I notice that the content of corpus[61] from the Buzzfeed News dataset is empty, which result in the type of the content is classied as `float`, is that normal or it's just somewhere I did wrong? >>Indeed! In this news dataset, in a few of the articles the contents are empty. (Can you see from the corresponding titles what kind of stories have empty content?). [name=Boaz][color=#e54927] --- > I would like to ask some question for part 1. If for example, we see that bigram (('United', 'NNP'), ('States', 'NNP')) and (('United', 'NNP'), ('States', 'NNPS')) exists, then should we consider them as the same when counting bigram? > On other words, should it be: > 1. (('United', 'States'), 2) -> considered the same > 2. ((('United', 'NNP'), ('States', 'NNP')), 1) and ((('United', 'NNP'), ('States', 'NNPS')), 1) -> considered as different counts due to difference in types > [name=Wilbert_0856021] >> Choose the one which makes more sense according to the homework description.[name=Boaz][color=#e54927] --- > Hi, I find out that the title of BuzzfeedDataset[555] is 'nan' but it has the content like "Brigitte Gabriel, the most...". > Is this normal? or I need to check my code again. Thanks.🙏[name=Eric_0850746] >> If you don't trust your code, it's always a good idea to examine the data :)[name=Boaz][color=#e54927] --- >Excuse me, I want to ask that if the answer of my lab1 was wrong, can I submit an updated one? Will you accept late submission of that? Thank you. >>Sorry, no late submissions.[name=Boaz][color=#e54927] ---