# Hands-On 1 This is the place to ask/share/answer any questions related to the hands-on lab. English only, please... thank you! ###### tags: `homework/lab` #### We would like to make our NLP e-meetings even better. Please help by filling this [short anonymous survey](https://docs.google.com/forms/d/e/1FAIpQLSc5u5RHW96LE-tBN4NQkR4KhBQY-raAuvTFYm_FjcI4bCYbRw/viewform). Thank you! --- > Do we have to do step 1~3 on both title and content? Or content only? >> The final program should use the content for computing the similarities; the output should display the corresponding titles. [name=Boaz][color=#e54927] > Is efficiency a part of score? Plus, is there a time limit? >> No, we do consider complexity when scoring this assignment. Please refer to the Google Slides for complete details. Thank you! [name=Boaz][color=#e54927] > Should we remove stopwords in reading stage (will affect output result) or only in computing stage (compute TF-IDF, output won't affect)? >> I found that it won't be a problem of output. Thanks! >> Tip: the vocabulary should contain 1000 features; the features must not be stopwords or punctuation [name=Boaz][color=#e54927] > How to deal with "U.S." , "U. S." and "U. S"? Should we categorize them into "US" or "U" & "S"? >>For this week's assignment, "U.S.", "U. S." and "U. S" should all be converted into the tokens "u" and "s". [name=Boaz][color=#e54927] >Should we remove key article whose similarity is 1 in the result? >>No need to remove the seed article.[name=Boaz][color=#e54927] >How to select 1000 features with tf-idf? (same way in BOW use the most common 1000) >>Please see the [Google Slides](https://docs.google.com/presentation/d/1Yo7BCyhqKtFmXaBFcBt-grcwwAJMCL4ghOgQKhF4QZ8/edit#slide=id.g81c4077e59_0_10), slide number 14, bullet 4. Let me know if you still need help![name=Boaz][color=#e54927] >I want to make sure that whether we should ouput the result in reverse order, cause the example output order on E3 system is in the reverse order compared with the google slide one. >>Both are fine :)[name=Boaz][color=#e54927] >This project suppose to use article **Content** as corpus to calculate similarity. However, I think that **Title**s are also important. **Title** can be a kind of extraction of keyword for a **Content**. Can I use calculate the similarity of **Title**s and **Content** respectively, then combine them? By the way, I think stemmer may be helpful, may I use stemmer?[name=0856721][color=#e54927] >> You are welcome to use both the title and the content, just make sure your >> Python notebook has clear documentation about the operation of the program and how you combined the two. >> Regarding the use of a [stemmer](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html): remember that for this assignment, you cannot use any NLP library, and thus you will need to write your own, which is not trivial.[name=Boaz][color=#e54927] >Where should we submit the link of our notebook? The link in the slide seems to be broken. >>Please submit the notebook and the link using the e3 system. The link should start with `colab.research.google.com/…` [name=Boaz][color=#e54927] > When computing TF, is TF the number of times a word appears in a vocabulary divded by the total number of words in the document or the number of times a word appears in a vocabulary divded by the total number of words in a vocabulary in this document? I think the former is better because the latter may cause nan. >>In the TF-IDF formula we use, the term frequency (TF) of a token is the number of times the token appears in the document (also known as "raw count"). For example, in the document `some are born great, some achieve greatness, and some have greatness thrust upon them`, the TF of `some` is 3. [name=Boaz][color=#e54927] > ![](https://i.imgur.com/XoUiBaV.png) >For this formula, word x may exist in multiple document y and have different tf-idf value, how to select x's tf-idf score ? select highest? thx. >>I am not sure I understand your question. I recommend referring to the [Lecture 3 slides](https://e3new.nctu.edu.tw/mod/folder/view.php?id=93407) where TF-IDF is explained. They should contain the answer to your question. [name=Boaz][color=#e54927] > When computing TF-IDF using content of corpus, the RAM will crash but work fine when using title of corpus, do you know how to fix it? >>Google Colab allocates up to 25GB of RAM. This is more than enough for our assignment, and indeed all other submissions don't have this issue. I suggest you carefully check your program to find which parts are consuming unreasonably large amounts of memory.[name=Boaz][color=#e54927] >Can I modify my code on the colab if I come up with a better solution? Or the code on Colab should be the same as the uploaded one? Thank you! >>The code on Colab must be the same as the uploaded one.[name=Boaz][color=#e54927] > Is later submission after deadline available? >>In general there is no submission after the deadline. Let us know if you have any exceptional or extenuating circumstances.[name=Boaz][color=#e54927] >TA, can you release the correct code for Hands-on 1? I would like to know where did I went wrong? >> Absolutely. A solution will be shared on Friday.[name=Boaz][color=#e54927] >Hi TA, I have a question about hands-on 1. The feedback you gave me talked about I remaining some stopwords in my corpus. I discovered that is caused by removing stopwords before removing punctuation. I found that there are some words like "you're", "it's" in stopswords, so I need to keep punctuation in document to match the same pattern. (I know there are also contain 's', "re" in the list)Otherwise, those words have punctuation in stopwords list will be meaningless. But if do so, some stopwords like 's' will remain, and it's derived from "U. S.". Is it the correct way to remove stopwords twice, before and after removing punctuation? >>Let's discuss this on Friday.[name=Boaz][color=#e54927] > Hi Boaz, in your Medium article, there is a spelling mistake in the sentence "The vocabilary is stored in the vocab variable." (should be vocabulary) >> Article fixed, thank you! [name=Boaz][color=#e54927] --- #### We would like to make our NLP e-meetings even better. Please help by filling this [short anonymous survey](https://docs.google.com/forms/d/e/1FAIpQLSc5u5RHW96LE-tBN4NQkR4KhBQY-raAuvTFYm_FjcI4bCYbRw/viewform). Thank you!