Hands-On 3 - HackMD

# Hands-On 3 ###### tags: `homework/lab` This is the place to ask/share/answer any questions related to the hands-on lab. :::info Make sure to **CAREFULLY READ** both the [Assignment slide](https://docs.google.com/presentation/d/1PYeqezfxk-B8I3EYmV-G9DTpBbAe3O9fqb3UW0I3NL4/edit?usp=sharing) and the [SOP slide](https://docs.google.com/presentation/d/1PYeqezfxk-B8I3EYmV-G9DTpBbAe3O9fqb3UW0I3NL4/edit?usp=sharing) Submission deadline: 5/13 Wednesday 22:00. Late submission deadline: 5/14 Thursday 22:00 **(10 point penalty)**. **A grade of 0 (zero) will be given for submissions after "Late submission"** ::: --- See [Google Slides #17](https://docs.google.com/presentation/d/1PYeqezfxk-B8I3EYmV-G9DTpBbAe3O9fqb3UW0I3NL4/edit#slide=id.g84d8703439_98_0) for more information about Part 2 (bidirectional model) --- >If the bigram model is based on train data tweet, then the perplexity of the test data tweet should be larger than the train data tweet on this model right? Does it make sense for both of the average perplexities to be around 1? I can't really tell. > What does V mean in Laplace smoothing? [Here](https://web.stanford.edu/~jurafsky/slp3/3.pdf) says there are V words in the vocabulary. I am not sure if this means all unique tokens in whole training data or only consider tokens in each text. >> The vocabulary is the **set** of all valid tokens.[name=Boaz][color=#e54927] > Since the bigram model is based on the training data, do we need to apply Laplace smoothing when we calculate the avg. perlexity of Training data? >> Please read the homework description carefully. You need to apply Laplace smoothing as clearly explained in the [Google Slide](https://docs.google.com/presentation/d/1PYeqezfxk-B8I3EYmV-G9DTpBbAe3O9fqb3UW0I3NL4/edit#slide=id.g776f20793f_0_3) (Page 15, 4th bullet)[name=Boaz][color=#e54927] > I have a question about the vocab, in Google Slide (Page 15, 3th bullet), should we replace all the infrequent tokens in the corpus by ‘<UNK>’ and compute the bi-gram or we don't need to change the original corpus but only replace the terms in the vocab? >> I consider words appeared less than third times as '<UNK>' when I was filtering vocabulary. Then I set all words those are not in the vocabulary as '<UNK>' when I was building bigrams. [name=0856154] > I have a question about the vocab, in Google Slide (Page 15, 3th bullet) too. Should we need to replace tokens in testing data which don't appear in training data to **<UNK>** ? I think they need to be replaced since they are also unknown tokens for language model, but in slides only mention to replace infrequent terms (appears only 1 or 2 times in training data). >> After you create your vocabulary, you replace all non-vocabulary words by **<UNK>** in both the training and test data. This is explained in bullet 3. For more information about this technique, see Page 12 [here](https://web.stanford.edu/~jurafsky/slp3/3.pdf), in the paragraph that starts with "The second alternative, in situations where we don’t have a prior vocabulary in advance..."[name=Boaz][color=#e54927] > I want to evaluate my result. Could anyone tell me what is an accurate range of output? Just like the result in COLAB(10.55)? I got a result less than one, maybe I am wrong? >> I got an answer much larger than 1 :rolling_on_the_floor_laughing: >>>My mistake. Now it is around 570. By the way, I noticed that there may be some sentences full of emojis. Thus their probabilities are very close to zero(or just zero). Do you have this problem? >>>> I suffered this problem too. >>>> Following is my solution to this problem. >>>> First, calculate the cross-entropy $H(W)$ because $P(w_i|w_{i-1})$ may be too small that $P(W)$ is close to zero. >>>> $P(W) = p(w_2|w_1)...p(w_i|w_{i-1})...p(w_n|w_{n-1})$ >>>> $H(W) = \frac{-1}{N} \log(P(W)) = \frac{-1}{N} (\log(p(w_2|w_1)) + ... + \log(p(w_n|w_{n-1})))$ >>>> $\log(p(w_i|w_{i-1}))$ won't be too close to zero, so that $H(W)$ will be able to compute in the double or float precision. >>>> Second, calculate Perplexity by Perplexity(W)=2^H(W). >>>> [name=0860911] >>>>>Thanks for your helpful solution! It works pretty well! Now I figured out why Boaz show two transformations of perplexity on the COLAB. > Do we need to consider unidecode in this current hands on homework? >> Please see first bullet in the assignment: "For tokenization, first convert to lowercase, then use the NLTK TweetTokenizer. That’s it!"[name=Boaz][color=#e54927] > Although it is late, I still have a question about preprocessing. According to the introduction in Google Slides, we only need to tokenize each Tweet and add '\<s>' and '\</s>' to the start and end respectively. However I found that if we split each Tweet as sub-sentences(most of Tweets are not a sentence only) and add 's' tag to the start and end of each sub-sentence, it reduced average of training/testing perplexity(1800->1360, 2270->1630). This effect is more significant in part 2 with an optimal gamma value. Base on this observation, I am curious about whether this preprocessing is effective or the improvement just because of additional 's' tag? >> It's a good question. You are right that if we break the corpus into shorter sentences, the perplexity will be lower because short sentences are easier to predict. What you suggest is useful if we want to look at individual sentences. It really depends how we plan to use the model. Imagine for example that we want to create a model that will auto-generate entire tweets, not just sentences. In this case, our kind of model will probably be more useful![name=Boaz][color=#e54927] >>>Got it! Thanks for your reply.