Tweet Tokenization

--- tags: NLP title: Tweet Tokenization --- ## Tweets Tokenization **Due Date:** Thursday, March 31, 2022, 09:00 **Submission Format:** .zip file (f.lastname.zip) with all the code (or written answers) packed within. All libraries used in implementation should be included in requirements file. **Only .py and .txt files are considered for grading** **Python version:** Python 3 or greater **Grading:** Max of 50 points. Maximum points for each task are indicated **Data:** [.zip for all .txt files](https://drive.google.com/file/d/1RrqXVKor165zdTleTCKN8wWzWaWA_XMG/view?usp=sharing) ## Goal The goal of the assignment is to write a tweet tokenizer. The input of the code will be a set of tweet text and the output will be the tokens in each tweet. The assignment is made up of three tasks. ## Input Data The data contains 5 files whereby each contains 44 tweets. Each tweet is separated by a newline. For manual tokenization only one file should be used. ## Task 1 (20 points) As a first task you need to tokenize a small number of tweets by hand. This will allow you to understand the problem from a linguistic point of view. The guidelines for tweet tokenization are as follows: * Each smiley is a separate token * Each hashtag is an individual token. Each user reference is an individual token * If a word has spaces between them then it is converted to a single token * If a sentence ends with a word that legitimately has a full stop (abbreviations, for example), add a final full stop * All punctuations are individual tokens. This includes double-quotes and single quotes also * A URL is a single token Example of output ``` Input tweet @xfranman Old age has made N A T O! Tokenized tweet (separated by comma) @xfranman , Old , age , has , made , NATO , ! ``` ## Task 2 (20 points) Write a program that takes as input a list of tweets on a topic and outputs tokenization for each . In your program, you should implement the 4 different tokenizers 1. White Space Tokenization 2. [Sentencepiece](https://arxiv.org/pdf/1808.06226.pdf) 3. Tokenizing text using regular expressions 4. TweetTokenizer (optional) For tokenizing text using regular expressions use the rules in task 1. Combine task 1 rules into regular expression and create a tokenizer.  Example of how your program should work : ``` $ ls my_tokenizer.py file1.txt ``` ``` $ python3 my_tokenizer.py usage: my_tokenizer.py [-flag value]* This program ... list of flags: source -file.txt file where the input source is stored method -wst : White Space Tokenization -sentpiece : Sentencepiece tokenizer -ret : Tokenizing text using regular expressions -twt : Tokenizing text using TweetTokenizer output -indicates the location where to dump the results output. If not indicated, the results are printed on terminal ``` ``` $ python3 my_tokenizer.py -method wst -source file1.txt hello this is my output for the assignment. 8 tokens [hello, this, is, my, output, for, the, assignment.] In all life this is my SECOND tweet that i like! #life 11 tokens [in, all, life, this, is, my, second, tweet, that, i, like!, #life] ... ``` ### Output Format Each tokenized tweet should be represented in the following format : * First line contains the full original tweet as given in input file * The next line should be an integer (n) specifying the number of tokens in the tweet * After that the n tokens in one line separated by comma symbol * All the output should be without quotes and should be all small case letters For each file the output should be in separate output file named as "output_n.txt" ***(i.e output_1.txt, output_3.txt)*** ## Task 3 (10 points) For this task you will have to use sentencepiece text tokenizer. Your task will be to read how it works and write a minimum 10 sentences explanation of the tokenizer works. In addition to writing the explanation you will have to apply it on the data used in the previous tasks. **The explanation should be written in .txt file and be included in the submission .zip file.** ## Resources 1. [Regular Expressions 1](https://realpython.com/regex-python/) 2. [Regular Expressions 2](https://realpython.com/regex-python-part-2/) 3. [SentencePiece](https://github.com/google/sentencepiece) 4. [sentencepiece tokenizer](https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15)