---
tags: NLP
title: Tweet Tokenization
---
## Tweets Tokenization
**Due Date:** Thursday, March 31, 2022, 09:00
**Submission Format:** .zip file (f.lastname.zip) with all the code (or written answers) packed within. All libraries used in implementation should be included in requirements file. **Only .py and .txt files are considered for grading**
**Python version:** Python 3 or greater
**Grading:** Max of 50 points. Maximum points for each task are indicated
**Data:** [.zip for all .txt files](https://drive.google.com/file/d/1RrqXVKor165zdTleTCKN8wWzWaWA_XMG/view?usp=sharing)
## Goal
The goal of the assignment is to write a tweet tokenizer. The input of the code will be a set of tweet text and the output will be the tokens in each tweet. The assignment is made up of three tasks.
## Input Data
The data contains 5 files whereby each contains 44 tweets. Each tweet is separated by a newline. For manual tokenization only one file should be used.
## Task 1 (20 points)
As a first task you need to tokenize a small number of tweets by hand. This will allow you to understand the problem from a linguistic point of view. The guidelines for tweet tokenization are as follows:
* Each smiley is a separate token
* Each hashtag is an individual token. Each user reference is an individual token
* If a word has spaces between them then it is converted to a single token
* If a sentence ends with a word that legitimately has a full stop (abbreviations, for example), add a final full stop
* All punctuations are individual tokens. This includes double-quotes and single quotes also
* A URL is a single token
Example of output
```
Input tweet
@xfranman Old age has made N A T O!
Tokenized tweet (separated by comma)
@xfranman , Old , age , has , made , NATO , !
```
## Task 2 (20 points)
Write a program that takes as input a list of tweets on a topic and outputs tokenization for each . In your program, you should implement the 4 different tokenizers
1. White Space Tokenization
2. [Sentencepiece](https://arxiv.org/pdf/1808.06226.pdf)
3. Tokenizing text using regular expressions
4. TweetTokenizer (optional)
For tokenizing text using regular expressions use the rules in task 1. Combine task 1 rules into regular expression and create a tokenizer.
<!--You will be evaluated finally on f-measure. Precision will denote the fractions of tokens (that were outputted by your system) that matched the gold standard. Recall will denote the fraction of tokens (that were in the gold) that matched the system output. -->
Example of how your program should work :
```
$ ls
my_tokenizer.py file1.txt
```
```
$ python3 my_tokenizer.py
usage: my_tokenizer.py [-flag value]*
This program ...
list of flags:
source
-file.txt file where the input source is stored
method
-wst : White Space Tokenization
-sentpiece : Sentencepiece tokenizer
-ret : Tokenizing text using regular expressions
-twt : Tokenizing text using TweetTokenizer
output
-indicates the location where to dump the results output. If not indicated,
the results are printed on terminal
```
```
$ python3 my_tokenizer.py -method wst -source file1.txt
hello this is my output for the assignment.
8 tokens
[hello, this, is, my, output, for, the, assignment.]
In all life this is my SECOND tweet that i like! #life
11 tokens
[in, all, life, this, is, my, second, tweet, that, i, like!, #life]
...
```
### Output Format
Each tokenized tweet should be represented in the following format :
* First line contains the full original tweet as given in input file
* The next line should be an integer (n) specifying the number of tokens in the tweet
* After that the n tokens in one line separated by comma symbol
* All the output should be without quotes and should be all small case letters
For each file the output should be in separate output file named as "output_n.txt" ***(i.e output_1.txt, output_3.txt)***
## Task 3 (10 points)
For this task you will have to use sentencepiece text tokenizer. Your task will be to read how it works and write a minimum 10 sentences explanation of the tokenizer works. In addition to writing the explanation you will have to apply it on the data used in the previous tasks. **The explanation should be written in .txt file and be included in the submission .zip file.**
## Resources
1. [Regular Expressions 1](https://realpython.com/regex-python/)
2. [Regular Expressions 2](https://realpython.com/regex-python-part-2/)
3. [SentencePiece](https://github.com/google/sentencepiece)
4. [sentencepiece tokenizer](https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15)