CR677 Thesis corpus

--- title: CR677 Thesis corpus tags: Thesis, corpus, Kent description: Main readme file for CR677 thesis corpus --- # CR677 These corpus : Automatically identifying Crypto currency scams on Twitter using machine learning Github: TBD ## Who am I? - Camran Roudbai - cr677@kent.ac.uk --- ### File structure --- ```bash cr677_corpus/ ├── background_researches/ --- Files used for research ├── machine_learning/ --- Folder containing machine learning data ├── data/ --- Dataset used to create v2 and v3 models ├── pos/ --- Known scam tweets ├── neg/ --- Known non scam tweets ├── data_v2/ --- Dataset used to create v4 model ├── train/ --- Training dataset ├── neg/ --- Known scam tweets ├── pos/ --- Known non scam tweets ├── test/ --- Testing dataset ├── neg/ --- Known scam tweets ├── pos/ --- Known non scam tweets ├── my_model/ --- V1 model ├── my_model_v2/ --- V2 model ├── my_model_v3/ --- V3 model ├── my_model_v4/ --- V4 and final model ├── preprocessing/ --- Files used to preprocess the data ├── results/ --- Result datas ``` ## Usage --- ### File structure --- ```bash cr677_corpus/ ├── background_researches/ --- Files used for research ├── machine_learning/ --- Folder containing machine learning data ├── data/ --- Dataset used to create v2 and v3 models ├── pos/ --- Known scam tweets ├── neg/ --- Known non scam tweets ├── data_v2/ --- Dataset used to create v4 model ├── pos/ --- Known scam tweets ├── neg/ --- Known non scam tweets ├── my_model/ --- V1 model ├── my_model_v2/ --- V2 model ├── my_model_v3/ --- V3 model ├── my_model_v4/ --- V4 and final model ├── preprocessing/ --- Files used to preprocess the data ├── results/ --- Result datas ``` --- ### Installation and requirements --- ```bash sudo apt install python3 pip install tensorflow numpy transformers torch matplotlib re nltk emoji urllib tweepy ``` I ran my tests using a rather powerfull GPU/CPU combo (RTX 3080 mobile / I7 10875H). A dedicated graphics card is recommended for training, or if not possible a powerfull CPU or else the training step will take and extremly long time. Alternativly you can use something like google collab which allows you to use powerfull hardware for machine learning training via jupyter notebooks for free. ### Usage --- You can either use the pre-trained model included in machine_learning/ (recommended version is v4 as it is the most effective with properly formatted inputs) or you can train your own model and use that. #### Using the pre-trained models --- First, you'll need to gather the tweets you want to classify and format them to the format best suitable for the model you choose. - Model v4: In preprocessing/, gather_tweets.py, specify the keyword or keywords you want the tweets to contain as well as the max age of the tweets as follow: ```python search_words = ["#btc"] date_since = "2015-11-14" ``` To specify multiple keywords simply put "AND" between each keyword Launch the script and redirect the output to a text file ```bash python3 gather_tweets.py > machine_learning/btc_v4.txt ``` In the machine_learning/ folder, change the content of test_model.py to match the model being used for classification and the file where the formated tweet to be classified are stored: ```python model = keras.models.load_model('my_model_v4') ... with open("btc_v4.txt", 'r') as f: btc = f.readlines() ``` Finally, launch the model: ```bash python3 test_model.py ``` After initialising the model, the script will print the tweets that are over the threshold of scam probability and print the values associated with each tweets. The closer the value is to 0, the more probable that this tweet is a scam. - Model v1 to v3: For models v1 to v3, the steps are similar but instead of using gather_tweets.py you will need to use gather_tweets_v1.py as it uses tokenizer_light.py (the originial BerTWEET tokenizer) which strips out more data (usernames, URL, RT status etc) and leads to fewer features for the model to base its classification upon. ##### v1-2-3 vs. v4 V4 uses a different type of tweet during training and classification. Here is a comparaison between two samples from each format (username anonimized for v4 example): ``` v1-3: @USER @USER Who is selling Bitcoin right now ? Square / Twitter , PayPal , Elon Musk all have billi ... HTTPURL v4: RT @XXXX : If they sell it on @amazon , you can buy it with #crypto . Just get a giftcard on the BitPay app : ... https://t.co/XXXXXXX XXXXXXXX https://XXXXXX.com/wallet/ ``` #### Training your own model --- In order to train your own model, you will need to have a dataset of known scam and known non-scam tweets. In data/ and data_v2/ you will find the tweets I have used to train the models v1-3 and model v4 respectively. Once the dataset created, you can launch the create_model.py script that will create, train and validate a model based on the data you provided. Make sure to change the name of you exported model or else it will override the already provided one: ```python export_model.save('my_model_v4') ``` At the end of the model training, some metrics will be output, you can base the efficiency of your model based on those, for instance here are the results for the model_v4: ``` Accuraccy is : 0.9810344576835632 Precision is : 0.9784736037254333 Recall is : 1.0 F1 Score is : 0.9891196949840357 ``` 2 plots will also be printed, one displaying training and validation losses over epochs, the second training and validation accuracy over epochs. You can also find more of my results in the results/ folder. ### Thank you!