Geoloc module navigation

# Geoloc module navigation Geolocation module provides classification of users by their tweets using 4 models: - NER - language identification model - Twitter api model - decision tree model that predicts the country of user by the results of the 3 models above ## Datasets' structure The following is applying to current dataset: - Columns with features of twitter api model have format <country_name>API, e.g. BelarusAPI, SaudiArabiaAPI. Country name starts with uppercase letter. ![](https://i.imgur.com/edniUZS.png) - Columns with features of NER model have format <country_name>, e.g. Belarus, SaudiArabia. Country name starts with uppercase letter. ![](https://i.imgur.com/4o6hV8o.png) - Columns with features of language identification model have format <language_short_form>, e.g. en, ru. All letters are lowercase. ![](https://i.imgur.com/aZ7q5qq.png) - Target column with ground truth should always has name *true_location* ## Folder structure There are 4 folders in the [main pipeline folder](https://drive.google.com/drive/u/1/folders/1vyKKdkKkpQCFVW2iZokMflJjERui7t70): ![](https://i.imgur.com/d0ocqTP.png) ### 1. Folder [raw_features](https://drive.google.com/drive/u/1/folders/1jIMb9YsZG4vxwwAkTCgh96H2cNhUKPRP) There are unprocessed datasets in this folder. To use them in a model, the raw files need to be preprocessed first. ![](https://i.imgur.com/z2mCzyn.png) In the folder named 3k_data there is a dataset with 3 features (language, ner results, twitter api results). Polina to Alla: let's drop the 3k+1k followers for the time being, it'll appear there in the future. Inside this folder there are folders corresponding to feature types. ![](https://i.imgur.com/pdGjVOi.png) Inside each of these folders there is a file with raw data and a notebook to preprocess it and put it in the [preprocessed_features folder](https://drive.google.com/drive/u/1/folders/1CH83WDXrVMfJVDbzvzMfFzNAjJI57XWz). ![](https://i.imgur.com/BBWLZW8.png) In [1k_data folder](https://drive.google.com/drive/u/1/folders/1JsweRSXZX8EGg2EyVB6LHX6PQl30P5JK) there is the merged dataset of all 3 features types and a notebook to preprocess this dataset and put it in the [preprocessed_features folder](https://drive.google.com/drive/u/1/folders/1CH83WDXrVMfJVDbzvzMfFzNAjJI57XWz). ![](https://i.imgur.com/9QMTxIp.png) ### 2. Folder [preprocessed_features](https://drive.google.com/drive/u/1/folders/1CH83WDXrVMfJVDbzvzMfFzNAjJI57XWz) There are folders with preprocessed datasets with 3k, 1k and merged (therefore 4k) rows, each with 3 feature types. ![](https://i.imgur.com/iQoXERD.png) In [1k_data folder](https://drive.google.com/drive/u/1/folders/1cAbjIfA1yPQ36AysyXOwD7yyuF1rkcf2) there is a ready-to-use dataset with 1k users; in [3k_data folder](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) there are: - 3 datasets, one for each feature type ([twitter_api_data_preprocessed.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3), [ner_data_preprocessed.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3), [lang_data_preprocessed.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) ), - [notebook to merge all 3 in one](https://colab.research.google.com/drive/172YqKWp_mCyR9rxunCA7dQ903mDbbxSS?ouid=106594660584588640625) for 3 models in a one - file with true countries for 3k users named [ground_truth_3k.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) - [notebook](https://colab.research.google.com/drive/10VdC3qwy8S0R99LCBIMq5NyEp-qIlMqc?authuser=1) for merging [ground truth label data](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) with merged dataset with [3k users and all feature types for them](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) - dataset [merged_features_3k.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) that contains 3 merged datasets - dataset [merged_features_with_ground_truth_3k.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) is ready for usage in a model. - All the remaining itermediate files and notebooks should be used for reference on processing new raw datasets. ![](https://i.imgur.com/vaArjWr.png) **Important note!** Dataset [merged_features_with_ground_truth_3k.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) is itself ready for usage in a model. All the remained itermediate files and notebooks could be used as a reference for the processing of new datasets from raw data that could be added further. ### 3. Folder [models](https://drive.google.com/drive/u/1/folders/19NQdMFtPmpdInLzN2LYszbLFN0ihE85-) This folder contains subfolders named by ML model type. ![](https://i.imgur.com/rketcQ6.png) In these folders there are notebooks with corresponding model experiments, fitting and testing. ![](https://i.imgur.com/okWd66b.png)