# Geoloc module navigation
Geolocation module provides classification of users by their tweets using 4 models:
- NER
- language identification model
- Twitter api model
- decision tree model that predicts the country of user by the results of the 3 models above
## Datasets' structure
The following is applying to current dataset:
- Columns with features of twitter api model have format <country_name>API, e.g. BelarusAPI, SaudiArabiaAPI. Country name starts with uppercase letter. 
- Columns with features of NER model have format <country_name>, e.g. Belarus, SaudiArabia. Country name starts with uppercase letter. 
- Columns with features of language identification model have format <language_short_form>, e.g. en, ru. All letters are lowercase. 
- Target column with ground truth should always has name *true_location*
## Folder structure
There are 4 folders in the [main pipeline folder](https://drive.google.com/drive/u/1/folders/1vyKKdkKkpQCFVW2iZokMflJjERui7t70):

### 1. Folder [raw_features](https://drive.google.com/drive/u/1/folders/1jIMb9YsZG4vxwwAkTCgh96H2cNhUKPRP)
There are unprocessed datasets in this folder. To use them in a model, the raw files need to be preprocessed first.

In the folder named 3k_data there is a dataset with 3 features (language, ner results, twitter api results).
Polina to Alla: let's drop the 3k+1k followers for the time being, it'll appear there in the future.
Inside this folder there are folders corresponding to feature types.

Inside each of these folders there is a file with raw data and a notebook to preprocess it and put it in the [preprocessed_features folder](https://drive.google.com/drive/u/1/folders/1CH83WDXrVMfJVDbzvzMfFzNAjJI57XWz).

In [1k_data folder](https://drive.google.com/drive/u/1/folders/1JsweRSXZX8EGg2EyVB6LHX6PQl30P5JK) there is the merged dataset of all 3 features types and a notebook to preprocess this dataset and put it in the [preprocessed_features folder](https://drive.google.com/drive/u/1/folders/1CH83WDXrVMfJVDbzvzMfFzNAjJI57XWz).

### 2. Folder [preprocessed_features](https://drive.google.com/drive/u/1/folders/1CH83WDXrVMfJVDbzvzMfFzNAjJI57XWz)
There are folders with preprocessed datasets with 3k, 1k and merged (therefore 4k) rows, each with 3 feature types.

In [1k_data folder](https://drive.google.com/drive/u/1/folders/1cAbjIfA1yPQ36AysyXOwD7yyuF1rkcf2) there is a ready-to-use dataset with 1k users;
in [3k_data folder](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) there are:
- 3 datasets, one for each feature type
([twitter_api_data_preprocessed.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3), [ner_data_preprocessed.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3), [lang_data_preprocessed.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) ),
- [notebook to merge all 3 in one](https://colab.research.google.com/drive/172YqKWp_mCyR9rxunCA7dQ903mDbbxSS?ouid=106594660584588640625) for 3 models in a one
- file with true countries for 3k users named [ground_truth_3k.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3)
- [notebook](https://colab.research.google.com/drive/10VdC3qwy8S0R99LCBIMq5NyEp-qIlMqc?authuser=1) for merging [ground truth label data](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) with merged dataset with [3k users and all feature types for them](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3)
- dataset [merged_features_3k.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) that contains 3 merged datasets
- dataset [merged_features_with_ground_truth_3k.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) is ready for usage in a model.
- All the remaining itermediate files and notebooks should be used for reference on processing new raw datasets.

**Important note!**
Dataset [merged_features_with_ground_truth_3k.csv](https://drive.google.com/drive/u/1/folders/1u12QQM1gx2fBaGve4bjH6GU7zOPw2Ue3) is itself ready for usage in a model. All the remained itermediate files and notebooks could be used as a reference for the processing of new datasets from raw data that could be added further.
### 3. Folder [models](https://drive.google.com/drive/u/1/folders/19NQdMFtPmpdInLzN2LYszbLFN0ihE85-)
This folder contains subfolders named by ML model type.

In these folders there are notebooks with corresponding model experiments, fitting and testing.
