# Q & A
- How did they get the data (raw data) and what source code did they use to convert?
## Support GPUs and CWL
**https://github.com/common-workflow-language/common-workflow-language/issues/587**
# Running code
Code nmea (equipe de besancon).
Pierre a recu des fichers csv (zip file par jour)
###### tags: `TSAR`
## Step-1: Raw data
Raw data is provided by Statsat (also some other sources) and are in a shared google drive. The data is "private" to the team and cannot be made public.
To access/download data from Google drive, we use Data Version Control (DVC)
Raw data is in nmea format (specific AIS transmission format where we have text messages).
**For now I still do not know how the raw data is transformed to csv**
## Step-2: data pre-processing
**Detecting Intentional AIS Shutdown in Open Sea Maritime Surveillance Using Self-Supervised Deep Learning (legacy name: pre-processing)
Journal paper & AIS shutdown detection**
Data: Google Drive DATA/dvc/representation_learning
Branch: v4 -> Most recent: Only AIS shutdown (IEEE ITS journal)
Code to run: src/data_processing.py
To be run per month.
Takes as input data zip files that are made available per month e.g. one folder per month.
Example: January. The corresponding folder is called **1** and contains:
```
data % ls 1/*.zip
1/ais_20200101.zip 1/ais_20200106.zip 1/ais_20200111.zip 1/ais_20200116.zip 1/ais_20200121.zip 1/ais_20200126.zip 1/ais_20200131.zip
1/ais_20200102.zip 1/ais_20200107.zip 1/ais_20200112.zip 1/ais_20200117.zip 1/ais_20200122.zip 1/ais_20200127.zip
1/ais_20200103.zip 1/ais_20200108.zip 1/ais_20200113.zip 1/ais_20200118.zip 1/ais_20200123.zip 1/ais_20200128.zip
1/ais_20200104.zip 1/ais_20200109.zip 1/ais_20200114.zip 1/ais_20200119.zip 1/ais_20200124.zip 1/ais_20200129.zip
1/ais_20200105.zip 1/ais_20200110.zip 1/ais_20200115.zip 1/ais_20200120.zip 1/ais_20200125.zip 1/ais_20200130.zip
```
So you have one zip file per day. If you unzip one file, you will find a csv file (each zip file contains 1 csv file). The format is csv with the following header/info:
```
mmsi;lon;lat;date_time_utc;sog;cog;true_heading;nav_status;rot;message_nr;source
210725000;20.1073;-34.9536;2020-01-31 00:00:00;12.4;268.5;269;0;-2;1;s
212111000;16.1965;70.1045;2020-01-31 00:00:00;12.9;48.3;44;0;0;1;g
219332000;-40.2628;-22.554;2020-01-31 00:00:00;0.0;216.9;37;3;4;1;s
219348000;7.25622;57.9044;2020-01-31 00:00:00;19.0;279.5;282;0;0;1;g
219468000;3.2278;56.5623;2020-01-31 00:00:00;0.2;86.6;272;0;71;3;g
235011640;1.5655;61.4111;2020-01-31 00:00:00;1.1;97.9;123;0;-720;3;g
257019000;16.3963;69.4115;2020-01-31 00:00:00;6.3;19.5;16;0;37;3;g
257014940;10.9415;64.9146;2020-01-31 00:00:00;0.0;360.0;511;7;-731;1;g
257149000;15.066;68.9111;2020-01-31 00:00:00;0.0;189.5;44;7;0;1;g
257961500;14.5809;68.2411;2020-01-31 00:00:00;0.1;41.3;511;-99;-99;18;g
```
So the data contains both satellite ("s" in the last column) and ground data ("g").
Then we run **data_processing.py** for each month (embarrassingly parallel)
```
python data_processing.py -m 1
```
Write "data/processed/1.h5"
Folders are a bit messy.
data_processing.py assumes the zip files are in ../data/1/*.zip
but output data is written in the local folder (where the code is) in data/processed/1.h5
scp -i $HOME/.ssh/id_rsa_bullet annefou@srl-login1.ex3.simula.no:/global/D1/projects/Dynamic_auto_encoder/data/1/\*.zip .
scp -i $HOME/.ssh/id_rsa_bullet annefou@srl-login1.ex3.simula.no:/global/D1/projects/Dynamic_auto_encoder/AAAI_context_aware_ae_reproducibility_code/params.yaml .
I think this step is also available from another gitlab repository called https://gitlab.com/reproducibility-code/context-aware-autoencoders-for-anomaly-detection-in-maritime-surveillance
We still need to have zip files (one folder per month and in each folder there are many zip files e.g. one per day and each file contains a csv file; still unclear how these csv files were generated from the raw data).
scp -i $HOME/.ssh/id_rsa_bullet annefou@srl-login1.ex3.simula.no:/global/D1/projects/Dynamic_auto_encoder/data/1/\*.zip .
## Get input data (not raw)
```
for f in {1..12}; do echo $f; mkdir $f; scp -i $HOME/.ssh/id_rsa_bullet annefou@srl-login1.ex3.simula.no:/global/D1/projects/Dynamic_auto_encoder/data/$f/\*.zip $f/.; done
```
## Run preprocessing to generate HDF5
```
for f in {1..12}; do python -m src.data_processing -A -m $f; done
```
Option -A is to make sure we create a new column with the distance to the closest harbor.
-A option is slowing down the code and this part should be optimized. For example:
- Create a grid (which resolution? 10m?) with already pre-computed distances and the main task is then to find the nearest grid point and take the pre-computed distance.
- Use BallTree algorithm or similar
python -m data_processing -m 1 577.36s user 149.28s system 95% cpu 12:41.10 total
## Concatenate all outputs
`data_concatenation.py` concatenates all the months (1.h5, 2.h5, etc.) to all.h5.
Not sure all.h5 is used?
## Create datasets
This step creates datasets for "train, validation and test".
```
python datasets_creation.py -m 1 -M 1
```
-m start month
-M end month
The result of this step is the creation of 3 datasets for training, validation and testing.
```
(tsar) annef@Annes-MacBook-Pro src % ls -lrt ../../data/dataset
total 11509880
-rw-r--r-- 1 annef staff 4552014512 Nov 14 14:10 train_trajectories_1.npy
-rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_vesseltype_1.npy
-rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_status_1.npy
-rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_fishing_type_1.npy
-rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_combination_1.npy
-rw-r--r-- 1 annef staff 562702304 Nov 14 14:10 test_trajectories_1.npy
-rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_vesseltype_1.npy
-rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_status_1.npy
-rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_fishing_type_1.npy
-rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_combination_1.npy
-rw-r--r-- 1 annef staff 594418592 Nov 14 14:10 validation_trajectories_1.npy
-rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_vesseltype_1.npy
-rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_status_1.npy
-rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_fishing_type_1.npy
-rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_combination_1.npy
```
Everything is stored as numpy arrays (not very portable...).