Q & A - HackMD

# Q & A - How did they get the data (raw data) and what source code did they use to convert? ## Support GPUs and CWL **https://github.com/common-workflow-language/common-workflow-language/issues/587** # Running code Code nmea (equipe de besancon). Pierre a recu des fichers csv (zip file par jour) ###### tags: `TSAR` ## Step-1: Raw data Raw data is provided by Statsat (also some other sources) and are in a shared google drive. The data is "private" to the team and cannot be made public. To access/download data from Google drive, we use Data Version Control (DVC) Raw data is in nmea format (specific AIS transmission format where we have text messages). **For now I still do not know how the raw data is transformed to csv** ## Step-2: data pre-processing **Detecting Intentional AIS Shutdown in Open Sea Maritime Surveillance Using Self-Supervised Deep Learning (legacy name: pre-processing) Journal paper & AIS shutdown detection** Data: Google Drive DATA/dvc/representation_learning Branch: v4 -> Most recent: Only AIS shutdown (IEEE ITS journal) Code to run: src/data_processing.py To be run per month. Takes as input data zip files that are made available per month e.g. one folder per month. Example: January. The corresponding folder is called **1** and contains: ``` data % ls 1/*.zip 1/ais_20200101.zip 1/ais_20200106.zip 1/ais_20200111.zip 1/ais_20200116.zip 1/ais_20200121.zip 1/ais_20200126.zip 1/ais_20200131.zip 1/ais_20200102.zip 1/ais_20200107.zip 1/ais_20200112.zip 1/ais_20200117.zip 1/ais_20200122.zip 1/ais_20200127.zip 1/ais_20200103.zip 1/ais_20200108.zip 1/ais_20200113.zip 1/ais_20200118.zip 1/ais_20200123.zip 1/ais_20200128.zip 1/ais_20200104.zip 1/ais_20200109.zip 1/ais_20200114.zip 1/ais_20200119.zip 1/ais_20200124.zip 1/ais_20200129.zip 1/ais_20200105.zip 1/ais_20200110.zip 1/ais_20200115.zip 1/ais_20200120.zip 1/ais_20200125.zip 1/ais_20200130.zip ``` So you have one zip file per day. If you unzip one file, you will find a csv file (each zip file contains 1 csv file). The format is csv with the following header/info: ``` mmsi;lon;lat;date_time_utc;sog;cog;true_heading;nav_status;rot;message_nr;source 210725000;20.1073;-34.9536;2020-01-31 00:00:00;12.4;268.5;269;0;-2;1;s 212111000;16.1965;70.1045;2020-01-31 00:00:00;12.9;48.3;44;0;0;1;g 219332000;-40.2628;-22.554;2020-01-31 00:00:00;0.0;216.9;37;3;4;1;s 219348000;7.25622;57.9044;2020-01-31 00:00:00;19.0;279.5;282;0;0;1;g 219468000;3.2278;56.5623;2020-01-31 00:00:00;0.2;86.6;272;0;71;3;g 235011640;1.5655;61.4111;2020-01-31 00:00:00;1.1;97.9;123;0;-720;3;g 257019000;16.3963;69.4115;2020-01-31 00:00:00;6.3;19.5;16;0;37;3;g 257014940;10.9415;64.9146;2020-01-31 00:00:00;0.0;360.0;511;7;-731;1;g 257149000;15.066;68.9111;2020-01-31 00:00:00;0.0;189.5;44;7;0;1;g 257961500;14.5809;68.2411;2020-01-31 00:00:00;0.1;41.3;511;-99;-99;18;g ``` So the data contains both satellite ("s" in the last column) and ground data ("g"). Then we run **data_processing.py** for each month (embarrassingly parallel) ``` python data_processing.py -m 1 ``` Write "data/processed/1.h5" Folders are a bit messy. data_processing.py assumes the zip files are in ../data/1/*.zip but output data is written in the local folder (where the code is) in data/processed/1.h5 scp -i $HOME/.ssh/id_rsa_bullet annefou@srl-login1.ex3.simula.no:/global/D1/projects/Dynamic_auto_encoder/data/1/\*.zip . scp -i $HOME/.ssh/id_rsa_bullet annefou@srl-login1.ex3.simula.no:/global/D1/projects/Dynamic_auto_encoder/AAAI_context_aware_ae_reproducibility_code/params.yaml . I think this step is also available from another gitlab repository called https://gitlab.com/reproducibility-code/context-aware-autoencoders-for-anomaly-detection-in-maritime-surveillance We still need to have zip files (one folder per month and in each folder there are many zip files e.g. one per day and each file contains a csv file; still unclear how these csv files were generated from the raw data). scp -i $HOME/.ssh/id_rsa_bullet annefou@srl-login1.ex3.simula.no:/global/D1/projects/Dynamic_auto_encoder/data/1/\*.zip . ## Get input data (not raw) ``` for f in {1..12}; do echo $f; mkdir $f; scp -i $HOME/.ssh/id_rsa_bullet annefou@srl-login1.ex3.simula.no:/global/D1/projects/Dynamic_auto_encoder/data/$f/\*.zip $f/.; done ``` ## Run preprocessing to generate HDF5 ``` for f in {1..12}; do python -m src.data_processing -A -m $f; done ``` Option -A is to make sure we create a new column with the distance to the closest harbor. -A option is slowing down the code and this part should be optimized. For example: - Create a grid (which resolution? 10m?) with already pre-computed distances and the main task is then to find the nearest grid point and take the pre-computed distance. - Use BallTree algorithm or similar python -m data_processing -m 1 577.36s user 149.28s system 95% cpu 12:41.10 total ## Concatenate all outputs `data_concatenation.py` concatenates all the months (1.h5, 2.h5, etc.) to all.h5. Not sure all.h5 is used? ## Create datasets This step creates datasets for "train, validation and test". ``` python datasets_creation.py -m 1 -M 1 ``` -m start month -M end month The result of this step is the creation of 3 datasets for training, validation and testing. ``` (tsar) annef@Annes-MacBook-Pro src % ls -lrt ../../data/dataset total 11509880 -rw-r--r-- 1 annef staff 4552014512 Nov 14 14:10 train_trajectories_1.npy -rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_vesseltype_1.npy -rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_status_1.npy -rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_fishing_type_1.npy -rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_combination_1.npy -rw-r--r-- 1 annef staff 562702304 Nov 14 14:10 test_trajectories_1.npy -rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_vesseltype_1.npy -rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_status_1.npy -rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_fishing_type_1.npy -rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_combination_1.npy -rw-r--r-- 1 annef staff 594418592 Nov 14 14:10 validation_trajectories_1.npy -rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_vesseltype_1.npy -rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_status_1.npy -rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_fishing_type_1.npy -rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_combination_1.npy ``` Everything is stored as numpy arrays (not very portable...).