T-SAR project - HackMD

# T-SAR project ###### tags: `TSAR` ## Glossary - AIS : Automatic Identification System ## Research Object in RoHub https://w3id.org/ro-id/7998d851-41e8-4c51-aa06-deff6fd5f09a ## Google drive & Document https://docs.google.com/document/d/19Qc0lhSPTjIbaZdruwgNjxIk-EW3C7HQ2BGX1iEKfRo/edit ## Data version control Data Version Control · DVC: it was used initially but has been abandoned. - https://dvc.org - Google drive & DVC. https://dvc.org/doc/user-guide/how-to/setup-google-drive-remote ## Connection to EX3 ### Within Simula premises (SRL Wifi) ``` ssh -Y2C -i $HOME/.ssh/id_rsa_bullet $USER@srl-login1.ex3.simula.no ``` ### External access to EX3 ``` ssh -Y2C -i $HOME/.ssh/id_rsa_bullet $USER@dnat.simula.no -p 60441 ``` ## Connection to Virtual Machine (Ubuntu) ``` ssh -X -i .ssh/tsar.pem ubuntu@147.251.21.143 ``` ## Connection using container with GPUs Login on ex3: ``` ssh -i $HOME/.ssh/id_rsa_bullet -t -L 6826:localhost:6826 annefou@srl-login1.ex3.simula.no ssh -L 6826:localhost:6826 annefou@10.128.0.31 ``` Then ``` module load singularity-ce/3.10.0 ``` Then check which GPU is free: ``` nvidia-smi ``` if NDVI GPUs... ``` CUDA_VISIBLE_DEVICES=3 singularity run --nv tensorflow_22.12-tf2-py3.sif jupyter lab --no-browser --port 6826 --ip=0.0.0.0 ``` Then open browser locally e.g. ``` http://localhost:6826 ``` ### Data #### AIS input data The original raw data was not used (no code provided to read) because PhD student had too much difficulties with raw data. New set of data was provided (zipped csv). Data used by the PhD student was provided to me as zipped csv files: - one folder per month - in each folder, one file per day - file is zipped csv One year of data (2020). On the EX3, data is available in `/global/D1/projects/Dynamic_auto_encoder/data_2`. ``` total 218623126 drwxr-sr-x 2 pierbernabe VIAS 32 Oct 24 17:41 7 -rw-r--r-- 1 pierbernabe VIAS 22858619976 Oct 24 17:42 05.zip -rw-r--r-- 1 pierbernabe VIAS 11808122600 Oct 24 17:42 12.zip -rw-r--r-- 1 pierbernabe VIAS 15273734189 Oct 24 17:43 11.zip -rw-r--r-- 1 pierbernabe VIAS 23494586610 Oct 24 17:44 08.zip drwxr-sr-x 2 pierbernabe VIAS 32 Oct 24 17:45 1 drwxr-sr-x 2 pierbernabe VIAS 30 Oct 24 17:46 2 -rw-r--r-- 1 pierbernabe VIAS 21658866912 Oct 24 17:47 10.zip -rw-r--r-- 1 pierbernabe VIAS 13934018250 Oct 24 17:47 03.zip drwxr-sr-x 2 pierbernabe VIAS 32 Oct 24 17:48 8 drwxr-sr-x 2 pierbernabe VIAS 32 Oct 24 17:49 5 -rw-r--r-- 1 pierbernabe VIAS 13047349461 Oct 24 17:49 02.zip drwxr-sr-x 2 pierbernabe VIAS 31 Oct 24 17:50 9 drwxr-sr-x 2 pierbernabe VIAS 31 Oct 24 17:51 6 -rw-r--r-- 1 pierbernabe VIAS 21026944568 Oct 24 17:52 09.zip drwxr-sr-x 2 pierbernabe VIAS 32 Oct 24 17:53 3 drwxr-sr-x 2 pierbernabe VIAS 32 Oct 24 17:53 10 -rw-r--r-- 1 pierbernabe VIAS 13825555823 Oct 24 17:54 01.zip drwxr-sr-x 2 pierbernabe VIAS 31 Oct 24 17:55 4 drwxr-sr-x 2 pierbernabe VIAS 31 Oct 24 17:56 11 -rw-r--r-- 1 pierbernabe VIAS 22814658248 Oct 24 17:56 07.zip -rw-r--r-- 1 pierbernabe VIAS 22253417527 Oct 24 17:57 04.zip -rw-r--r-- 1 pierbernabe VIAS 21874196166 Oct 24 17:58 06.zip drwxr-sr-x 2 pierbernabe VIAS 32 Oct 24 17:58 12 ``` Each month is then pre-processed to compute additional information (columns). ##### Links for reading AIS raw data The following web links may be useful: - https://towardsdatascience.com/automatic-identification-system-and-parquet-365160852b3 - https://github.com/ScottSyms/AISArchive - https://github.com/schwehr/libais #### Auxiliary input data Additional csv files are used; for instance when preprocessing data. - mmsi_with_vesseltype.csv: (2 columns with one row for the header) contains a list of MMSI (vessel identifier) and corresponding vesseltype. - Q: Who provided this file? How is it generated? Is there a database with valid MMSI and type of vessels? - Q: Can this dataset be published as Open data? (why not?!) - anchorages.csv: (6 columns with one row for the header). Contains s2id, latitude, longitude, label, sublabel, iso3. Used to find the nearest anchorage for each AIS message. - Q: Where does it comes from? - Q: Can this dataset be published as Open data? - fishing-vessels-v2.csv: One row for the header. Characteristics of fishing vessels (identified by MMSI). Information provided: mmsi,flag_ais,flag_registry,flag_gfw,vessel_class_inferred,vessel_class_inferred_score,vessel_class_registry,vessel_class_gfw,self_reported_fishing_vessel,length_m_inferred,length_m_registry,length_m_gfw,engine_power_kw_inferred,engine_power_kw_registry,engine_power_kw_gfw,tonnage_gt_inferred,tonnage_gt_registry,tonnage_gt_gfw,registries_listed,fishing_hours_2012,fishing_hours_2013,fishing_hours_2014,fishing_hours_2015,fishing_hours_2016,fishing_hours_2017,fishing_hours_2018,fishing_hours_2019,fishing_hours_2020 - Q: Who provided this file? - Q: Is https://globalfishingwatch.org relevant? ## Codes The PhD student focused on machine learning e.g. development of deep-learning codes to classify the type of message (voluntary AIS shutdown or not). 1. Pre-processing (including auto-labelling) 2. Preparation of the input data for machine learning (training, test, validation) 3. Training: sometimes there is a loop for different models (because the focus is on ML) 4. Test & Validation 5. "Visualization": basic visualization No proper workflow management system used (attempt to use dvc was abanboned; probably better to use something like snakemake that is simple enough for researcher). ## Preprocessing code The source code is usually in `pre-processing/src`. `param.yml` file is a parameter file where you need to specify input files, where you want to write, etc. The preprocessing step read zipped csv files. Often called data_processing.py; options are slightly different from one repo to another (one repo = one paper); some options are meant to compute additional columns such as nearest distance to satellite or anchorage; others are meant to only select certain types of vessels such as fishing vessel, etc.): - Paths are hard-coded - Usually processed per month and accumulate data within a month (not really justified the rationale here?) - Code needs to be rewritten (use class to make it clearer and easier to extend). Separate pre-processing from auto-labelling. Make sure we can use different data formats, add any quality controls, etc. Take sample AIS data. It would be much better to make the new code with MIT license. But obivously there is no interest to discuss this option and decide before the end of the project. It is an issue for Research Engineer. One option would be to separate auto-labelling (that is what the PhD has in his paper along with ML) and make the pre-processing (QC, etc.) open with MIT license. ### conda environment There is a file called `requirements.txt` but it seems to contain much more packages than needed. I tried to make a smaller conda environment: name: tsar channels: - conda-forge - defaults dependencies: - python=3.10 - pandas - tensorflow - tqdm - numba - swifter - scikit-learn - flask - numpy_ringbuffer - pyorbital - plotly - matplotlib - python-kaleido - pip - pip: - conda-lock[pip_support] `conda-lock` is meant to be used to generate a more reproducible environment on different architecture. The best would be to have separate [conda] environments for each step in the workflow (one for pre-processing, auto-labelling, ML, prediction & visu) ## Gitlab repositories There are several repositories, one per paper. For now it is not really relevant e.g. focus on pre-processing & auto-labelling, then ML and visualization. ## Step-1: data pre-processing **Detecting Intentional AIS Shutdown in Open Sea Maritime Surveillance Using Self-Supervised Deep Learning (legacy name: pre-processing) Journal paper & AIS shutdown detection** Data: Google Drive DATA/dvc/representation_learning Branch: v4 -> Most recent: Only AIS shutdown (IEEE ITS journal) Code to run: src/data_processing.py To be run per month. Takes as input data zip files that are made available per month e.g. one folder per month. Example: January. The corresponding folder is called **1** and contains: ``` data % ls 1/*.zip 1/ais_20200101.zip 1/ais_20200106.zip 1/ais_20200111.zip 1/ais_20200116.zip 1/ais_20200121.zip 1/ais_20200126.zip 1/ais_20200131.zip 1/ais_20200102.zip 1/ais_20200107.zip 1/ais_20200112.zip 1/ais_20200117.zip 1/ais_20200122.zip 1/ais_20200127.zip 1/ais_20200103.zip 1/ais_20200108.zip 1/ais_20200113.zip 1/ais_20200118.zip 1/ais_20200123.zip 1/ais_20200128.zip 1/ais_20200104.zip 1/ais_20200109.zip 1/ais_20200114.zip 1/ais_20200119.zip 1/ais_20200124.zip 1/ais_20200129.zip 1/ais_20200105.zip 1/ais_20200110.zip 1/ais_20200115.zip 1/ais_20200120.zip 1/ais_20200125.zip 1/ais_20200130.zip ``` So you have one zip file per day. If you unzip one file, you will find a csv file (each zip file contains 1 csv file). The format is csv with the following header/info: ``` mmsi;lon;lat;date_time_utc;sog;cog;true_heading;nav_status;rot;message_nr;source 210725000;20.1073;-34.9536;2020-01-31 00:00:00;12.4;268.5;269;0;-2;1;s 212111000;16.1965;70.1045;2020-01-31 00:00:00;12.9;48.3;44;0;0;1;g 219332000;-40.2628;-22.554;2020-01-31 00:00:00;0.0;216.9;37;3;4;1;s 219348000;7.25622;57.9044;2020-01-31 00:00:00;19.0;279.5;282;0;0;1;g 219468000;3.2278;56.5623;2020-01-31 00:00:00;0.2;86.6;272;0;71;3;g 235011640;1.5655;61.4111;2020-01-31 00:00:00;1.1;97.9;123;0;-720;3;g 257019000;16.3963;69.4115;2020-01-31 00:00:00;6.3;19.5;16;0;37;3;g 257014940;10.9415;64.9146;2020-01-31 00:00:00;0.0;360.0;511;7;-731;1;g 257149000;15.066;68.9111;2020-01-31 00:00:00;0.0;189.5;44;7;0;1;g 257961500;14.5809;68.2411;2020-01-31 00:00:00;0.1;41.3;511;-99;-99;18;g ``` So the data contains both satellite ("s" in the last column) and ground data ("g"). ### Run preprocessing to generate HDF5 ``` for f in {1..12}; do python -m src.data_processing -A -m $f; done ``` Option -A is to make sure we create a new column with the distance to the closest harbor. -A option is slowing down the code; algorithm needs to be changed! For example: - Use BallTree algorithm or similar - Create a grid (which resolution? 10m?) with already pre-computed distances and the main task is then to find the nearest grid point and take the pre-computed distance. Then we run **data_processing.py** for each month (should normally be embarrassingly parallel; is there a need to process per month and accumulate data?) ``` python data_processing.py -m 1 ``` Write "data/processed/1.h5" Folders are a bit messy. data_processing.py assumes the zip files are in ../data/1/*.zip but output data is written in the local folder (where the code is) in data/processed/1.h5 ## Visualization - prediction.py shows how to make a prediction (one entry --> one output) - server_ASD_detection.py is the program to send data to visualizer: https://gitlab.com/simula_ais_message/pre-processing/-/blob/v4/src/server_ASD_detection.py ### Simple visualization with H3 ![](https://i.imgur.com/vQexGO4.png) made with `maritime_H3_visu.ipynb`. QC: why do we have so many data over land? ### How are AIS anomalies usually display to the operator? - How many panels, what should panels contain? - UX design: need inputs from the clients/potential users. - What are we willing to demonstrate? - How are anomalies displayed in existing software? - See https://docs.google.com/presentation/d/1mY7tPh02UpeQ_MLX-t_fBmaQuc37XQ5XfEH4H-ndJo8/edit?usp=sharing The main initial objective is to understand what to display and how to display it. ### Maritime Informatics http://maritime-informatics.com/bereta/ Example of visualization: https://youtu.be/jO3l1Bf-Z1I May be interesting to display: - Incident per area (area chosen and display on a map) - Vessels of interest: worldwide map where the X vessels with the biggest anomalies are display. X=100 or whatever the end user would like to see. ### Coastline Taken from https://www.naturalearthdata.com/downloads/10m-physical-vectors/10m-coastline/ 10 metre coastlines. It is in shapefile format. ### Marine regions https://www.marineregions.org/downloads.php Downloading product: Maritime Boundaries Geodatabase: Contiguous Zones (24NM) (version: 3) World EEZ v11 (2019-11-18, 127 MB) - downloads: 24743 ### Sample data to visualize https://doi.org/10.5281/zenodo.3754481 Single Ground Based AIS Receiver Vessel Tracking Dataset by Kontopoulos I.; Vodas M.; Spiliopoulos G.; Tserpes K.; Zissis D. ### Spire Maritime https://docs.platform-xyzt.ai/tutorials/maritime/understanding-AIS-data.html In the example data set, following properties make up each record: - **MMSI**: a unique identifier for a vessel. All records belonging to the same vessel will have the same identifier. - **Longitude**: the horizontal coordinate (typically in WGS-84 reference) - **Latitude**: the vertical coordinate (typically in WGS-84 reference) - **Time stamp**: the time the record was logged - **Speed (knots)**: the speed of the vessel - **Course**: the heading of the car (in degrees) - **Draught (dm)**: the draught of the vessel in dm - **Ship type**: the type of the ship, e.g., Cargo, all ships of this type, Fishing,… - **Status**: the status of the ship, e.g., Underway using engine, Moored,… - **Collection type**: how the record was collected, through satellite, through a terrestrial receiver, or through a receiver on another ship (Dynamic A and B) Other YouTube videos from Spire: - https://youtu.be/GgHuMSsI7i8 - https://youtu.be/BvXMRo78XRA ### Anomaly Detection in Maritime AIS Tracks: A Review of Recent Approaches by Konrad Wolsing 1,2,*, Linus Roepert 2, Jan Bauer 1 and Klaus Wehrle 2 https://www.mdpi.com/1450022 Show a plot of the vessels (lat,lon) with a marker and the color of the marker is SOG (speed in knots). ### NASA Office for coastal management https://coast.noaa.gov/data/marinecadastre/ais/AISDataHandlerTutorial_Jan2013.pdf ### SAT AIS Services http://www.sat-ais.org ### COPERNICUS SECURITY SERVICES - MARITIME SURVEILLANCE COMPONENT https://insitu.copernicus.eu/FactSheets/CSS_Maritime_Surveillance/ ![](https://i.imgur.com/yRNNx6f.png) ### Maritime Traffic Research https://www.marinetraffic.com/research/ ### Danish Maritime Authority https://dma.dk/safety-at-sea/navigational-information/ais-data Also provide sample datasets Columns in *.csv file Format ---------------------------------------------------------------------------------------------------------------------------------------------------- 1. Timestamp Timestamp from the AIS basestation, format: 31/12/2015 23:59:59 2. Type of mobile Describes what type of target this message is received from (class A AIS Vessel, Class B AIS vessel, etc) 3. MMSI MMSI number of vessel 4. Latitude Latitude of message report (e.g. 57,8794) 5. Longitude Longitude of message report (e.g. 17,9125) 6. Navigational status Navigational status from AIS message if available, e.g.: 'Engaged in fishing', 'Under way using engine', mv. 7. ROT Rot of turn from AIS message if available 8. SOG Speed over ground from AIS message if available 9. COG Course over ground from AIS message if available 10. Heading Heading from AIS message if available 11. IMO IMO number of the vessel 12. Callsign Callsign of the vessel 13. Name Name of the vessel 14. Ship type Describes the AIS ship type of this vessel 15. Cargo type Type of cargo from the AIS message 16. Width Width of the vessel 17. Length Lenght of the vessel 18. Type of position fixing device Type of positional fixing device from the AIS message 19. Draught Draugth field from AIS message 20. Destination Destination from AIS message 21. ETA Estimated Time of Arrival, if available 22. Data source type Data source type, e.g. AIS 23. Size A Length from GPS to the bow 24. Size B Length from GPS to the stern 25. Size C Length from GPS to starboard side 26. Size D Length from GPS to port side http://web.ais.dk/aisdata/ ### What to focus on? - create a simple Jupyter Notebook with interactive visualisation. Still need a minimum info on what is usually displayed. - There is no point to develop software to mimic real-time processing (we do not know what is the current processing pipelines) - Shouldn't we focus on cross-validation (using satellite images, info, etc.)? and write a jupyter notebook to show it (narrative + interactive visualization) ## Training & Testing - training.py - testing.py ## Concatenate all outputs `data_concatenation.py` concatenates all the months (1.h5, 2.h5, etc.) to all.h5. Not sure all.h5 is used? ## Create datasets This step creates datasets for "train, validation and test". ``` python datasets_creation.py -m 1 -M 1 ``` -m start month -M end month The result of this step is the creation of 3 datasets for training, validation and testing. ``` (tsar) annef@Annes-MacBook-Pro src % ls -lrt ../../data/dataset total 11509880 -rw-r--r-- 1 annef staff 4552014512 Nov 14 14:10 train_trajectories_1.npy -rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_vesseltype_1.npy -rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_status_1.npy -rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_fishing_type_1.npy -rw-r--r-- 1 annef staff 29751856 Nov 14 14:10 train_combination_1.npy -rw-r--r-- 1 annef staff 562702304 Nov 14 14:10 test_trajectories_1.npy -rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_vesseltype_1.npy -rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_status_1.npy -rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_fishing_type_1.npy -rw-r--r-- 1 annef staff 3677920 Nov 14 14:10 test_combination_1.npy -rw-r--r-- 1 annef staff 594418592 Nov 14 14:10 validation_trajectories_1.npy -rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_vesseltype_1.npy -rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_status_1.npy -rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_fishing_type_1.npy -rw-r--r-- 1 annef staff 3885216 Nov 14 14:10 validation_combination_1.npy ``` Everything is stored as numpy arrays (not very portable...). Weights (NN) are also stored as numpy array. Could it be H5 instead? ## Lessons learned - What have we learned? - How could we do better in future projects? ### Working alone is not efficient The PhD student was left alone and had limited technical skills when starting. We see a slow but constant improvement in the code for each new paper but the final results is still quite low. Obviously code has never been reviewed, pair programming or similar does not exist at Simula, PhD do not receive the necessary training in programming and code design (modularity) when starting. Is it true for all Simula's group? It seems to be true for students in VIAS. What is the role of scientific computing? ### Refactoring by Research Engineer (RE) at the end is costly REs need to be incorporated early on; not at the end of the project (not after publishing papers). Time is wasted when asking REs to refactor at the end: PhD students do not learn best practices. ### No common e-infrastructure at Simula It looks like everyone is working in silo. No code shared. Re-invent the wheel for every single new project? Several projects have overlapping objectives but not much interest in collaboration. They like to work alone... ### What could be done better? ScientificComputing has started an initiative for reproducible research at Simula. Focus should not be on reproducibility only. Could Open Science practices be adopted at Simula?