Hands-on Data Anonymization April 2021

# Hands-on Data Anonymization April 2021 :::success ## Course practicalities - **Course organizer**: Enrico Glerean - **Guest lecturers**: Dan Häggman, Abraham Zewoudie - **Contact**: enrico.glerean@aalto.fi - **Zoom Link**: SENT VIA EMAIL - **hackMD for async course chat**: - DAY1 (Archived): https://hackmd.io/@eglerean/dataAnon2021chatDay1 - DAY2 (Archived): https://hackmd.io/@eglerean/dataAnon2021chatDay2 - DAY3 (Archived): https://hackmd.io/@eglerean/dataAnon2021chatDay3 - DAY4 (Archived): https://hackmd.io/@eglerean/dataAnon2021chatDay4 - **Times**: **12, 15, 19, and 22 April at 11:50-15:00** (we always start *10 minutes before 12:00* to have an informal tea/coffee together). 2 x 15 minutes breaks at the beginning of each hour following the first one (i.e. at 13:00 and 14:00) - **Learning goals**: The goals for this course are ***practical***: to have people to actually de-identify/pseudo-anonymise/anonymize personal data in many of its forms and also use modern techniques for working with personal / sensitive data when anonymization is not possible. - **Course structure**: each part starts with conceptual/theoretical background, and is then followed by hands-on session of "doing/coding together". If you do not plan to do the hands-on part, it is ok to hang around only for the start of each part. - **Target audience**: anyone working with personal data in research. We are *very* diverse, and we need to help each other learning by doing. The teachers will try to help especially those who need the course credit. Note: If you do not know how to code (or if you are not familiar with language X) you are not expected to learn it during this course. The goal for you is to find a way to reach your anonymization goals with the tools you are familiar with. Identifying what you need to learn outside this course to reach your anonymization is also an important goal (and very important one!). **This is not a computer science course on data privacy or data security**. For that you can check https://mycourses.aalto.fi/course/view.php?id=28167 . - **Course credit**: The course gives 1ETCS credit (equivalent to 27 hours of work, of which about half are attending the contact sessions). The credit is registered automatically for those who have an Aalto student number. Other participants can request a certificate that can be converted into 1 ECTS. - **To get the 1 ECTS**: attend all contact sessions, complete the hands-on parts during the sessions, submit homework results ***before May 9 2021 (Mother's day!)***. If you need to skip more than one lecture, you will be asked to compensate with extra homeworks (this is just to be fair towards other participants, not to punish you!). ::: :::danger ## DISCLAIMER: WORK IN PROGRESS This is a pilot course and you are pilot test participants. We can improve it together (and even teach it together in the future)! ::: ## Course structure *Week 1 focuses on tabular data, week 2 focuses on more advanced structured and unstructured data types. We can still adapt the course content to participants' wishes (especially considering those who will complete the course for the credit).* ## Day 1 (Mon 12/04/2021) - Identifying the problem :::spoiler Learning Outcomes: - you understand the importance of data anonymization - you can evaluate if a dataset is anonymous - you are able to peer review management plans, grant applications which require (pseudo)anonymization - you are able to peer review (pseudo)anonymized tabular datasets in research studies ### Timetable for day 1 | Start | End | Topic | Notes | | ------| ---- | -------- | -------- | | 11:50 | 12:00 | Informal tea/coffee break | | | 12:00 | 12:15 | Practicalities+Icebreaker | | 12:15 | 12:50 | Data Anonymization Basics pt1 | [video](https://www.youtube.com/watch?v=ILXeA4fx3cI) [slides](https://docs.google.com/presentation/d/1Y_iuG6bDSWdnBoUfd6n_tGlj9Gd8Trzi1IfbhQDbf-8/edit?usp=sharing) | 12:50 | 13:00 | Q&A | 13:00 | 13:15 | Break | 13:15 | 13:45 | Data Anonymization Basics pt2| [video](https://www.youtube.com/watch?v=ILXeA4fx3cI) [slides](https://docs.google.com/presentation/d/1Y_iuG6bDSWdnBoUfd6n_tGlj9Gd8Trzi1IfbhQDbf-8/edit?usp=sharing) | 13:45 | 14:00 | Q&A + intro to Exercise 1 | 14:00 | 14:15 | Break | 14:15 | 15:00 | Exercise 1 in smaller groups and solutions together | [dataset](https://docs.google.com/spreadsheets/d/1mHvshwjQiCm2y10aGlkIjnNStzxrCZfcM9gqg2GX3eU/edit?usp=sharing) ::: ## Day 2 (Thu 15/04/2021) - Solutions for tabular data :::spoiler Learning Outcomes: - you can de-identify/pseudo-anonymize/anonymize personal data in tabular form - you can evaluate if a (pseudo)anonymization strategy is successful - you learn some workflows for keeping sensitive and less sensitive data separated and linked ### Timetable for day 2 *Note, time table is still subject to changes according to participants' wishes* | Time Start | End | Topic | Notes | | --------|---- | -------- | -------- | | 11:50 | 12:00 | Informal tea/coffee break | | | 12:00 | 12:05 | Practicalities recap | | | 12:15 | 12:50 | Amnesia tool show + Exercise | See linkx below | 12:50 | 13:00 | Q&A | 13:00 | 13:15 | Break | 13:15 | 13:30 | ARX data anonymization tool | | 13:30 | 13:45 | Reflections + google scholar exercise | 13:45 | 14:00 | Q&A | 14:00 | 14:15 | Break | 14:15 | 15:00 | More advanced options + Exercise | | - | - | Workflow for secure processing and Research ethics considerations (if there is time) ### Notes for Day 2 **Amnesia tool** - Tool: https://amnesia.openaire.eu/amnesia/index.html - Documentation: https://amnesia.openaire.eu/about-documentation.html - Get some demo data: https://amnesia.openaire.eu/Datasets.zip - Then for the exercise: 1) https://amnesia.openaire.eu/Scenarios/AmnesiaTutorialKAnon.pdf 2) https://amnesia.openaire.eu/Scenarios/AmnesiaKMAnonymityTutorial.pdf **ARX tool** - https://arx.deidentifier.org/ - https://www.youtube.com/channel/UCcGAF5nQ_O6ResEF-ivsbVQ/videos ### Reflections - Which one to choose? **Amnesia** Pro: simple to explore properties of tabular dataset, no installation needed Con: only k-anonymity is provided, sometimes it can be buggy, it needs to be installed for larger datasets, difficult to ensure reproducibility (*) using only the graphical tool **ARX** Pro: state of the art for doing anonymization, lots of options Con: steeper learning curve ***Which one to choose?*** start with a solution that best suits your skills. Some people might be able (or might prefer) to code themselves the anonymization techniques presents in these tools. For small datasets even "manual" anonymization might be an option (e.g. 30 participants), but if the numbers are growing and if you need to do this multiple time, choosing a more robust tool makes your life easier. My flowchart for picking the right tool would be - less than 30 rows: do it manually with excel - less than 200 rows and less than 5 columns: Amnesia - more than 200 rows / 5 columns: ARX and/or your preferred programming language (*) Come to my lecture on questionable research practices! 28.4.2021 at 10-11 AM https://aalto.zoom.us/j/61120027388 --- ### Reflection: which k number / which technique to choose? **Google scholar exercise** Divide into group based on research interest. Each group tries to find papers in their field were anonymization was used and/or k-anonymity was mentioned. Paste links to the studies here below. You can also do this alone if you do not want to work in groups, but let's find interesting papers! * BreakOut Room 1: Medical (brain imaging, physiological time series, health data) * link here * another link here * BreakOut Room 2: Behavioural sciences * link here * an another link here * BreakOut Room 3: Data from interviews * link here * another link here * BreakOut Room 4: WRITE YOUR BROAD TOPIC * etc etc * etc * ADD MORE ROOMS IF NEEDED #### Results and final reflection Very challenging task to find what others have used to anonymize tabular microdata. Anything from k = 3 to thousands was found. Maybe we should look at the protocols of dataset released by consortium effors and see if in the protocols they mention the magic value of K. I guess a possible explanation is also the fact that maybe researchers tend to NOT share these details in their studies, leaving the data unaccessible for future reuses. The situation is different when it comes to more complex data where the risk of re-identification is much higher (e.g. from a picture of a face). --- ### Multidimensional K-anonymity **Mondrian method ([LeFevre et al 2006](http://pages.cs.wisc.edu/~lefevre/MultiDim.pdf))** ![](https://i.imgur.com/QEtg2aW.png) *Figure from [these slides](https://www.slideshare.net/hirsoshnakagawa3/privacy-protectin-models-and-defamation-caused-by-kanonymity)* Comments: it has the same issues with other clustering or dimensionality reduction algorithms. If the clusters / principal components do not make much sense, it is difficult to claim something generizable about the findings. --- ### Coding anonymization of tabular microdata **Python** - https://github.com/leo-mazz/crowds - https://medium.com/brillio-data-science/a-brief-overview-of-k-anonymity-using-clustering-in-python-84203012bdea - Use ARX from python (docker needed) - https://navikt.github.io/arxaas/ - https://pypi.org/project/arkhn-arx/ **R** - https://cran.r-project.org/web/packages/sdcMicro/index.html **Stata** - https://stats.idre.ucla.edu/stata/code/generate-anonymous-keys-using-mata/ **Matlab** ::: ## Day 3 (Mon 19/04/2021) - Visual, speech, and geospatial data :::spoiler Learning Outcomes: - you can de-identify/pseudo-anonymize/anonymize personal data from visual material - you can de-identify speech - you can de-identify geospatial data ### Timetable for day 3 *Note, time table is still subject to changes according to participants' wishes* | Time Start | End | Topic | Notes | | --------|---- | -------- | -------- | | 11:50 | 12:00 | Informal tea/coffee break | | | 12:00 | 12:05 | Practicalities recap | | | 12:15 | 12:50 | Facial blurring, pose estimations, and hidden EXIF data | | 12:50 | 13:00 | Q&A | 13:00 | 13:15 | Break | 13:15 | 13:45 | Text and Speech anonymization | | 13:45 | 14:00 | Q&A | 14:00 | 14:15 | Break | 14:15 | 15:00 | Geospatial data anonymization | Lecture by Dan Häggman, [slides](https://drive.google.com/file/d/1Zi2s0BW3533zZOkDz9gN20qUNi6VnwvW/view?usp=sharing) ### Notes for day 3 Data for today practicals: https://drive.google.com/drive/folders/17TNrNxEBNYHs_jlu-Ys1RXL2NbfO0zp4?usp=sharing (This is Aalto event at https://www.aalto.fi/en/events/online-event-this-is-aalto-our-direction-in-research-education-and-impact) ### removing faces in photographs and videos - manually via webbrowser (possible https://filmora.wondershare.com/video-editing-tips/blur-face-online.html but maybe not recommended if sensitive) - manually with softwares like gimp/photoshop - programmatically (in python with deface) removing faces in videos - manually with filmora (demo with face removal + mosaic). *Note: filmora does not allow anymore free exports, and it now dds a watermark to the anonymised video. Enrico is investigating better options (Adobe PremierePro might be one available for Aalto users, but it seems not everyone at Aalto can access it)* - programmatically with python deface https://pypi.org/project/deface/ Enrico's python environment for the exercise ```bash python -m venv videodeface iswin=$(uname|grep MINGW|wc -l); if [ $iswin -eq 1 ]; then echo "we are on a Windows machine" source videodeface/Scripts/activate else echo "I assume this is Linux or Mac" source videodeface/bin/activate fi pip install deface pip install 'git+https://github.com/ORB-HD/deface' pip install exif # used for the exif data # if you do not use it with jupyter, ignore below pip install ipykernel python -m ipykernel install --name=videodeface ``` ### hidden personal data in images and videos (and all files) - EXIF data - remove it manually e.g. via windows - edit exif programmatically https://pypi.org/project/exif/ Alternative - openpose Exercises - unzip exifimages.zip and anonymize a picture in that subfolder - anonymize the video using filmora or python deface --- ### anonymizing text / transcriptions - https://github.com/microsoft/presidio - Demo at https://presidio-demo.azurewebsites.net/ - https://github.com/openredact - https://nlp.stanford.edu/software/CRF-NER.html - for medical records https://medium.com/nedap/de-identification-of-ehr-using-nlp-a270d40fc442 --- ### speech morphing - No out of the box solution without coding, unless one uses text-to-speech synthesis - Useful links from Abraham - https://github.com/sarulab-speech/lightweight_spkr_anon - ::: ## Day 4 (26/04/2021) - Secure analysis workflows. Differential privacy and data synthesis. Medical images and physiological data. :::spoiler {state="open"} Learning Outcomes: - you learn about workflows for working with sensitive data, especially when anonymization is not possible - you learn about differential privacy, personal data synthesis, federated analysis approaches and are able to synthesize tabular microdata - you can de-identify/pseudo-anonymize medical images in DICOM and other formats #### Timetable for day 4 *Note, time table is still subject to changes according to participants' wishes* | Time Start | End | Topic | Notes | | --------|---- | -------- | -------- | | 11:50 | 12:00 | Informal tea/coffee break | | | 12:00 | 12:05 | Practicalities recap | | | 12:05 | 12:15 | Secure processing workflows (remote computing, subject ID pseudoanonymization) | 12:15 | 12:50 | Differential privacy, data synthesis, federated analysis approaches | 12:50 | 13:00 | Q&A | 13:15 | 14:00 | Medical images: DICOM + exercise | | 14:00 | 14:15 | Break | 14:15 | 14:45 | MRI defacing + exercise, considerations for time series pseudo-anonymization | | 14:45 | 15:00 | Final recap ## Materials for day 4 # Secure workflows at Aalto What do we do when we cannot anonymize / de-identify personal data? Some data are impossible to de-identify (think of a fingerprint). Sometimes we need to keep a way to trace back the identity (longitudinal studies, studies that links with data from other sources). 1. Pseudoanonymize subject identifiers Sometimes it is enough to have a secret list stored somewhere, sometimes you need more complicated approaches (e.g. generation of a hashed user ID, with hash key owned by organization such as THL). 2. Make sure your workflows is secure: secure storage (short and long term), secure processing, interactive use - FINDATA: audit of everything? https://www.findata.fi/ Coming soon... - Aalto shared folders - https://www.youtube.com/watch?v=1Sck_R1glCs - https://drive.google.com/file/d/151wBvk90xJgvRwx6WJeb0tdiSj3AhIoq/view - sensitive code: sensitive parts not on github (all you need to run git is a remote SSH server that you share with colleagues) - VDI https://vdi.aalto.fi/ - A list of workflows https://scicomp.aalto.fi/triton/usage/workflows/ - CSC ePouta https://research.csc.fi/-/epouta ## Differential privacy and personal data synthesis - Perturbate the values to keep structures in the data (the more you perturbate, the least usable the data) - Data synthesis https://github.com/DPBayes/twinify (note:install with ``` pip install git+https://github.com/DPBayes/twinify.git@v0.1.1 pip install jaxlib==0.1.51 ``` Challenges - Precision medicine vs differential privacy: impossible? - How do we synthesize fake images? - Cross modal synthesis: https://medium.com/analytics-vidhya/medical-imaging-being-transformed-with-gan-mri-to-ct-scan-and-many-others-18a307ef528 - Other papers: https://arxiv.org/pdf/1907.08533.pdf https://arxiv.org/pdf/2003.13653.pdf https://arxiv.org/abs/2009.05946 ## Federated analysis - Bring your code to where the data are (e.g. hospital) and only store the model (prediction model, normative model, etc) e.g. from WikiPedia https://en.wikipedia.org/wiki/Federated_learning#/media/File:Federated_learning_process_central_case.png Challenges: technical and human bottlenecks - How to run a complex pipeline inside a hospital? Containers! - But who is going to check your code...? ## Medical imaging ### DICOM HEADERS - read them graphically https://nrg.wustl.edu/software/dicom-browser/ or pick your fauvorite https://www.fossmint.com/linux-dicom-viewers-for-doctors/ - read them from CLI https://dcmtk.org/dcmtk.php.en Precompiled version I have found: Linux: https://github.com/QIICR/dcmtk-dcmqi/releases/download/0d2826645/dcmtk-Linux-0d2826645.zip Win (not tested): https://github.com/QIICR/dcmtk-dcmqi/releases - read them programmatically With Pydicom. I recommend installing it through https://pydicom.github.io/deid/ pip install deid which we will use later - which dicom exercise do you want to do? ## anonymize by converting to another format - dcm2niix (briefly on BIDS) with https://zenodo.org/record/16956#.YIZ-EWczZnI ```bash git clone https://github.com/rordenlab/dcm2niix.git cd dcm2niix mkdir build && cd build #module load cmake cmake .. # if needed: "cmake .. -DUSE_STATIC_RUNTIME=OFF -DCMAKE_INSTALL_PREFIX=/scratch/PATHTOBINFOLDER" make # and with option above "make install" # and then we run ./bin/dcm2niix -ba y /m/nbe/scratch/braindata/eglerean/dicomdeid/SE000002/ ``` :::danger Exercise1 : download the zenodo DICOM files https://zenodo.org/record/16956#.YIZ-EWczZnI and then pick a graphical tool (if you are unsure pick https://nrg.wustl.edu/software/dicom-browser/ ). Where you able to open it with a graphical tool? Exercise2: command line tools for exploring dicom headers. ::: :::warning BREAK! Let's be back at 14:12. Mark your presence if you didn't yet. ::: ## defacing https://open-brain-consent.readthedocs.io/en/latest/anon_tools.html let's test mri_deface with https://zenodo.org/record/16956#.YIZ-EWczZnI Matlab option: https://www.fieldtriptoolbox.org/faq/how_can_i_anonymize_dicom_files/ ## Considerations on medical time series - time series modelling can make them "less personal" ARIMA, NGRAMS (Good luck!) - For those of you working with M/EEG: there can be personal data in the headers, but I have never seen it. But some options: - https://mne.tools/dev/generated/mne.io.anonymize_info.html - https://www.fieldtriptoolbox.org/faq/how_can_i_anonymize_fieldtrip_data/ :::