# Hands-on Data Anonymization February 2023 # LINK TO THIS DOC: # https://hackmd.io/@eglerean/anonfeb2023 :::success ## Workshop practicalities - **Course organizers**: Enrico Glerean - **Contact**: Instructor enrico.glerean@aalto.fi / Organizer agata.bochynska@ub.uio.no - **Learning goals**: The goals for this workshop are ***practical***: to have people to actually de-identify/pseudo-anonymise/anonymize personal data and also use modern techniques for working with personal / sensitive data when anonymization is not possible. - **Target audience**: anyone working with personal data in research. Please note that **this is not a computer science course on data privacy or data security**. - Please bring your laptop if possible - Interactive HackMD chat: https://hackmd.io/@eglerean/HODAFeb2023chat ::: :::danger ### DISCLAIMER This is work in progress! Materials, links and other resources will be added just before / during the workshop. Let's use this page as shared document + reference. ::: ## Course structure *We focus on tabular data and interview data. There is materials also on more advanced structured and unstructured data types (electronic medical records, medical images, ...). We can still adapt the course content to participants' wishes .* ## Part 1 - Concepts (12:00 - 12:50) Learning Outcomes: - you understand the importance of data anonymization - you understand how anonymization is aligned with research ethics, data protection law (GDPR), open science - you can evaluate if a dataset is anonymous - you are able to peer review management plans, grant applications which require (pseudo)anonymization - you are able to peer review (pseudo)anonymized datasets in research studies ### Notes - We adapt to the audience considering what was covered during the morning seminar - Let's do an icebreaker in the [chat](https://hackmd.io/@eglerean/HODAFeb2023chat) ### Slides - https://docs.google.com/presentation/d/1dxK-7PrIcl73laNcQu1VkJ3D7F4iV8gr0CX0jXXX8Wc/edit?usp=sharing ## Part 2 - Data anonymization with interview data and text (13:10 - 14:00) ### Learning Outcomes: - you can de-identify/pseudo-anonymize/anonymize data from interviews (audio/visual or just text) - you can evaluate if a (pseudo)anonymization strategy is successful - you are aware of the limitations of these approaches ### Working with audio/visual interviews - Video - Is it necessary? If not delete - If it is necessary: blur faces - manually with softwares like gimp/photoshop/premiere - programmatically in python https://pypi.org/project/deface/ - openface or openpose http://multicomp.cs.cmu.edu/resources/openface/ & https://github.com/CMU-Perceptual-Computing-Lab/openpose - Synthesis, generative approaches https://github.com/hukkelas/DeepPrivacy (i.e. deepfakes) - Audio (= speech) - Is it necessary? If not, transcribe - If necessary: - No out of the box solution without coding, unless one uses text-to-speech synthesis - https://github.com/sarulab-speech/lightweight_spkr_anon - https://openai.com/blog/whisper/ can be run locally without giving your data to the "cloud" - Content of the interview (= text) - Manually, see a set of rules at https://www.fsd.tuni.fi/en/services/data-management-guidelines/anonymisation-and-identifiers/ - Programmatically - https://github.com/microsoft/presidio - Demo at https://presidio-demo.azurewebsites.net/ - https://github.com/openredact - https://nlp.stanford.edu/software/CRF-NER.html   - for medical records https://medium.com/nedap/de-identification-of-ehr-using-nlp-a270d40fc442 -> Exercise 1 --- ## Part 3 - Data anonymization with tabular data (background factors) (14:00 - 14:30) ### Learning outcomes - you can de-identify/pseudo-anonymize/anonymize personal data in tabular form - you will get an overview of the Amnesia tool - you are aware of the limitations of these approaches **Amnesia tool** - Tool: https://amnesia.openaire.eu/amnesia/index.html - Documentation: https://amnesia.openaire.eu/about-documentation.html - Get some demo data: https://amnesia.openaire.eu/Datasets.zip - Then for the exercise: 1) https://amnesia.openaire.eu/Scenarios/AmnesiaTutorialKAnon.pdf 2) https://amnesia.openaire.eu/Scenarios/AmnesiaKMAnonymityTutorial.pdf **ARX tool** - https://arx.deidentifier.org/ - https://www.youtube.com/channel/UCcGAF5nQ_O6ResEF-ivsbVQ/videos ### Reflections - Which one to choose? **Amnesia** Pro: simple to explore properties of tabular dataset, no installation needed Con: only k-anonymity is provided, sometimes it can be buggy, it needs to be installed for real data or for large datasets **ARX** Pro: state of the art for doing anonymization, lots of options Con: steeper learning curve ***Which one to choose?*** start with a solution that best suits your skills. Some people might be able (or might prefer) to code themselves the anonymization techniques presents in these tools. For small datasets even "manual" anonymization might be an option (e.g. 30 participants), but if the numbers are growing and if you need to do this multiple time, choosing a more robust tool makes your life easier. My flowchart for picking the right tool would be - less than 30 rows: do it manually with excel - less than 200 rows and less than 5 columns: Amnesia - more than 200 rows / 5 columns: ARX and/or your preferred programming language --- ### Multidimensional K-anonymity **Mondrian method ([LeFevre et al 2006](http://pages.cs.wisc.edu/~lefevre/MultiDim.pdf))** ![](https://i.imgur.com/QEtg2aW.png) *Figure from [these slides](https://www.slideshare.net/hirsoshnakagawa3/privacy-protectin-models-and-defamation-caused-by-kanonymity)* Comments: it has the same issues with other clustering or dimensionality reduction algorithms. If the clusters / principal components do not make much sense, it is difficult to claim something generizable about the findings. --- ### Coding anonymization of tabular microdata :::spoiler **Python** - https://github.com/leo-mazz/crowds - https://medium.com/brillio-data-science/a-brief-overview-of-k-anonymity-using-clustering-in-python-84203012bdea - Use ARX from python (docker needed) - https://navikt.github.io/arxaas/ - https://pypi.org/project/arkhn-arx/ **R** - https://cran.r-project.org/web/packages/sdcMicro/index.html **Stata** - https://stats.idre.ucla.edu/stata/code/generate-anonymous-keys-using-mata/ **Matlab** ::: --- ## Extra Materials :::spoiler # Secure workflows What do we do when we cannot anonymize / de-identify personal data? Some data are impossible to de-identify (think of a fingerprint). Sometimes we need to keep a way to trace back the identity (longitudinal studies, studies that links with data from other sources). 1. Pseudoanonymize subject identifiers Sometimes it is enough to have a secret list stored somewhere, sometimes you need more complicated approaches (e.g. generation of a hashed user ID, with hash key owned by organization such as THL).  2. Make sure your workflows is secure: secure storage (short and long term), secure processing, interactive use - FINDATA: audit of everything? https://www.findata.fi/ Coming soon... - Aalto shared folders   - sensitive code: sensitive parts not on github (all you need to run git is a remote SSH server that you share with colleagues) - VDI https://vdi.aalto.fi/ - A list of workflows https://scicomp.aalto.fi/triton/usage/workflows/ - CSC SD Desktop https://research.csc.fi/-/sd-desktop ## Differential privacy and personal data synthesis - Perturbate the values to keep structures in the data (the more you perturbate, the least usable the data) - Data synthesis https://github.com/DPBayes/twinify (note:install with ``` pip install git+https://github.com/DPBayes/twinify.git@v0.1.1 pip install jaxlib==0.1.51 ``` Challenges    - Precision medicine vs differential privacy: impossible?    - How do we synthesize fake images? - Cross modal synthesis: https://medium.com/analytics-vidhya/medical-imaging-being-transformed-with-gan-mri-to-ct-scan-and-many-others-18a307ef528 - Other papers: https://arxiv.org/pdf/1907.08533.pdf https://arxiv.org/pdf/2003.13653.pdf https://arxiv.org/abs/2009.05946     ## Federated analysis - Bring your code to where the data are (e.g. hospital) and only store the model (prediction model, normative model, etc) e.g. from WikiPedia https://en.wikipedia.org/wiki/Federated_learning#/media/File:Federated_learning_process_central_case.png Challenges: technical and human bottlenecks - How to run a complex pipeline inside a hospital? Containers! - But who is going to check your code...? ::: ## Part 4 - Wrap-up and looking ahead (14:30 - 15:xx) Free discussion and reflections on what we have learned related to topics such as open science, AI, sensitive data, etc... https://coderefinery.github.io/manuals/chat/