Hands-on Data Anonymization February 2023
LINK TO THIS DOC:
Workshop practicalities
- Course organizers: Enrico Glerean
- Contact:
Instructor enrico.glerean@aalto.fi / Organizer agata.bochynska@ub.uio.no
- Learning goals: The goals for this workshop are practical: to have people to actually de-identify/pseudo-anonymise/anonymize personal data and also use modern techniques for working with personal / sensitive data when anonymization is not possible.
- Target audience: anyone working with personal data in research. Please note that this is not a computer science course on data privacy or data security.
- Please bring your laptop if possible
- Interactive HackMD chat: https://hackmd.io/@eglerean/HODAFeb2023chat
DISCLAIMER
This is work in progress! Materials, links and other resources will be added just before / during the workshop. Let's use this page as shared document + reference.
Course structure
We focus on tabular data and interview data. There is materials also on more advanced structured and unstructured data types (electronic medical records, medical images, …). We can still adapt the course content to participants' wishes .
Part 1 - Concepts
(12:00 - 12:50)
Learning Outcomes:
- you understand the importance of data anonymization
- you understand how anonymization is aligned with research ethics, data protection law (GDPR), open science
- you can evaluate if a dataset is anonymous
- you are able to peer review management plans, grant applications which require (pseudo)anonymization
- you are able to peer review (pseudo)anonymized datasets in research studies
Notes
- We adapt to the audience considering what was covered during the morning seminar
- Let's do an icebreaker in the chat
Slides
Part 2 - Data anonymization with interview data and text
(13:10 - 14:00)
Learning Outcomes:
- you can de-identify/pseudo-anonymize/anonymize data from interviews (audio/visual or just text)
- you can evaluate if a (pseudo)anonymization strategy is successful
- you are aware of the limitations of these approaches
Working with audio/visual interviews
- Video
- Is it necessary? If not delete
- If it is necessary: blur faces
- Audio (= speech)
- Is it necessary? If not, transcribe
- If necessary:
- Content of the interview (= text)
-> Exercise 1
Part 3 - Data anonymization with tabular data (background factors)
(14:00 - 14:30)
Learning outcomes
- you can de-identify/pseudo-anonymize/anonymize personal data in tabular form
- you will get an overview of the Amnesia tool
- you are aware of the limitations of these approaches
Amnesia tool
ARX tool
Reflections - Which one to choose?
Amnesia
Pro: simple to explore properties of tabular dataset, no installation needed
Con: only k-anonymity is provided, sometimes it can be buggy, it needs to be installed for real data or for large datasets
ARX
Pro: state of the art for doing anonymization, lots of options
Con: steeper learning curve
Which one to choose? start with a solution that best suits your skills. Some people might be able (or might prefer) to code themselves the anonymization techniques presents in these tools. For small datasets even "manual" anonymization might be an option (e.g. 30 participants), but if the numbers are growing and if you need to do this multiple time, choosing a more robust tool makes your life easier.
My flowchart for picking the right tool would be
- less than 30 rows: do it manually with excel
- less than 200 rows and less than 5 columns: Amnesia
- more than 200 rows / 5 columns: ARX and/or your preferred programming language
Multidimensional K-anonymity
Mondrian method (LeFevre et al 2006)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Figure from these slides
Comments: it has the same issues with other clustering or dimensionality reduction algorithms. If the clusters / principal components do not make much sense, it is difficult to claim something generizable about the findings.
Coding anonymization of tabular microdata
Python
R
Stata
Matlab
Secure workflows
What do we do when we cannot anonymize / de-identify personal data? Some data are impossible to de-identify (think of a fingerprint). Sometimes we need to keep a way to trace back the identity (longitudinal studies, studies that links with data from other sources).
-
Pseudoanonymize subject identifiers
Sometimes it is enough to have a secret list stored somewhere, sometimes you need more complicated approaches (e.g. generation of a hashed user ID, with hash key owned by organization such as THL).
-
Make sure your workflows is secure: secure storage (short and long term), secure processing, interactive use
Differential privacy and personal data synthesis
- Perturbate the values to keep structures in the data (the more you perturbate, the least usable the data)
- Data synthesis https://github.com/DPBayes/twinify (note:install with
Challenges
Federated analysis
Challenges: technical and human bottlenecks
- How to run a complex pipeline inside a hospital? Containers!
- But who is going to check your code…?
Part 4 - Wrap-up and looking ahead
(14:30 - 15:xx)
Free discussion and reflections on what we have learned related to topics such as open science, AI, sensitive data, etc…
https://coderefinery.github.io/manuals/chat/