Hands-on Data Anonymization February 2023

LINK TO THIS DOC:

https://hackmd.io/@eglerean/anonfeb2023

Workshop practicalities

Course organizers: Enrico Glerean
Contact:
Instructor enrico.glerean@aalto.fi / Organizer agata.bochynska@ub.uio.no
Learning goals: The goals for this workshop are practical: to have people to actually de-identify/pseudo-anonymise/anonymize personal data and also use modern techniques for working with personal / sensitive data when anonymization is not possible.
Target audience: anyone working with personal data in research. Please note that this is not a computer science course on data privacy or data security.
Please bring your laptop if possible
Interactive HackMD chat: https://hackmd.io/@eglerean/HODAFeb2023chat

DISCLAIMER

This is work in progress! Materials, links and other resources will be added just before / during the workshop. Let's use this page as shared document + reference.

Course structure

We focus on tabular data and interview data. There is materials also on more advanced structured and unstructured data types (electronic medical records, medical images, …). We can still adapt the course content to participants' wishes .

Part 1 - Concepts

(12:00 - 12:50)
Learning Outcomes:

you understand the importance of data anonymization
you understand how anonymization is aligned with research ethics, data protection law (GDPR), open science
you can evaluate if a dataset is anonymous
you are able to peer review management plans, grant applications which require (pseudo)anonymization
you are able to peer review (pseudo)anonymized datasets in research studies

Notes

We adapt to the audience considering what was covered during the morning seminar
Let's do an icebreaker in the chat

Slides

https://docs.google.com/presentation/d/1dxK-7PrIcl73laNcQu1VkJ3D7F4iV8gr0CX0jXXX8Wc/edit?usp=sharing

Part 2 - Data anonymization with interview data and text

(13:10 - 14:00)

Learning Outcomes:

you can de-identify/pseudo-anonymize/anonymize data from interviews (audio/visual or just text)
you can evaluate if a (pseudo)anonymization strategy is successful
you are aware of the limitations of these approaches

Working with audio/visual interviews

Video
- Is it necessary? If not delete
- If it is necessary: blur faces
  - manually with softwares like gimp/photoshop/premiere
  - programmatically in python https://pypi.org/project/deface/
  - openface or openpose http://multicomp.cs.cmu.edu/resources/openface/ & https://github.com/CMU-Perceptual-Computing-Lab/openpose
  - Synthesis, generative approaches https://github.com/hukkelas/DeepPrivacy (i.e. deepfakes)
Audio (= speech)
- Is it necessary? If not, transcribe
- If necessary:
  - No out of the box solution without coding, unless one uses text-to-speech synthesis
  - https://github.com/sarulab-speech/lightweight_spkr_anon
  - https://openai.com/blog/whisper/ can be run locally without giving your data to the "cloud"
Content of the interview (= text)
- Manually, see a set of rules at https://www.fsd.tuni.fi/en/services/data-management-guidelines/anonymisation-and-identifiers/
- Programmatically

-> Exercise 1

Part 3 - Data anonymization with tabular data (background factors)

(14:00 - 14:30)

Learning outcomes

you can de-identify/pseudo-anonymize/anonymize personal data in tabular form
you will get an overview of the Amnesia tool
you are aware of the limitations of these approaches

Amnesia tool

Tool:
https://amnesia.openaire.eu/amnesia/index.html
Documentation:
https://amnesia.openaire.eu/about-documentation.html
Get some demo data: https://amnesia.openaire.eu/Datasets.zip
Then for the exercise:
1. https://amnesia.openaire.eu/Scenarios/AmnesiaTutorialKAnon.pdf
2. https://amnesia.openaire.eu/Scenarios/AmnesiaKMAnonymityTutorial.pdf

ARX tool

Reflections - Which one to choose?

Amnesia
Pro: simple to explore properties of tabular dataset, no installation needed
Con: only k-anonymity is provided, sometimes it can be buggy, it needs to be installed for real data or for large datasets

ARX
Pro: state of the art for doing anonymization, lots of options
Con: steeper learning curve

Which one to choose? start with a solution that best suits your skills. Some people might be able (or might prefer) to code themselves the anonymization techniques presents in these tools. For small datasets even "manual" anonymization might be an option (e.g. 30 participants), but if the numbers are growing and if you need to do this multiple time, choosing a more robust tool makes your life easier.

My flowchart for picking the right tool would be

less than 30 rows: do it manually with excel
less than 200 rows and less than 5 columns: Amnesia
more than 200 rows / 5 columns: ARX and/or your preferred programming language

Multidimensional K-anonymity

Mondrian method (LeFevre et al 2006)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Figure from these slides

Comments: it has the same issues with other clustering or dimensionality reduction algorithms. If the clusters / principal components do not make much sense, it is difficult to claim something generizable about the findings.

Coding anonymization of tabular microdata

Python

https://github.com/leo-mazz/crowds
https://medium.com/brillio-data-science/a-brief-overview-of-k-anonymity-using-clustering-in-python-84203012bdea
Use ARX from python (docker needed)
- https://navikt.github.io/arxaas/
- https://pypi.org/project/arkhn-arx/

https://cran.r-project.org/web/packages/sdcMicro/index.html

Stata

https://stats.idre.ucla.edu/stata/code/generate-anonymous-keys-using-mata/

Matlab

Extra Materials

Secure workflows

What do we do when we cannot anonymize / de-identify personal data? Some data are impossible to de-identify (think of a fingerprint). Sometimes we need to keep a way to trace back the identity (longitudinal studies, studies that links with data from other sources).

Pseudoanonymize subject identifiers
Sometimes it is enough to have a secret list stored somewhere, sometimes you need more complicated approaches (e.g. generation of a hashed user ID, with hash key owned by organization such as THL).
Make sure your workflows is secure: secure storage (short and long term), secure processing, interactive use

FINDATA: audit of everything? https://www.findata.fi/ Coming soon…
Aalto shared folders
sensitive code: sensitive parts not on github (all you need to run git is a remote SSH server that you share with colleagues)
VDI https://vdi.aalto.fi/
A list of workflows https://scicomp.aalto.fi/triton/usage/workflows/
CSC SD Desktop https://research.csc.fi/-/sd-desktop

Differential privacy and personal data synthesis

Perturbate the values to keep structures in the data (the more you perturbate, the least usable the data)
Data synthesis https://github.com/DPBayes/twinify (note:install with

pip install git+https://github.com/DPBayes/twinify.git@v0.1.1
pip install jaxlib==0.1.51

Challenges

Precision medicine vs differential privacy: impossible?
How do we synthesize fake images?
Cross modal synthesis: https://medium.com/analytics-vidhya/medical-imaging-being-transformed-with-gan-mri-to-ct-scan-and-many-others-18a307ef528
Other papers: https://arxiv.org/pdf/1907.08533.pdf https://arxiv.org/pdf/2003.13653.pdf https://arxiv.org/abs/2009.05946

Federated analysis

Bring your code to where the data are (e.g. hospital) and only store the model (prediction model, normative model, etc) e.g. from WikiPedia https://en.wikipedia.org/wiki/Federated_learning#/media/File:Federated_learning_process_central_case.png

Challenges: technical and human bottlenecks

How to run a complex pipeline inside a hospital? Containers!
But who is going to check your code…?

Part 4 - Wrap-up and looking ahead

(14:30 - 15:xx)

Free discussion and reflections on what we have learned related to topics such as open science, AI, sensitive data, etc…

https://coderefinery.github.io/manuals/chat/