Try   HackMD

Hands-on Data Anonymization February 2023

LINK TO THIS DOC:

https://hackmd.io/@eglerean/anonfeb2023

Workshop practicalities

  • Course organizers: Enrico Glerean
  • Contact:
    Instructor enrico.glerean@aalto.fi / Organizer agata.bochynska@ub.uio.no
  • Learning goals: The goals for this workshop are practical: to have people to actually de-identify/pseudo-anonymise/anonymize personal data and also use modern techniques for working with personal / sensitive data when anonymization is not possible.
  • Target audience: anyone working with personal data in research. Please note that this is not a computer science course on data privacy or data security.
  • Please bring your laptop if possible
  • Interactive HackMD chat: https://hackmd.io/@eglerean/HODAFeb2023chat

DISCLAIMER

This is work in progress! Materials, links and other resources will be added just before / during the workshop. Let's use this page as shared document + reference.

Course structure

We focus on tabular data and interview data. There is materials also on more advanced structured and unstructured data types (electronic medical records, medical images, ). We can still adapt the course content to participants' wishes .

Part 1 - Concepts

(12:00 - 12:50)
Learning Outcomes:

  • you understand the importance of data anonymization
  • you understand how anonymization is aligned with research ethics, data protection law (GDPR), open science
  • you can evaluate if a dataset is anonymous
  • you are able to peer review management plans, grant applications which require (pseudo)anonymization
  • you are able to peer review (pseudo)anonymized datasets in research studies

Notes

  • We adapt to the audience considering what was covered during the morning seminar
  • Let's do an icebreaker in the chat

Slides

Part 2 - Data anonymization with interview data and text

(13:10 - 14:00)

Learning Outcomes:

  • you can de-identify/pseudo-anonymize/anonymize data from interviews (audio/visual or just text)
  • you can evaluate if a (pseudo)anonymization strategy is successful
  • you are aware of the limitations of these approaches

Working with audio/visual interviews

-> Exercise 1


Part 3 - Data anonymization with tabular data (background factors)

(14:00 - 14:30)

Learning outcomes

  • you can de-identify/pseudo-anonymize/anonymize personal data in tabular form
  • you will get an overview of the Amnesia tool
  • you are aware of the limitations of these approaches

Amnesia tool

ARX tool

Reflections - Which one to choose?

Amnesia
Pro: simple to explore properties of tabular dataset, no installation needed
Con: only k-anonymity is provided, sometimes it can be buggy, it needs to be installed for real data or for large datasets

ARX
Pro: state of the art for doing anonymization, lots of options
Con: steeper learning curve

Which one to choose? start with a solution that best suits your skills. Some people might be able (or might prefer) to code themselves the anonymization techniques presents in these tools. For small datasets even "manual" anonymization might be an option (e.g. 30 participants), but if the numbers are growing and if you need to do this multiple time, choosing a more robust tool makes your life easier.

My flowchart for picking the right tool would be

  • less than 30 rows: do it manually with excel
  • less than 200 rows and less than 5 columns: Amnesia
  • more than 200 rows / 5 columns: ARX and/or your preferred programming language

Multidimensional K-anonymity

Mondrian method (LeFevre et al 2006)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Figure from these slides

Comments: it has the same issues with other clustering or dimensionality reduction algorithms. If the clusters / principal components do not make much sense, it is difficult to claim something generizable about the findings.


Coding anonymization of tabular microdata

Python

R

Stata

Matlab


Extra Materials

Secure workflows

What do we do when we cannot anonymize / de-identify personal data? Some data are impossible to de-identify (think of a fingerprint). Sometimes we need to keep a way to trace back the identity (longitudinal studies, studies that links with data from other sources).

  1. Pseudoanonymize subject identifiers
    Sometimes it is enough to have a secret list stored somewhere, sometimes you need more complicated approaches (e.g. generation of a hashed user ID, with hash key owned by organization such as THL).

  2. Make sure your workflows is secure: secure storage (short and long term), secure processing, interactive use

Differential privacy and personal data synthesis

  • Perturbate the values to keep structures in the data (the more you perturbate, the least usable the data)
  • Data synthesis https://github.com/DPBayes/twinify (note:install with
pip install git+https://github.com/DPBayes/twinify.git@v0.1.1
pip install jaxlib==0.1.51

Challenges

Federated analysis

Challenges: technical and human bottlenecks

  • How to run a complex pipeline inside a hospital? Containers!
  • But who is going to check your code?

Part 4 - Wrap-up and looking ahead

(14:30 - 15:xx)

Free discussion and reflections on what we have learned related to topics such as open science, AI, sensitive data, etc

https://coderefinery.github.io/manuals/chat/