Camila Rangel Smith
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # What is QUIPP > These notes are a mixture of documentation from QUiPP, discussions with members of the team, extracts from papers, reflections from Camila and [notes](https://hackmd.io/tpy3YfaARPWYLA7Rle_5yQ) written by Callum. ## Description The QUiPP software is a pipeline for generating synthetic tabular data. QUiPP uses a variety of methods as implemented by several libraries and provides measures of privacy and utility on the resulting released datasets. Whilst the synthetic data generation methods come from external libraries, the privacy and utility metrics have been chosen and implemented by the QUiPP team. QUiPP is a makefile-based reproducible pipeline. The pipeline is highly configurable with input JSON files, that tells the pipeline which dataset to use, synthetic methods to run (as well as providing configuration intrinsic to each method), and then subsequently which privacy and utility metrics to calculate. The input data is present as two files: a `.csv` file which must contain column headings (along with the column data itself), and a JSON file describing the types of the columns used for synthesis. The following figure indicates the full pipeline, as run on an input file called example.json. This input file has keywords *dataset* (the base of the filename to use for the original input data) and `synth-method` which refers to one of the synthesis methods. As output, the pipeline produces: - A number of **output synthetic datasets** `.csv` files (configurable by the input JSON file) e.g: synthetic_data_1.csv, synthetic_data_2.csv, ... - **Privacy metrics** in JSON files as the disclosure risk privacy score (e.g disclosure_risk.json) - **Utility metrics** in several JSON files such as correlation like utility metrics and classification tasks (e.g sklearn_classifiers.json) ![](https://i.imgur.com/rfSyvd1.png) The file-based design of the pipeline makes it very customisable, allowing it to run in any dataset (as long its format and variable types are properly described on the dataset JSON file) and any configuration of available methods and assessment metrics. The data format and types that can be synthesised do not depend on the QUiPP pipeline, but on the synthesis method to use. Furthermore, this design makes the QUiPP extensible. For example, to add a new synthetic data method you just need a new folder with the method name, and a file named `run` (which is typically a python wrapper that picks up the values from the input JSON and passes them to a python script housing the method). Finally, QUiPP can also generate toy datasets that can be used in the pipeline. This is not part of the main functionality of the pipeline, but can be used for experimentation and was useful in the early days of the development of the software. The table below summarises all methods (synthetic data generation, utility and privacy estimations) avalaible in QUiPP and provides details on their implementation. | Method | Data | QUiPP implementation | Comments | | -------- | -------- | -------- |-------- | | **Generation** |||| | CTGAN | Tabular data which can contain continuous, and discrete values. | Uses [CTGAN library](https://pypi.org/project/ctgan/). | Seems to work with most/all example datasets from QUiPP. | | PrivBayes | Tabular data with four data types: integer, float, datetime and strings. | Uses [DataSynthesizer](https://github.com/gmingas/DataSynthesizer) library. | Uses a DataSynthesizer fork, which has been modified by the QUiPP team. Seems to work with most/all example datasets from QUiPP. | | Synthpop | Any data type, according to [Table 1 of its accompanying paper](https://www.jstatsoft.org/article/view/v074i11). | Uses the [Synthpop](https://CRAN.R-project.org/package=synthpop) R package, particularly the CART method. | The pipeline can only use the Polish data set embedded in the synthpop package or the ONS Census data set available in QUiPP. | | SGF | Only supports discrete variables | Uses the [Synthetic Data Generation Framework](https://vbinds.ch/node/69) | Only works for dummy A&E generated dataset | | Baselines | - | Implemented within QUiPP | Sampling from original dataset with and without replacement | | **Utility** |||| | Classifier metrics | Target variable must be binary categorical data. | Implemented within QUiPP | Compares ML classifier performance metrics for the original and released datasets on a given target variable (LogisticRegression, KNeighborsClassifier,SVC, RandomForestClassifier, etc. from [sklearn](https://scikit-learn.org/stable/)). | | Correlation metrics | Categorical | Implemented within QUiPP | Cramer's V and Thiel's U for categorical-categorical combinations of columns and correlation ratios from [dython library](http://shakedzy.xyz/dython/). | Implemented in QUiPP | | | Feature importance | Categorical, numerical, ordinal. | Implemented within QUiPP (this was part of the research done within the project) | Calculates differences on feature importance ranking between original and released datasets, using a random forest classification model, various feature importance measures and various feature rank/score comparison measures. | | **Privacy** |||| | Disclosure risk | - | Implemented within QUiPP | Only works with partially generated data (inspired on this [paper](http://www2.stat.duke.edu/~jerry/Papers/PSD08.pdf)). | ## How to use The installation instructions are found in the [README](https://github.com/alan-turing-institute/QUIPP-pipeline#local-installation) of the repo. To run the pipeline, you need to understand the top-level directory structure which mirrors the data pipeline. Not all directories in there are relevant for running the pipeline, therefore I will only describe the ones that are (a more detailed description of the top-level repo can be found in [here](https://github.com/alan-turing-institute/QUIPP-pipeline#top-level-directory-contents)). Relevant directories for running QUiPP: - `env-configuration`: Set-up of the computational environment needed by the pipeline and its dependencies - `generators`: Quickly generating toy input data for the pipeline from a few tunable and well-understood models. - `datasets`: Sample data that can be consumed by the pipeline. - `synth-methods`: One directory per library/tool, each of them implementing a complete synthesis method - `utility-metrics`: Scripts relating to computing the utility metrics - `privacy-metrics`: Scripts relating to computing the privacy metrics - `run-inputs`: Parameter JSON files (see below), one for each run ### Running the pipeline 1. Have an input JSON file, in `run-inputs/`, for each desired synthesis (a dataset + a synthesis method + metrics of utility and privacy to be calculated). 2. Run `make` in the top-level QUIPP-pipeline directory. - This will run a synthesis for each JSON input file found in `run-inputs/`. - This also run the generation of a toy dataset and the subsequent synthesis of it. When the pipeline is run, additional directories are created: - `generator-outputs`: Sample generated input data (using `generators`) - `synth-output`: Contains the result of each run (as specified in `run-inputs`), which will typically consist of the synthetic data itself and a selection of utility and privacy scores 3. `make clean` removes all synthetic output and generated data. ## Methods implemented > Disclaimer: Some descriptions below are extracted directly from the papers describing the methods. ### CTGAN [CTGAN](https://pypi.org/project/ctgan/) is a library that implements a synthesiser based on generative adversarial networks (GANs) for table data which was presented at the NeurIPS 2020 conference by the paper titled [Modeling Tabular data using Conditional GAN](https://arxiv.org/abs/1907.00503). In the paper, the authors found that modelling tabular data poses unique challenges for GANs, causing them to fall short of other baseline methods. These challenges include the need to simultaneously model discrete and continuous columns, the multi-modal non-Gaussian values within each continuous column, and the severe imbalance of categorical columns. To address these challenges, the conditional tabular GAN (CTGAN) method is proposed, which introduces several new techniques: augmenting the training procedure with mode-specific normalization, architectural changes, and addressing data imbalance by employing a conditional generator and training-by-sampling (details of what this means can be found in the paper). To use CTGAN, the input data must be in either a numpy.ndarray or a pandas.DataFrame object with two types of columns: Continuous (can contain any numerical value), and discrete (contain finite number values, whether these are string values or not). **Summary on QUiPP implementation**: latest version of the library in `develop-paper`. Seems to work in several tests datasets. ### Privbayes Given a dataset *D*, PrivBayes first constructs a Bayesian network *N*, which (i) provides a succinct model of the correlations among the attributes in *D* and (ii) allows us to approximate the distribution of data in *D* using a set *P* of low-dimensional marginals of *D*. After that, PrivBayes injects noise into each marginal in *P* to ensure **differential privacy** and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in *D*. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset (description taken from the paper [PrivBayes: Private Data Release via Bayesian Networks](https://dl.acm.org/doi/10.1145/3134428)). QUiPP uses the PrivBayes implementation within the DataSynthesizer fork found [here](https://github.com/gmingas/DataSynthesizer), here DataSynthesizer learns a differentially private Bayesian network capturing the correlation structure between attributes, then draw samples from this model to construct the result dataset. DataSynthesizer supports four data types (integer, float, datetime and string). The system allows users to explicitly specify attribute data types. If an attribute data type is not specified by the user, the system infers it (details about this in this [document]( https://github.com/gmingas/DataSynthesizer/blob/master/docs/cr-datasynthesizer-privacy.pdf)). DataSynthesizer allows users to specify a data type, and state whether an attribute is categorical or continuous, but is my understanding that the method is not directly compatible with continuous features, it converts them to ordinal categorical variables via a histogram. **Summary on QUiPP implementation**: latest version of the library in `develop-paper`. Seems to work in several tests datasets. ## Synthpop [Synthpop](https://CRAN.R-project.org/package=synthpop) is an R package for producing synthetic microdata using sequential modelling. The synthesizing methods can be parametric and non-parametric. The latter is based on classification and regression trees (CART) that can handle any type of data. The parametric methods. The parametric methods can handle numeric, binary, unordered factor and ordered factor data types. The methods currently implemented and the data types that they accept are listed in [Table 1 of its accompanying paper](https://www.jstatsoft.org/article/view/v074i11). The synthetic values of the variables are generated sequentially from their conditional distributions given variables already synthesized. QUiPP input JSON file allows for the specification of the type of synthesis to be used and the order in which the columns are synthesised. The synthesis can also be done only on the marginals **Summary on QUiPP implementation**: latest version of the library in `2011-uk-census-microdata`. The wrapper around the library seems to be customised for two of the existing QUiPP example datasets such as the Polish or UK census. It is not clear to me at the moment of writing if this will run on a new dataset out of the box. ### SGF The SGF library uses a probabilistic model that captures the joint distribution of attributes. The model is learned from data samples and a defined directed acyclic graph (DAG) drawn from the original dataset, where the nodes are the random variables, and the edges represent the probabilistic dependency between them. The main draw of this method is that offers to generate synthetic data in a privacy-preserving manner by using a mechanism that enforces **plausible deniability**. Plausible deniability is a criterion that provides a formal privacy guarantee, notably for releasing sensitive datasets: an output record can be released only if a certain amount of input records are indistinguishable, up to a privacy parameter. This notion does not depend on the background knowledge of an adversary. The mechanism works in the following way: Given a generative model M, dataset D, and privacy parameters k and γ, output a synthetic record y or nothing. 1. Randomly sample a seed record d ∈ D. 2. Generate a candidate synthetic record y = M(d). 3. Invoke a plausible deniability privacy test based on the privacy parameters. 4. If the tuple passes the test, then release y. Otherwise, there is no output. The larger the privacy parameter k is, the larger the indistinguishability set for the input data record. Also, the closer to 1 privacy parameter γ is, the stronger the indistinguishability of the input record among other plausible records. More details about Plausible Deniability and the SGF method can be found in [this paper](https://vbinds.ch/sites/default/files/PDFs/VLDB17-Bindschaedler-Plausible.pdf). In the QUiPP, pipeline the input JSON file allows for the configuration of privacy parameters. The SGF method only supports discrete variables, thus continuous variables must be discretized a priori by the user. QUiPP provides extra scripts to create configuration files (beyond the input JSON file) needed to set up the DAG graph, discretisation and other synthesis parameters. **Summary on QUiPP implementation**: latest version of the library in `develop`. The wrapper around the library seems to be customised for a purposely generated example datasets. This workflow doesn't seem to be well integrated with the general design of the QUiPP pipeline, it was an early-stage development that was not carried over with the pipeline. However, is the only method with a different privacy implementation to differential privacy. I doubt this will run on a new dataset out of the box. ### Baseline methods #### Subsample Returns a random subsample of entries without replacement from the original dataset. #### Bootstrap Returns a random sample with replacement of entries from the original dataset. #### Ensemble Seems to refer to the "sample" method from Synthpop. **Summary on QUiPP implementation**: latest version of the library in `develop-paper`. Seems to work in several tests datasets. ## Assessment metrics implemented ### Privacy #### Privacy parameter as input Privacy can be given as an input parameter (whether this is differential privacy or plausible deniability) in the PrivBayes and SGF method implementations. #### Disclosure risk The disclosure risk can be defined as the risk that an intruder can access samples from the original dataset to identify information on an individual on the released dataset. Calculating disclosure risks can be relevant when producing partial synthetic datasets, where some columns remain unchanged. QUiPP calculates a number of metrics, such as `EMRi`,`TMRi`,`TMRi`, `TMRa`,`TMRa`,`EMRi_norm`,`EMRi_norm`, `TMRi_norm`. >DISCLAIMER: I don't understand how these metrics are calculated, there is no clear documentation around it. **Summary on QUiPP implementation**: latest version of the library in `develop`. Seems to be useful only in very specific cases. #### PATHE-GAN > DISCLAIMER: I don't understand how this library is used within QUiPP, there is no clear documentation around it. ### Utility #### Classifier metrics Calculates the performance of different machine learning classifiers trained on the original and released datasets for a specified target variable and then tested on the original. These values are then compared to estimate the utility of the released dataset. Results are saved to `.json` files and an html report is generated. **Summary on QUiPP implementation**: latest version of the library in `develop-paper`, but it doesn't seem to have many changes from `develop`. #### Correlation metrics Several correlation-like utility metrics are calculated for all combinations of columns of the original and released dataset. Results are saved into a `.json` file. These can be compared to estimate the utility of the released dataset. Metrics are the implementation from the [dython library](http://shakedzy.xyz/dython/) of the following association variables for categorical variables: - Cramers_v - theils_u - correlation_ratio **Summary on QUiPP implementation**: latest version of the library in `develop-paper` branch, but it doesn't seem to have many changes from the `develop` branch. #### Feature importance The feature importance ranking differences between the original and released datasets are calculated, using a random forest classification model, various feature importance measures and various feature rank/score comparison measures. The results are saved into a `.json` file. These can be compared to estimate the utility of the released dataset. Some of the rank comparison metrics implemented are the following: - Ranked-biased Overlap - Correlated rank similarity metrics - L2 norm - KL divergence **Summary on QUiPP implementation**: only implemented in the `develop-paper` branch. Seems to work in several test datasets. ## Streams of work/functionalities The main working branch is `develop`. However, this branch doesn't seem to have any new developments (beyond changes in the documentation) in over a year. Two branches diverged from `develop` and have important new contributions. The description of QUiPP above tries to include all the newest contributions. One of the branches is the `2011-census-microdata`, where the census data was generated for the ONS project. This branch contains the most documented example of how QUiPP works in form of notebooks. The other branch is `develop-paper` and has extra utility metrics (e.g. feature importance) and updated synthesis methods in respect to `develop` and `2011-census-microdata` (plus other developments specifically made to run benchmarking experiments for a paper locally and in Azure), but not very clear documentation/examples. ## Final thoughts (critics) on QUiPP There has been an enormous amount of work dedicated to the QUiPP pipeline, and in this section I do not aim to criticise these efforts done by the team but to provide a view that might help us think about how we need to take QUiPP forward. - Several methods have been customised for specific datasets (e.g. the Census dataset or generated data). There is no guarantee that they work on a new one out of the box (this seems to be the case for Synthpop and SGF). - Synthetic data generation libraries tend to have several different methods implemented. QUiPP in some cases reduces the potential of these libraries, by only accessing one or a few of its methods. - The level of customizability of the pipeline makes it also very difficult to understand even in the parts that are documented. Other areas are not documented at all. - It feels that for the same effort of understanding a new library and writing a wrapper for QUiPP a user would just rather use the library by itself. Unless QUiPP has something unique to offer. - It is not clear to me what is the unique thing QUiPP has to offer. QUiPP has not an authoritative definition of how to measure utility/privacy (and probably no one does). Perhaps QUiPP can function better as a benchmarking tool as it seemed to be on the `develop-paper` branch. - The more libraries in different programming languages get added to the pipeline the more complicated it is to install and run (e.g. the SGF library is an example of this). A good amount of work has to be done in order to keep compatibilities and the interoperability of the pipeline.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully