# Platforms for Synthetic Data
The primary purpose of data synthesis in the ONS is to provide researchers and staff early access to safe data while their application to use the real, confidential data progresses. In doing so, users get the opportunity to explore something like the real data, assess its suitability for their aims, and develop their analyses outside the secure environment.
Now that our data synthesis projects are maturing, we must find an appropriate system for people to access these synthetic outputs. Any such system requires two parts: a place to store the synthetic data and an environment in which to develop code.
Importantly, we want a system that is **secure**, **easy to use**, and **cost-effective**.
While data storage options are seemingly plentiful, fulfilling these criteria imposes constraints on the development environment. The remainder of this document summarises several popular storage-development systems against these criteria.
## Google Colaboratory
Colaboratory (Colab for short) is an in-browser, GPU-enabled Python environment using the familiar Jupyter notebook interface. It does not provide a storage solution.
- **Security:** Colab offers little by way of security. Although notebooks can be made private, they can be exported freely, and data can be exported from them to the user's Google Drive.
- **Ease of use:** Using Colab is straightforward, and access to GPUs may be helpful to some analysts. However, both RAM and storage are limited, and there is no interface for R developers. Also, individual sessions can run for a maximum of 12 hours.
- **Costs:** Standard-size Colab sessions are available for free so long as the user has a Google account, but larger instances can be created with a Colab Pro or Pro+ subscription.
## Google Cloud
Google Cloud is one of the big players in cloud-based storage-development solutions. They offer several services for each of our needs that integrate with each other. Also, Google Cloud is used by a number of government projects.
Of the services offered, two options stand out. First, a virtual machine with the synthetic dataset attached as a persistent disk; this would effectively be a hosted container. Second, data stored in a restricted Cloud Storage bucket and users given access to a Vertex AI Workbench--another notebook-like environment.
- **Security:** Administrators have a lot of options when it comes to configuring access to Google Cloud resources (storage buckets and development instances), and the documentation is thorough.
- **Ease of use:** Users work in a fully fledged development environment for several langauges. The workbench sessions are also scalable, with the option of spinning up a Spark cluster or using the BigQuery API. This sort of distributed workflow will be required for analysing larger datasets.
- **Costs:** Costs are measured against resource use, so a quote would be required. Some capabilities may be covered by an existing licence.
## Amazon Web Services (AWS)
As the other big player, AWS offers an array of storage and development services that have been adopted by government bodies for infrastructure and research, including the Ministry of Justice.
For our purposes, we would likely store the synthetic data in private S3 buckets and provide users with limited access to some sort of SageMaker notebook instance.
- **Security:** Within AWS, administrators have low-level control over user's access to, and functionality within, their resources. Policies are defined in a programmatic way using a JSON-style language.
- **Ease of use:** Using a notebook-style interface is good for exploratory work, and users would also have access to GPUs for free. SageMaker also provides an R kernel.
- **Costs:** See Google Cloud costs.