One of many nice things about [Google BigQuery](https://cloud.google.com/bigquery/docs/introduction) is that they constantly add new features. Recently, we had a look at [BigQuery Column Encryption](https://cloud.google.com/bigquery/docs/column-key-encrypt) with Google KMS integration and it’s quite cool.
BigQuery by default offers multiple layers of security: encryption at rest and properly configured IAM policies prevent unauthorized parties to access your data. These measures provide strong enough guarantees for security for some of the use-cases. However in some industries for sensitive data such as PII (Personally Identifiable Information) the defaut measures are not enough, an additional layer of security is required. BigQuery’s column encryption can help you in this case, BigQuery allows you to encrypt columns, decrypt encrypted columns in SQL using [AEAD family functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/aead_encryption_functions) like AEAD.ENCRYPT and AEAD.DECRYPT to minimize the risk of data exposure
About AEAD, we just need to know this is a strong cipher function that guarantees your data can never be exposed without the encryption key. More can be found here [AEAD encryption concepts](https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts)
So that’s about encryption/decryption when the data is already in BigQuery using SQL. What if we have to encrypt information outside of BigQuery and later decrypt it using only native BigQuery SQL? We actually had this challenge in of our recent project, where our task was to built a real-time data ingestion pipeline from on-prem datasources into BigQuery using the Confluent tech-stack. Due to regulatory requirements we had to encypt certain sensitive fields before the change events reached the Kafka topic. We used custom Java [Single Message Transform](https://docs.confluent.io/platform/current/connect/transforms/overview.html) to fulfill this requirement, to encrypt customer data in the middle of the data ingestion pipeline.
In this post, we will demonstrate encrypting a simple ``"hello world"`` string in Java to get the ciphertext, and decrypt that ciphertext in BigQuery.
To do that, we will use [Tink library](https://developers.google.com/tink) to encrypt our plaintext in a way that BigQuery can understand and decrypt using the native AEAD functions. From Tink’s documentation:
> Tink is an open-source cryptography library written by cryptographers and security engineers at Google. Tink's secure and simple APIs reduce common pitfalls through user-centered design, careful implementation and code reviews, and extensive testing
Tink supports Java, Python, and C++ programming languages. In Java we need to include the following dependecies:
- `com.google.crypto.tink:tink-gcpkms:1.6.1`
- `com.google.crypto.tink:tink:1.6.1`
# Encrypt data using plaintext key
In order to replicate our example you need to fulfill the following prerequisites:
- [Install the tinkey](https://developers.google.com/tink/tinkey-overview) utility tool
- a [GCP project](https://cloud.google.com/free)
- with at least [BigQuery Job User](https://cloud.google.com/bigquery/docs/access-control#bigquery) role
- BigQuery API [enabled](https://cloud.google.com/endpoints/docs/openapi/enable-api)
Once you have the prerequisites ready, follow the steps below:
1. First, create a encryption key using 'tinkey' utility (you can to install it by following [guides](https://developers.google.com/tink/tinkey-overview) of Tink documentation)
`tinkey create-keyset --key-template AES256_GCM --out-format json --out plaintext-dek.json`
2. Use that `plaintext-dek.json` key to encrypt data in Java
```java
public static void encryptWithPlaintextKeyset() throws IOException, GeneralSecurityException {
KeysetHandle keysetHandle =
CleartextKeysetHandle.read(JsonKeysetReader.withPath(Paths.get("plaintext-dek.json")));
AeadConfig.register();
Aead aead = keysetHandle.getPrimitive(Aead.class);
byte[] plainText = "hello world".getBytes(StandardCharsets.UTF_8);
byte[] associatedData = "tink".getBytes(StandardCharsets.UTF_8);
byte[] cipherText = aead.encrypt(plainText, associatedData);
String cipherTextBase64 = Base64.getEncoder().encodeToString(cipherText);
System.out.println(cipherTextBase64);
}
```
3. Take a note at the base64 output of the cipher text, and decrypt it using BigQuery SQL
```sql
SELECT
AEAD.DECRYPT_STRING(
KEYS.KEYSET_FROM_JSON('<content of plaintext-dek.json>'), -- content of our plaintext-dek.json
FROM_BASE64('<base64 representation of ciphertext>'), -- base64 of cipher text
'tink' -- associated data
) AS plaintext
```
BigQuery automatically logs all the queries, so adding the plaintext encryption key to the query directly means that plaintext key can be found in the BigQuery logs. [Query parameters](https://cloud.google.com/bigquery/docs/parameterized-queries) can be used to avoid this security risk.
4. Done, you'll get the plaintext `hello world` back.
# Encrypt data using an encrypted key protected by Google KMS.
The approach where we use multiple layer of keys to encrypt data is called [envelope encryption](https://cloud.google.com/kms/docs/envelope-encryption). A data encryption key (DEK) is used to encrypt data itself, and the DEK is again encrypted by another key called key encryption key (KEK). We still use the AEAD functions but this time the key is protected (encrypted) by Google KMS so without appropriate permissions such as `Cloud KMS CryptoKey Encrypter/Decrypter` on the key in KMS, no one can decrypt and use our key. You may tempting to ask why not use Google KMS to encrypt our data directly? It won't work as Google KMS is not suitable for encrypting large volume of data (maximum size that you can encrypt using KMS is about 64KBs).
In order to execute the steps described below you have to meet the following prerequisites(additionally to the ones mentioned in the previous example):
- a [GCP project](https://cloud.google.com/free)
- with at least [Cloud KMS Admin](https://cloud.google.com/kms/docs/reference/permissions-and-roles#predefinedy) role
- Google KMS API [enabled](https://cloud.google.com/endpoints/docs/openapi/enable-api)
- gcloud [installed and authenticated](https://cloud.google.com/sdk/docs/install-sdk)
Follow the steps below:
1. Create a keyring named 'demo-keyring-us' in KMS
```
gcloud kms keyrings create demo-keyring-us --location us --project <gcp-project-id>
```
2. Create a key named 'demo-key' in KMS as KEK
```
gcloud kms keys create demo-key --keyring demo-keyring-us --purpose encryption --location us --project <gcp-project-id>
```
3. Generate the encrypted DEK using tinkey
```
tinkey create-keyset --key-template AES256_GCM \
--master-key-uri "gcp-kms://projects/<gcp-project-name>/locations/us/keyRings/demo-keyring-us/cryptoKeys/demo-key" \
--out encrypted-dek.json
```
4. Open `encrypted-dek.json`, note the `encryptedKeyset` value. This is the base64 representation of the the encrypted DEK keyset. We will use this during the decryption in BigQuery SQL later.
5. Encrypt the plaintext in Java using the `encrypted-dek.json` file. In order to execute this Java code, you will need to [authenticate properly to GCP](https://cloud.google.com/docs/authentication), the easiest way in a local environment is via `gcloud auth application-default login`.
```java
public static void encryptWithEncryptedKeyset() throws IOException, GeneralSecurityException {
AeadConfig.register();
String kekUri = "gcp-kms://projects/<gcp-project-name>/locations/us/keyRings/demo-keyring-us/cryptoKeys/demo-key";
GcpKmsClient.register(Optional.of(kekUri), Optional.empty());
KeysetHandle kekHandle = KeysetHandle.generateNew(KmsAeadKeyManager.createKeyTemplate(kekUri));
Aead keyAead = kekHandle.getPrimitive(Aead.class);
KeysetHandle keysetHandle = KeysetHandle.read(JsonKeysetReader.withPath(Paths.get("encrypted-dek.json")), keyAead);
Aead aead = keysetHandle.getPrimitive(Aead.class);
byte[] plainText = "hello world".getBytes(StandardCharsets.UTF_8);
byte[] associatedData = "tink".getBytes(StandardCharsets.UTF_8);
byte[] cipherText = aead.encrypt(plainText, associatedData);
String cipherTextBase64 = Base64.getEncoder().encodeToString(cipherText);
System.out.println(cipherTextBase64);
}
```
6. Finally, decrypt cypertext with BigQuery SQL
```sql
DECLARE kms_resource_name string;
DECLARE first_level_keyset bytes;
SET kms_resource_name = 'gcp-kms://projects/<gcp-project-name>/locations/us/keyRings/demo-keyring-us/cryptoKeys/demo-key';
SET first_level_keyset = from_base64('<base64-encrypted-keyset>'); -- our base64 encrypted keyset before
SELECT
AEAD.DECRYPT_STRING(
KEYS.KEYSET_CHAIN(kms_resource_name, first_level_keyset),
FROM_BASE64('<base64 representation of ciphertext>'),
"tink"
) AS plaintext
```
You can verify that we get our `hello world` plaintext back.
That's all we got. You can access the the Java source code in the [bq-column-level-encryption-example](https://github.com/aliz-ai/bq-column-level-encryption-example) repository. Head over [tink](https://developers.google.com/tink) and [BigQuery documentation](https://cloud.google.com/bigquery/docs/introduction) for more advanced capabilities such as key rotation.
Have a good day!