changed 3 years ago
Linked with GitHub

How to use Terraform to create Azure Data Factory pipelines

Intro

Most of the online resources, including Microsoft documentation, suggest to use Azure Data factory (ADF) in Git mode instead of Live mode as it has some advantages, such as ability to work on the resources as a team in a collaborative manner or ability to revert changes that introduced bugs. However, the way that git mode is implemented is not taking advantage of the infrastructure as code approach - being able to reason about the state of deployed resources just by looking at code. Because when Git integration is enabled it is not the main branch that shows you the resources, but rather an auto-generated adf_publish branch. This branch is implemented using ARM templates, and is quite verbose and not human-friendly - just take a look at this example from Microsoft which only contains one activity. Now imagine that you have dozens of pipelines with complexe activities, several datasets and linked services.

We want to show how to mitigate the shortcomings of the ADF's Git mode and still benefit from advantages of the code stored in source control. In order to achieve this, we'll use Terraform to deploy both ADF (in Live mode) and its resources. This implies that the Terraform code is stored in the Git repo. Next to that, we'll show how to circumvent some of the limitations that Azure's Terraform provider has when it comes to more complex pipelines.

Using our method, one can simply look at a code (or specific tagged version of it) and tell for sure what is (or was) deployed.

There is one limitation for using Terraform though. Currently, azurerm Terraform provider doesn’t allow for creation of complex pipelines that contain any other variables than of type string. But for that we have a workaround.

Approach

Prerequisites

If you, like us, are using CI/CD pipeline to provision the resources, you probably are using service principal on your build agent. In order to use one solution that works both locally during debug/testing and on the build agent, you will need to have the service principal credentials.

Also, this is supposed to work on Bash, so if you want to use it as-is, make sure you have it installed.

The fileset approach

Since we didn't start from scratch, but already had existing pipelines in JSON format and a process to generate them, we decided that these resources would stay as-is. This separation allows Platform and Data Engineering parts of the team to be as efficient as possible and use languages they are the most used to - platform engineers can use Terraform to provision resources and take the most out of it, while data engineers can work and edit the pipelines in the same format as they are represented in ADF. In the image below you can see how the pipelines and triggers (resources that will stay in json format) were stored in our case.

Other resources, such as linked services and datasets, could be migrated to the Terraform code directly, allowing us to get rid of these json files:

Our approach will make sure that every time a data engineer would generate new pipelines (or modify existing ones), Terraform code would automatically pick them up and deploy them.

To read and process a set of existing files, we use Terraform's fileset function. Let's see it in action. First, we add a local variable using fileset:

locals {
  pipelines       = { for value in fileset("./pipelines", "*.json") : value => jsondecode(file("./pipelines/${value}")) }
  data_factory_id = "DATA_FACTORY_ID"
}

This code iterates over all the json files stored in the pipelines folder and deserializes them. In your case, if your pipelines have to adhere to a certain naming convention or live in a different folder, you can modify the mask and/or location.

Ideally we'd like to use azurerm_data_factory_pipeline resource to manage the pipelines, but, at the time of writing, we couldn't just use azurerm_data_factory_pipeline since its field variables only allows for the map of string and our pipelines used variables of type array. To work around the bug, we used null_resource. Please note that this is only a temporary workaround and should not be used unless needed, which was exactly our case. When the bug will be fixed, this blog will be updated with the proper solution.

The null_resource workaround

locals {
  ...
  tmp_files_location = ".terraform/tmp"
  data_factory_name  = "DATA_FACTORY_NAME"
  rg_name            = "RESOURCE_GROUP_NAME"
  tenant_id          = "YOUR_TENANT_ID"
}

resource "null_resource" "pipelines" {
  for_each = local.pipelines

  triggers = {
    on_change = "${md5(jsonencode(each.value))}"
    tenant_id = local.tenant_id
    data_factory_name = local.data_factory_name
    pipeline_name     = each.value.name
    data_factory_resource_group_name = local.rg_name
  }

  provisioner "local-exec" {
    when    = create
    command = <<-EOC
      az login --service-principal -u $ARM_CLIENT_ID -p $ARM_CLIENT_SECRET --tenant "${local.tenant_id}"
      az account set --subscription $ARM_SUBSCRIPTION_ID
      az datafactory pipeline create --factory-name "${local.data_factory_name}" --name "${each.value.name}" --resource-group "${local.rg_name}" --pipeline @${path.root}/pipeline/${each.value.name}.json
    EOC
    interpreter = [
      "bash",
      "-c"
    ]
  }

  provisioner "local-exec" {
    when    = destroy
    command = <<-EOC
      az login --service-principal -u $ARM_CLIENT_ID -p $ARM_CLIENT_SECRET --tenant "${self.triggers.tenant_id}"
      az account set --subscription $ARM_SUBSCRIPTION_ID
      az datafactory pipeline delete --factory-name "${self.triggers.data_factory_name}" --name "${self.triggers.pipeline_name}" --resource-group "${self.triggers.data_factory_resource_group_name}" -y
    EOC
    interpreter = [
      "bash",
      "-c"
    ]
  } 
}

We decided to not use azurerm_data_factory_pipeline at all, even for initial resource creation, because mixing these two approaches would bring even more issues to the table, such as - having to use timestamp() trigger on null_resource and effectively re-create pipelines on every apply.

Using only null_resource allowed us to use md5 as a trigger in order to re-create pipelines once their file content changes. This approach means that we also need an 'on destroy' condition to delete pipelines when we run terraform destroy. As a consequence, we had to define variables in the trigger block as destroy-time provisioners cannot access external variables.

Once again, this is a workaround which will be removed once azurerm_data_factory_pipeline will support complex variables.

Conclusion

These snippets should give you a good starting point if you want to have advantages of both Git and Infrastructure As a Code (IaaC). Of course, this is not universal and your use case might require some adjustments, but feel free to experiment, it's worth it!
Nevertheless, as with all solutions, this approach has its pros and cons:

Pros

  • It’s possible to deploy and rollback any version/tag of your pipelines
  • Same as with Git mode integration:
    • All the crucial workflows are stored in the source control
    • It allows for incremental changes of data factory resources regardless of what state they are in

Cons

  • It's not possible to use multiple branches on the same ADF instance
  • Engineers have to have the same environment variables exported on their local machine as the build agent if they want to test it locally

As mentioned before, the workaround is just a temporary fix until Azure comes with a solution for the bug. When it's fixed, null_resource part should not be necessary anymore.

Select a repo