How to use Terraform to create Azure Data Factory pipelines

Intro

Most of the online resources, including Microsoft documentation, suggest to use Azure Data factory (ADF) in Git mode instead of Live mode as it has some advantages, such as ability to work on the resources as a team in a collaborative manner or ability to revert changes that introduced bugs. However, the way that git mode is implemented is not taking advantage of the infrastructure as code approach - being able to reason about the state of deployed resources just by looking at code. Because when Git integration is enabled it is not the main branch that shows you the resources, but rather an auto-generated adf_publish branch. This branch is implemented using ARM templates, and is quite verbose and not human-friendly - just take a look at this example from Microsoft which only contains one activity. Now imagine that you have dozens of pipelines with complexe activities, several datasets and linked services.

We want to show how to mitigate the shortcomings of the ADF's Git mode and still benefit from advantages of the code stored in source control. In order to achieve this, we'll use Terraform to deploy both ADF (in Live mode) and its resources. This implies that the Terraform code is stored in the Git repo. Next to that, we'll show how to circumvent some of the limitations that Azure's Terraform provider has when it comes to more complex pipelines.

Using our method, one can simply look at a code (or specific tagged version of it) and tell for sure what is (or was) deployed.

There is one limitation for using Terraform though. Currently, azurerm Terraform provider doesn’t allow for creation of complex pipelines that contain any other variables than of type string. But for that we have a workaround.

Approach

Prerequisites

If you, like us, are using CI/CD pipeline to provision the resources, you probably are using service principal on your build agent. In order to use one solution that works both locally during debug/testing and on the build agent, you will need to have the service principal credentials.

Also, this is supposed to work on Bash, so if you want to use it as-is, make sure you have it installed.

The `fileset` approach

Since we didn't start from scratch, but already had existing pipelines in JSON format and a process to generate them, we decided that these resources would stay as-is. This separation allows Platform and Data Engineering parts of the team to be as efficient as possible and use languages they are the most used to - platform engineers can use Terraform to provision resources and take the most out of it, while data engineers can work and edit the pipelines in the same format as they are represented in ADF. In the image below you can see how the pipelines and triggers (resources that will stay in json format) were stored in our case.

Other resources, such as linked services and datasets, could be migrated to the Terraform code directly, allowing us to get rid of these json files:

Our approach will make sure that every time a data engineer would generate new pipelines (or modify existing ones), Terraform code would automatically pick them up and deploy them.

To read and process a set of existing files, we use Terraform's fileset function. Let's see it in action. First, we add a local variable using fileset:

locals {
  pipelines       = { for value in fileset("./pipelines", "*.json") : value => jsondecode(file("./pipelines/${value}")) }
  data_factory_id = "DATA_FACTORY_ID"
}

This code iterates over all the json files stored in the pipelines folder and deserializes them. In your case, if your pipelines have to adhere to a certain naming convention or live in a different folder, you can modify the mask and/or location.

Ideally we'd like to use azurerm_data_factory_pipeline resource to manage the pipelines, but, at the time of writing, we couldn't just use azurerm_data_factory_pipeline since its field variables only allows for the map of string and our pipelines used variables of type array. To work around the bug, we used null_resource. Please note that this is only a temporary workaround and should not be used unless needed, which was exactly our case. When the bug will be fixed, this blog will be updated with the proper solution.

The `null_resource` workaround

locals {
  ...
  tmp_files_location = ".terraform/tmp"
  data_factory_name  = "DATA_FACTORY_NAME"
  rg_name            = "RESOURCE_GROUP_NAME"
  tenant_id          = "YOUR_TENANT_ID"
}

resource "null_resource" "pipelines" {
  for_each = local.pipelines

  triggers = {
    on_change = "${md5(jsonencode(each.value))}"
    tenant_id = local.tenant_id
    data_factory_name = local.data_factory_name
    pipeline_name     = each.value.name
    data_factory_resource_group_name = local.rg_name
  }

  provisioner "local-exec" {
    when    = create
    command = <<-EOC
      az login --service-principal -u $ARM_CLIENT_ID -p $ARM_CLIENT_SECRET --tenant "${local.tenant_id}"
      az account set --subscription $ARM_SUBSCRIPTION_ID
      az datafactory pipeline create --factory-name "${local.data_factory_name}" --name "${each.value.name}" --resource-group "${local.rg_name}" --pipeline @${path.root}/pipeline/${each.value.name}.json
    EOC
    interpreter = [
      "bash",
      "-c"
    ]
  }

  provisioner "local-exec" {
    when    = destroy
    command = <<-EOC
      az login --service-principal -u $ARM_CLIENT_ID -p $ARM_CLIENT_SECRET --tenant "${self.triggers.tenant_id}"
      az account set --subscription $ARM_SUBSCRIPTION_ID
      az datafactory pipeline delete --factory-name "${self.triggers.data_factory_name}" --name "${self.triggers.pipeline_name}" --resource-group "${self.triggers.data_factory_resource_group_name}" -y
    EOC
    interpreter = [
      "bash",
      "-c"
    ]
  } 
}

We decided to not use azurerm_data_factory_pipeline at all, even for initial resource creation, because mixing these two approaches would bring even more issues to the table, such as - having to use timestamp() trigger on null_resource and effectively re-create pipelines on every apply.

Using only null_resource allowed us to use md5 as a trigger in order to re-create pipelines once their file content changes. This approach means that we also need an 'on destroy' condition to delete pipelines when we run terraform destroy. As a consequence, we had to define variables in the trigger block as destroy-time provisioners cannot access external variables.

Once again, this is a workaround which will be removed once azurerm_data_factory_pipeline will support complex variables.

Conclusion

These snippets should give you a good starting point if you want to have advantages of both Git and Infrastructure As a Code (IaaC). Of course, this is not universal and your use case might require some adjustments, but feel free to experiment, it's worth it!
Nevertheless, as with all solutions, this approach has its pros and cons:

Pros

It’s possible to deploy and rollback any version/tag of your pipelines
Same as with Git mode integration:
- All the crucial workflows are stored in the source control
- It allows for incremental changes of data factory resources regardless of what state they are in

Cons

It's not possible to use multiple branches on the same ADF instance
Engineers have to have the same environment variables exported on their local machine as the build agent if they want to test it locally

As mentioned before, the workaround is just a temporary fix until Azure comes with a solution for the bug. When it's fixed, null_resource part should not be necessary anymore.

Werner Buck

2022/05/17 17:17:23

. However, the way that git mode is implemented is not taking advantage of the infrastructure as code approach - being able to reason about the state of deployed resources just by looking at code. Because when Git integration is enabled it is not the `main` branch that shows you the resources, but rather an auto-generated adf_publish branch. This branch is implem

auditability only matters to auditors. the idea is: "Ability to revert changes that introduced bugs." because they are in source code. (Edited)

2022/05/17 17:20:45

is implemented using ARM templates, and is quite verbose and not human-fri

This doesn't make sense to me, only until the next sentence about `adf_publish` please allude to this earlier. E.g., "However, the way that git mode is implemented is not taking advantage of the infrasrtucture as code approach - being able to reason about the state of deployed resources just by looking at code. Because when Git integratio nis enabled *it is not the `main` branch that shows you the resources, but rather an auto-generated adf_publish branch. This branch... etc. (Edited)

2022/05/17 17:21:44

Awesome, good structuring.. problem, solution, outcome in intro (Edited)

2022/05/17 17:22:45

Terraform provider [doesn’t allow](https://github.com/hashicorp/terraform-provider-azurerm/issues/14198) for creation of complex pipelines that contain any other variables than of type `string`

be more specific, I don't know what this means. Give link tor esource docs and/or issue/problem (Edited)

2022/05/17 17:29:04

null_resource`

look. in terraform-land cheering on the null_resource is like advocating Ado. I would make clear that this is a work-around and phrase this slightly differently. (Edited)

2022/05/17 17:36:00

## Conclusion

This is a problem. Terraform should not be an "always change" engine which it now effectively is. Terraform needs ownership. Unfortunately because the provider is lacking the capability I would use null_resource to replace the entire function of `azurerm_data_factory_pipeline` because then you can at least "own" the problem in one place with proper triggers on changes of the fileset. (Edited)

2022/05/17 17:38:50

As mentioned before, the workaround is just a temporary fix until Azure comes with a solution for the [bug](https://github.com/hashicorp/terraform-provider-azurerm/issues/14198). When it's fixed, `null_resource` part should not be necessary anymore.

I think the idea to use fileset and manage files and provision the azurerm_data_factory_pipeline is the main innovation of this blog but it gets side-tracked by the emphasis on null_resource which is a temporary workaround at most. I would add more meat around the fileset (a nice picture) and how its workin) and delegate the workaround to implementatiosn details and mention that you will update the article when it gets fixed. (Edited)

Nils

2022/05/18 14:37:16

The jump between paragraphs, reads like you're going in DRY (Edited)

2022/05/20 12:37:21

Our approach will make sure that every time a data engineer would generate new pipelines (or modify existing ones), Terraform code would automatically pick them up and deploy them. To read and process a set of existing files, we use Terraform's [fileset](https://www.terraform.io/language/functions/fileset) function.

This should be togther. Something like this: Our approach will make sure that every time a data engineer generates new pipelines (or modifies them), Terraform code will automatically pick them up and deploy them. To achieve reading and processing a set of existing files, we use the Terraform fileset function. Let's see it in action: First, we add a local variable using this fileset function, that also deserializes our jsons so we can later deploy them. [code] (Edited)

Guest Welch2022/05/17 20:01:48

This does mean that we need to add an on destroy to the null resource. But I do agree with the statement. (Edited)

2022/05/20 12:22:49

he ADF's

ADF's (Edited)

2022/05/20 12:26:51

Terraform provider [doesn’t allow](https://github

There is one limitation for using Terraform though. Currently, `azurerm` (Edited)

2022/05/20 12:28:52

we (Edited)

2022/05/20 12:33:23

e use

we use Terraform's (Edited)

2022/05/20 12:43:36

. Us

quotation issues (Edited)

2022/05/20 12:45:37

Reader could not understand what we mean with 'owning'. Committing to using `null_resource` for generating pipelines allowed us to use ... (Edited)

2022/05/20 12:52:52

as a trigger

as a (Edited)

2022/05/20 12:54:04

And because of the destroy condition we had to define variables in the `trigger` .... (Edited)

2022/05/20 12:54:39

Infrastructure

Add (IaaS) maybe for SEO? (Edited)

2022/05/20 12:56:22

olutions, this approach has its pros and cons

Nevertheless, as with all solutions, this approach has its pro's and con's: (Edited)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

How to use Terraform to create Azure Data Factory pipelines

Intro

Approach

Prerequisites

The fileset approach

The null_resource workaround

Conclusion

Pros

Cons

The `fileset` approach

The `null_resource` workaround