owned this note
owned this note
Published
Linked with GitHub
# How to use Terraform to create Azure Data Factory pipelines
## Intro
Most of the online resources, including Microsoft documentation, suggest to use Azure Data factory (ADF) in Git mode instead of Live mode as it has some [advantages](https://docs.microsoft.com/en-us/azure/data-factory/source-control#advantages-of-git-integration), such as ability to work on the resources as a team in a collaborative manner or ability to revert changes that introduced bugs. However, the way that git mode is implemented is not taking advantage of the infrastructure as code approach - being able to reason about the state of deployed resources just by looking at code. Because when Git integration is enabled it is not the `main` branch that shows you the resources, but rather an auto-generated adf_publish branch. This branch is implemented using ARM templates, and is quite verbose and not human-friendly - just take a look at [this example](https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-resource-manager-template#review-template) from Microsoft which only contains one activity. Now imagine that you have dozens of pipelines with complexe activities, several datasets and linked services.
We want to show how to mitigate the shortcomings of the ADF's Git mode and still benefit from advantages of the code stored in source control. In order to achieve this, we'll use Terraform to deploy both ADF (in Live mode) and its resources. This implies that the Terraform code is stored in the Git repo. Next to that, we'll show how to circumvent some of the [limitations](https://github.com/hashicorp/terraform-provider-azurerm/issues/14198) that Azure's Terraform provider has when it comes to more complex pipelines.
Using our method, one can simply look at a code (or specific tagged version of it) and tell for sure what is (or was) deployed.
There is one limitation for using Terraform though. Currently, `azurerm` Terraform provider [doesn’t allow](https://github.com/hashicorp/terraform-provider-azurerm/issues/14198) for creation of complex pipelines that contain any other variables than of type `string`. But for that we have a workaround.
## Approach
### Prerequisites
If you, like us, are using CI/CD pipeline to provision the resources, you probably are using service principal on your build agent. In order to use one solution that works both locally during debug/testing and on the build agent, you will need to have the service principal credentials.
Also, this is supposed to work on Bash, so if you want to use it as-is, make sure you have it installed.
### The `fileset` approach
Since we didn't start from scratch, but already had existing pipelines in JSON format and a process to generate them, we decided that these resources would stay as-is. This separation allows Platform and Data Engineering parts of the team to be as efficient as possible and use languages they are the most used to - platform engineers can use Terraform to provision resources and take the most out of it, while data engineers can work and edit the pipelines in the same format as they are represented in ADF. In the image below you can see how the pipelines and triggers (resources that will stay in json format) were stored in our case.
![](https://i.imgur.com/lULgges.png)
![](https://i.imgur.com/VfE7FW4.png)
Other resources, such as linked services and datasets, could be migrated to the Terraform code directly, allowing us to get rid of these json files:
![](https://i.imgur.com/MVif2OD.png)
![](https://i.imgur.com/t2yliEC.png)
Our approach will make sure that every time a data engineer would generate new pipelines (or modify existing ones), Terraform code would automatically pick them up and deploy them.
To read and process a set of existing files, we use Terraform's [fileset](https://www.terraform.io/language/functions/fileset) function. Let's see it in action. First, we add a local variable using `fileset`:
```
locals {
pipelines = { for value in fileset("./pipelines", "*.json") : value => jsondecode(file("./pipelines/${value}")) }
data_factory_id = "DATA_FACTORY_ID"
}
```
This code iterates over all the json files stored in the `pipelines` folder and deserializes them. In your case, if your pipelines have to adhere to a certain naming convention or live in a different folder, you can modify the mask and/or location.
Ideally we'd like to use `azurerm_data_factory_pipeline` resource to manage the pipelines, but, at the time of writing, we couldn't just use `azurerm_data_factory_pipeline` since its field `variables` only allows for the map of `string` and our pipelines used variables of type `array`. To work around the [bug](https://github.com/hashicorp/terraform-provider-azurerm/issues/14198), we used `null_resource`. Please note that this is only a temporary workaround and should not be used unless needed, which was exactly our case. When the bug will be fixed, this blog will be updated with the proper solution.
### The `null_resource` workaround
```
locals {
...
tmp_files_location = ".terraform/tmp"
data_factory_name = "DATA_FACTORY_NAME"
rg_name = "RESOURCE_GROUP_NAME"
tenant_id = "YOUR_TENANT_ID"
}
resource "null_resource" "pipelines" {
for_each = local.pipelines
triggers = {
on_change = "${md5(jsonencode(each.value))}"
tenant_id = local.tenant_id
data_factory_name = local.data_factory_name
pipeline_name = each.value.name
data_factory_resource_group_name = local.rg_name
}
provisioner "local-exec" {
when = create
command = <<-EOC
az login --service-principal -u $ARM_CLIENT_ID -p $ARM_CLIENT_SECRET --tenant "${local.tenant_id}"
az account set --subscription $ARM_SUBSCRIPTION_ID
az datafactory pipeline create --factory-name "${local.data_factory_name}" --name "${each.value.name}" --resource-group "${local.rg_name}" --pipeline @${path.root}/pipeline/${each.value.name}.json
EOC
interpreter = [
"bash",
"-c"
]
}
provisioner "local-exec" {
when = destroy
command = <<-EOC
az login --service-principal -u $ARM_CLIENT_ID -p $ARM_CLIENT_SECRET --tenant "${self.triggers.tenant_id}"
az account set --subscription $ARM_SUBSCRIPTION_ID
az datafactory pipeline delete --factory-name "${self.triggers.data_factory_name}" --name "${self.triggers.pipeline_name}" --resource-group "${self.triggers.data_factory_resource_group_name}" -y
EOC
interpreter = [
"bash",
"-c"
]
}
}
```
We decided to not use `azurerm_data_factory_pipeline` at all, even for initial resource creation, because mixing these two approaches would bring even more issues to the table, such as - having to use `timestamp()` trigger on `null_resource` and effectively re-create pipelines on every apply.
Using only `null_resource` allowed us to use `md5` as a trigger in order to re-create pipelines once their file content changes. This approach means that we also need an 'on destroy' condition to delete pipelines when we run `terraform destroy`. As a consequence, we had to define variables in the `trigger` block as destroy-time provisioners [cannot access](https://github.com/hashicorp/terraform/issues/23679) external variables.
Once again, this is a workaround which will be removed once `azurerm_data_factory_pipeline` will support complex variables.
## Conclusion
These snippets should give you a good starting point if you want to have advantages of both Git and Infrastructure As a Code (IaaC). Of course, this is not universal and your use case might require some adjustments, but feel free to experiment, it's worth it!
Nevertheless, as with all solutions, this approach has its pros and cons:
### Pros
- It’s possible to deploy and rollback any version/tag of your pipelines
- Same as with Git mode integration:
- All the crucial workflows are stored in the source control
- It allows for incremental changes of data factory resources regardless of what state they are in
### Cons
- It's not possible to use multiple branches on the same ADF instance
- Engineers have to have the same environment variables exported on their local machine as the build agent if they want to test it locally
As mentioned before, the workaround is just a temporary fix until Azure comes with a solution for the [bug](https://github.com/hashicorp/terraform-provider-azurerm/issues/14198). When it's fixed, `null_resource` part should not be necessary anymore.