or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
How to use Terraform to create Azure Data Factory pipelines
Intro
Most of the online resources, including Microsoft documentation, suggest to use Azure Data factory (ADF) in Git mode instead of Live mode as it has some advantages, such as ability to work on the resources as a team in a collaborative manner or ability to revert changes that introduced bugs. However, the way that git mode is implemented is not taking advantage of the infrastructure as code approach - being able to reason about the state of deployed resources just by looking at code. Because when Git integration is enabled it is not the
main
branch that shows you the resources, but rather an auto-generated adf_publish branch. This branch is implemented using ARM templates, and is quite verbose and not human-friendly - just take a look at this example from Microsoft which only contains one activity. Now imagine that you have dozens of pipelines with complexe activities, several datasets and linked services.We want to show how to mitigate the shortcomings of the ADF's Git mode and still benefit from advantages of the code stored in source control. In order to achieve this, we'll use Terraform to deploy both ADF (in Live mode) and its resources. This implies that the Terraform code is stored in the Git repo. Next to that, we'll show how to circumvent some of the limitations that Azure's Terraform provider has when it comes to more complex pipelines.
Using our method, one can simply look at a code (or specific tagged version of it) and tell for sure what is (or was) deployed.
There is one limitation for using Terraform though. Currently,
azurerm
Terraform provider doesn’t allow for creation of complex pipelines that contain any other variables than of typestring
. But for that we have a workaround.Approach
Prerequisites
If you, like us, are using CI/CD pipeline to provision the resources, you probably are using service principal on your build agent. In order to use one solution that works both locally during debug/testing and on the build agent, you will need to have the service principal credentials.
Also, this is supposed to work on Bash, so if you want to use it as-is, make sure you have it installed.
The
fileset
approachSince we didn't start from scratch, but already had existing pipelines in JSON format and a process to generate them, we decided that these resources would stay as-is. This separation allows Platform and Data Engineering parts of the team to be as efficient as possible and use languages they are the most used to - platform engineers can use Terraform to provision resources and take the most out of it, while data engineers can work and edit the pipelines in the same format as they are represented in ADF. In the image below you can see how the pipelines and triggers (resources that will stay in json format) were stored in our case.
Other resources, such as linked services and datasets, could be migrated to the Terraform code directly, allowing us to get rid of these json files:
Our approach will make sure that every time a data engineer would generate new pipelines (or modify existing ones), Terraform code would automatically pick them up and deploy them.
To read and process a set of existing files, we use Terraform's fileset function. Let's see it in action. First, we add a local variable using
fileset
:This code iterates over all the json files stored in the
pipelines
folder and deserializes them. In your case, if your pipelines have to adhere to a certain naming convention or live in a different folder, you can modify the mask and/or location.Ideally we'd like to use
azurerm_data_factory_pipeline
resource to manage the pipelines, but, at the time of writing, we couldn't just useazurerm_data_factory_pipeline
since its fieldvariables
only allows for the map ofstring
and our pipelines used variables of typearray
. To work around the bug, we usednull_resource
. Please note that this is only a temporary workaround and should not be used unless needed, which was exactly our case. When the bug will be fixed, this blog will be updated with the proper solution.The
null_resource
workaroundWe decided to not use
azurerm_data_factory_pipeline
at all, even for initial resource creation, because mixing these two approaches would bring even more issues to the table, such as - having to usetimestamp()
trigger onnull_resource
and effectively re-create pipelines on every apply.Using only
null_resource
allowed us to usemd5
as a trigger in order to re-create pipelines once their file content changes. This approach means that we also need an 'on destroy' condition to delete pipelines when we runterraform destroy
. As a consequence, we had to define variables in thetrigger
block as destroy-time provisioners cannot access external variables.Once again, this is a workaround which will be removed once
azurerm_data_factory_pipeline
will support complex variables.Conclusion
These snippets should give you a good starting point if you want to have advantages of both Git and Infrastructure As a Code (IaaC). Of course, this is not universal and your use case might require some adjustments, but feel free to experiment, it's worth it!
Nevertheless, as with all solutions, this approach has its pros and cons:
Pros
Cons
As mentioned before, the workaround is just a temporary fix until Azure comes with a solution for the bug. When it's fixed,
null_resource
part should not be necessary anymore.