---
# System prepended metadata

title: 'Data Pipeline 套件比較： Airflow, MetaFlow, Prefect'
tags: [anue]

---

# Data Pipeline 套件比較： Airflow, MetaFlow, Prefect

###### tags: `anue`
> [time=Thu, Dec 19, 2019 3:30 PM]
- Airlofw
- MetaFlow
- Prefect


DAG
Operators



---
## Compare
| flow |Owner | Feature  |Cloud Services Support|  Github CreateAt
| ------ | ------| -------- | ------ | ----- | 
| Airflow  |Airbnb/Apache |Complete| GCP/AWS    |2015-04-13
| Metaflow |Netflix |Saves Every Resulting| AWS    |2019-09-17
| Prefect  |PrefectHQ |Light| GCP/AWS    |2018-06-29


## Airflow


----
## DAG
```python
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta


default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2015, 6, 1),
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),
}

dag = DAG(
    'tutorial', default_args=default_args, schedule_interval=timedelta(days=1))

# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag)

t2 = BashOperator(
    task_id='sleep',
    bash_command='sleep 5',
    retries=3,
    dag=dag)

templated_command = """
    {% for i in range(5) %}
        echo "{{ ds }}"
        echo "{{ macros.ds_add(ds, 7)}}"
        echo "{{ params.my_param }}"
    {% endfor %}
"""

t3 = BashOperator(
    task_id='templated',
    bash_command=templated_command,
    params={'my_param': 'Parameter I passed in'},
    dag=dag)

t2.set_upstream(t1)
t3.set_upstream(t1)

```

----

![](https://i.imgur.com/GPcR5cl.png)

---

## Metaflow
[Metaflow tutorials](https://docs.metaflow.org/getting-started/tutorials)

- Netflix
- it automatically saves everything resulting from your code to S3, which makes it really portable and easy to pick up from failed tasks
- 資料與不同時間的 run() 都包在 flow裡

----

```python
# 00-helloworld/helloworld.py
from metaflow import FlowSpec, step


class HelloFlow(FlowSpec):
    """
    A flow where Metaflow prints 'Hi'.

    Run this flow to validate that Metaflow is installed correctly.

    """
    @step
    def start(self):
        """
        This is the 'start' step. All flows must have a step named 'start' that
        is the first step in the flow.

        """
        print("HelloFlow is starting.")
        self.next(self.hello)

    @step
    def hello(self):
        """
        A step for metaflow to introduce itself.

        """
        print("Metaflow says: Hi!")
        self.next(self.end)

    @step
    def end(self):
        """
        This is the 'end' step. All flows must have an 'end' step, which is the
        last step in the flow.

        """
        print("HelloFlow is all done.")


if __name__ == '__main__':
    HelloFlow()

```


----

```bash
python 00-helloworld/helloworld.py show
```

![](https://i.imgur.com/WOEwuiG.png)

----

```bash
python 00-helloworld/helloworld.py run
```

![](https://i.imgur.com/WJTYzxD.png)



---

## Prefect
[medium](https://medium.com/the-prefect-blog?source=post_sidebar--------------------------post_sidebar-)


----

![](https://i.imgur.com/MKBQmx1.png)

----

- Integration with Dask
- like airlofw, but better design
- minimal errfots
- Use decorator: "task" ,a node in the DAG.
- lightful

----


```python
from prefect import task, Flow


@task
def extract():
    """Get a list of data"""
    return [1, 2, 3]

@task
def transform(data):
    """Multiply the input by 10"""
    return [i * 10 for i in data]

@task
def load(data):
    """Print the data to indicate it was received"""
    print("Here's your data: {}".format(data))

    
with Flow('ETL') as flow:
    e = extract()
    t = transform(e)
    l = load(t)

flow.visualize()
```

----

![](https://i.imgur.com/mzjSgN6.png)

----

[cloud-scheduler](https://www.prefect.io/products/cloud-scheduler/)
![](https://i.imgur.com/L7YvMnn.png)



----

## 結論
Prefect : looks a lot nicer to use and a lot easier to get started
Metaflow : For Data Science  ，適合機器學習專案使用
Airflow : 大型多人使用的專案


## Reference