[Design Doc] Job Pipelining

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

We migrated all design docs to Notion platform -> https://pl-strflt.notion.site/Bacalhau-Pipelines-4f4b46558148477db721b8b8f1f09766

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Enrico, 2022-10-10

_{Kudos to everyone who worked on this epic before myself - I took the liberty of copy/pasting some of your thoughts here.}

Context

Bacalhau (currently v0.3.x) can take workloads composed by a single job but to execute a multi-step workload they have to manually submit and track each job, this (1) poses a significant burden on the user, (2) makes reproducing a pipeline difficult, (3) data origin can only be determined by manually logging each step and backtracking.

In the effort to offering broader support towards modern multi-step workloads, this document aims at designing a compelling pipelining feature that is user-friendly and allows for complex workloads.

In this context, a Pipeline is completely user-defined meaning they write a pipeline spec detailing what/how each step (i.e. a Bacalhau job) is related to one another.

Philippe This last point could be a useful restriction. Still, at some point, won't we have to decide which DAG format we want to use? This would tell us (1️⃣ ‒ inflow) which tools / SDKs data engineers may use to define / compute DAGs (ie their description) and (2️⃣ - outflow) what format orchestrators should be able to read.

This document is based on prior work:

Job Pipelines as DAGs #190
Support in the design of Bacalhau’s connection to external orchestration systems (e.g. Airflow) by polyphene.io
Bacalhau Master Plan Roadmap - Sept. 2022
https://docs.google.com/document/d/1Whyj-e17YW6VGILw2VB4ErwvjvtqDa2H2BSRcmo_mMo/edit#
https://filecoinproject.slack.com/files/TEHTVS1L6/F0420QETMU3
let me know if I missed anything that should be on this list!

Personas

User: a computational scientist, interested in consuming jobs/pipelines. Can be a data engineer as well. In short, they're the bacalhau docker run cli users.
Compute Provider (CP): a third-party willing to run a Bacalhau node on their infrastructure. In short, they're the bacalhau serve users. We consider this persona too because any new piece of infrastructure must be easily deployable.

Philippe May it be useful to define from the start sub-personae, based on their main priorities? A distinction that we find mostly in the current user stories, but could make more sense to differentiate those at the persona level to then not be limited in the usage and number of user stories we may have.

eg Data integrator (will be interested by an automated lineage solution, the visualization of graphs of data dependencies).

eg Data scientist, focused on asynchronous collaboration (interested in availability of the descriptions of DAGs of tasks, and in an automated lineage solution).

eg Dev-op engineer, focused on performance (will be interested by caching strategies).

Philippe

Two personae (Data User and Compute Provider) make a lot of sense to start with. But one of the advantages of modern workflow automation platforms is their extensibility to infrastructures of very different sizes, considering their agnosticism to the scale of the underlying architecture. One workflow that runs on a large global infrastructure in production may have been designed and prepared by a solo engineer, running a local computation engine locally. In early stages of development of workflows, data users and compute providers are often the same (third?) persona, working on smaller static datasets. Workflow platforms and tooling often support and promote this way of developing and then deploying workflows, and it could be valuable for us to consider this persona quite early too.

This is mentioned in feature n°9.

Called local to multi-node cluster journey in other Bac. material.

User stories

As a computational scientist, I want to branch pipelines with two or more parallel jobs and then reduce the results back into a single job. For example, see @NiklasTR's example use case
As a computational scientist, I want to reproduce an existing pipeline
As a data engineer, I want to be able to track down the origin of a data artifact generated by a pipeline
As a computational scientist, I need to list & inspect pipelines

Goals

Allow for multi-step job pipelines (i.e. chain consecutive/parallel jobs)
Achieve pipeline reproducibility
Achieve data lineage
Avoid any operational burden for users & Compute Providers
Aim at a smooth onboarding for existing multi-step workloads

Long term goals

In the long run we'd like to add support for running popular distributed compute frameworks such as Spark on Bacalhau.

Walid: This requires inter-job communications, or even better intra-CP communications, and co-locate compute nodes within the same data centres or geographical regions as much as possible to minimize the added latency of moving intermediary results across multiple hops and regions. This also allows Bacalhau to run Spark like jobs that require inter-job communication since the communication will happen between compute nodes in a private network of a single SP

The above would introduce networking changes as well as some security concerns. Let's keep this as a future goal.

Features

List of essential features:

Simple multi-step pipelines - e.g. input -> job 1 -> job 2 -> output
Fan in jobs - e.g. start with two separate jobs, fan into a single job
Fan out jobs - e.g. start with a single job, fan out to two separate jobs
Easy data passing - e.g. use the output from the previous job as the input to the next one
Pipeline parameters - e.g. parameterized input data, some configuration
Pipeline scheduling - e.g. run every day, see Cron as a service #70
Failure control - e.g. retries, fail early, continue despite failures, etc.
Pipeline status - e.g. track where we currently are in a pipeline run
Development & testing - e.g. local environment to run pipelines or single jobs within a pipeline, test harness

Philippe

About Simple multi-step pipelines

I suppose there is still room for clarification in the kinds of pipelines we want to support or not support. Sequential? Branching / joining? Dynamic / conditional workflows? Non-deterministic workflows? Triggers (scheduling is mentioned in the list of essential features ; triggering is loosely mentioned in the list of desirable features)?

NB: this is also conditioned by the DAG format (/ framework).

Philippe

About Pipeline parameters

We may clarify here whether we are talking about configuration of the pipeline? or of the compute infrastructure? or both?

Philippe We may add a feature about description of DAGs, their availability and (ideally) their discoverability.

List of desirable features:

Job caching - e.g. don't re-run something we already have a result for
Pipeline triggering - e.g. run this pipeline when … something
Asset typing - e.g. this job expects a json file, a stream, etc.
Pipeline lifecycle - e.g. can inject or run certain things at certain points
Pipeline hooks - e.g. can connect to a websocket to listen for lifecycle changes
Logging - e.g. stream logs from all jobs in a pipeline back to user
Monitoring - e.g. expose metrics about jobs in a pipeline
Data-driven pipelines - e.g. pipelines react to new data

Philippe

About Job caching

What is our take on deterministic operations in the current design. Enforcements a priori? validation a posteriori, possibly relying on Verifiers – as called in current Bac. architecture? user's responsibility? toggle-able feature?

Philippe

About Asset typing There might be something interesting to do here with multiformats, with information included in CIDs without having to fetch underlying data.

Technical design

Be responsible or not?

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

The first point to discuss is whether or not Bacalhau take resposibility over a Pipeline or not. This translates into different UX from the user/CP perspective.

Not Responsible

To alleviate Bacalhau from that, we could extend the current API to allow for pluggability into an Orchestrator managed externally, by CP or just running on user's laptop. This shouldn't entails lots of engineering work: build a plug-in for our favourite Orchestrator + add some info regarding the pipeline to jobspec.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Edit figure

For example, a hypothetical BacalhauOperator built as an Airflow operator would compose a DAG as shown in the snippet below. That's code a user can run on their laptop Then, they'd connect to their Orchestrator dashboard to monitor the progress of the pipeline. At this point, bacalhau list will show two individual jobs.











# connect to localhost:8080 or remote managed service
with DAG("my-dag") as dag:
    resize_job = BacalhauOperator(verb="docker run", \
        inputs="QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72") \
        image="dpokidov/imagemagick:7.1.0-47-ubuntu" \
        cmd="magick mogrify -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg'")
    blackwhite_job = BacalhauOperator(verb="run python", \
        image="python3.10" \
        cmd="python convert_bw.py")
    
    resize_job >> blackwhite_job

Pros & Cons

Pros
1. Relatively easy existing pipeline onboarding: just replace existing tasks with Bacalhau jobs
2. Separation of concerns - Bacalhau stays tiny and doesn't get bloated with more code
3. Writing a connector/operator is relatively easy
Cons
1. Makes our value proposition weaker i.e. there's no Bacalhau Pipeline as such - @lewq long term goals?
2. We don't really have solid datapoints on what orchestration systems our users prefer - where to start from?
3. The full pipeline lifecycle is controlled outside of Bacalhau - could be hard to for us to get feedback/improve on this feature
4. We'd still have to deploy an Orchestrator service somewhere for bootstrapping demos

Responsible

Alternatively, Bacalhau could take full resposibility over pipelines meaning Pipeline is a first-class citizen as much as Job is. This means:

The Orchestrator is shipped with Bacalhau and is part of the core codebase. Should it be self-build or not? Let's discuss this in the sections below
User submits and lists a Pipeline directly to Bacalhau using a to-be-defined DSL
Pipeline API is CRUD: can Create and Read but no updates/deletes (immutable), similar to a normal Job

Although this approach requires a good amount of engineering work, I believe this is the way to go because it:

Allows us to control the pipeline lifecycle, test it and debug issues
Pipeline spec can be persisted and used for reproducibility and data lineage
It also makes the value proposition of Bacalhau more compelling because: (a) there's no set-up effort required to user, (b) CP can deliver that functionality too "just by installing Bacalhau".

High level architecture

According to the docs:

The requestor node is responsible for handling requests from clients using JSON over HTTP and is the main “custodian” of jobs submitted to it.

It's therefore a good candidate for handling a pipeline request too. An orchestrator interface could use either an external orchestrator (e.g. CP already manages one) or a self-built, internal, orchestrator.

Edit figure

Outstanding questions

Should Pipeline be broadcast to the Transport layer? Probably yes?!

External engine

The idea is an external engine is managed by a third-party. It may be the case CPs already have their own Orchestrators running and naturally they'd like to use those. However, while we do have some understading on what current users prefer as orchestrators (see List of open-source orchestrators), we lack of knowledge regarding what managed services they use, if any.

Internal engine

A Bacalhau node is composed of the following services:

bacalhau-daemon: compute
ipfs-daemon: storage
openresty: check node health
prometheus-daemon (optional): for app metrics

The orchestrator engine could run in a dedicated internal service, next to those listed above, or live within bacalhau-daemon. Either way, we can include that as part of the installation process and make it CP friendly.

Orchestrator Engine: Self-built vs open-source?

Prior research on this scope highlighted two alternatives as in whether we should build our own orchestrator engine, or if we should just use a open-source solution. While Phil's thread leans towards a self-built solution, Polyphene's thread hints a stable open-source engine would be better.

Leverage an open-source orchestration platofrm like Airflow, build an operator that is able to control bacalhau jobs…
Build DAG orchestration into Bacalhau…

Then hide interaction with airflow/bacalhau behind an interface and expose the ability to create and manage dags in the API. As you can see, both require hiding the implementation of the DAG and adding the ability to interact with the DAG in an API. The key difference is what has control over the DAG.

Open-source solution

Pros:

Handing off control to a well-established DAG solution.
Probably has more functionality that we can hope to implement in Bac.

Cons:

Server operator burden. We'd have to ask CPs to run and operate Airflow, along with bacalhau.
Operational burden. What about HA? What about backups? What about updates? How do we orchestrate it?
Will have to implement a custom Airflow operator (or equivalent) to allow it to interact with the bacalhau network.

Self-built Into Bacalhau

Pros:

Full control, tight integration, can improve over time.
Likely to be more robust over the long term, due to fewer external dependencies
Less operational and server operator burden.

Cons:

Probably more engineering effort, but might not be much more. (Implementing an Airflow Operator vs. implementing control of a DAG)

For more details, check the Engineering Effort Discussion here.

Reproducibility

Philippe I certainly lack general knowledge of common lexicon: are we talking about caching? or some enforcement about deterministic operations? or sthg else?

TODO

Lineage

TODO - see https://openlineage.io/

What IPFS/IPLD/Estuary/web3.storage features can we use to perist lineage?

List of open-source orchestrators

Airflow - https://airflow.apache.org/docs/apache-airflow/stable/index.html
https://flyte.org/
Dagster - https://docs.dagster.io/getting-started
Luigi - https://luigi.readthedocs.io/en/stable/
TEZ: built atop Apache Hadoop YARN - https://tez.apache.org/
Netflow: comp-bio - see Niklas' comment
Slurm: Nvidia academic & enterprise client's - see Wes' comment
Prefect - https://docs.prefect.io/
Kedro - https://kedro.readthedocs.io/en/stable/introduction/introduction.html
Metaflow (??) - https://docs.metaflow.org/
Uber Cadence (in Golang) - https://github.com/uber/cadence
… more to join the list

Airflow POC

Concepts

Provider

Providers can contain operators, hooks, sensor, and transfer operators to communicate with a multitude of external systems… (reference)

Operator

A reusable, pre-made Task template whose logic is all done for you and that just needs some arguments.

…How to Create a custom operator

Hook

A Hook is a high-level interface to an external platform that lets you quickly and easily talk to them without having to write low-level code that hits their API or uses special libraries

Passing data between tasks: XComs

XComs are one method of passing data between tasks, but they are only appropriate for small amounts of data. Large data sets require a method making use of intermediate storage and possibly utilizing an external processing framework.

https://big-data-demystified.ninja/2020/04/15/airflow-xcoms-example-airflow-demystified/

with DAG('first-bacalhau-dag', start_date=datetime(2021, 1, 1)) as dag:
    op1 = BashOperator(
        task_id='submit-a-job',
        bash_command='echo hello {{ ti.xcom_push(key="jobid", value="$(bacalhau docker run --id-only --wait ubuntu date)") }}',
    )

    op2 = BashOperator(
        task_id='get-a-job',
        bash_command='bacalhau get --download-timeout-secs 10 --output-dir /tmp/enrico/ {{ ti.xcom_pull(key="jobid") }}',
    )

    op1 >> op2

Lineage

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Lineage support is very experimental and subject to change. (reference)

Inlet: upstream task ids, attr annotated object
Outlet: can only be attr annotated object











f_in = File(url="/tmp/whole_directory/")
outlets = []
for file in FILE_CATEGORIES:
    f_out = File(url="/tmp/{}/{{{{ data_interval_start }}}}".format(file))
    outlets.append(f_out)
run_this = BashOperator(task_id="run_me_first", \
                        bash_command="echo 1", \
                        dag=dag, \
                        inlets=f_in, \
                        outlets=outlets
                       )

Metadata is pushed into XCOM

How would a BacalhauOperator look like?

Docker operator

DockerOperator - Execute a command inside a docker container.

t_print = DockerOperator(
        api_version="1.19",
        docker_url="tcp://localhost:2375",
        image="centos:latest",
        mounts=[Mount(source="/your/host/output_dir/path", target="/your/output_dir/path", type="bind")],
        command=f"cat {t_move.output}",
        task_id="print",
        dag=dag,
    )

https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/index.html

K8s operator

KubernetesPodOperator - Execute a task in a Kubernetes Pod

write_xcom = KubernetesPodOperator(
        namespace='default',
        image='alpine',
        cmds=["sh", "-c", "mkdir -p /airflow/xcom/;echo '[1,2,3,4]' > /airflow/xcom/return.json"],
        name="write-xcom",
        do_xcom_push=True,
        is_delete_operator_pod=True,
        in_cluster=True,
        task_id="write-xcom",
        get_logs=True,
    )

    pod_task_xcom_result = BashOperator(
        bash_command="echo \"{{ task_instance.xcom_pull('write-xcom')[0] }}\"",
        task_id="pod_task_xcom_result",
    )

https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/4.4.0/

Postgres

PostgresOperator - run SQL query






get_birth_date = PostgresOperator(
    task_id="get_birth_date",
    postgres_conn_id="postgres_default",
    sql="SELECT * FROM pet WHERE birth_date BETWEEN SYMMETRIC %(begin_date)s AND %(end_date)s",
    parameters={"begin_date": "2020-01-01", "end_date": "2020-12-31"},
)

Dbt Cloud (!!!)

DbtCloudRunJobOperator - to trigger a run of a dbt Cloud job

DbtCloudJobRunSensor to periodically retrieve the status of a dbt Cloud job run and check whether the run has succeeded.

DbtCloudGetJobRunArtifactOperator to download dbt-generated artifacts for a dbt Cloud job run

DbtCloudListJobsOperator to list all jobs tied to a specified dbt Cloud account.

https://airflow.apache.org/docs/apache-airflow-providers-dbt-cloud/stable/operators.html

## Trigger job
trigger_job_run2 = DbtCloudRunJobOperator(
    task_id="trigger_job_run2",
    job_id=48617,
    wait_for_termination=False,
    additional_run_config={"threads_override": 8},
)
## Poll status
job_run_sensor = DbtCloudJobRunSensor(
    task_id="job_run_sensor", run_id=trigger_job_run2.output, timeout=20
)

get_run_results_artifact = DbtCloudGetJobRunArtifactOperator(
    task_id="get_run_results_artifact", run_id=trigger_job_run1.output, path="run_results.json"
)

https://airflow.apache.org/docs/apache-airflow-providers-dbt-cloud/stable/index.html

Single VS Multiple operators?

Single:

CON: advanced use cases tend to need a dedicated Python SDK (e.g. create mount in docker, create volume in k8s)

Multiple:

PRO: Bacalhau comes with multiple verbs and one is normally interested in doing more than just DockerRun

Bacalhau

BacalhauDockerRunOperator
BacalhauGetOperator
Bacalhau[...]Operator
BacalhauWasmOperator

Probably hooks too… look at how DBT Cloud provider is made.

Use REST api instead of client?

Meeting minutes

Will move this to another doc as soon as I wrap my head around hackmd

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

10-25-2022

Enrico
- Described progeess
Philippe
- Unify pipelines, template level
  - 1 CID in, 1 CID OUT
  - it's actually 1+ CIDs in, 1 CID out
- Fan in/out
  - remove string keys, use indices
  - We output an array containing indices
- Flyte
  - built in scheduler
  - data caching
  - not best rep for post-Airflow era
Enrico
- extendibilty is key
Philippe
- Simon's img processing examples
Demo:
- Img processing - cool, must run on laptop
Philippe
- Docker image could be CID so
  - deterministic task could be made out of a tuple: docker-CID, input-CID
  - use IPFS-backed Docker registry

10-19-2022

Part 1) Alternatives to Airflow
- Philippe
  - Kedro and others may not come with a scheduler attachted
  - Differentiatrs
    - Airflow still domenante - network effect!
    - post-airflow mind set -> extra abstraction layer (e.g Dagster)
  - Task scheduler is key
    - Prefect changed it recentrly
  - Popularity
    - Dagster (!) - focus on data integration
    - Prefect - the next airflow
    - ( Metaflow (by Netflix) )
  - Dagster less mature
  - Github stars
    - Prefect 10k stars (!) - possibly will grown
  - Will share examples offline
- Luke
  - put reseach in writing
  - any front runner?
  - Philippe:
    - Prefect first, Argo 2nd
    - PYTHON, Jaml, (visual editor?)
Part 2) AIRFLOW research
- Kai
  - CID ås output is small - good news
  - Bacalhau could be bacalahu\
  - Airflow
    - XCom is cool for intermediate steps
    - How do you get your end result out of your pipeline
    - Philippe:
      - templating
    - Airflow should output a CID, input CID as Bacalhau normally does
- Philippe: pipeline state could sit on IPFS
- Kai: pipeline export format? Philippe: look for a common interface
- Philippe: dbt operator
  - Figure out how Prefect manages task comms

10-13-2022

meeting goals

Spark or not?
Be responsible or not?
Enrico: Elaborate on Pros and Cons
opts
- Operator
  - POC bac can talk to operator
- embedded
  - core feature
Kay:
- job should be aware its part of a pipeline
- visualize the whole thing
Philippe: integration is key to address Walid's
Luke:
- we can do both (not now)
  - INTEGRATION
    - engage with existing community - Airflow cool!
    - get feedback from them
    - Luke votes for this option
  - NATIVE DAG
Kay:
- cost of NATIVE is expensive, INTEG. is cheap
Luke: figure out how to pass data across jobs in INEGRATION
Philippe: Pros and Cons - see what Lison say
Luke:
- Build Integration prototype
- Free compute for you!
Kay:
- Airflow: onboarding should easy
Kay: would be cool to write a NATIVE DAG design
Philippe:
- onboarding to Airflow could be tricky - surely need to optimize
- Passing CID is enough? Need more data ?
- How does caching and determinism impact Piplines? don't know yet
DECISION:
- Enrico: Go ahead with prototype
- Enrico: Expand on INTEGRATION
- Philipp - Thomas:
  - pick a 2nd orch. other than Airflow
  - invstigate caching in Airflow

10-06-2022

User research
- Popular systems? HPC pipelines - ask Wes?
Goal: push data to filecoin
Compatiblity with existing systems
Philippe: CoD summits are good for data research (Lisbon in coming up!)
- Personas: DS vs Data engineering (latter probably more promising, more narrow scope)
- data integrations - smaller set of tools
- Reliabiltiy is needed
Luke: Data engineering space!
- dbt
- snowflake
- docker is main interface! what's the equivalent for DAGs?
- find a niche that works well with pvt dataset
Philippe
- open dataset & open science
- first users could be: scientists that collaborate
Kay
- meta-dag idea
Philippe
- dag: ipfs, IPLD fashion
Luke
- lineage
- provenance
- reproducibility
Philippe
- 4-5 orch. systems, pick 1
- Airflow, Dagster, Prefect… Maybe Argo and Luigi, Luke: flyte
Kay
- stress on the "reproducibilty"
- ideas
  - option 1 - connect to an external DAG system - have bacalhau talk to it
  - option 2 - DAG maanged from within a Bacalhau job (maybe not good)
Luke
- pitch to airflow user
  - here's free compute, (!) data is public
  - here's the connection to airflow
Luke
- airflow providers
  - we love Docker
  - K8s
Check what opinions DAG systems have on data input/output
- Airflow
- others?

Next steps:

Prepare a design doc
Discovery work: explore the grounds for integrations (limitations, etc.)
Build a Protoype & Future Plans

Luke Marsden

2022/10/11 15:16:23

Is this essential? I'd maybe move it to desirable. (Edited)

Enrico Rotundo

2022/10/13 09:13:39

This comment refers to `Pipeline scheduling - e.g. run every day, see Cron as a service #70`

2022/10/11 15:18:43

what about compatibility with existing pipelines: being able to onboard users quickly by working with what they've already got? (Edited)

2022/10/11 15:19:13

could we make our thing compatible with $MOST_POPULAR_DAG_SYSTEM? (Edited)

Walid Baruni

2022/10/12 15:38:59

In the long run we'd like to add support for

With that mind, running DAGs on compute nodes that belong to the same SP makes sense, and also allows them to store intermediary results in a less durable, cheaper and faster storage (Edited)

2022/10/12 15:48:21

This sounds like a feature on its own to schedule jobs on Bacalhau regardless if they are pipelines or not (Edited)

2022/10/12 17:27:22

Working backwards from the defined goal, this doesn't seem the optimal option. As Luke mentioned, it doesn't sound like it will allow smooth onboarding of existing pipeline, and it will add operational burden for compute providers compared to using managed airflow services https://airflow.apache.org/ecosystem/#airflow-as-a-service (Edited)

2022/10/12 17:28:39

Allows us to control the pipeline lifecycle, test it and debug issues * Pipeline spec can be persisted and used for reproducibility and data lineage

testing and persisting the pipeline spec is also possible with previous approach. What do you mean by control of the pipeline lifecycle, and what value does that provide? (Edited)

2022/10/12 17:29:59

Why is this is a problem? I can look at it as positive by having a separation of concern, not re-inventing the wheel, and have a more tested and documented platform to orchestrate the jobs :) (Edited)

2022/10/12 17:44:49

Server operator burden. We'd have to ask CPs to run and operate Airflow, along with bacalhau.

CPs will have an additional operation burden by managing an orchestration engine, whether it is an existing open source engine or new in-house built engine. I would argue that operating a new engine is more challenging than operating a mature, well tested and documented engine. CPs can also have the option to use a managed Airflow behind the scene to reduce their ops pain. (Edited)

2022/10/12 17:45:34

I understand that Airflow might have more features to manage than what we need, which can add to the ops pain, but that's where we need to investigate and see if those features are configurable and whether they add pain or not

2022/10/12 17:46:17

Operational burden. What about HA? What about backups? What about updates? How do we orchestrate it?

This falls into the same con as above, and the same comment applies (Edited)

2022/10/12 17:47:00

Will have to implement a custom Airflow operator (or equivalent) to allow it to interact with the bacalhau network.

This sounds way easier than an orchestrator from scratch, and can allow easy migration of existing workloads :) (Edited)

2022/10/12 17:49:18

Full control, tight integration

Can you clarify? An orchestrator should be decoupled from the underlying execution to aim for separation of concerns, more manageable ops and testability (Edited)

2022/10/12 17:50:19

Likely to be more robust over the long term, due to fewer external dependencies * Less operational and server operator burden.

I doubt that will be the case (Edited)

2022/10/13 09:11:40

It depends on the UX we want to provide. If the bacalhau is NOT responsible for the Orchestrator then yes, if it is then it'd probably come "for free" from the scheduler that sits under the hood. See discussion below (Edited)

Simon Worthington

2022/10/13 11:20:01

Fan in jobs - e.g. start with two separate jobs, fan into a single job 1. Fan out jobs - e.g. start with a single job, fan out to two separate jobs

I think a strong motivating use case here is where the number of fan in or fan out jobs is not known at the moment the job is submitted. I.e. it is dependent on the contents of some input data. So the job might choose to fan out to 30 jobs or just to 3, or might not even choose to fan out at all, depending on arbitrary logic. (Edited)

2022/10/13 11:22:05

As an example, maybe my input data is a set of 000's of images, and I only want to fan out to another job with the ones that look like faces, and ideally one job per face for full parallelism. I won't know how many there are until I examine the input data. (Edited)

2022/10/13 11:23:45

That suggests to me that instead of/as well as doing things "outside-in" (e.g. we define a spec which specifies the full DAG up front) we want to do "inside-out" (e.g. allow jobs that are executing to spin up/control other jobs). What do you think? (Edited)

2022/10/13 11:24:54

Airflow has this: https://airflow.apache.org/docs/apache-airflow/stable/concepts/dynamic-task-mapping.html (Edited)

2022/10/13 11:31:35

The full pipeline lifecycle is controlled outside of Bacalhau - could be hard to for us to get feedback/improve on this feature

I think we should explore this a bit more with an eye on the slightly longer term roadmap. I think there could be a few new features that are coming that will require it to be internal. (Edited)

2022/10/13 11:32:10

E.g. when we start supporting paid jobs, how will payment be managed in a pipeline job? Will users still have to pay per job? If so, how will this payment be managed with an external orchestrator? I feel like as a user I want to "pay an amount to get a result" – so specify the payment I am willing to make at a pipeline level rather than job level. (Edited)

2022/10/13 11:34:14

er submits and lists a Pipeline directly to Bacalhau using a to-be-defined [DSL]

I want to challenge the notion that having a Pipeline requires having a DSL. As above, I think it could/should be possible to define the Pipeline dynamically from code, as in Airflow. (Edited)

2022/10/13 11:34:40

What is motivating the decision to require the spec to be sent up front? (Edited)

2022/10/19 15:20:28

Airflow should output a CID, input CID as Bacalhau normally does** *

Where will the results of the intermediary tasks be persisted at? Are users mainly interested in the pipeline output(s), or also tasks output? (Edited)

2022/11/15 09:59:45

good catch @Simon I wan't aware of ariflow's dynamic dags! (Edited)

[Design Doc] Job Pipelining

Context

Personas

User stories

Goals

Long term goals

Features

Technical design

Be responsible or not? Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More →

Not Responsible

Pros & Cons

Responsible

High level architecture

Outstanding questions

External engine

Internal engine

Orchestrator Engine: Self-built vs open-source?

Open-source solution

Self-built Into Bacalhau

Reproducibility

Lineage

List of open-source orchestrators

Airflow POC

Concepts

How would a BacalhauOperator look like?

Docker operator

K8s operator

Postgres

Dbt Cloud (!!!)

Single VS Multiple operators?

Bacalhau

Meeting minutes

10-25-2022

10-19-2022

10-13-2022

10-06-2022

Next steps:

Be responsible or not?

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →