Develop & Run Flytekit with Jupyter Notebook

# Develop & Run Flyte Workflows with Jupyter Notebook ![image](https://hackmd.io/_uploads/H1tnBHQxkl.png) ## Motivation Prototyping, experimenting, and visualizing results are crucial in data science and machine learning tasks. With Flytekit’s workflow orchestration embedded in Jupyter notebooks, users can create scalable pipelines while retaining the agility and ease of the notebook experience. This integration eliminates the friction between prototyping and production, enabling users to execute tasks, test workflows on smaller datasets, and fine-tune hyperparameters interactively. This approach extends Flytekit’s capabilities to a broader audience of researchers and developers by simplifying collaboration and boosting productivity. ## Challenges of Running Flytekit in an Interactive Environment Integrating Flytekit into Jupyter notebooks introduces several challenges: 1. **Code Serialization and Registration:** Flyte workflows are typically registered using command-line tools like `flytectl` or `pyflyte`, which handle code serialization and uploading to the Flyte backend. While `FlyteRemote` allows users to fetch and execute previously registered entities, the initial registration process still relies on these command-line tools. This creates friction in interactive environments where users expect simple, direct execution without additional command-line steps or external dependencies. Alternative approaches, such as using Python packages like `Click` to execute `pyflyte` commands within Python code, would also add unnecessary complexity to the execution process. 2. **Notebook-Specific Code Management:** Jupyter notebooks present unique challenges for code management during runtime registration. While the IPython kernel maintains a shared namespace across cells, it only tracks the execution state (variables, functions, classes) without preserving references to the original source code. Additionally, notebook code is stored in `.ipynb` files using a JSON format rather than as standard Python modules, creating a mismatch with Flyte's registration mechanism. This makes it difficult to extract and package the complete source code needed for workflow registration, especially when code dependencies are spread across multiple cells. This differs fundamentally from traditional Python modules where code is available in plain text files with clearly traceable dependencies. These challenges indicate the need for a more cohesive and intuitive approach. Ideally, users should be able to develop, register, and execute workflows directly within the Jupyter notebook environment without needing to manually extract code or rely on complex external tools. ## How Does Interactive Mode Work? Flytekit’s interactive mode builds upon the existing [fast registration](https://docs.flyte.org/en/latest/user_guide/flyte_fundamentals/registering_workflows.html#fast-registration) method to better suit interactive environments like Jupyter notebooks. By setting `interactive_mode_enabled=True` in the `FlyteRemote` constructor, Flytekit triggers a modified fast registration process that serializes entities (i.e., tasks, workflows, launch plans, etc.) directly from memory. Rather than requiring a complete project package, this approach creates a pickled object saved as a `pkl.gz` file, which is then registered to the specified Flyte cluster and uploaded to the configured blob store (e.g., `s3`, `gcs`). To complement this, a new task resolver, `DefaultNotebookTaskResolver`, has been designed to load the task functions directly from the pickled file on the remote cluster. This resolver fetches the corresponding task entity based on its task name and executes it. Here’s a comparison between the original fast registration and the new interactive mode: ![jupyter](https://hackmd.io/_uploads/SyAYbX2lkx.jpg) This approach will bring some pros and cons: - **Pros:** - **No need for real files:** With Flyte’s interactive mode, tasks and workflows are serialized directly from memory. This eliminates the need for creating physical files or packaging entire project directories, simplifying the development workflow and making it easier to iterate on your code quickly. By working directly from memory, you reduce the complexity of file management, saving time and avoiding potential file system issues. - **Lightweight:** Instead of bundling the entire project directory, only the specific entities (tasks and workflows) that are actively used are serialized. This approach not only minimizes the package size but also reduces deployment and registration overhead. Developers can focus solely on the necessary tasks without worrying about extraneous files or dependencies, leading to faster iterations and more streamlined deployments. - **Cons:** - **Python and package versions must match between environments:** Since entities are serialized at the binary level using `cloudpickle`, both the local and remote environments must have compatible Python versions and module versions to avoid deserialization issues. Users can address this by specifying a container image with matching versions for each task using [ImageSpec](https://docs.flyte.org/en/latest/user_guide/customizing_dependencies/imagespec.html#imagespec). - **No support for dynamic workflows:** [Dynamic workflows](https://docs.flyte.org/en/latest/user_guide/advanced_composition/dynamic_workflows.html#dynamic-workflow) are compiled and registered at runtime in the remote cluster. Without access to the complete project source, this compilation process becomes challenging in the remote environment. ## Usage The example in this section can be found in [flytesnacks](https://github.com/flyteorg/flytesnacks/pull/1754). All the following code should be executed within a Jupyter Notebook environment. ### How to Setup Interactive Mode in Jupyter? 1. **Initialize FlyteRemote with Interactive Mode Enabled** Set up `FlyteRemote` with `interactive_mode_enabled=True`, which is all you need to get started: ```python from flytekit.remote import FlyteRemote from flytekit.configuration import Config remote = FlyteRemote( Config.for_sandbox(), default_project="flytesnacks", default_domain="development", interactive_mode_enabled=True, # Optional in notebooks - automatically enabled ) ``` Note: When running in a Jupyter notebook, interactive mode is automatically enabled even without specifying this parameter. This example uses sandbox configuration - check out the [FlyteRemote](https://docs.flyte.org/en/latest/api/flytekit/design/control_plane.html) documentation for other ways to configure your connection! 2. **Developing Tasks and Workflows!** Create tasks and workflows using Flytekit: ```python from flytekit import task, workflow @task def hello(name: str) -> str: return f"Hello {name}!" @task def world(pre: str) -> str: return f"{pre} Welcome to the Jupyter Notebook!" @workflow def wf(name: str) -> str: return world(pre=hello(name=name)) ``` 3. **Run Entities on the Remote Cluster** Execute tasks or workflows with `remote.execute()`, which automatically fetches and registers entities if needed: ```python # Execute the task exe = remote.execute(hello, inputs={"name": "Flyte"}) # This will print the URL to the console print(exe.execution_url) # Wait for the task to complete exe = exe.wait(poll_interval=1) # Print the outputs print(exe.outputs) ``` Alternatively, you can execute a workflow and wait for completion in one step: ```python # Execute the workflow and wait for it to complete exe = remote.execute(hello, inputs={"name": "world"}, wait=True) # Print the outputs print(exe.outputs) ``` Currently, the interactive mode supports various entity types, including tasks, workflows, map tasks, and multiple plugins (e.g., FileSensor Agent, K8s, etc.). ### Developing and Versioning > **TL;DR:** In interactive mode, Flytekit uses function names as entity identifiers and automatically handles versioning. Every time you execute a cell that defines a task or workflow, it creates a new version - but when you just use existing entities without redefining them, it reuses the previous version. Before diving into versioning, it’s essential to understand how Jupyter Notebook and Flyte handle code execution and naming. When you define a task or workflow in Flytekit, each entity is automatically named based on its context in the code. In traditional Python scripts, this naming is typically determined by the module name and the entity's identifier. However, in Jupyter Notebooks, where code is stored in dynamically generated virtual files, Flytekit simplifies this by using only the function name, regardless of the notebook's filename. Since you'll likely be updating your tasks and workflows frequently in an interactive environment, Flytekit implements an automatic versioning system to manage these changes efficiently. Each time you execute a cell containing task or workflow definitions, Flytekit generates a new version and performs a fast registration - even if the code hasn't changed. This behavior changes only when you run cells that use existing entities without redefining them. In such cases, Flytekit simply retrieves the previously registered version, bypassing the registration process. Let's look at some examples: 1. When you first execute a cell in a Jupyter Notebook, the entity name matches the function name, and FlyteRemote generates a version: ![image](https://hackmd.io/_uploads/H1tYJ0ne1e.png) 2. If you re-execute the cell without changes, the name and version remain the same: ![image](https://hackmd.io/_uploads/SJWTy03l1x.png) 3. However, when you modify and execute the entity with new content, FlyteRemote assigns a new version. Flytekit then registers and executes the new entity: ![image](https://hackmd.io/_uploads/S17sZ0neye.png) Therefore, when working in a Jupyter Notebook, there’s no need to manually manage Flyte versions. Each code update automatically creates a new version, enabling rapid iteration and experimentation without the hassle. This streamlined approach keeps your workflow agile and perfectly suited for the dynamic nature of interactive notebooks! ## Conclusion Flyte’s integration with Jupyter Notebooks opens up a world of possibilities for data scientists and engineers. By automating task registration and embracing interactive development, Flyte lets you focus on innovation and experimentation without the overhead. Whether you’re building complex workflows or fine-tuning models, Flyte’s interactive mode empowers you to bring your ideas to life faster and more efficiently. Dive in, iterate boldly, and unlock the full potential of your workflows!