Scientific File System (SciF) Next Steps

# Scientific File System (SciF) Next Steps **Moved to OSF Wiki https://osf.io/xdkcy/wiki/Requirements ## Background [Makeflow plus Scientific Filesystem proof-of-concept #132](https://github.com/AgPipeline/issues-and-projects/issues/132) This document follows outline provided on this blog post: https://www.perforce.com/blog/alm/how-write-software-requirements-specification-srs-document http://www.cse.msu.edu/~cse870/IEEEXplore-SRS-template.pdf ## Purpose of this document The purpose of this document is to define the requirements for the open drone processing pipeline and to prioritize steps toward implementation. This document is intended for use by developers in designing and implementing the pipeline. It is also intended for use by users and contributors interested in understanding the purpose of the software and how to make meaningful contributions. While this document is intended to be a blueprint for software, it is intended to be updated as needed with agreement among David, Chris, and Julian. ## Overall Description: See Schnaufer, Pistorius, LeBauer 2020 https://doi.org/10.1117/12.2560008 available as a preprint here: https://eartharxiv.org/vezn5/ ## User Needs & defined in Schnaufer et al and NIFA proposal. ### User Stories In this section we use User Stories to identify our intended audience from the user’s perspective. These stories help describe the people our software is designed to support, as well as those we hope will join efforts with us, and the user's perspective helps explain as well as focus our objectives. #### User 1: A researcher has purchased a new drone and has collected a season's worth of data. It is fall, and they want to convert these images into plot level measurements that are similar to what they collected in previous years using standard methods to measure plants. They have a campus license for ArcGIS but have not used it to process images before. They are currently considering a commercial software package but the license costs \$5-10k/year. #### User 2: The lead of the information technology group at a Land Grant University They have received multiple requests to support the computing and data storage needs of researchers who have been collecting drone imagery over the last few years. They have been asked for help processing as well as publishing these data in a way that complies with funding agencies and publishers. #### User 3: graduate student in computer science published a new algorithm. They don't have the need or funding to establish an experimental trial and collect data from a field site, but they can find published drone imagery in a repository. They just want to see if their new algorithm is efficient and correct in a real-world scenario. #### User 4: pipeline developer. A researcher has an existing pipeline and wants to expand its capabilities. For example: to other sensors or to high performance computing environments, or to adopt standards, or to enable archiving and sharing of results. This researcher already has a workflow that they are using. They are able to use the existing Docker containers for their new sensor workflow. They can copy and modify their existing workflow to make workflows for the new sensors. examples: FieldImageR, ImageBreed, UASTools, ... ### Other Users New to flying drones and needs a processing pipeline: this researcher will need to learn the ins-and-outs of flying drones. They will be able to deploy existing Drone Processing Pipeline workflows “as is” and use the results to evaluate whether it’s sufficient for their needs. If changes are needed, the modular nature of the workflows allows for easy modification and the templating of Transformer code allows for easy development of new or replacement algorithms. A researcher has developed a new way to measure and wants to replace existing measurement formula with a new one: the new algorithm can be developed using one of the code templates as a starting point. The code template allows them to focus on their new measurement algorithm and not about the execution environment. Additionally, since they are using a well defined environment, other researchers can make use of the new Algorithm by reusing their Entry Point and Environment components to create a new Transformer. A researcher wants to reproduce a published result that was reached through the pipeline: this assumes the researcher has access to the original data and has the version information of the pipeline available. This research stands up a copy of the pipeline using the same version of the pipeline and components as the original research, and runs the data through the pipeline to reproduce the results. There may be processing environment factors that prevent the exact same results to be reproduced, but by being able to reproduce the processing pipeline these factors are minimized and may be accounted for. ## Assumptions and Dependencies * Everyone can use Docker to share computing environtments. * Use defined inputs and outputs * Core system only relies on a filesystem + docker * Pipeline developers should clearly define requirements for testing and documenting algorithms * Algorithm developers are responsible for testing and packaging algorithms # System Features and Requirements Goal is to minimize the overhead for end users and contributors. ## Functional Requirements * Minimize requirements on end user * If a user can provide command line instructions for converting defined inputs into defined outputs, plus documentation and tests that define functionality and limits. * Any script, binary, python or R function, or docker image is acceptable * Minimize complexity required to contribute an algorithm * Minimize complexity required to deploy * Can run (even if slowly) on a laptop * Can be distributed to available HPC resources * Should be easy to deploy on a laptop, HPC, or cloud service * Computing environment (priorities) * docker on linux (1) and macos (2) and windows (3) * HPC [2] * Kubernetes [2?] ## External Interface Requirements * Inputs and outputs follow standards and conventions where available * Core pipelines can run without a database server, but it is possible to add data access and upload through database and API interfaces * Anything that is not standardized should be minimized * Use Python where possible, as this is familiar to scientists * Follow standards and conventions in data * Follow standard patterns of software architecture * anything non-standard should be well justified in documentation * Preference for useability over performance ## Non Functional Requirements # Roadmap ## Features List ### Alpha (Date) Begin with orthomosaic Inputs: * orthomosaic * shapefile Outputs: * csv file with canopy cover ### Beta (Date) Add Orthomosaicing, georeferencing Inputs: * Drone Files * Experimental metadata Outputs: * Orthomosaic ### 1.0 (Date) ## Discussions * Docker Images (maybe this is really architecture?) * prefer Ubuntu LTS [1] > [name=schnaufer] [time=Tue, May 26,2020 10:08] Any lightweight Linux system should be OK * Layers * \+ SciF * \+ makeflow + workqueue * Can use any image on dockerhub * not necessary for ODM, which runs on Ubuntu 16.04LTS [supported until 2024](https://en.wikipedia.org/wiki/Ubuntu_version_history#Table_of_versions) and has migration plans to 18.04 and python3 * imagine a very fragile CentOS 4 server. Or any future situation in which an algorithm has been packaged in a deprecated OS, VM, or version of a language (e.g. python) * Could use chroot. * Could also use workqueue to call this image from an orchestrator * discussion of input/output mapping > [name=schnaufer] [time=Tue, May 26,2020 10:15] What's the discussion? @dlebauer Can you provide information as to what you think the discussion is about > This whole section was written by David. Please ask him for clarification. I added the section further down [Proposed solution: Multi-image pipelines](#Proposed-solution-Multi-image-pipelines), as well as the personas > [name=julianp-diag] [time=Tue, May 26,2020 10:16] * input:level_n/productname * output:level_n+1/derivativename * human-readable, descriptive paths and files * e.g. include level, data product, algorithm, date and optionally location??? * something like TERRA REF * static vs dynamic workflows * why, how important, reference CCTools discussion * Makeflow is "static" but can be semi-dynamic because list.txt can contain an list of files that can't be determined before the workflow starts (e.g. plot clip exception) * Avoid docker-in-docker in production pipelines because (@julianp-diag?) > Docker themselves advise against running Docker in Docker: <https://hub.docker.com/_/docker> > "Although running Docker inside Docker is generally not recommended, there are some legitimate use cases, such as development of Docker itself." > Some of the things may not apply in modern Docker (i.e. launching top-level, sibling containers from inside another container), but to me it still seems more complexity than necessary for what we're trying to achieve. > Not sure what the added benefit of running Docker from Docker would be. > [name=julianp-diag] [time=Wed, May 27, 2020 13:30] > Correction: I believe this advice refers to running the Docker _daemon_ inside another Docker container, not running the Docker _client_ from within a Docker container. The latter is not as fraught, but still adds extra unnecessary complexity, and may cause problems in HPC and certain workflow engines. Especially if we also end up running the workflow engine in a container. The levels of indirection may become very difficult for pipeline creators to untangle. I will want to know what benefit adding Docker running Docker would have over the various solutions I proposed below for handling 3rd party images. > [name=julianp-diag] [time=Wed, May 27, 2020 14:20] --- ### 2020-05-27 - Tentative agenda for Zoom meeting Participants: Chris, David, Julian Date & time: Thursday, May 28 14:00 – 15:00 Zoom link: David's Scribe: David #### Agenda - Clarification from @dlebauer of requirements when using 3rd party container images ([name=julianp-diag]) - See [below](#2020-05-26---Clarification-of-requirements-when-using-3rd-party-container-images) - Based on 'clarification of requirements' above, what is the benefit of running Docker from Docker which alternative options outlined below do not satisfy? ([name=julianp-diag]) - What is blocking Chris from implementing first iteration of pipeline as prototyped in proof-of-concept? ([name=julianp-diag])([name=schnaufer] The definition of releases - see below) - ... - ... #### Concrete outcomes - Agreement on what will be in scope for the implementation plan for this first iteration of the Makeflow + SCIF pipeline ([name=julianp-diag]) - update (Chris): using transformers as-is and opendronemap as a 'sibling container' which is the preferred alternative to issues w/ docker-in-docker - Agreement on what will NOT be in scope for this first iteration of the Makeflow + SCIF pipeline ([name=julianp-diag]) - Start recording ideas for things which are NOT in scope, to evaluate and prioritize for future iterations ([name=julianp-diag]) - ... - ... #### Notes - Chris: Re Docker in Docker, we are not using this but "sibling containers" using this but "sibling containers" - Using sibling container for ODM - Got it working end-to-end - Julian: Asked David to clarify requirements regarding 3rd party requirements - Chris: These questions are already in the weeds - Real questions we should be asking... - What should be in the first iteration of pipeline? - Example: Two core options, and an optional addon - version 0.1: - Begin with orthomosaic and a shape file and/or BETYdb - All the way to a CSV file that contains canopy - \+ documentation + tests - version 0.2: - Begin with drone captured images & GCP (Shapefile and/or BETYdb) - All the way to a CSV file that contains canopy - \+ documentation + tests - Possible add on: - Upload to BETYdb - David: Talked to Chris, who is having trouble with the orthomosaic geographic accuracy using ODM - There is a requirement for a GCP file with ODM to obtain geographic accuracy - this is something that drone user needs to create - is this necessary if someone uses plot defining interface like UASTools? - If we start with Option 1, we can punt on ODM for now - Chris: Wants to know what the _first_ deliverable is - Chris: Has both Option 1 and Option 2 ready, barring the GCP issue - Chris: Any objections to having Option 2 as first deliverable, and Option 1 as second deliverable? - We can imnplement Option 1 as an app in the SCIF container & Option 2 as another app - And individual steps in workflows remain as SCIF apps in same containers - David: What's in the way of me running this on my laptop and sharing it with people? - Chris: Copy workflow and remove ODM step - Julian: Idea - naming convention? - Apps which are steps: `example_step`, or `step_example` - Apps which are workflows: `example_workflow`, or `workflow_example` - David: Concern about how difficult it is for people to integrate their own algorithms - Julian: We could parameterize the `makeflow` entrypoint so it can run any `.jx` file - Each workflow can be its own recipe, which bundles a `.jx` file, and this workflow app can call the `makeflow` app with its own `.jx` file - Julian: "Clarification from @dlebauer of requirements when using 3rd party container images" - David: Must be able to handle a Docker image as an algorithm issue - David: Both option A and option B satisfy the requirements - David: Hopefully this is a rare occurrence - Chris: Option B makes some things simpler, but will make some things more complicated - - Finalized: Two scenarios for running the workflows - In particular, when there are one or more algorithms which can only be consumed as separate container images - A: Docker Siblings - SCIF app in 'main' image which may call `docker run` on algorithm image - B: Multi-image - No SCIF app calls `docker run` - Move container details to workflow engine level - E.g. `makeflow --docker` or `makeflow --singularity` - Container details also not coupled to `workflow.jx`, it's handled by workflow engine 'wrappers' - Open question: SCIF installed on 3rd party image or not. Or just set minimum environment variables necessary (e.g. SCIF_APP_DATA)? Only reason is for .jx files to have consistency across steps. - Chris: Current priority should be to make it as easy as possible to 'consume' (run) workflows. Next make it easy to add steps or customize existing workflows. - Chris: Highest priority is to have something working. It doesn't matter so much if only _we_ feel the pain of integrating algorithms initially, as long as it is easy for the pipeline consumers to use. - Julian: So if multiple options have similar/same 'ease of use', what do we use as criteria for choosing approach? Ease of algorithm integration? Conceptual simplicity? :shrug: - **Whiteboard** ![](https://i.imgur.com/UOmz0b5.png) - David: Question - AGISoft as container? - Chris: yes. - But AGISoft would be installed as a Python library as a SCIF app - How to 'archive' version of AGISoft used as part of workflow? - Chris: Provenance of AGISoft version, metadata, etc. - --- ### 2020-05-27 - More background on rationale for SCIF, and alternative ways to combine it with third-party container images I chose SCIF in order to provide the following features: 1. A uniform interface to scientific code (entry points to replace our custom transformer boilerplate code) 2. Consistent environment variables across transformers which pipeline creators can rely on 3. Provide internal modularity and ability to install multiple transformers in one container, thereby being able to have all the steps for a pipeline packaged in one container image I tried to get as close as possible to the functionality of the previous architecture, while also endeavoring to combine all the steps for the pipeline in one container. After the demo of the proof-of-concept David said that it is only a 'nice to have' for all the steps of a pipeline to be in one container, but it _is_ a hard requirement to be able to integrate algorithms which are difficult to install, and for which working versions may only exist on old Docker images. To make the life of 'Pipeline creators' easier, it is still valuable to provide features 1 and 2 above. If for some reason it is not possible for a specific algorithm to be bundled with others, but there does exist a working Docker image, then SCIF can still help. Features 1 and 2 (uniform interface, consistent environment) can be provided in at least three ways: A. Derived image with SCIF B. Mount SCIF into container on startup C. Custom wrapper script to provide a subset of SCIF functionality #### A. Derived image with SCIF Note: This is something which a person with the 'Transformer packager' role might do. - Create a Dockerfile which uses the working Docker image as a base (`FROM`) - In Dockerfile install SCIF - Run `scif install` on a recipe file which sets up the Scientific Filesystem as a new layer - Use this derived image when instantiating the container for a step in a pipeline #### B. Mount SCIF into container on startup Note: Generating the SCIF boilerplate is something which a person with the 'Transformer packager' role might do. One relatively simple and reliable way to do this (and will also provide maximum compatibility with other tools which use SCIF): - Perform the steps for solution "A. Derived image with SCIF" - Use the static binary version of SCIF (not the Python one) - Export the layers which contain the `scif` binary and newly created SCIF file hierarchy into a local directory - Commit this directory to source control as a transformer (aka boilerplate around a scientific algorithm), and possibly provide a versioned tar file for every release When a person with the 'Pipeline creator' role wants to integrate such a transformer as a step into a pipeline: - Get the transformer boilerplate layer (either as a tar file or by checking it out from a source control system) - Mount the contents of the transformer boilerplate layer into the container on startup - Use the SCIF entry point as the entry point for the `docker run` command (instead of the default one provided by image creator) The above steps for the 'Pipeline creator' can be automated by a convenience script which the 'Transformer packager' can provide as part of the documentation in the SCIF transformer layer repository and tar file. Yes, in essence this is a way to create a derived image, but dynamically. #### C. Custom wrapper script to provide a subset of SCIF functionality Note: This is something which a person with the 'Pipeline creator' role might do. Note: This does not involve installing the SCIF application into the container - The script sets the minimum environment variables necessary for consistency with other transformers which _do_ leverage the standard way of using SCIF - Some of these environment variables, example/default values, and descriptions: - SCIF_BASE: /scif - the root location for SCIF - SCIF_DATA: /scif/data - the root location for apps data - SCIF_MESSAGELEVEL: INFO - a client level of verbosity. Must be one of CRITICAL, ABORT, ERROR, WARNING, LOG, INFO, QUIET, VERBOSE, DEBUG - SCIF_APPNAME: example - the active software app - SCIF_APPDATA: /scif/data/example - the data root for the active software app - SCIF_APPROOT: /scif/apps/example - the install root for the active software app - SCIF_APPENV: /scif/apps/example/scif/environment.sh - a shell script to source for the software app environment - These and other environment variables described here: <https://sci-f.github.io/spec-v1#environment-namespace> - The remainder of the script depends on the specific workflow system and/or containerization technology that the pipeline creator is using - E.g. the script could call `docker run` with the image and tag, passing in the environment variables above using `-e` Note: Despite SCIF not being installed on the container, setting these environment variables will provide a uniform experience for a 'Pipeline creator' who wants to combine a third-party container image as a step in a pipeline alongside standard SCIF transformers. --- ### 2020-05-26 - Clarification of requirements when using 3rd party container images @dlebauer, could you please clarify whether one of your requirements for the pipeline is: **Option A:** - To pull, at runtime, original, unmodified Docker images (specifically images published/maintained by 3rd parties) from DockerHub - The workflow engine/workers create containers directly from these images - The workflow engine/workers can't make any assumptions about file system organization, entry point, or other metadata inside the container - just whatever the container author provided - Nothing is added at container startup other than possibly setting environment variables and mounting volumes As opposed to: **Option B:** - Creating a Dockerfile which specifies a published Docker image (as above) in the `FROM` line - Adding a layer to the Dockerfile to provide a standardized entry point, and conveniences like metadata for introspection, testing, etc. - Otherwise the Dockerfile leaves the published image as is If you have a hard requirement to use unmodified images (Option A) then SCIF (as normally used) may not be a good fit. Note: The previous design for transformers does not satisfy this requirement (Option A) either. That design also leverages a Dockerfile to build on top of published Docker images like OpenDroneMap. It adds layers which contain files like: - `configuration.py` - `entrypoint.py` - `transformer.py` - `transformer_class.py` - `worker.py` - `packages.txt` - `settings.yaml` It then installs various packages specified in `packages.txt`. See: <https://github.com/AgPipeline/transformer-opendronemap/tree/2ff5943af6546d126f1128cea47a291157e42b44> The only difference between the transformers in the proposed design below (SCIF and Makeflow multi-image pipelines) and the previous design when building on published 3rd party images like OpenDroneMap is that I use SCIF to replace all the custom transformer boilerplate above. I've replaced all the files in the transformer-opendronemap repository with a SCIF recipe. Otherwise we can use the OpenDroneMap image as published: <https://github.com/julianpistorius/drone-makeflow/blob/765ad1a2d28a30a21937a48673e4a88aff058722/scif_app_recipes/opendronemap_v0.9.1_ubuntu16.04.scif> P.S. In the case of transformers [like soil mask](https://github.com/julianpistorius/transformer-soilmask/tree/scif) and other algorithms where we don't build on top of published 3rd party images: 1. I'm using SCIF to replace all the boilerplate 2. The remaining Python code is mainly science code 3. SCIF makes it possible to compose these transformers into a single container image P.P.S. If Option A is a hard requirement, but we still want to use SCIF if possible, there may be a way to 'inject' the SCIF environment into an unmodified third-party container at startup using volume/bind mounts and changing the entry point dynamically. It's a bit more complexity, but not nearly as much as running Docker from Docker. Update: See [More background on rationale for SCIF, and alternative ways to combine it with third-party container images](#2020-05-27---More-background-on-rationale-for-SCIF-and-alternative-ways-to-combine-it-with-third-party-container-images) above ### 2020-05-22 - Julian #### Summary of Chris & David's concerns - Adding transformers based on hard-to build algorithms/apps which _do_ have a working container image - Chris: *~~"(W)hat do you think is the best approach is for isolating "old" solutions in the SciF approach?"~~* > [name=schnaufer] [time=Tue, May 26,2020 10:20] Why not use an industry supported solution that involves less support by our limited resources? This ties into the question of: "why build it when we can just use it" - David: "If someone provides a docker container - for a non-trivial build, is it necessary to re-construct a new image or can't we just use what is available? (this seems the point of docker images ...)" #### Proposed solution: Multi-image pipelines (DRAFT) ##### Packaging pipeline steps in the ideal/easy cases - Install all steps as SCIF recipes in one container image, same as now ##### Packaging pipeline steps in a 'tricky' case - Pipeline steps will be installed as SCIF recipes in more than one container image (as few as possible) - Create a Dockerfile to use existing (ancient) Docker image as a base for a new image - Add a layer to this new image that contains SCIF application (`pip install scif`, or [static binary using Go](https://github.com/sci-f/scif-go)) and the SCIF recipe file - Run `scif install` with the recipe file to set up the SCIF filesystem and entrypoint for running the ancient image entrypoint > ##### Packaging using Docker images for certain processing steps > * Install all SCIF recipies in one image > * The workflow "documents" the version of the docker image to use as part of the workflow > [name=schnaufer] [time=Tue, May 26,2020 10:21] > Re 'documents version of docker image: Yep. See below. > Re 'installing all SCIF recipes in one image: Not sure how you would like to handle hard-to-build algorithms which may only provide an ancient Docker image, while also using SCIF recipes. SCIF by definition/convention works inside a container (i.e. don't use it to run Docker commands using docker-in-docker, etc.) > [name=julianp-diag] [time=Tue, May 26,2020 10:23] > @julianp-diag > * Please provide a link to "don't use to run Docker commands using docker-in-docker" > * Please expand on etc. > [name=schnaufer] [time=Tue, May 26,2020 10:21] > @schnaufer: See <https://sci-f.github.io/#how-does-scif-related-to-containers> > > "Although scif is not exclusively for containers, in that **a container can provide an encapsulated, reproducible environment**, **the scientific filesystem works optimally when contained**. Containers traditionally have one entrypoint, one environment context, and one set of labels to describe it. A container created with a Scientific Filesystem can expose multiple entry points, each that includes its own environment, metadata, installation steps, tests, files, and a primary executable script. SCIF thus brings _internal modularity_ and _programatic accessibility_ **to encapsulated, reproducible environments**" > > To extract main points: > - "A container can provide an encapsulated, reproducible environment" > - "The scientific filesystem works optimally when contained" > - SCIF brings '**internal modularity** and programatic accessibility' to (containers) > > My conclusion/understanding from my reading and experience: > - SCIF is not meant to live outside the encapsulated, reproducible environment (container) for which it is providing modularity and programatic accessibility > > I can ask @vsoch (SCIF author) to confirm. > > [name=julianp-diag] [time=Tue, May 26, 2020 10:55 AM] > Also: <https://sci-f.github.io/goals> > > "At it’s core, the scientific filesystem is a simple description of how to organize software and metadata on a filesystem for discoverability. This description encompasses a filesystem structure to ensure that scientific software is distinct from standard software on the host, and is interacted with by way of a set of environment variables and functions to expose the structure to the user. In that containers provide encapsulated, reproducible environments, SCIF works optimally when installed and used within a container" > > [name=julianp-diag] [time=Tue, May 26, 2020 10:58 AM] > @julianp-diag new comments > Broadly, I don't see how the above links answer my concerns above https://hackmd.io/KJ2bgG9_RRCY3jr8jCRlsw?both#Summary-of-Chris-amp-David%E2%80%99s-concerns > > Specifically: > 1. I agree with your comment: "SCIF is not meant to live outside the encapsulated, reproducible environment (container) for which it is providing modularity and programatic accessibility". > 2. I don't read the first link they way you do and I'm not seeing anything that says "don't use docker in docker" > 3. The second link to the philosophy is very much tied into why they recommend running scif in a Docker container. I don't see how this precludes docker-in-docker. > 4. It looks to me as if having a SciF command which consistently runs a specific docker container through its interface and follows the rules is allowed > [name=schnaufer] [time=Tue, May 26,2020 11:15] > You are correct, it is _technically_ possible. > I don't think this is the way SCIF is meant to work. > Most importantly I think running Docker from Docker as part of our workflow will introduce a lot of complexity. > > [name=julianp-diag] [time=Tue, May 26, 2020 11:48] > Proposal: Let's try both ways for a while, and see what we like the best. > > [name=julianp-diag] [time=Tue, May 26, 2020 12:03] > Ah, I forgot to mention that using Docker-in-Docker will rule out a few things: > > - Docker on Mac and Windows > - Kubernetes > - HPC > > There are also some nice things we get for free from using SCIF in the standard way that we would not get if we used it in this 'indirect' way. I will make a list. > > [name=julianp-diag] [time=Tue, May 26, 2020 13:10] > I should have started out with this, but I accept the proposal to try it out. <checkmark> (I don't know how to add a real one) > [name=schnaufer] [time=Tue, May 26,2020 13:45] > @julianp-diag I am unaware of the HPC exclusion, can you elaborate? What prevents it from working? > [name=schnaufer] [time=Tue, May 26,2020 13:26] > * I'm looking forward to the list > * docker-in-docker on Mac works - just tried it. > * I don't have access to Kubernetes, but from what I can find it looks do-able (https://www.google.com/search?q=Kubernetes+%22docker+in+docker%22&rlz=1C5CHFA_enUS825US825&oq=Kubernetes+%22docker+in+docker%22&aqs=chrome..69i57j0l7.7153j0j4&sourceid=chrome&ie=UTF-8) > * I'm not going to try Windows since it's not that Docker friendly to begin with > [name=schnaufer] [time=Tue, May 26,2020 13:39] > Re HPC: You can't run Docker in HPC, so you have to use Singularity. I don't think you can run Singularity in Singularity > [name=julianp-diag] [time=Tue, May 26, 2020 16:10 PM] ##### Creating pipelines in all cases - Run Makeflow and WorkQueue outside of the container - Modify `workflow.jx` to use Makeflow [wrapper commands](https://cctools.readthedocs.io/en/latest/makeflow/#wrapper-commands) to run `docker run` or `singularity run` - Include Docker/Singularity image name and version for step/steps as part of configuration (`jx-args.json`) ##### Pros - Simple for transformer packager to package - Simple for pipeline developer to integrate - If we leave Makeflow and WorkQueue out of the images entirely we can be 'less opinionated' about the workflow engine, and can show a simple, non-idempotent bash script which runs the steps one by one for somebody who wants to port it to another workflow engine ##### Cons - In some cases there will no longer be a single container image for all the steps in a pipeline - A single artifact for a multi-image pipeline would have to be a tar file that contains multiple image archives as well as the the workflow description (`workflow.jx`) - For maximum reproducibility we would have to make sure that the hypothetical ancient image is locked down (including layers further down, see Rocker and Jupyter Docker Stacks), or always use a tar export of these kinds of images #### Personas/roles ##### Role: Algorithm creator - Likely a domain scientist, and not a trained software engineer - Likely uses Jupyter, RStudio for development - May use Conda environments during development - May have simple test cases and test data to test correctness of the algorithm ~~(hopefully!)~~ > [name=schnaufer] [time=Tue, May 26,2020 10:20] I'm not sure "hopefully" should be part of the persona - May have a public code repository with source code, installation scripts, and tests - May publish algorithms as packages on PyPI, Conda, CRAN, etc. - May publish a container image with an entrypoint to execute the algorithm - Should not have to care about packaging into transformers - Should not have to care about workflow engines ##### Role: Transformer packager - They wrap scientific algorithms to present a uniform, standardized interface - They may be domain scientist with some computing skills - We assume they are somewhat comfortable with using and creating container recipes (at least Docker, ideally also Singularity) - They may be a junior software engineer with no domain science knowledge - The goal of turning an algorithm into a transformer is to make it easy for consumers of algorithms to have a consistent, predictable, uniform way to: - Discover - Inspect - Test - Execute - Provide input - Receive output - Should not have to know details of underlying algorithms - Should be easy and quick for packagers to wrap an algorithm - The packaging process should include a mechanism to automatically update, test and publish new versions of a transformer when a new version of the underlying algorithm is published (Continuous Analysis/Continuous Deployment) - If the algorithm creator supplied tests and test data, these should be included in the transformer tests - Maintenance of an existing transformer should ideally require zero, or at most very minimal ongoing effort - Should not package it in a way that restricts a transformer to working on specific workflow engines - Should not package it in a way that restricts a transformer to working on a specific container format (Docker, Singularity, etc.) - We assume they are somewhat comfortable with the Unix commandline interface ##### Role: Algorithm/Transformer consumers - They consume algorithms packaged as transformers - They may be a domain scientist with some computing skills - They may be a junior software engineer with no domain science knowledge - They should be able to easily discover, inspect, test, and execute an algorithm - They should be able to easily validate that an algorithm is correct (based on assumptions and understanding of original algorithm creator) - Their experience should be uniform and predictable if they have previously used other algorithms packaged as transformers - They may want to execute a transformer on different platforms: - Laptop - HPC - Lab server - Virtual machines (also cloud computing) - Interactive workbench environments (Binder, CyVerse VICE, Code Ocean) - They may not be familiar or comfortable with more than the basics of using containerization technology (Docker or Singularity) - They may not be very comfortable with the Unix commandline interface ##### Role: Pipeline creator - Subset of 'Transformer consumer' role - They combine algorithms into computational pipelines - They publish pipelines for consumption by pipeline consumers - They may publish pipelines on GitHub - They may publish pipelines on platforms that allow pipeline consumers to run pipelines using their own data (Science Gateways, Galaxy, CyVerse Nafigos, etc.) - They may use any workflow engine (e.g. Make, Snakemake, doit, Makeflow, Nextflow, CWL Runner, etc.) - We assume they are comfortable with the mechanism and syntax for creating computational pipelines using their preferred workflow engine - We assume they are somewhat familiar and comfortable with containerization technology (Docker or Singularity), especially as leveraged by their preferred workflow engine - We assume they are somewhat comfortable with the Unix commandline interface ##### Role: Pipeline consumer - They consume pipelines of algorithms packaged as transformers - They should be able to easily validate that a pipeline is correct (based on assumptions and understanding of the original creator creator) - They may be a domain scientist with basic computing skills - They should be able to easily validate that a pipeline is correct (based on assumptions and understanding of original pipeline creator) - They should be able to easily discover, inspect, test, and execute individual steps in a pipeline - Their experience should be uniform and predictable if they have previously used other pipelines composed out of transformers - They may want to execute a pipeline on different platforms: - Laptop - HPC - Lab server - Virtual machines (also cloud computing) - Interactive workbench environments (Binder, CyVerse VICE, Code Ocean) - Platforms that allow unsophisticated pipeline consumers to run pipelines using their own data (Science Gateways, Galaxy, CyVerse Nafigos, etc.) - Clusters (K8s, Nomad, etc.) - They may have little or no familiarity with containerization technology (Docker or Singularity) - They may not know the syntax or mechanism for constructing computational workflows for workflow engines - They may not be familiar with the Unix commandline interface at all ### Meta - [ ] Ground rules - Time box - Scope - Tie-breaker ## Agenda 1. Extend existing prototype to plot clipping using unaltered code from [drone repo](https://github.com/AgPipeline/drone-pipeline-environment/tree/master/transformer-plotclip) to help in determining next steps * Have it almost working * Concerned that the way scif calls procs will cause jobs to hang due to large volumes of generated screen outputs (everything showing up at once is an indication of stopping and waiting for a proc instead of proactively obtaining and printing output: this can cause apps to block when/if the pipe gets filled) * Especially ODM, which produces a lot of data * Clarification: The pipe will not fill up. Makeflow is buffering the output of the individual steps, because there can be multiple steps running concurrently, and interleaved step outputs would not be very friendly/useful. * Potential solution: [`makeflow_monitor`](https://cctools.readthedocs.io/en/latest/makeflow/#monitoring) 2. Do we want to re-design Transformers? (Yes/No) If yes, how? * Third option: There are no transformers anymore * Only scientific packages (see soil mask & ODM for examples) which may (soil mask) or may not (ODM) live in our organization * Using SCIF recipes to turn 'scientific packages into 'scientific apps'. SCIF recipes are responsible for: * Installing & configuring scientific packages * Standard locations for input & output data * Entry points to run scientific apps (`%apprun`) * Running tests (`%apptest`) * Metadata * Help * strip out all extra code / dependencies on clowder, geostreams, terrautils, bety; re-add as needed using independent modules * alpha release: only dependency is on filesystem - no APIs or databases * beta: add options for BETYdb, clowder, calliope based on end-user demand / motivated by specific use cases * Make python libraries for base code * Remove layered docker images * Cookie cutter recipes for algorithms * need to define clear use cases for these - what do these add beyond the python function capabilities? (one answer: guiding users to define metadata, tests, documentation) * AKA Pit of Success :tm: * <https://drivendata.github.io/cookiecutter-data-science/> > We're not talking about bikeshedding the indentation aesthetics or pedantic formatting standards — ultimately, data science code quality is about correctness and reproducibility. * No intermediate steps for users * [This is the MVP (Minimal Viable Product) specification](https://hackmd.io/mSYV8WPsQ-GSsg84rCfxmA) that I put together for Julian 3. How to distribute workload from SciF solution? * Makeflow already handles this 4. What are the next development steps for pipeline processing solution? * Complete canopy cover workflow ~~(make distributable? Already done)~~ * Add SciF & makeflow for IR and Lidar? (do afterwards) * these are lower priority than RGB - perhaps for Beta release * Optional plot shape sources: shapefile, ~~BETYdb (wkt)~~, databases, **GeoJson** ([if SRID mandatory](https://gis.stackexchange.com/questions/297139/geojson-coordinate-projection)), ~~BRAPI (geojson)~~, others? * see web UI for end-user plot specification --> generate polygons and make available on disk as a file; later insert into BETYdb * focus on supporting a single OGC compliant format (maybe geopackage or geojson?) * Then create tools for converting from shapefiles, BETYdb (WKT from the API which I think is also OGC, or we can extend brapi to provide geojson) and other files (csv, excel) as indicated by end users 6. Add Web UI to allow plot detection when missing GCP? Beta release? * This is only required / provides an option if valid GCP files and shapefiles are not provided * Present OM and allow user to draw rectangle * Use BETYdb/Shapefile/spreadsheet/databases to inform what plots are named and shapes * Review how [plotshpcreate](https://github.com/andersst91/UAStools/wiki/plotshpcreate.R) and [Fieldimager](https://github.com/OpenDroneMap/FIELDimageR) approach this **with an eye on using them** * beta release - develop shiny app for one or both of these tools, based on experience using them: also integrate into Makeflow workflow 7. Review and update timeline, releases, and features * https://app.zenhub.com/workspaces/drone-processing-pipeline-5e7e97f39771620e1b5a8893/roadmap * iRods (other File Store) support: Keep towards end of Beta * has CyVerse irods been updated so that it can use s3 API? Answer: No * MVP 1.0: <https://github.com/julianpistorius/drone-makeflow/tree/scif> * `wget -O test/testimages.tar https://de.cyverse.org/dl/d/84A57A62-B6EB-4826-ADC4-337D4A0ABBEA/images.tar` * MVP 2.0: <https://github.com/bioteam/minio-irods-gateway> * MVP 2.1: WebDAV - CyVerse IRODS supports it now * Makeflow has 'remote files' support: * <https://cctools.readthedocs.io/en/latest/makeflow/#mounting-remote-files> * This means we can use standard iCommands * Another file-management option: * <https://www.anaconda.com/accessing-remote-data-generalized-file-system/> * <https://www.anaconda.com/fsspec-remote-caching/> * <https://filesystem-spec.readthedocs.io/> * start/stop/defining workflow API: *Post-beta* for now * Jupyter Notebook for developing algorithms & create Docker SciF: **start with Scientific Package development** * Use cases: * Running custom pipelines * Monitoring progress * Developing pipelines * Developing scientific packages * Standalone (see <https://github.com/drivendata/cookiecutter-data-science>) * In the context of a custom pipeline * You can package Jupyter Notebooks at three levels: * Pipeline * Scientific app (using SCIF recipe, and add a `<app_name>-notebook` entry point) * Scientific package ![](https://i.imgur.com/SZ8gcox.png) ### Example GCP file WGS84 UTM 12N 605599.527 3030095.289 0 4525 1389 DJI_0384.JPG 605599.527 3030095.289 0 3918 1389 DJI_0385.JPG 605599.527 3030095.289 0 3645 1402 DJI_0386.JPG 605599.527 3030095.289 0 2510 118 DJI_0418.JPG 605599.527 3030095.289 0 2663 779 DJI_0419.JPG 605591.724 3030065.035 0 1086 2934 DJI_0427.JPG 605591.724 3030065.035 0 924 2310 DJI_0428.JPG 605591.724 3030065.035 0 795 1647 DJI_0429.JPG 605591.724 3030065.035 0 618 1021 DJI_0430.JPG 605591.724 3030065.035 0 500 364 DJI_0431.JPG 605609.420 3030065.288 0 3935 2975 DJI_0426.JPG 605609.420 3030065.288 0 3760 2307 DJI_0427.JPG 605609.420 3030065.288 0 3638 1669 DJI_0428.JPG 605609.420 3030065.288 0 3465 1004 DJI_0429.JPG 605609.420 3030065.288 0 3318 356 DJI_0430.JPG 605601.078 3030034.951 0 2750 289 DJI_0407.JPG 605601.078 3030034.951 0 2894 957 DJI_0408.JPG 605601.078 3030034.951 0 3066 1604 DJI_0409.JPG 605601.078 3030034.951 0 3219 2283 DJI_0410.JPG 605601.078 3030034.951 0 3375 2956 DJI_0411.JPG