owned this note
owned this note
Published
Linked with GitHub
---
tags: Galaxy
---
# UPDATED ROADMAP FOR 2023
The Galaxy project has been recognised by major global projects in the last years. It has proliferated into multiple new directions.
- **Positive**:
+ The working groups seem to work well and major big changes could be implemented
+ The various UI efforts in the last year demonstrated the real potential of our community and team
+ The project and the community is able to quickly adapt to new scientific use-cases and software development initiatives (e.g., VGP, Pha4ge ...)
+ It is a global effort with a true community support and major funding for main instances
- **Negative**:
+ The team does not grow with the same speed. Having money does not make it easy to hire, certain grant aims are lagging behind
+ The WGs should be a more central point for new contributors
+ We have a capacity problem in our admin community
The following is a list of priorities grouped by WGs, listed in the order of importance.
## Additional (no necessarily short-term points from JohnC)
- Visualizing data flow
- Caching jobs
- Spin your own Galaxy
- Seamlessly connecting Galaxy workflows to cloud resources
- Workflow change logs
- Packaging frontend and backend for reuse
- Parameter sweeping
- Tool sweeping
- Building cohorts over time
- Building cohorts for searching datasets
- [Histories as interactive notebooks](https://github.com/galaxyproject/galaxy/issues/8584)
- Sharing across Galaxies
- Collaborations on Galaxy
- One click analyses
- Data ingress and processing wizards
Cohort building = collection-like entities that can grow over time - there should be fixable data ingress protocols defined in Galaxy and shown as part of the "collection", workflows that are automatically run on new data and abstractions and UI elements for aggregating and summarizing the data. It seems to me like in a lot of these big projects - data comes in over time and Galaxy doesn't have any way of connecting this data and growing an analysis fluidly over time like that.
## UI/UX
1. Underlying infrastructure
- Vue as the sole framework in primary app
- Zero Backbone, Zero jQuery, single entry
- Grids, Upload, FormElements
- Rule Builder UX+Refactor
- Testing for Accessibility
2. History
- Graph view (<kbd>BACKEND</kbd>)
- Architecture: new scroller
- "Jump-to" bookmark
- Collection-level versus dataset-level behavior
3. Activity Bar
4. Notification framework (<kbd>BACKEND</kbd>)
- A notification page
- Badge on the activity bar
5. Visualisation
- IGV.js replacing Trackster (with other options to come, including JBrowse, etc)
- Split from client build, API-driven registry & build management (<kbd>BACKEND</kbd>)
- ITs
6. Dataset-view: tabbed interface in the middle pane
- Comprehensive component for displaying all of a Dataset’s related sub-interfaces (display, viz, edit, info, etc.)
7. Dateset-management related features (*not likely for this round*) (<kbd>BACKEND</kbd>)
- Design a good UX for Scratch history / History archival
8. UI simplification: One button-type analyses
- Select a well developed workflow and prototype a "one-click" type of analysis
9. Paper
## Backend
1. Underlying infrastructure
- address limitations in the task execution framework
- SQLAlchemy 2.0
- FastAPI
- Port (and document) more workflow APIs to FastAPI
2. Assist System WG (<kbd>SYSTEMS</kbd>)
- IDC
- Pulsar hardening
- Metascheduling
- switching .eu and .org to TPV
- deciding on how to emit Pulsar state information to implement metascheduling
- creating one big pulsar network for usegalaxy.*? (failover, more resilient usegalaxy.* services)
- Get the new ToolShed deployed
3. User based object store
- External stores
- Scratch (see https://github.com/galaxyproject/galaxy/pull/14073)
4. federated and data-local computing on commercial cloud(s):
- From .org, run an analysis on AWS/GCP that processes data on AWS/GCP and stores results on AWS/GCP (which requires user-based object store?)
5. Merge and harden the Tool Shed replacement (<kbd>SYSTEMS</kbd>)
6. Push ITs to be considered "stable" (Tool Shed ready)
- What is missing?
## Testing & Hardening
1. Support other WG and new contributors to write tests
- Expand testing tutorial
2. Ongoing work on testing infrastructure with a focus on deployment tests
3. Upgrade tests, test infrastructure for database access, add documentation
4. Systematic improvement of test coverage:
- Prioritize features that lack test coverage, are critical, and are known to break
- Improve documentation on Galaxy's testing utilities: help write more/better tests using existing infrastructure
5. Talk and write more about the testing efforts in Galaxy
## Tools / WFs
1. Improve subworkflow maintenance user story:
- Replace / upgrade subworkflow, keep connections (as far as possible)
- Workflows as trees
- Link child and derived workflows back to parent workflows
2. Execution of workflow and tool tests using embedded Pulsar by default
- Harden Pulsar support; less work for admins to route tools to non-pulsar destinations
- Support for sending steps that require large resources to external TES server
3. Improve support for job caching framework (<kbd>Backend</kbd>)
4. Step javascript expressions (<kbd>UI/UX</kbd>)
5. A website for IWC workflows
6. Workflow development
- Workflow editor improvements: making workflow elements selectable, copyable, and pastable
- IWC procedure for workflow submission will benefit from simplification
- Versioning
- Named versions
- A nice UI for going back or displaying differences
7. High-importance tools & workflows
- Machine learning
- Genome assembly/Long-read analysis
- Spatial analyses
8. Executable Workflow Editor Tour(s) and tutorials
9. Standalone workflow graph view (builds on reactive workflow editor work, with UI/UX)
- For Static Page, Progress View, Pages / Workflow Reports
- Entry in galaxy-hub for every (new or updated) workflow
10. Schema for job and test definitions (23.2)
- Make it easier and faster to write and validate tests and jobs
:::danger
**US PROBLEM**: Keeping old tools up-to-date!
:::
## Systems
1. Make VGP workflows available on the 3 big usegalaxy.* instances
2. Deployment of iRODS on .org
3. Evaluate/collect all hacks that are currently used to keep usegalaxy.* working. Talk to WGs to get it fixed, or make plans to improve the situation
- Potential candidates:
- Fix toolbox handling
- Data-managers
- Better errror reporting
4. Reference data handling
## GOATS
1. Provide better support for the GTN & outreach
- Hire & onboard a communications specialist (PIs)
- Editing GTN
- Search and apply for GTN related funding
2. GTN infrastructure help
3. Grow & diversify the [Galaxy Event Horizon](https://galaxyproject.org/events/) to reach wider audiences
- Actively encourage Galaxy team members to present at conferences & locally
- Work with the community members to help them publish & publicize their work
## Cancer Applications (ITCR)
- Scientific Goals
- Multimodal and spatial analysis of molecular tumor datasets to understand how tumors adapt to therapy and identify therapeutic opportunities
- Predictive analysis based on machine learning. Examples: (1) use RNA-seq to predict response to therapy and (2) predict the % of cells that are proliferating in a tumor from a histopathology image.
- Analysis of large cancer datasets located on AWS/GCP/Azure
- Key Galaxy needs
- Simpler user interface and Activities
- Better tabular dataset support and inline text editing
- Data-local processing on AWS/GCP from a public server: (1) enable Pulsar on AWS/GCP/Azure and tackle billing questions and (2) user-based object storage so everything is stored on AWS/GCP/Azure
- New and simpler machine learning tool suite and a model repository
- Workflows, examples, and tutorials for (1) multiplexed tissue imaging and (2) machine learning. IWC?
## Human Genetics (AnVIL)
- Key goals
- Provide Galaxy users with an accessible Galaxy platform to operate on protected and private data
- An integration platform for GA4GH APIs
- Support large scale analyses
- Key Galaxy needs
- Read-optimized copy of the usegalaxy database
- Improve results & consistency of containerized tool tests, extending to end-to-end workflows
- Ability & documentation to run jobs with input data in a bucket and the results get deposited in a bucket
- Optimize startup
- Speed up the startup; reduce startup deps & services
- Standalone, static client
- Continue adding support for interfacing with GA4GH APIs, primarily Passports, TES, and WES
:::danger
Old text from previous years
:::
# Galaxy: The next 6-12 months
> Updated Jan 31, 2023
:::info
This document will be posted to the Galaxy Hub after working groups meeting on February 23, 2023.
:::
## Priorities by Working Groups
The current composition of working groups can be found [here](https://docs.google.com/spreadsheets/d/1CWUpoxyMQ1KU8eb8G7XEnP8AmYCgRtCBSuWjCKiHJBI/edit#gid=0).
## UI/UX
- Remove old cruft and harmonize grids
- Visualization / ITs framework polishing
- Progressing along the [history roadmap](https://hackmd.io/3Ib2vyGzSWGsDK2kDGR0Cw)
- UI for scratch space and storage selection
- Deeper integration of GTN into Galaxy
- left side panel tutorial search
- Better visualization of tabular datasets
- Better inline text editing of datasets
- ==Question: what is the role/future of scratchbook?==
## Backend
- Refactoring the Galaxy object store code
- finalizing iRoDs for ORG, S3 ...
- data streaming API without data caching
- User based object store
- external stores
- scratch (see https://github.com/galaxyproject/galaxy/pull/14073)
- ==
needs (Jeremy we need input here): remote data and handling intermetiate data. Can be used as a driving example==
- Multi-user instance with sensitive (encrypted) data
- Pulsar:
- maintenance
- metascheduling (TPV and CERN)
- switching .eu and .org to TPV
- investigating meta scheduling solutions
- deciding on how to emit Pulsar state information to implement metascheduling
- creating "Pulsar cloud" if we want to create a freely accessible Pulsar network for any platform (not just Galaxy)
![](https://i.imgur.com/YzLeHSq.png)
<small>A global pulsar "cloud". Here green arrows represent Galaxy use while blue show potential use by other systems. By opening the network to communities beyond Galaxy we will again solidify our lead in this space.</small>
## Tools / WFs
- Making toolpanel more usable
- finishing tool search
- depending on how well tool search works, we may or may not invest more time into tool ontology (and tagging)
- Workflow development
- Workflow editor improvements: making workflow elements selectable, copyable and pastable
- IWC procedure for workflow submission will benefit from simplification
- Versioning
- named version
- nice UI for going back or displaying differences
## Toolshed
- ==Toolshed evolution (need input from John)==
## Deployment (systems WG)
- Deployment simplification - building up capacity and making Admin life easier (fix toolbox handling, data-managers, better error reporting)
- Reference data handling: need to be finally solved
- Deployment of iRODs ...
- What hacks are currently used to keep usegalaxy.* working? This way we can identify pain points and incorporate them into this roadmap
## GTN
- GTN infrastructure help
- Editing GTN
- Support GTN (Money and people)
## Open Questions and Suggestions
- It would be nice to have end-to-end analyses in mind to guide development, perhaps something like the GCC demo that we discussed. Should we add this?
- Can we add emphasis on IWC, ITC, and complete production analyses?
- How much emphasis to place on public Galaxy services (responsiveness, capacity, features, bugs, tool/workflow availability) vs. framework development?
# OLD ROADMAP
Galaxy project has grown considerably in the past five years. It has proliferated into multiple new directions. These developments have two main effects:
- **Positive**:
+ The project is well know and it is easy to join multiple new scientific and software development initiatives (e.g., VGP, Pha4ge ...)
+ It is a global effort with a true community support and major funding for main instances
- **Negative**:
+ The team does not grow with the same speed. Having money does not make it easy to hire
+ Multiple direction dilute focus, we are risking to do multiple things half-assed instead of doing a few things *really* well
## The Focus Areas
:::success
A lot of time has been spent on distilling these priorities. In this current form they represent shared goals of `.org` and `.eu` communities and agreed upon by all the PIs from both sides of the Pond.
:::
We need to decide on the primary direction for the next several years. These need to be consistently maintained, refined, and re-evaluated. Here are the main direction that warrant our immediate attention:
1. **[Making Galaxy known](#1-Making-Galaxy-known)** for what it *already* has. Galaxy pioneered many futures such as workflow language, graphical workflow editor, collections, and many other things. The trouble is that these are not that well known because we never published them. Now is the time to do it.
2. **[Remote data and compute](#2-Remote-data-and-compute)**. Within this direction we need to hit several targets. (1) Deploy global pulsar network, (2) switch all `usegalaxy.*` instances to use the global network, (3) enable data proximal compute and BYO{C,D}.
3. **[Improvement and polish for UI and UX](#3-UI-and-UX)**. User interface is one of the defining features of Galaxy. Here we will focus on (1) small usability fixes that can be done quickly, (2) the new history as well as (3) the hierarchical view.
4. **[Workflow enhancements](#4-Workflows)**. Galaxy is facing increasing competition from newer workflow management systems such as, for example, Nextflow. We need to improve the usability of workflows and grow portfolio of "best practices" via IWC. Our workflows need to be easier to create and maintain and they need to be appealing to bioinformaticians.
:::info
**Note** that a lot of these things are *"almost"* there. Rough drafts of the papers exist. Pulsar can almost be used in this context. New history is sorta done etc., etc., Let's focus of the transition between "*almost working*" and "*reliably working"*.
:::
### 1. Making Galaxy known
> PIs, WG: GOATS
There are a number of Galaxy components that are known to us but largely unknown to the community. By working continuously on Galaxy components many of us no longer feel how innovative these components actually are. We need to promote them!
Actions
#### 1.1. Papers to write
In the next two years we need to write and submit the following manuscripts (also see this [memo](https://docs.google.com/document/d/1y7P7FKWRE3E4UnaXHLMm_f4tr5LFpRmtzXFvxqxdLCE/edit?usp=sharing) from Björn):
1. **Collections: UI for large datasets** -- collections are an incredible part of Galaxy UI. This is the only system of its kind. This paper should focus on collections, group tags, and examples ranging from simple (e.g., large paired collections) to complex (e.g., hierarchical data used in RNAseq);
2. **Galaxy workflow creation and execution** -- description of the language, comparison with other systems, expression tools, subworkflows, and API-driven use;
* Substantial work has been done on IWC. Workflow editor has been rewritten in Vue with many new features.
3. **Using remote resources with Pulsar** -- description of Pulsar, the concept of Pulsar end-points and Pulsar cloud. Potential use of Pulsar cloud without Galaxy.
> *Genome Research* may be interested in one or two of these as it is seeking papers with potential high future citation count (as was the case for many previous Galaxy papers). *Nature* {*Methods, Biotechnology*} and *Genome Biology* are, of course, other possibilities.
:::warning
Are we missing anything? What other components do we need to promote the hell out of? Add to this list!
:::
#### 1.2. The new hub and unification of web presence
In addition to writing papers we need to clean up and unify our web-presence. The new Hub framework has been developed for this purpose.
* [x] EU has migrated to use the Hub infrastructure
#### 1.3. What can we do by GCC 2022?
- [x] Planemo
- [ ] Collection manuscripts;
- [x] Moving covid19.galaxyproject.org to hub infrastructure.
### 2. Remote data and compute
> WG: Systems, Backend, Testing
Galaxy is unique in that it is an analysis framework coupled with **real** and **always-on** compute infrastructure: one can run analyses right now! We need to further develop our ability to utilize remote compute and storage. This Fall we will establish a formal collaboration with [CERN](https://home.cern/) to align some of our development efforts as the two communities (biomedical and particle physics [experimental physics in general]) have a large number of common goals.
#### 2.1. Global Pulsar network
Perform necessary development to enable deployment of globally distributed Pulsar nodes on a variety of computational resources. In terms of resources we need to satisfy two extremes:
- **Processing very large numbers of small/medium samples** -- an ability to process very large datasets that are small to medium size. A perfect example of this is SARS-CoV-2 analysis. It involves 100,000s of samples that were relatively small.
- **Processing very large sample that require substantial compute** -- an ability to process small number of samples each of which consists of very large (TB-sized) files that need large-memory multi-core infrastructure for processing. An example of this is our VGP effort as well as image processing needs of ITCR project.
All other research scenarios fall between these two extremes.
----
![](https://i.imgur.com/YzLeHSq.png)
<small>A global pulsar network. Here green arrows represent Galaxy use while blue show potential use by other systems. By opening the network to communities beyond Galaxy we will again solidify our lead in this space.</small>
----
#### 2.2. What do we need?
While Pulsar is *"almost"* ready there is a number of developments that need to take place before this vision can be implemented:
- **A meta-scheduler** -- effective use of the network will require the development of meta-scheduler that would allow different global Galaxy instances to access the same Pulsar endpoint and utilize allocated resources efficiently.
- **An actual pulsar network** -- a deployed collection of Pulsar endpoints on a variety of compute resource types and at different geographic locations (US, EU, AU).
- **A straightforward deployment** -- it should be easy to deploy a Pulsar end-point on most resources.
- **An ability to pass over user credentials** -- how do we connect say, XSEDE, allocation to a user account.
```
#### 2.3. What can we do by GCC 2022?
We should aim for the following scenario:
- Stage a number of SARS-CoV-2 datasets in EU
- A `usegalaxy.org` user access these files and add them to a history as a collection
- The same use starts a variant calling workflow that schedules jobs on a Pulsar node proximal to these datasets
- The end result of this workflow is stored at `usegalaxy.org`
- Similarly, can we demonstrate computation near VGP data? (we may need to copy it from AWS to a public resource).
In practice:
- Go to upload tool and find remote data you want to analyze
- "Import" data ... but the upload rendsers data in history without uploading
- Map adatsets against reference by starting your tool
- Galaxy is picking correct distination where data is stored
- and schedules jobs
- outputs are transfered back (e.g., list if sites or a plot) are transfered back or
While this example is possible in principle there is a number of questions. For example, how does the user see these remote data? Are they visible in `.org` library?
```
### 3. UI and UX
> WG: UI/UX, Testing, GOATS
This is an enormously important area for our future.
#### 3.1. Small interface improvements
Historically we postponed small fixes "until the new history is done". We need to comb through existing issues to identify bugs/features that can be fixed/developed independently of the new history. Here are some examples:
- [x] Dataset deletion from history is painfully slow;
* I think this is done when using celery as a task backend
- Can we enable sorting and simple search on tabular datasets displayed in the center pane (this should only work on datasets smaller than some practical limit);
- [x] Can we visualize notebook content by clicking of the 'eye' icon?
:::warning
Are there other "low hanging fruits" -- bugs and features that can be fixed/implemented quickly?
:::
#### 3.2. New tool panel
* The new tool panel has been prototyped and is exactly what we need. It needs to be deployed (coherently) across `usegalaxy.*`.
* Not done yet, but less important as we are shifting to a search-first (in contrast to browse-first) model and will get rid of the tool panel by default
#### 3.3. New history
New history exists but has a large number of issues. The UI working group needs to ensure that they have a good true use example to drive the tune-up of the new history **before** doing any new work on it.
* Bjoern thinks this is mostly done, there are still bugs, but the new history rocks. Anton what is your feeling?
#### 3.4. Hierarchical view
The core idea of history--a linear succession of datasets--remained unchanged since ~2006. For large histories this linearity presents a problem: it becomes difficult to understand the relationship between datasets (e.g., "which of the original inputs these derived datasets originated from?"). An incremental solution to this problem was the development of *name tags*. Yet this only partially solves this problem.
-----
![](https://i.imgur.com/ndZzDLH.png)
<small>Prototype for rendering Galaxy history as a graph. In this analysis there are two initial datasets: a collection of multiple fastq files `5:mt`) and a single BED dataset containing SNP annotations from the UCSC Table Browser (`[12: SNPs: dbSNP153]`). While this analysis is simple, it involves branching. Using the current view it is difficult to discern where initial dataserts are and where the branching point is. A graph-like rendering will allow visualizing datasets and tools that have been used to produce them. It should be possible to toggle between data-only and tool-only (workflow) view. </small>
-----
We need to develop alternative history views, which will graphically represent relationships between datasets (see above). This graph-based view will toggle between representing datasets or tools as graph nodes. We also need to implement a number of selection mechanisms (by type, by tool, by initial dataset, and others) that will highlight datasets (or tools) and facilitate downstream analyses.
### 4. Workflows
> Workflows, Tools, Backend, UI/UX, Testing, GOATS
Workflows are essential for applying Galaxy to various types of data analysis. The more high quality workflows we have, the more people Galaxy will reach. This will also drive our tool development efforts.
#### 4.1. The workflow roadmap
The Workflow WG has developed a very [detailed roadmap](https://docs.google.com/presentation/d/1R7gkC603gne59IlP31dACn72XAGRfC63ZdHaA9q5Oh0/edit?usp=sharing) for workflow functionality. We include workflows in the vison to reflect the need to make them "first class citizens". To achieve this we will need to make workflows:
- **Simpler and clearer** -- more intuitive ways to launch workflows are generated invocation reports. Ability to see a workflow preview. Conditional flow
- **More appealing to pipeline developers** -- workflow nodes need to reveal more information such as description of inputs or outputs and the way command line is formed. Ability to edit some aspects of the `.gx2` files from the editor. Integration with notebook environments.
- **More flexible** -- user installable tools and adapters for small bash scripts. Progressive deletion of intermediate data if all jobs in a given step are successful. Real webhooks and resource hints for a tool and for a workflow as a whole.
- **More powerful** -- conditionals again! Scheduling plugins and ability to use external schedulers such as AirFlow. Changes to the way parallelization is done without polluting the database with intermediate items.
- **Easier to create and edit** -- ability to create workflows locally. Ability to select multiple nodes and create subworkflows. Automated machanism for inspecting a workflow and suggesting simplification via replacing repetitive steps with subworkflows. Text mode for the editor.
- **More visible** -- see [Section 1](#1-Making-Galaxy-known) above.
#### 4.2. What can we do by GCC 2022?
- A collection of 10 great best practice workflows
- Better integration of notebooks and conditionals;
- Embed parts of the workflow editor as standalone components (for reports, previews etc).
## What is next?
We would like to ask WGs to:
1. Assess if the goals for GCC are reasonable. Is anything needs to be scaled down or expanded?
2. Identify components that are within this WG's expertise and which would require collaboration with other WGs
3. Create a roadmap for the next two tertiles ([Nov, Dec, Jan, Feb] & [Mar, Apr, May, June]) that outlines steps necessary for implementing these components.
4. Communicate this to PIs (you can do this in any form such as inviting us to relevant meetings). However, whatever this form is the roadmap needs to be easily accessible.
5. This needs to be wrapped up by Nov 15. By this date WGs and PIs need to be in sync.
### Comments
Don't we need more focus on community based activity to help achieve these goals? The technical changes are necessary but not sufficient.
#### Regarding 1.:
*Making Galaxy know for what it already is* - There may be things the community might want to do to help - the GGSC is willing to try to assist. Any suggestions are welcomed!
Big science initiatives - the next VGP, may deserve a place in this section too. Setting aside some resources and having a strategy for identifying a target group and helping them get some of their workflows into Galaxy would be one possible approach.
#### Regarding 3.1:
Galaxy’s UI has incredible potential for improvement if we resolve existing UI bugs and performance issues and add proper tests to avoid reoccurrence. There is a conflict between adding new features and fixing whats there. New features may lead to new issues while existing issues remain unresolved. Moving forward it would be great to start off by adding a new testing framework to complement selenium tests, and then utilize it to create more sophisticated and comprehensive UI tests to guide any further development. This is similar to how we complemented qunit tests with jest which already resulted in a signficantly improved test coverage. Most (if not all) of the UI performance issues we experience today can be resolved if cataloged by tests and prioritized.
#### Regarding 3.1:
Stepwise internationalisation of the UI and perhaps eventually even tool form text and help (yes, yet more work to maintain - needs resources!) opens lots of potential new places for Galaxy to be useful, especially among non-English speaking scientists. It's underway but is extending this as far as practicable worth embedding in 3.1 as a long term design goal?
## Open Questions and Suggestions
- Emphasis on IWC, ITC, and complete production analyses
- IWC has a bunch of high quality workflows and external contributors. I think its a good idea to emphasis this is important and tell them which direction we see as a priority - BUT in the end scientists needs to create and verify these workflows - who is doing that?
- J: IMO the Galaxy development team needs to partner with scientists to see their use cases through. We cannot push this entirely entirely onto scientists using Galaxy.
- How much emphasis to place on public Galaxy services (responsiveness, capacity, features, bugs, tool/workflow availability) vs. framework development?
- B: I would prioritise framework developments which take some load from our Admins. This way they can concentrate on better deployments, less bugs and a more stable system.
- J: I support framework developments that improve our public services and administration. I would say this should be a priority over other framework developments in 2023. Our public services could be better with increased focus.
:::info
**Things to add**
- Intrainstance (including GTN) communication, branding, etc. (GOATS?)
- Rewarding contributors such as by resurrecting "Contributor of the month" mechanism
:::