An increasing number of studies employ workflows involving methods and tools from different domains of the life sciences. A high-level view of such workflows and corresponding publications is shown in figure 1. Bringing such a workflow to the EOSC environment, where they can be collaboratively developed, used, re-used and combined, presents numerous challenges. WP2 in EOSC-Life has the objective to facilitate the porting of scientific workflows by the RIs to the EOSC environment and help make them FAIR.
The plan to achieve this objective, which is better described in the “Current WP2 roadmap” section, has many facets, spanning from the creation of suitable workflow metadata standards, to a workflow registry to support findability, to supporting the adoption and use of workflow testing for easier maintenance, to helping in the adoption of cloud technologies to enable the execution of these workflows on EOSC-compatible cloud infrastructures.
Figure 1. Examples of workflows from publications in which the workflow is composed of tools from domains covered by different LS RIs.
For our purpose, the following definitions apply:
A workflow describes a set of computational tasks and their relationships.
A tool is a piece of software used by researchers to carry out a computational task.
A command-line tool is a piece of software that runs as a non-interactive program that automatically terminates upon task completion.
A workflow management system or workflow engine is software in charge of executing and monitoring a workflow. Workflow engines are meant to automate the running of workflows and as such are generally not suitable for interactive tasks and often only support the running of command-line tools. A workflow engine generally includes components for task execution (e.g. running a given software, handling failure), task scheduling (e.g. running tasks in parallel), resource provisioning, managing metadata and provenance (where, when and with what parameters and input was a task run?) and data management (is input data available? where to send output).
Registries are catalogues that can be queried manually or automatically to locate and obtain tools and workflows.
Tool interoperability in the context of EOSC-Life means for a workflow to be able to access and use resources and tools from different domains of the life sciences represented by the BMS RIs.
In WP2, workflows are specified in terms of the flow of data between a set of tools. We note that tools are not limited to the implementation of an atomic task but can also implement a workflow.
WP2 focuses on the implementation, sharing and maintenance of workflows — e.g., tool packaging, containerisation, workflow management systems, etc. On the other hand, challenges in the area of provisioning and integrating cloud infrastructures are out of its remit. Cloud deployment is done in cooperation with EOSC-Life WP7.
To maximise the use of WP2 resources and promote interoperability, WP2 will focus on a limited number of components and build upon resources already available.
To promote findability and reusability, WP2 will unify tool and workflow descriptions using structured data, provide a workflow registry that leverages current resources, and create a specific service to support workflow maintenance through automated analysis and testing.
Current WP2 roadmap
Reviews of online materials and publications related to the activities of the LS RIs as well as informal discussions with individual researchers within some of the RIs (including during the project kick-off meeting) identified a range of tools and workflow systems in common use. This was complemented by a survey of the EOSC-Life science demonstrators. Based on this, WP2 has developed an initial technical roadmap that highlights technologies and standards that can be readily supported within the project. The technologies and standards include the Linux operating system, the Conda package manager, Singularity (and/or Docker) for containerisation, CWL for describing data analysis workflows, Nextflow for running workflows on the command line and the Galaxy platform as web-based UI for building and running data analysis workflows. In addition, there is growing interest in the use of RStudio and Jupyter notebooks. To build on existing efforts and expertise, WP2 will aim at using these tools or ensuring compatibility with them.
Tool packaging and distribution
Conda is a cross-platform package and environment manager. Used to install and manage software packages and their dependencies.
Bioconda is a channel for the Conda package manager specializing in bioinformatics software. Through Continuous Integration, Bioconda packages are made available as Docker and Singularity containers. https://docs.conda.io/en/latest/ https://bioconda.github.io/ 1
Docker is a popular software container platform. Software containers excel at providing software portability and repeatability, particularly onto cloud infrastructures. Docker is compliant with the Open Container Initiative. The Open Container Initiative (OCI) develops open industry standards for container formats and runtimes. https://www.docker.com/
Singularity is another popular software container platform. Singularity has greater adoption than Docker in HPC, as it integrates with many resource managers and historically its engine never required root privileges. Tools are available to convert Docker containers to Singularity. Singularity is compliant with the Open Container Initiative. https://sylabs.io/ 1
bio.tools strives to provide a comprehensive registry of software and databases for the biological and biomedical sciences. Resources are described in a rigorous semantics and syntax, providing end-users with the convenience of concise, consistent and therefore comparable information. Each bio.tools entry is assigned a human-readable, unique identifier, which provides a persistent reference and a means to trace resources and integrate bio.tools data with other resources.
BioContainers is a community-driven project that provides the infrastructure and basic guidelines to create, manage and distribute bioinformatics packages (e.g conda) and containers (e.g docker, singularity). BioContainers provides recipes to build software containers and packages, and also a registry to host all the BioContainer images. As a community-driven project, it allows users to request support to create containers that don’t yet exist.
Workflows specification and management systems
Workflow management system agnostic description and interoperability
The Common Workflow Language (CWL) is selected as the standard for describing tools and workflows that can be executed by multiple workflow engines such as Nextflow and Snakemake. ELIXIR has invested in the support of CWL. CWL is also used by the EU’s BioExcel2 Centre of Excellence for Biomolecular modelling, and by the IBISBA ESFRI for Industrial Biotechnology. CWL is participating in GA4GH Task Execution API 1 (a minimal common API for submitting a single job to a remote execution endpoint) and GA4GH Workflow Execution API (a minimal common API for submitting workflow requests to workflow execution systems in a standardized way). http://www.commonwl.org 1
Workflow Management Systems (WMS)
EOSC Life aims to provide an environment to support a wide range of Workflow Management Systems available to its RI developers and users.
Some workflow systems have been identified as meriting dedicated attention.
Galaxy is a web-based scalable platform for running biomedical data analysis tasks. In Galaxy, workflows are built by selecting from a web interface a series of operations to apply to the data. The saved history of applied operations constitutes a shareable and reusable workflow.
Tools are made available to Galaxy by writing a wrapper script and a description and can be distributed via the Galaxy ToolShed. The recommended best practices to manage tool dependencies is the usage of (Bio)conda. Galaxy can also use containers (Docker, Singularity) to run jobs. Galaxy can run workflows on remote resources using its Pulsar network.
Galaxy is integrating support for CWL. As a first step export of ‘abstract CWL’ will be developed, which can serve as metadata (suited e.g. for inclusion in a workflow registry) but is not executable CWL.
ELIXIR runs an EU-wide Galaxy installation, hosted and managed by ELIXIR Germany. https://usegalaxy.eu/ 1 https://galaxyproject.org/ 1
Nextflow is a platform for data-driven computational pipelines executable from the command line and from executable notebooks. Nextflow is specifically designed for scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages. Its domain-specific language (DSL) simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
Nextflow is an open-source platform that has commenced a commercialisation activity.
Nextflow has not yet committed to supporting CWL. https://www.nextflow.io/ 1
Executable (Notebook) Environments
Jupyter is a web-based computational notebook used for literate programming supporting several programming languages. Workflows are composed of interactive programming in a selected language and consist of documented code chunks and their output.
Data provenance is not tracked although some plugins such as Verdant or external tools such as noWorkflow can be used for this, and there are some experimental systems such as ProvBook 1. https://jupyter.org/
RStudio is a web-based computational notebook dedicated to the R statistical software environment. Workflows are composed of interactive programming in the R language and consist of documented code chunks and their output.
Data provenance can be tracked using various R packages such as RDataTracker, adapr, recordr or repo. https://www.rstudio.com/
WorkflowHub.eu is a a registry for workflows developed within EOSC-Life. It is an entirely EOSC-Life product, though co-financed by other projects. Despite the relatively recent availability it is already widely used. Search over and launch workflows directly from WorkflowHub in Galaxy.
LifeMonitor is a service developed in EOSC-Life WP2 to support the sustainability and reusability of published computational workflows through long-term periodic testing and workflow analysis. An early alpha version was released in May 2021. LifeMonitor focuses on workflow test monitoring and integration with WorkflowHub through RO-Crate. Life Monitor gets the test metadata from the workflow’s RO-Crate and uses them to communicate with the relevant CI services. The service already supports the submission of workflows with associated test metadata through a specialized RO-Crate profile. Test outcomes are collected and exposed via a RESTful API. Test execution results from Jenkins, Travis CI and GitHub Actions are gathered and exposed under a common interface, accessible by clients authenticated with a WorkflowHub or GitHub identity. Many new features are currently in the works, including a tighter integration with WorkflowHub, e-mail notifications of failing tests, a Web interface and tools to assist in test suite creation and workflow maintenance. The LifeMonitor team participates in the WorkflowHub Club.
RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata. It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wider variety of situations, from an individual researcher working with a folder of data, to large data-intensive computational research environments.