Work Package 5 (WP5): Service Building

WP5 is the main development effort of EUDAT2020 to consolidate and develop the technical architecture of the CDI. Not only does it improve the EUDAT CDI and its core software, it also continues development of research activities conducted in the first phase of EUDAT. Of these, the Generic Execution Framework (GEF) is of particular interest to us. MPI-M has assumed a coordinating role for the GEF and is steering the development (subtask 5.4.2 Workflow Generic Execution Framework and Workspaces) to ensure our requirements are implemented. The work done in WP5 is summarized in annual reports which are released via EUDAT's own B2Share publication service. As of November 2017, the first annual report from 2016 has been made publicly available. The 2017 annual report will be published before work on the final report starts in January 2018.

Workflow Generic Execution Framework and Workspaces
MPI-M is coordinating development of the GEF with the Eberhard Karls Universität Tübingen (EKUT) and the Centre Européen de Recherche et de Formation Avancée en Calcul Scientifique (CERFACS). As an agile software project, the development is open to new requirements by the climate community and the other participating scientific communities. EUDAT partners and user communities have been invited to test and scrutinize the GEF during its development for the last two years and have suggested further requirements that are still flowing into the development process.

The Generic Execution Framework
The GEF allows containerized scientific tools to be enacted close to data storage locations, in most use cases located within the EUDAT CDI. Features include:

  • Annotated Docker images (called GEF services) are used to encapsulate scientific tools like the CDOs and make scientific toolchains reproducible. Docker images are stand-alone and executable packages that hold a virtual lightweight Linux environment along with all the necessary software components to run the desired software, forming a transient workspace.
  • In the spirit of Open Science, GEF services are to be hosted in a public repository. We aim to have a prototype repository ready by the end of the project in February 2018. This prototype will be based on DockerHub technology.
  • Integration with the EUDAT AAI (i.e. the EUDAT B2Access service). Another GEF configuration will also run independent of the EUDAT AAI.
  • User interaction via a GUI as well as an HTTP API.
  • Aim for integration with all data services of the EUDAT CDI. B2Share and B2Drop are already integrated. B2Safe integration will be implemented by the end of the project.
  • Data can be specified by Persistent Identifiers (PIDs) or URL.
  • GEF backend can interface with Docker Server and Docker Swarm installations on various platforms.

The following diagram shows the GEF architecture with its frontend and backend connected to a Docker Swarm installation deploying GEF services retrieved from a (yet incomplete) GEF service repository:

The first beta version of the GEF (TRL6) was released in September 2017 and we intend to keep working towards a release candidate (TRL7) until the end of the project. The GEF source code along with documentation can be found on Github.

Integration with ESGF and EGI as a Final Use Case for the GEF
The main goal of all ENES members participating in the development is to enable the GEF to integrate with ESGF data nodes and the EGI Federated Cloud. The GEF ESGF/EGI use case requires the following set up:

  • CMIP5 data is stored at the Centre Informatique National de l'Enseignement Supérieur (CINES) site on ESGF data nodes (and alternatively on a data node running the EUDAT data storage service B2SAFE).
  • The GEF runs on a virtual machine in the EGI Federated Cloud
  • Docker images called GEF services are available from a remote GEF service repository that contains the same scientific toolchain, in this case the Climate Data Operators (CDOs), that is installed on the ESGF compute nodes

The GEF will be able to interface with ESGF via an HTTP API conforming to the Web Processing Service (WPS) standard and trigger post-processing remotely. Two use case variants are planned that depend on the availability of compute nodes within the ESGF e-infrastructure:

  1. With at least one ESGF compute node available at the CINES site, the user can trigger post-processing with the CDOs on the available ESGF compute nodes using the GEF connected to ESGF via its future WPS interface.
  2. Without an ESGF compute node available, CMIP5 data is downloaded to the EGI Federated Cloud and post-processed by a GEF service that mirrors the scientific tools available on the ESGF compute node, the CDOs in this particular case.

This will be the final and most elaborate GEF use case for the climate community within the EUDAT project ending in February 2018. After EUDAT ends, DKRZ is to take over coordination of GEF development and may push it to the level of a production-ready system.