Electron cryo-microscopy produces extensive amounts of raw scientific data (reaching 1-2 TB per day). This project arose as an attempt to make this kind of data more aligned with FAIR principles. In order to achieve it, we have developed a pipeline for automated annotation and archival of raw data to the Onedata data management system and increased the data analysis FAIRness by exporting the image processing workflow in Common Workflow Language (CWL), using a CryoEM ontology and depositing workflows in WorkflowHub.
The system is composed from a frontend web-browser based application which gathers the metadata related to the particular dataset and submits a request for the deposition data deposition to the Onedata cloud. The request is processed by the backend Python-based application running on the local storage server which serves as a local cloud provider in the federated network. The backend service creates a new data “space” for each dataset, checks for the availability of new data points, and appends those to the “space” on-the-fly. The backend service also sets rules for data archival and publication. Once the data acquisition is terminated in the frontend application, the metadata together with cloud identifiers are added to an internal catalog. The data owner can access the data directly through Onedata interface using the data identifier or download it via link both being automatically sent via email once the data acquisition is completed.
In addition, the external user through the Scipion processing framework could retrieve the raw data from Onedata (using the Scipion Onedata plugin), get inspired by a template that another user would have previously uploaded to the WorkflowHub public catalog (via the Scipion WorkflowHub plugin) which is enriched with CryoEM ontology terms (Refs. 1-3). Then, the user would continue processing and eventually, if good results are achieved, submit raw and intermediate data to EMPIAR, where the scientific community could easily understand the Scipion workflow used in the data analysis taking a look at the viewer integrated in the EMPIAR website. This viewer has been enriched with new functionalities to have an improved view of the data in the different steps of the analysis, hence providing some evaluation regarding data quality.
References:
1. https://www.ebi.ac.uk/ols/ontologies/cryoem
ISC, Masaryk University, Brno, Czech Republic
CNB-CSIC, Madrid, Spain
CEITEC, Masaryk University, Brno, Czech Republic