Large scale screening efforts, whether based on drug libraries, RNAi or CRISPR, have become very common in modern biological research. The microscopy images from these screens contain much more information than what is used in the original analysis and therefore represent a rich resource for data re-use, especially considering the large resource costs involved in performing the initial screening.
In particular, such screen data can contain information on phenotypes that were previously not of interest and therefore not analysed or detected. To extract this new phenotypic information however requires re-analysing the very large sets of images gathered in the original screen. Performing the typical image analysis pipeline to segment and extract features from objects in the images on the scale of genome-wide screens requires use of a high-performance computing infrastructure.
This EOSC-Life pilot project demonstrated the use of such an infrastructure to run new image analysis via the cloud on publicly available image data sets from genome wide screens. As an example case, this project investigated the biology of the nucleolus, a small membrane-less organelle found within the nucleus of the cell. The process by which nucleoli structure is established and maintained is yet poorly understood, although phase separation has been hypothesized to play a key role.
Analysing existing screening data for gene perturbations that affect nucleoli numbers and structure can provide insight into how phase separation impacts nucleoli structure and which proteins play a key role in controlling the process.
Rather than performing new experiments, this pilot project reused the wealth of image data provided by already published image-based RNAi screens, which are publicly available in the IDR, and protein data form the Human Protein Atlas. Integrative analysis of multiple image datasets takes advantage of complementary information provided by the use of different assays and reporters in different systems. Major obstacles to the reuse and analysis of genome-scale image datasets are the complexity and the size of the data, which preclude easy network transfer and require a high-performance computing infrastructure to process in a reasonable amount of time.
The pilot project of Demonstrator 6 connected the public data resources, such as the IDR, with Galaxy, and brought popular image analysis tools of the CellProfiler suite into Galaxy to allow for collection and analysis of data in the cloud.
This project demonstrates how data re-use can substitute for new data generation and establishes a re-usable workflow.
Training on how to run the workflow in Galaxy – https://training.galaxyproject.org/training-material/topics/imaging/tutorials/tutorial-CP/tutorial.html
EMBL HD (Beatriz Serrano-Solano / University of Freiburg (former EMBL), Jean-Karim Hériché, Yi Sun)
Sci Life Lab
University of Dundee (Jean-Marie Burel)
EMBL EBI
The close feedback loop was very helpful, particularly sharing an office with some EOSC-Life technical experts! It was very helpful to have hackathons and various other face to face events. We wouldn’t have access to the cloud resources if it weren’t for EOSC-Life, so that added value to this project. Connecting Galaxy to the imaging community was important in both directions, both new tools and useful imaging data being made available to people who might need it.
– Beatriz Serrano