The Common Infrastructure for National Cohorts in Europe, Canada and Africa (CINECA) project has been running since January 2019. In this blog, we’d like to give you an impression of what the consortium aims to achieve and the activities that The Hyve has been involved in over the past year.
The CINECA project aims to bring together patient genomic and biomolecular data from 18 research institutions and biobanks from ten different countries across three continents. This creates a virtual cohort of no less than 1.4 million individuals located all over the world.
Performing analyses on such a large cohort should help scientists better explore for instance the effect of a range of mutations in humans and establish a link between genetic variants and certain diseases. Besides, it should speed up the development of drugs tailored to genetic variants causing cancer, cystic fibrosis and other conditions.
To achieve this goal, CINECA is developing a federated cloud-based network. Some of the partners have already successfully implemented workflows for analysing patient genomic and biomolecular data. The challenge is now to scale up the effort and make the analysis available in different institutions and biobanks participating in the CINECA project.
Personal health train
One issue you quickly encounter with such an initiative is that patient data often cannot be exported from the hospital or research institution for safety and privacy reasons. The solution that the CINECA project will adopt, is the Personal Health Train concept. The idea behind this concept is that the data stays in the same physical location, but scientists from participating institutions can access the relevant data via a workflow that is adopted by different partners. The advantage of this approach is that data can be shared selectively, without exposing privacy-sensitive information. At the same time, pooling cohorts means that researchers can now analyse data from thousands or even a million individuals. This is a huge advantage for research areas such as genomics, where big data analysis is essential if you want to determine if a mutation is associated with an increased or a decreased risk on certain diseases or types of cancer.
To keep the work manageable, task forces focussing on different work packages have been created: Federated Data Discovery, Authentication & Authorisation, Harmonised Metadata, and Federated Data Analysis and ELSI framework. In the past year, The Hyve has been overseeing the activities of the Federated Data Analysis group. Of course, the activities of the different work packages cannot be completely separated as the decisions of one group often have implications for other work packages. Think about questions like: Which type of metadata and data do we need to collect from this cohort?, What are the minimal requirements for running the workflow across all participating institutions, hospitals or biobanks?, and What communication barriers need to be resolved before the systems can be aligned? Needless to say, this is all very much work in progress.
The CINECA project is led by the European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI). What makes the project stand out, is that it goes beyond Europe. It is truly a transcontinental effort, linking European cohorts to patient groups in Canada and Africa. This approach should enhance our understanding globally of which mutations have a significant impact on disease.
In the past year, the focus of the project was mainly on making an inventory: listing the cohorts at the various institutions and which genomic and biomolecular data are available for all participants. Another important aspect was describing the workflows for data analysis that have been implemented successfully at one or more institutions, so in a relatively small-scale setting. The challenge is now to scale up, merge and roll out these workflows at different partner institutions. Of course, you then have to overcome the problem that the IT infrastructure differs from country to country and across continents.
In the past year, several use cases were identified for federated workflows that provide a meaningful scientific output. Here are three examples of use cases in which The Hyve will be involved.
The first use case involves running federated joint cohort variant genotyping across cohorts. Joint genotyping can be performed and implemented in various ways, based on data access levels of the participating cohorts. Data requirements could involve raw sequencing data (FASTQ/long read formats), reference-aligned data (BAM/CRAM) and/or individual genotypes (VCF/GVCF).
The second use case concerns running a federated Expression Quantitative Trait Loci (eQTL) analysis across different cohorts. eQTL profiles are based on genetic variants that explain variation in gene expression levels. An eQTL may regulate different genes depending on the tissue type and disease state. Cohort data required to perform eQTL analysis are genotype (DNA-data) and gene expression data (RNA-data).
The third use case is running a federated Polygenic Risk Score (PRS) analysis across different cohorts, calculating the likelihood that an individual develops a particular phenotype given his or her genotype. A PRS weighs trait-associated alleles across many loci. Data requirements from these cohorts are genotypes and phenotypes for the reference data or a pre-existing PRS, genotypes and phenotypes from individuals of similar ethnic background to evaluate the efficacy of the PRS.
All participating partners were supposed to meet in Toronto, Canada, from 17th until 19th March to discuss the current state of affairs. Unfortunately, the conference had to be cancelled because of the Coronavirus outbreak. It was changed to an online meeting that took place on the same dates. Despite the changes, the partners involved were able to provide meaningful updates of the work done and how to proceed. News and future events can be consulted on the CINECA website.