Common Data Models for FAIR biomedical data

Kees van Bochove

01-03-2019 10 min read

Much of The Hyve's day to day business can be summarized as supporting customers to make their biomedical data FAIR. In practice, this means that we now have executed dozens of projects harmonizing biomedical data, with healthcare sources ranging from local GP offices to clinical trials from the largest pharmaceutical companies in the world, and with biology data ranging from whole genome sequencing to physical activity data. In the course of these projects, we've also encountered and worked with a dozen different models for representing clinical data, such as i2b2/tranSMART, OMOP, FHIR, RDF, CDISC SDTM, ODM, etc. Using a standard, instead of just an Excel file and codebook, greatly helps to implement FAIR principle R1.3: (meta)data meet domain-relevant community standards.

In this blog post I will address some of the questions that people often ask me regarding the Common Data Model for FAIR biomedical data.

"How do you choose which data model applies to your data management process?"

This is not an easy question to answer, and it depends on many different factors, such as the nature (observational? interventional?) and source of the data, update frequency, intended usage, current and expected data modalities, applicable rules, regulations, best practices etc. A nice example of a context-based comparison of data models is the recent EMA report on comparing OMOP vs Sentinel for European healthcare data. But of course there are general patterns. A national health data network will have a different data integration approach than a bench-side cell line experiment. I would like to share a few insights based on 'frequently asked questions' that we often get about common data models as a step to make biomedical healthcare data FAIR.

"What models can you recommend?"

Let's look at a generalized use case, where we try to find a data model and tooling to harmonize data from several biomedical studies (say clinical trials or investigator-initiated studies) for analytics purposes. Important questions here are who the beneficiaries of this data integration are and what type of analysis they are planning to do with the data (the 'use case' in IT lingo). Especially when starting out, you should try to demonstrate and maximize the utility of the approach directly to the end-users (whether these are translational medicine scientists, department heads, medical doctors, citizens/patients etc.). If you approach The Hyve which such a request, we can help you find a suitable technical integration approach, for example:

using OHDSI tooling to query observational health datasets in a uniform way, leveraging the OMOP data model
using an interface such as Glowing Bear on top of conformed clinical trial data using the i2b2/tranSMART data model
Using Avro schemas from RADAR-BASE to persist data from mobile questionnaires and wearable sensors
using cBioPortal to provide oncology researchers with direct access to the integrated clinical (survival status, cancer staging,) and omics (SNPs, CNVs etc.) data
using PhUSE or a similar RDF-based approach to ensure that the data is queryable and using semantic web and search tools such as Disqover

"Which services does The Hyve provide?"

We help academic hospitals and our big pharma customers improve their level of data harmonization. This benefits analytics and data reuse per the FAIR principles, from a proof of concept (POC) based on a few studies to a company-wide strategy for making R&D data FAIR. In this outtake from a Pistoia webinar, I discussed a few examples (see the video below). In the rest of this post, I will highlight OHDSI as one example.

Please accept marketing-cookies to watch this video.

We need your consent to show you this video

"As an example, where does the OMOP CDM fit in?"

This is a question I often get. Again here, there is much more to say on this topic than I can mention here. A blog post cannot really do this question justice. It is definitely a valid question and there isn't a lot of practical information out there to help answer it. So here an attempt to summarize some of our findings at The Hyve, where the OMOP CDM is one of the data models we often use in projects.

The OMOP data model is very well suited for observational data and is widely used (an estimated 1 billion health records worldwide are represented in it). Its scope is to model (observational) medical history data in such a way that it enables systematic analysis of associations between interventions (drug exposure, procedures, healthcare policy changes, etc.) and outcomes of these interventions (condition occurrences, mortality etc.). The default OHDSI tool, ATLAS, excels at performing this type of systematic analysis over multiple databases and the associated tools also allow advanced analyses such as patient-level prediction. There's a ton of material on the use of OMOP and OHDSI, and also the ETL and analytics involved, on the OHDSI YouTube channel. The OMOP model can be found here.

"Can you give us a flavour of the OHDSI community?"

One of the main driving forces behind the OHDSI community is Patrick Ryan from the pharmaceutical company Janssen. ,In his talks, I’ve seen him use OMOP databases with several hundreds of millions of patient records in it. The scale of use is impressive, it’s really worldwide. For example, Korea has a very active K-OHDSI community, and China has their own ‘OHDSI China’ meetings. The OHDSI meetings are also famous for their great atmosphere. Last year, in Rotterdam we even had one of our close collaborators and leaders in OHDSI Europe, prof. Peter Rijnbeek from ErasmusMC, singing on stage! My personal favourite OHDSI gimmick is the 'LegendMed Central' which has millions of automatically generated research papers, showing how real the need is for systematic, large-scale evidence generation in observational research versus single hypothesis-based studies (background). Within the community you’ll encounter a great mix of a passion for helping patients and exercising scientific rigour, while also enjoying friendship and a sense of community.

Please accept marketing-cookies to watch this video.

We need your consent to show you this video

"How is The Hyve involved with OMOP/OHDSI?"

We are supporting multiple customers with their enterprise-scale implementations of OMOP, and also provide OMOP mapping and OHDSI tools installation services. Within the EHDEN project we lead the technical implementation work package, together with Janssen, and we are even rolling out a European federated health data network based on OMOP together with Odysseus! I can highly recommend everyone using OMOP/OHDSI to join the OHDSI community and attend their meetings.

"And what about i2b2/tranSMART and other models such as CDISC?"

The i2b2/tranSMART, FHIR, openEHR, CDISC, cBioPortal data models and associated applications all enable different use cases and have their own scope. For example, cBioPortal (both data model and application) was specifically built to analyse cancer genomics datasets and study associations between the genomic make-up of cancer cells and the relation to clinical outcomes such as survival. cBioPortal can be seen in action on its public website (see for example patient view and a cohort view). In addition, the CDISC models are optimized to represent clinical study and trial data in a way that is transparent for regulators. The i2b2/tranSMART data model is well suited to represent clinical trial data data alongside electronic health record (EHR) data (which could even come from OMOP) and perform exercises such as data browsing, data access requests and cohort selection and data sharing with tools like Glowing Bear and Podium. The i2b2/tranSMART model leverages a star schema, which provides a lot of flexibility to define the domains and concepts you would like to include. You can check out the loading tools.

Again, this quick overview doesn't really do all these tools and models justice. There's a lot more to say about, for example, the rise of FHIR, the importance of architecting for scalability and the support of multiple modalities, the metadata that should go with these datasets, and our research on data models crossovers such as representing clinical trials in OMOP. Implementation of the FAIR principles remains a very active area of research, which we continue in our projects such as FAIRplus, EHDEN, and with our collaborators in the Pistoia Alliance, GO-FAIR, DTL, ELIXIR, etc.

We’d like to hear your opinions or suggestions on what you would like to see covered in a next post!