This case describes how The Hyve supported University College London (UCL) in mapping data from the UK Biobank (UKB) to the OMOP Common Data Model (OMOP CDM). UK Biobank is a large-scale registry containing medical and genetic data from half a million participants from the UK’s general population − mainly healthy subjects − aged 37 to 73 years. The initial data were collected from 2006 to 2010 and have been routinely enriched since then. The dataset consists of multiple baseline assessments, which include surveys, blood, urine and/or DNA samples, and MRI imaging data. The information is additionally linked to electronic health records (EHR) (from primary care, secondary care and cancer care, including historical records), death registries, and − more recently − information on COVID-19 testing.
The UKB database is globally accessible to approved research initiatives seeking to shed light on common and life-threatening diseases, providing insights into prevention, risk factors, and treatment efficacy. In short, UKB is an extraordinary resource at the service of modern medicine, which will keep enabling scientific discoveries that benefit human health in years to come.
UCL contracted The Hyve in the context of the European Health Data Evidence Network (EHDEN), a precompetitive initiative around OMOP/OHDSI. The Hyve is a leading provider of OMOP/OHDSI services and tech leader in EHDEN. We also support the training of Small Medium Enterprises (SMEs) and help pharmaceutical companies, academia and healthcare institutions adopt open-source tools by teaching their IT experts and researchers how to map data to the OMOP CDM. The main goal of the collaboration between The Hyve and UCL was to make the UKB dataset available for research related to the COVID-19 pandemic by converting it to a standardised and widely adopted model for observational health data. The process came with several challenges that called for custom solutions, including the in-house creation of a new ETL development tool: Delphyne.
Why mapping the UKB data to OMOP?
Sound and reliable analytics require data standards, interoperable vocabularies, and ETL conventions. This is typically not possible with unprocessed real-world data based on EHRs, for example due to the inconsistent use of medical terminology and measurement units, or the excessive occurrence of free text fields. Such inconsistencies are typically a consequence of observational data being captured in different ways or reflecting different conventions (or lack thereof) in various care settings, etc. The breadth of clinical, scientific and technical competencies required in the journey from source data to evidence further complicates the process; this is especially true for the vast and heterogeneous dataset provided by UKB.
The Hyve supported UCL in mapping UKB data to the OMOP CDM (resulting in a so-called “OMOPed” dataset) to ensure the interoperability and reproducibility of future analysis results. The OMOP CDM provides a standardised data model and terminology that can be used for research across healthcare providers and regions. The model is complemented by a set of open-source tools − the OHDSI suite − that enable researchers from multiple organisations to collaboratively work on data at different locations. The OHDSI suite also offers a stack of open-source tools that can be used for standardised analytics. The collection of tools is constantly evolving and expanding to meet the needs of the research community.
While mapping UKB data to the OMOP CDM, The Hyve was able to overcome common challenges encountered when working with observational health data by adopting a collaborative approach with UCL and by developing new technical solutions.
Collaborative Agile development
Thanks to the close collaboration with UCL, The Hyve was able to fully understand and process privacy-sensitive UKB data without obtaining direct access to them. This is an issue often encountered when working with patient data: privacy rules and regulations often prohibit access or make transfer of personal data particularly costly and complex. To tackle this issue, we used a process based on Agile development practices that The Hyve has perfected over the years while working with other healthcare providers, registries and research organizations. Here are the main steps we took with regard to the UKB data stored and managed by UCL:
- UCL provided The Hyve with a scan report of the data source, produced with the White Rabbit OHDSI tool. The Hyve used the scan report to generate synthetic datasets for ETL development (step 1).
- The Hyve developed the ETL (step 2).
- Once tested with the synthetic data, the code was then run locally by UCL on a part of the original UKB data. Data quality checks were executed and The Hyve was given feedback on changes that need to be made to the code. The Hyve refined the ETL scripts and returned them to UCL for a new iteration (step 3).
- Finally, the ETL was run on the complete UKB dataset at UCL, producing the full OMOPed version and a final report.
This collaborative Agile development approach, based on regular consultation and feedback sessions with our client, is fundamental to achieve a high-quality, high-coverage output that enables different analysis queries. In other words, the time invested with UCL in understanding the dataset and technical requirements resulted in a versatile OMOPed UKB dataset that will form a sound basis for multiple future research studies, without the need to remap the data for specific purposes. This saves our client time and money in the long term. Moreover, assuming no large restructuring of the UKB registry will occur, the high level of code optimization means that the ETL could also be executed on new UKB data releases with no or very minor changes, should the need arise.
A new ETL framework
While The Hyve makes regular use of open-source tools from the OHDSI stack, the bulk of the mapping work for UKB was performed using a custom ETL framework: Delphyne. Rather than replacing or competing with existing OHDSI tools, we developed Delphyne to fill a gap in the current set of capabilities and integrate seamlessly with the rest of the tools. Delphyne can facilitate the conversion of any data source to the OMOP CDM, and we intend to use the ETL framework for other projects and clients. Moreover, our developers can easily adapt the package to specific source data peculiarities or conversion goals.
For the UKB mapping effort, Delphyne specifically helped us address the following challenges:
- Mapping of non-standard ontologies. A first mapping challenge we faced was that some fields contained free text, while others made use of a variety of ontologies. As a result, reviewing, harmonising, and mapping these terms can entail a significant manual effort. In order to enable better collaborative research, the OHDSI community prescribes which ontologies to use (for example, SNOMED). These are referred to as standard vocabularies. For fields such as free text, there is little choice but to manually review the mappings to the standard OMOP vocabularies. However, for fields containing terms from valid non-standard OMOP vocabularies (e.g. ICD-9-CM, ICD-10, ICD-O-3, dm+d, OPCS-4, Read, HES specialty), Delphyne allowed us to completely automate the semantic mappings to the standard OMOP vocabularies.
- Data heterogeneity amongst data providers. The different healthcare systems in England, Scotland, and Wales do not always gather the same information, or use different ontologies, for example for drug prescriptions or clinical procedures. This can lead to errors if the data are not handled with specific mapping logic for each provider. To facilitate quality assessment of the ETL output, we extended the standard CDM v5.3 model with fields that captured the provenance of each record. Delphyne made this process extremely straightforward by allowing us to customise the OMOP CDM table definitions.
- Conversion of a wide format table to a long format. UKB participants can have up to four baseline assessments, executed during multiple visits, that are in the form of questionnaires, interviews, measurements, biological samples (blood, urine, saliva), data from activity monitors, or scans. Thus for each patient, a wide variety of variables and time points had to be extracted from a 500,000 by 9,000 table. Delphyne enabled us to easily implement and integrate in the ETL the custom processing logic required for this table, despite its unusual format. Moreover, we could execute the code in a memory-efficient way, thanks to its built-in batch mode functionality and other performance optimization options.
- Working with an evolving data source. While carrying out the implementation of the UKB ETL, COVID-19 data was added to the biobank. To map this new data, an update of the mapping scripts was required. This was easily implemented, without disruptions to the general ETL execution flow, thanks to the modular structure of the ETL framework. We also needed to refresh the OMOP vocabularies whenever updates relevant to the UKB mapping effort were published, for example when a UKB-specific OMOP vocabulary was released. This has typically been a manual and time-consuming process, however Delphyne has greatly improved its efficiency by automating the loading of standard vocabularies from disk.
Additionally, Delphyne allowed us to track the ETL execution steps with detailed logging and summary reports, which is particularly useful in the context of an iterative development approach like the one we carried out with UCL.
“A successful journey from Real World Data to Real World Evidence is made of collaborative efforts: multiple expertises need to join forces working with open-source tools, the most powerful asset to empower an open-science mindset.” - Rosa Bianca Gallo, Sales Executive, The Hyve
UKB is an incredibly rich resource for healthcare research. Given its size and complexity, the mapping work required great adaptability and a constant exchange of ideas. For example, over the course of the project we frequently consulted the OHDSI community, and shared back with the community some of our learnings. We founded and led the OHDSI UKB working group (WG) (currently known as Registry WG), leveraged the expertise from the OHDSI Oncology WG, and initiated several OHDSI Forums discussions.
From a technical perspective, a one-size-fits-all approach does not work when tackling such complex and heterogeneous datasets like UKB. It is important to use flexible tools that allow for custom solutions, tailored to the dataset characteristics and to the specific analysis requirements. A powerful ETL framework such as Delphyne was essential in successfully carrying out the conversion effort. Delphyne and existing open-source components in the OHDSI suite allowed us to keep our approach agile and deliver a high-quality mapping of the UKB data, enabling future research to build upon our efforts.
This particular project with UCL also led us to improve two widely used OHDSI suite components: Usagi and the Data Quality Dashboard (DQD).
More about that in a future blog from The Hyve’s OMOP specialists!
“The Hyve is a European leading SME focusing primarily on open-source solutions in healthcare and bioinformatics. Their acknowledgment of the FAIR principles and close connection with the OHDSI community result in an efficient support of open science with respect for data privacy and sensitivity. Our cooperation with The Hyve on harmonization of two health data sources within the international projects BigData@Heart and EHDEN was very smooth. Perfect communication, together with The Hyve’s experience in iterative and agile development using synthetic data helped to separate the development process and the ETL deployment on a client side. Moreover, even though the access to the deployment environment was restricted to The Hyve, their experts were always able to provide consultation and help, including further quality assessment on harmonization results.” - Spiros Denaxas and Václav Papež, University College London