Mapping UK Biobank to the OMOP CDM: development of two OHDSI tools

The UK Biobank (UKB) is a medical registry containing vast amounts of medical data from 500,000 participants. Last year, The Hyve collaborated with University College London (UCL) to convert the information into the Observational Medical Outcomes Partnership common data model or OMOP CDM (Read more). This article illustrates how we used, adapted, and shared back with the community two of the OHDSI tools leveraged in this project with UCL: Usagi - utilized in the preparation phase, and the Data Quality Dashboard (DQD) - for quality assessment, as part of the conversion of the UKB data mapping project.

Open-source contribution to other OHDSI components

The journey to convert data into the OMOP CDM includes various steps. Within the OHDSI (Observational Health Data Sciences and Informatics) community, several tools have been developed to make the process easier and more standardized. A general workflow using the OHDSI tools includes a thorough preparation phase, the design of the main ETL (Extract, Transform, Load) procedure, a quality assessment process, and finally, making the data available and ready for analysis. The ETL in-depth design is left to the developer, and OHDSI does not maintain a universal tool.

Usagi

The data preparation step before ETL aims to understand the data and produce the syntactic and semantic mappings. The former is done with the White Rabbit tool, which produces a report containing an overview of the source tables with descriptive statistics. The syntactic mapping is simply a detailed document describing how the source tables and fields will be translated into target model ones. This document serves as a reference point for the ETL engineers and is created using the Rabbit in a Hat tool. The second component, the semantic mapping, consists of translating the codes in the source data to an existing “standard,” agreed to be used as a common (within the OHDSI community) vocabulary.

There are different levels of semantic mapping. The source values can either be already coded in the OMOP standard or a “non-standard” vocabulary. When the source values are written in a “non-standard” vocabulary, but a mapping is available, it is simpler to use the existing files. However, suppose the source values are coded in a “non-standard” vocabulary with no existing mapping available or not coded at all (for example, text fields). In that case, the data owner or medical personnel should supervise a conversion effort to the appropriate “standard” vocabularies. Within the OHDSI community, the Usagi tool is used for this effort.

Usagi is a Java application that takes all source concepts and finds suitable equivalent “standard” OMOP concepts. The advantage of Usagi is that it searches for similar terms in common vocabularies and classes and shows a suggested mapping based on a matching score. The Usagi mappings can then be reviewed by multiple people involved in the project. Nevertheless, there might be cases in which the data owner is required to make the final decision based on their expertise.


Usagi interface. The overview table contains the source terms, the selected mapping is the Usagi suggested mapping and can be changed and the search facility is all the “standard” concepts one can choose from.

While performing the preparation steps for the UKB mapping, we found that several concepts required pre-coordination between two or more concepts. For example, the field ‘Alcohol intake frequency’ (1558) can take seven values (1=‘Daily’ to 6=’Never’ and -3=‘Prefer not to answer’), which was different from the typical OMOP perspective and was not supported by the tool. Our proposed solution was to map this field (alcohol intake frequency) into a combination of variables and values from the available standard OMOP concepts. Following the example, the variable concept was ‘Alcohol intake,’ and the value could be ‘Only occasionally.’ We applied these changes - extending the tool Usagi - by allowing the available mappings of a source code to contain multiple standard concepts. You can follow the discussion in more detail in the relevant OHDSI forums post.

Usagi functionality extension to accommodate pre-coordinating concepts. The selecting mapping (middle rectangles) contain two target concepts, one with mapping type ‘EVENT’ and one with ‘VALUE’.

Usagi can accelerate the semantic mapping step, but it requires a high level of medical expertise. Thus multiple people often need to work with the same file. The Hyve added metadata fields to Usagi to enhance this collaborative approach. To improve the reviewing process of the semantic mappings, we extended the code to keep track of the authors. The mapping files can now be shared between the collaborators while easily monitoring the participants who produced each mapping.
Similarly, some mappings required special attention. For example, mappings for which there is a higher level of uncertainty (“ill” could refer to many different conditions), or ones that are not worth retaining (“patient wears only pink clothes”). Therefore, this Usagi extension also includes ‘flag’ and ‘ignore’ buttons to highlight those cases, respectively.

Usagi interface with the added metadata for the person who performed the review (top right corner) and options to ignore or flag mappings (bottom rectangle).

Since the user provides the tool with the initial source file, an older version of USAGI can be used if the above cases apply to your data. There is also the option to load older versions of USAGI-export files into the latest version of USAGI; however, not all the fields shown above will be filled, and the user will have to enter them manually.

"A common question is if Usagi can be used as a general vocabulary mapping tool. The application was designed to be used with the OMOP CDM vocabularies, which contains over a hundred source terminologies (Athena Standardised vocabularies) like SNOMED, ICD, LOINC, and RxNORM. However, a user can theoretically index any vocabulary of interest, as long as it is provided in the OMOP format" - Maxim Moinat (Data Engineer, The Hyve).

All additions turned out to be useful, making the concept mapping process more collaborative between The Hyve and UCL, resulting in higher levels of completeness and confidence in the mappings. You can always find the latest version of the tool on its GitHub page!

Data Quality Dashboard (DQD)

Following the general workflow, we can perform the ETL once the preparation step is complete. Once the ETL is achieved, we need to perform a quality assessment to ensure that the data are ready for subsequent analysis, for example, an epidemiological study.

The OHDSI toolset has a tool for this purpose: the Data Quality Dashboard (DQD). It includes a set of distinctive quality checks, organized and harmonized to be comparable between different data sets. It follows the work of M.G. Kahn et al. (2016), where they define three categories of data quality: conformance, completeness, and plausibility. It is important to note that these individual tests are run at the data owners' side, thus maintaining their privacy.

The DQD is an R package that runs more than 3.000 individual quality checks on the OMOPed data. These checks may pass or fail depending on a predefined threshold percentage of the number of rows that satisfy one specific condition for each check. For example, the DQD checks whether a person's year of birth is always after 1850. The DQD runs these tests from the OMOPed database on the data owner's side. Then, it returns a summary of the results in a JSON (JavaScript Object Notation) format file that can be shared among the conversion team members. The DQD summary results can be used to check for major issues in the source data, for example, if compulsory OMOP CDM fields are left empty. It can also check the transformation itself and highlight areas that might need improvement (for example, lack of coverage of certain domains).

The interface of the DQD. Figure from the DQD GitHub repository (https://github.com/OHDSI/DataQualityDashboard).

In practice, we run the ETL several times to improve the mapping quality. We also run the DQD after each ETL iteration. We can visualize the performance difference between iterations in the - newly added - graph below. In the video below, most of the checks pass (blue dots). Compared to the previous ETL run, most checks remain similar (see dashed line), and a few have improved.

Scatterplot comparing the data quality results from two consecutive ETL iterations from UKB.

We need your consent to show you this video

One of the previously mentioned data quality checks is completeness (or concept mapping coverage). However, this information is often hidden under the other quality checks. Therefore we made a visualization for the coverage of mapping concepts from the DQD results. It shows the percentage of unique concepts from the source data present in the OMOPed one (in light blue), and the amount of records in the source data that could be successfully mapped (in dark blue). This utility was handy during the UK Biobank ETL because it allowed us to work collaboratively, particularly when deciding the focus of further code improvements.

Barplot showing the concept mapping coverage of one ETL iteration from UKB.


Lastly, as previously mentioned, the DQD yields a list of checks that may pass or fail depending on a general predetermined threshold and the percentage of the converted data that satisfies a specific condition. The tool holds default thresholds for each check. However, for the UKB conversion effort, we would sometimes expect to deviate from the given thresholds. For example, we deemed all observations to have a past date important, thus enforcing a stricter rule by editing this checks’ threshold from the 1% default value to 0%.

The given threshold values are written in a large table in the current code. To implement the change in the threshold values would mean editing this table, a process cumbersome and prone to errors due to its size. We created a new framework consisting of a script that takes a simpler table and automatically applies the changes to facilitate the process.

These new utilities facilitated the ETL journey, enhanced the collaboration between The Hyve and UCL, and improved the final OMOPed data set. These new DQD utilities can be found on GitHub!

Summary

In summary, The Hyve performs ETL with proprietary tools (for example, with the Delphyne tool introduced here) in addition to the general OHDSI tools. For the UK Biobank the Hyve developed Usagi and the DQD.

The additions to the tools aim to fill gaps that one might find while using the tools for specific cases. These new utilities helped make the work more straightforward and strengthened our collaboration with UCL. The proposed additions are incorporated into the latest release of Usagi. Soon these changes will also be added to the new DQD release. This way, they are easily accessible to the OHDSI community.

Our mission at The Hyve is to help the scientific community by enabling open science, and we strive to do this one utility at a time.