Mapping UK Biobank to the OMOP CDM: development of two OHDSI tools

UK Biobank (UKB) is a medical registry containing vast amounts of medical data from 500.000 participants. Last year, The Hyve collaborated with University College London (UCL) to convert the information into the Observational Medical Outcomes Partnership common data model or OMOP CDM (Read more). This article illustrates how we used, adapted and shared back with the community two of the OHDSI tools leveraged in this project with UCL: Usagi - utilized in the preparation phase, and the Data Quality Dashboard (DQD) - for quality assessment, as part of the conversion of the UKB data mapping project.

Open-source contribution to other OHDSI components

The journey to convert data into the OMOP CDM includes various steps, and within the OHDSI (Observational Health Data Sciences and Informatics) community several tools have been developed to make the process easier and more standardized. A general workflow using the OHDSI tools includes a thorough preparation phase, the design of the main ETL (Extract, Transform, Load) procedure, a quality assessment process, and finally making the data available and ready for analysis. Note that the ETL in depth design is left to the developer, and OHDSI does not maintain a universal tool.


The preparation of the data for the ETL is aimed to understand the data, and produce the syntactic and semantic mappings. The former is done with a tool named White Rabbit, which produces a report containing an overview of the source tables with descriptive statistics. The syntactic mapping is simply a document describing in detail how the source tables and fields will be translated to the target model ones. This document serves as a reference point for the ETL engineers and is created with Rabbit in a Hat. The second component, the semantic mapping, consists of translating the codes in the source data to an existing “standard”, agreed to be used as a common (within the OHDSI community) vocabulary. There are different levels of semantic mapping. The source values can either be already coded in the OMOP standard, or coded in a “non-standard” vocabulary. When the source values are written in a “non-standard” vocabulary, but a mapping is available, it is simpler to use the existing files. However, if the source values are coded in a “non-standard” vocabulary with no existing mapping available, or not coded at all (for example text fields), then the data owner or medical personnel should supervise a conversion effort to the appropriate “standard” vocabularies. Within the OHDSI community, the Usagi tool is used for this effort.

In short, Usagi is a Java application that takes all source concepts and finds suitable equivalent “standard” OMOP concepts. The advantage of Usagi is that it searches for similar terms in common vocabularies and classes, and shows a suggested mapping based on a matching score. The Usagi mappings can then be reviewed by multiple people involved in the project. Nevertheless, there might be cases in which the data owner is required to make the final decision based on their expertise.

Usagi interface. The overview table contains the source terms, the selected mapping is the usagi suggested mapping and can be changed and the search facility is all the “standard” concepts one can choose from.

Whilst performing the preparation steps for the UKB mapping, we found that several concepts required pre-coordination between two or more concepts. For example, the field ‘Alcohol intake frequency’ (1558) can take seven values (such as 1=‘Daily’ to 6=’Never’ and -3=‘Prefer not to answer’), which was different from the typical OMOP perspective and was not supported by the tool. Our proposed solution was to map this field (alcohol intake frequency) into a combination of variable and value from the available standard OMOP concepts. Following the example, the variable concept was ‘Alcohol intake’ and the value could be ‘Only occasionally’. We applied these changes - extending the tool Usagi - by allowing the available mappings of a source code to contain multiple standard concepts. You can follow the discussion in more detail in the relevant OHDSI forums post.

Usagi functionality extension to accommodate pre-coordinating concepts. The selecting mapping (middle rectangles) contain two target concepts, one with mapping type ‘EVENT’ and one with ‘VALUE’.

Usagi can accelerate the semantic mapping step, but it requires a high level of medical expertise. Thus multiple people often need to work with the same file. The Hyve added metadata fields to Usagi to enhance this collaborative approach. In order to improve the reviewing process of the semantic mappings, we extended the code to keep track of the authors. Then the mapping file can be shared between the collaborators whilst easily monitoring the participants who produced each mapping. In a similar way, there were mappings that required special attention. For example, mappings for which there is a higher level of uncertainty (“ill” could refer to many different conditions), or ones that are not worth retaining (“patient wears only pink clothes”). Therefore, this Usagi extension also includes ‘flag’ and ‘ignore’ buttons to clearly highlight those cases, respectively.

Usagi interface with the added metadata for the person who performed the review (top right corner) and options to ignore or flag mappings (bottom rectangle).

Since the user provides the tool with the initial source file, an older version of the tool is still very useful if not find the above cases apply to your data. There is also freedom to load older version usagi-export foles into the latest version however in that case not all the fields shown above will be filled and will have to be entered anew if desired.

"A common question is if Usagi can be used as a general vocabulary mapping tool. The application was designed to be used with the OMOP CDM vocabularies, which contains over a hundred source terminologies (Athena Standardised vocabularies) like SNOMED, ICD, LOINC and RxNORM. However, a user can theoretically index any vocabulary of interest, as long as it is provided in the OMOP format" - Maxim Moinat (Data Engineer, The Hyve).

All additions turned out to be useful, making the concept mapping process more collaborative between The Hyve and UCL, and they resulted in higher levels of completeness and confidence on the mappings. You can always find the latest version of the tool in its GitHub page!

Data Quality Dashboard

Following the general workflow, once the preparation step is complete we can perform the ETL. Once the ETL is finished, we need a form of quality assessment to make sure that the data are ready for subsequent analysis, for example an epidemiological study.

In the OHDSI toolset exists a tool for this purpose: the Data Quality Dashboard. It includes a set of distinctive quality checks that are organized and harmonized to be comparable between different data sets. It follows the work of M.G. Kahn et al. (2016) where they define three categories of data quality: conformance, completeness and plausibility. Importantly, these individual tests are run at the data owners side, thus maintaining their privacy.

In essence, the DQD is an R package that runs more than 3.000 individual quality checks on the OMOPed data. These checks may pass or fail depending on a predefined threshold percentage of the number of rows that satisfy one specific condition for each check. For example, the DQD checks whether a person's year of birth is always after 1850. The DQD runs these tests from the OMOPed database, in the data owners side. Then, it returns a summary of the results in a json (JavaScript Object Notation) format file that can be shared among the conversion team members. The DQD summary results can be used to check for major issues in the source data (for example, compulsory OMOP CDM fields left empty) or the transformation itself, and also highlight areas that might need improvement (for example, lack of coverage of certain domains).

Interface of the DQD. Figure from the DQD GitHub repository (

In practice we run the ETL several times, to improve the quality of the mapping. And we also run the DQD after each ETL iteration. We can visualize the difference of performance between iterations in the - newly added - graph below. In this example, most of the checks pass (blue dots). From the previous ETL run, most checks remain similar (see dashed line), and a few have improved.

We need your consent to show you this video

One of the previously mentioned data quality checks is completeness (or concept mapping coverage). However, this information is often hidden under the other quality checks. Therefore we made a visualization for the coverage of mapping concepts from the DQD results. It shows the percentage of unique concepts from the source data that are present in the OMOPed one (in light blue), and the amount of records in the source data that could be successfully mapped (in dark blue). This utility was especially useful during the UK Biobank ETL because it allowed us to work in a collaborative way, particularly when deciding the focus of further code improvements.

Barplot showing the concept mapping coverage of one ETL iteration from UKB.

Lastly, as previously mentioned, the DQD yields a list of checks that may pass or fail depending
on a general predetermined threshold and the percentage of the converted data that satisfy a specific condition. The tool holds default thresholds for each check. However, for the UKB conversion effort we would sometimes expect to deviate from the given ones. For example, we deemed all observations to have a past date important, thus enforcing a stricter rule by editing this checks’ threshold from the 1% default value to 0%. In the current code the given threshold values are written in a large table. To implement the change in the threshold values would mean editing the said table, a process cumbersome and prone to errors due to its size. In order to facilitate the process, we created a new framework consisting of a script that takes a simpler table and creates the changes automatically.

These new utilities facilitated the ETL journey, enhanced the collaboration between The Hyve and UCL, and improved the final OMOPed data set. The new DQD utilities can be found in GitHub!


In summary, The Hyve performs ETL with proprietary tools (like our Delphyne tool introduced here) in addition to the general OHDSI tools. For the UK Biobank the Hyve developed Usagi and the DQD.

The additions to the tools aim to fill gaps that one might find whilst using the tool for specific cases. The new utilities have definitely made our work easier and have strengthened our collaboration with UCL. The proposed additions are incorporated into the latest release of Usagi. Soon they will also be added to the new DQD release. This way they are easily accessible to the OHDSI community.

At The Hyve we have a mission of helping scientists and enabling open software. We are doing it one utility at the time.