Making End-To-End Privacy-Preserving Federated Research a Reality

Customer

Our customer is a global, top-10 pharmaceutical company that is committed to delivering innovative medicines to improve patient outcomes.

The challenge

For a disease-specific research network, our customer is connecting its federated analysis platform to real-world data sources in hospitals worldwide. Federated data analysis is a privacy-preserving technique that allows multiple parties to collaboratively analyze data without sharing patient-level information. It achieves this by bringing the analysis code to the data, rather than moving the data itself, thereby protecting sensitive information and complying with privacy regulations such as GDPR and HIPAA.

Like centralized analysis, federated analysis requires data to be harmonized — meaning all data sources must share a common structure and vocabulary.

In order to lower the threshold for hospitals to participate in the research network, the task of data harmonization was assigned to The Hyve. To ensure end-to-end privacy preservation, the challenge for The Hyve was to harmonize patient-level data without actually accessing it. This was especially challenging since the participating hospitals were located in different countries and used a broad range of data management systems and non-standardized data formats. In addition to data-focused challenges, we also encountered and addressed highly varying levels of resource availability and technical skillsets at the participating hospitals.

Our goal was to solve the challenge while applying three important principles:

  • Low efforts: streamlined and efficient processes, keeping efforts low and predictable for the participating hospitals.
  • Easy: lightweight data pipelines, easy to deploy and run, using broadly adopted open-source technology.
  • High quality: the harmonized data should be of high quality.

How we solved it

For each participating hospital, we developed a source-specific data pipeline to convert the data sources to the data model used in the network. With the help of The Hyve, a custom data model was created to specifically accommodate the analysis needs for the research network. We helped speed up the process of the purpose-driven custom model creation by sharing our long-term experience of working with a wide range of data models covering all domains of the pharmaceutical lifecycle, from translational biology to target discovery and real-world data.

Preparation

As the first step, we asked the hospitals to run White Rabbit on their data sources. 

White Rabbit performs a scan of the source data, providing detailed information on the tables, fields, and values that appear in a field without revealing any person-level data. This scan generates a report that was used as a reference when designing the harmonization scripts and overall pipelines.
Secondly, we organized a mapping workshop to create structure mappings (source to target table mappings) together with the data provider’s domain expert. Structure mappings were created using Rabbit-In-a-Hat in which we loaded the WhiteRabbit scan report (source information) and the research network’s custom data model (target information). The mapping workshop was also used to clarify the meaning of individual fields in the source data and discuss semantic mappings for those in order to capture true meaning as well as better understand site-specific characteristics.  

Implementation

After collecting all required source data information, we implemented the source-to-target mappings as a Python pipeline with a Postgres database as the target. Before sharing the pipeline with the site we tested it using synthetic data that we generated using the WhiteRabbit scan report.
Sites were enabled to run the data harmonization pipeline in their environment using easy-to-understand instructions shared by The Hyve. If the run was successful, the result was a converted dataset in the database. If not, the log report clearly indicates at which step the conversion could not be performed, enabling the resolution to be found.

Validation/Verification

The last step in the conversion pipeline is validation of the converted data. To ensure consistency and high data quality across the federated research network, we designed and implemented detailed checks for data completeness, consistency, plausibility, precision, and conformance. To enable the customer to get early insights into the data and inform feasibility assessments even before connecting the data to the federated platform, we implemented descriptive statistics.

After running the data validation package, the hospitals shared the data quality and summary statistics report with The Hyve, and resolutions for data quality issues were discussed in a data harmonization review meeting. Data quality fixes were implemented, and the updated data harmonization pipeline was shared with the site. On average, three implement-run-validate iterations were needed per participating hospital.

The outcome

The data harmonization pipelines developed by The Hyve enabled our customer to build a fully privacy-preserving federated research network, powered by high-quality data.

Let's start collaborating

Working with real-world data across hospitals? Need high-quality, GDPR-compliant harmonization without accessing patient-level data?

The Hyve can help you build a scalable, privacy-preserving data infrastructure that is lightweight, easy to implement, and tailored to your research goals.

Fill in the form and we will get in touch