Benchmarking big observational health data

Many users of tools for querying data in data warehouses such as tranSMART and OHDSI, face performance issues when trying to analyse large amounts of observational health data (OHD). When searching a large database for all patients that smoke and have high blood pressure, you may have to wait for minutes before the answer appears on screen. During my internship at The Hyve, I try to find solutions for such performance issues related to big data analysis.

What we see now is that the traditional relational database management systems (RDBMS) all have trouble keeping up with the increasingly large amounts of OHD collected by healthcare providers. RDBMSs are a better match for transactional-type workloads with many rapid, and small data updates, rather than occasional intensive aggregation queries. It would seem modern, columnar-type database systems are better suited for big-data queries, but there is no healthcare relevant comparison between RDBMSs and column-based database systems in order to make solid decisions on how to bring data warehouses up to speed.

During my internship, I am evaluating alternative database systems to power data analysis tools for OHD. I defined a benchmark for example data and queries to help making decisions based on clear metrics. To make sure the results are meaningful, I used the MIMIC-III dataset (1) with eleven years of observational health data resulting from approximately 60,000 intensive care unit admissions from over 40,000 patients. The dataset consists of bedside vital sign measurements, laboratory tests, medications, demographics, and multiple other variables. The queries I defined cover a wide range of healthcare use cases, many of which were gathered from data warehousing applications here at The Hyve.

Four different open-source database management systems (DBMS) are up for the test: Apache Druid, ClickHouse, MonetDB, and PostgreSQL. These systems differ, amongst other things, in storage architecture and available features.

My predictions about the outcome is that all columnar DBMSs will outperform the traditional DBMS PostgreSQL, which is currently used in tranSMART and OHDSI, in many performance criteria. Between columnar DBMSs, I expect ClickHouse and MonetDB to outperform Apache Druid for having less overhead. If you’re interested in the technical details of my study, the full-length article can be found here and make sure to check out my next blog post with the results!

1. MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35.

Written by

Bas Katsma