Capr+: The R package you didn't know you needed for your cohort creation

Have you ever wished that there was a simple R package with the same functionality as OHDSI’s ATLAS? Well, have we got news for you! Meet Capr+ (Capr-PLUS), an R package developed by The Hyve's Guus Wilmink.

In today's healthcare scene, the ability to learn from real-world experiences is just as important as controlled clinical research. Real-World Data (RWD) captures the complexity of patient journeys, as it happens in life —spanning electronic health records, insurance claims, and even wearable device readings—offering a dynamic and often unfiltered view of treatment effectiveness and disease progression. Harnessing this data effectively requires not just access but structure, standardization, and collaboration. That’s where the Observational Health Data Sciences and Informatics (OHDSI) initiative comes in—an open-science movement with the goal to transform fragmented healthcare data into a FAIR, analyzable resource, empowering researchers and clinicians worldwide to uncover patterns, improve patient care, and drive evidence-based policy at an unprecedented scale.

We at The Hyve are experts in leveraging OHDSI tools to facilitate research on healthcare data. Our expertise spans harmonizing observational health data into the OMOP Common Data Model (CDM) and other data frameworks, ensuring data quality, to supporting cohort creation and feasibility assessments for study execution. This broad scope involves collaboration with diverse stakeholders, including pharmaceutical companies, universities, and government agencies, giving us valuable insight into the features that enhance the journey from raw data to completed research. Capr-PLUS was born from this understanding, shaped by our work with data partners in the PHEMS project, a Horizon Europe initiative.

The advantages of Capr-PLUS are not limited to the users of OMOP CDM. Teams that generate large numbers of patient cohorts from data in various data models or frameworks can significantly benefit from the cohort creation functionalities of ATLAS and Capr-PLUS. These tools empower research teams to create highly granular, fit-for-purpose patient cohorts without requiring technical expertise. With the addition of Capr-PLUS, this process is now faster and more efficient. 

Why Capr+

Capr was initially developed to enable users to construct OHDSI cohort definitions in R without the use of ATLAS. ATLAS is a web-based tool developed by the OHDSI community that facilitates cohort design and execution of analyses. It is built upon the same foundational OHDSI Java library, circe-be, to compose SQL (or SQL-like) queries for cohort definitions in the form of CohortExpression objects. Now, Capr is part of the open-source OHDSI Health Analytics Data-to-Evidence Suite (HADES); an R-based toolkit around OHDSI and the OMOP Common Data Model.
Like ATLAS, Capr allows for the construction of concept sets to be utilized in cohort definitions. In OMOP, a concept set is a collection of concepts from the OMOP standard vocabulary that represent a clinical or observational idea. The OMOP vocabulary is a community-determined standard derived from commonly used standardized vocabularies, such as ICD-10-CM, SNOMED, and RxNorm. Due to the hierarchical nature of OMOP vocabularies, concept sets allow for the optional inclusion of ‘child’ concepts to facilitate concept set building.
Therefore, researchers use the concept sets created with Capr or ATLAS to construct project-relevant or clinically relevant patient cohorts that can be analyzed and compared against one another downstream. Researchers may experiment with several cohort definitions before identifying their final cohorts. This can be a very time-consuming and highly iterative process; it is therefore important for cohort creation tools to operate swiftly and efficiently.

ATLAS vs Capr

The intention behind Capr is to enhance seamless workflows between cohort definitions and HADES tools. Crucially, ATLAS provides some additional summary functionalities that Capr, at the time of writing, does not. Namely,
 

  1. ATLAS, unlike Capr, displays the occurrence frequency for each concept in a concept set per dataset. This includes frequencies of each concept on the record level, and on the person level; both for the concept itself, as well as for its hierarchical descendants. This helps assess how well each concept in a set is represented within the data.
  2. For each concept, ATLAS displays whether it is a Standard, Non-Standard, or Classification concept. To achieve interoperability with other OMOP datasets, it is important that Standard concepts are used wherever possible. Therefore, Capr can provide further value for users by integrating concept standardization.

Incorporating these features into the Capr library as Capr-PLUS has substantially elevated its functionality, fostering greater transparency and offering valuable insights into the processes of concept set and cohort definition.

Use case: The PHEMS project

The PHEMS project (Pediatric Hospitals as European drivers for multi-party computation and synthetic data generation capabilities across clinical specialities and data types) is a consortium of Horizon Europe programme, which aims to revolutionize the way pediatric health data is managed and utilized across Europe. PHEMS involves a total of six hospitals from different parts of Europe, in addition to technical and expert organizations. To enable co-development, the project aims to harmonize data used in the data pools of different hospitals.

The project’s key target is to develop an ecosystem that enables research collaboration and information management by utilizing data from different hospitals under the EU General Data Protection Regulation (GDPR) so sensitive information is not transferred from the original location (known as federated learning) and privacy is protected by anonymizing the data.

PHEMS is set to conduct studies on three key pediatric clinical use cases: managing cardiac patient pathways for children, predicting sepsis in pediatric intensive care, and treating hematology/hemophilia in young patients. These use cases aim to enhance understanding in pediatric healthcare operations and confirm the effectiveness of data.
For these use cases, meaningful cohort definitions need to be constructed based on both the end-goal of each use case and the availability of data; particularly regarding the variables to be collected. The availability of these variables, i.e. heart rhythm metrics, lab measurements, and comorbidities for the cardiac use case as an example, is crucial to the extent that the cohort definitions are possible.

To make construction of cohorts and the corresponding concept sets more insightful, The Hyve has made the effort to implement the aforementioned ‘missing’ functionalities from ATLAS in Capr, in an open-source R library called Capr-PLUS, a fork from the original Capr library. These implementations include the ‘countOccurrences’ function to address point 1), and ‘isStandard’, ‘isStandardCS’, and ‘isStandardDB’ functions to address point 2).

1.     The ‘countOccurences’ function, as its name suggests, counts the occurrences of each concept given a connection to an OMOP CDM instance. It returns the Record Count, Person Count, Descendant Record Count, and Descendant Person Count for each concept per OMOP CDM table. The function works by using a single SQL query to query the CDM instance and construct the returned table displaying each concept_id and its name, the OMOP table to which it has been mapped, and the counts.

Table 1. Example results from the countOccurrences functions, storing information about the concepts as well as their counts.

2.      The ‘isStandard’, ‘isStandardDB’, and ‘isStandardCS’ functions return the standardness of concepts given a list of concepts, a CDM instance, or a concept set, respectively. ‘isStandard’ and ‘isStandardDB’ functions work using simple SQL queries that match the used concepts to the OMOP ‘Concept’ table to fetch the standardness of that concept. This enables checking of standard concepts irrespective of whether they are present in a concept set.

Table 2. Example results of the 'isStandard' function which retrieves OMOP standardness for a list of concepts. Only non-standard results are returned.
Table 3. Example results of the “isStandardCS” function, which retrieves standardness given a concept set. S = Standard concept, C = Classification concept.
Table 4. Results of 'isStandardDB' retrieving standardness given concepts in an OMOP database; checking concepts across OMOP tables. S = Standard concept, C = Classification concept.

Concluding remarks

In conclusion, Capr-PLUS offers a powerful, flexible tool for constructing patient cohorts from real-world data, particularly for those working with OMOP as well as other data models. The enhancements introduced through the Capr-PLUS, such as the countOccurrences and isStandard functions, address critical gaps in the original Capr library, providing users with deeper insights into concept sets and cohort definitions, offering faster, more transparent, and highly granular cohort definitions. Its successful application in the PHEMS project underscores its potential to drive evidence-based research and improve patient outcomes.


The newly developed features have been submitted to HADES through the Capr GitHub repository and have gone through an initial review; the feedback of which is being implemented. We invite any users of OMOP and the HADES tool suite to share their suggestions for future enhancements. Feature requests can be submitted via the OHDSI developer forum or through the relevant GitHub repositories of HADES. The development of such tools continues to support engineers, researchers, and clinicians in making better use of healthcare data.

Let's start collaborating

  • Our mapping experts can work with EHR, EMR, registry data and most popular commercial/claims datasets
  • Training all the new mapping service providers in EU (EHDEN)
  • Integrating OHDSI with semantic standards

Fill in the form and we will get in touch