Inital Work on tranSMART’s “core”

Introduction

At the tranSMART Developer Workshop on London last February, there was a consensus that the future of tranSMART should comprehend a core that would implement the essential functionality of the application. The rest would be built upon this core, opening the way for having a more modular architecture – with a stable API against which plugins could be developed – and for a better quality code base, since writing the core would imply at least a partial rewrite of tranSMART’s haphazard code.

The exact boundaries of this core were quite unclear after the workshop, but it could be inferred that it should include a data model and an interface for accessing and possibly submitting the data.

Modules core-api and core-db

Given the interest of the community in developing a core component, we decided to start development and see what this component could look like.

We decided to create two modules:

The module core-api is a Groovy Maven project with the interfaces needed to program against the core. It contains very little logic.

The module core-db is a Grails plugin that implements core-api and adds other functionality upon which the tranSMART Grails application relies. The transmartApp project, which is the current, monolithic tranSMART Grails application, was made to depend on core-db (though it supports a plugin mechanism, which is used for an R plugin, transmartApp is nevertheless tightly coupled to that “plugin”). This allows for a transitional phase where functionality can be moved gradually from transmartApp to core-db and other future plugins.

It should be noted that core-db has probably too large a scope right now. Arguably, it should only implement core-api and not include other functionality like controllers. This can be addressed in the future.

Elimination of i2b2 Application Dependency

The tranSMART application requires having i2b2 running on a JBoss server with which it can communicate. We decided to focus our initial core development effort on eliminating this runtime dependency. The task was small enough to be feasible in a few weeks.
First, we need to clear the relationship between i2b2 and tranSMART. TranSMART appropriated the i2b2 data model and extended in several ways. It eschewed the standard i2b2 tools and broke the abstraction provided by i2b2, except in the specific instances I’ll mention below, going directly to the database into the i2b2 tables instead of using the interfaces provided by the i2b2 cells via web services.

At this stage, there is little point in keeping this link that enforces some degree of i2b2 compatibility. The biggest advantage would be compatibility with i2b2 tools and having the ability to incorporate i2b2 upstream developments. But the model has already been extended to the point where standard i2b2 tools cannot capture all the information in the database and merging upstream developments is both dangerous (tranSMART does not use the standard interfaces) and of limited usefulness (tranSMART mostly cares only about the i2b2 data model, not i2b2 logic as implemented in i2b2, as the calls to i2b2 are few). Given this scenario, it makes sense to sever the connection to i2b2and allow give tranSMART complete freedom to evolve independently.

In any case, if it turns out that it’s desirable to synchronize the two projects in the future, the abstraction set forth in core-api will make such a task easier, as one could replace the logic in core-db with calls to i2b2 web services. Consumers of core-api would need not be changed. Finally, the carried out reimplementation tries not to break the i2b2 data model, even if it means populating tables and columns otherwise unused by core-db.

Remaining i2b2 Calls

The i2b2 calls that remained were made by the JavaScript front-end. Because the web browser cannot communicate directly with i2b2 owing to same origin policies, tranSMART provides a proxy that forwards XML requests and responses from i2b2 to the web client (incidentally, this proxy has several security issues – it does no validation on the endpoint specified by the client, as even file:// URLs are allowed, and it does not restrict the calls that can be made to i2b2).

Three i2b2 cells were implicated – Project Management, Ontology Management and Data Repository (CRC).

The Project Management cell is unimportant; its purpose is to allow discovery of the available cells and their service endpoints.

For the Ontology Management cell, three calls were relevant:

  • getCategories – Used to get the root nodes of the dataset explorer tree. These root nodes are special in several respects, but these are not relevant here.
  • getChildren – Obtain the nodes directly below a certain node.
  • getNameInfo – Fetch information about a node, identified by name.

For the Data Repository cell, only two calls turned out to be important:

  • CRC_QRY_runQueryInstance_fromQueryDefinition – Define a query and run it in one go; in tranSMART, only used to create patient sets.
  • CRC_QRY_getRequestXml_fromQueryMasterId – Retrieve the definition of a previously created query.

Replacement of Ontology Management

A survey of the usage of the Ontology Management cell from the JavaScript frontend revealed not only the calls that needed to be reimplement (perhaps partially), but also the sort of data that needed to be returned. This last inquiry informed the design of the OntologyTerm class in core-api. This class also specifies the method that replaces the getChildren i2b2 service call; we opted for non-anemic domain objects. The other calls, more static in nature, are specified in ConceptsResource.

The functionality specified in core-api and implemented core-db is but a subset of the Ontology Management cell, but this is all that’s needed given the limited usage in the JavaScript frontend.

Replacement of Data Repository (CRC)

The Data Repository cell is quite more complex than the Ontology Management cell. It allows creating queries involving any number of ontology terms (and even other objects, such as other queries) and restrict data by value, possibly even doing unit conversions. Several types of query results are possible and the cell also allows submission of data.

Fortunately, the usage scenarios from the JavaScript frontend require only a small fraction of this functionality. The queries are limited to specifying ontology terms, possibly constrained by value, and the only result type requested is patient sets.

Even under this limited subset, implementing the required functionality is not trivial. The very dynamic model used by i2b2, where ontology terms specify actual SQL queries, means that SQL queries must be generated on-the-fly, which raises some extra problems like requiring more data sanitization and escaping.

The specification of the subset is mainly set forth in these files in core-api:

  • QueryDefinition – represents a query definition, which is composed by a name and a set of panels, which are intersected to form the result.
  • Panel – aggregates a set of items and has a flag that allows negating the encoded set.
  • Item – includes the key for an ontology term, possibly with a value constraint.
  • ConstraintByValue – represents a value constraint, which further limits the set represented by an ontology term.
  • QueriesResource – this is the service class, the entry point for the replacements of the i2b2 calls.

Conclusion

This work serves two purposes: it provides a real world example of how the development of the much desired “core”can proceed and it provides a tangible benefit for the current tranSMART code base – namely, it is no longer necessary to run the i2b2 web application alongside tranSMART.

The integration work in transmartApp is available in the core-integration branch of The Hyve’s fork. It’s now also been pushed to the transmartApp repository.

Tags