Rewiring tranSMART from product to platform

A year ago, tranSMART was the internal research datawarehouse for translational research at Johnson & Johnson, and for a number of projects in which J&J companies participated, such as IMI U-BIOPRED. It had already quite some traction at that moment, including a CIO 100 award and a Bio-IT World Best Practices Award, both in 2010. But the really disruptive step was taken by J&J (or Janssen as it's called nowadays) and Recombinant Data (now Recombinant by Deloitte) when they finalized the legal process to make tranSMART open source, and published the source code on internet, now a little more than a year ago.

At first, not much seemed to happen with that source code. Sure, a lot was going on behind the scenes. Other major pharmaceutical companies started doing pilots with tranSMART. University medical centers in the United States and across the world started to investigate tranSMART and it's possibilities. Disease foundations started dialogues about how they could leverage tranSMART for sharing data and knowledge. Standards organizations and regulating government agencies showed interest. Some companies even invested in implementing new functionality in tranSMART. But from an open source point of view, tranSMART still largely remained a proprietary built software product which happened to have (parts of) its source code published.

This was bound to change, however. And it did! For me, as a long term open source advocate, it is fascinating to see how the disruptive power of the sharing and collaboration values behind the open source philosophy work their magic ways even in the IP fortresses of this world - the pharmaceutical companies. The step that the J&J scientists took proved to be visionary. It was just a matter of time before large public and public-private initiatives started picking up the opportunities that an open source platform with the nature of tranSMART could provide. An important project in this space is IMI eTRIKS, a joint EFPIA and European Union project that has the mission of supporting all other IMI projects with a shared IT infrastructure. The eTRIKS consortium decided to leverage tranSMART to build this infrastructure. A comparable project in which we are heavily involved, CTMM TraIT, has the same mission and commitment for 20+ CTMM translational research projects funded by the Dutch government. And it is only a matter of time before other national and international initiatives around translational research realize the enormous potential that collaboration around a platform like tranSMART could provide, both in software and data operability, and will join in.

However, all this momentum and excitement creates an enormously high expectation level for the open source technology that tranSMART is, or rather, has to be. TranSMART faces the challenge to progress from a complex, largely monolithic product with fairly well defined user workflows and a manageable amount of data sources, towards an open ecosystem that can support the multitude of use cases it is now facing, both from a user level as well as from a software development and deployment perspective. An example of this (taken from CTMM TraIT) is the expansion from a clinical trial and biomedical knowledge perspective towards the realtime translational use case in academic research hospitals, where clinical and biobanking data from the hospital domain is anonymized to feed into the realm of the medical researchers. All this, without losing its original appeal and strength of having a powerful, ready to go user interface for analyses such as gene expression and survival analyses, as well as a biomolecular knowledge base search interface.

On a more technical level, the current code base of tranSMART, while based on other innovative open source big data tools such as Grails, R, SOLR, Pentaho, and more domain specific projects such as i2b2 and GenePattern, still needs a lot of work to become a truly versatile, modular, well tested and well documented open source ecosystem. The good news is that this work has already been started and is progressing at an increasingly fast pace. Instrumental in this is the worldwide open source community that tranSMART attracted and which is growing rapidly. An important enabler for this community is the recently started tranSMART Foundation, an initiative by amongst others the NCIBI at University of Michigan. The tranSMART Foundation has found a natural home at the global Pistoia Alliance, which confirms and strengthens its origins and strong bonds with the pharmaceutical industry, but at the same time is closely working together with academic partners such as the already mentioned NCIBI based in Ann Arbor, Imperial College in London, one of the primary partners in IMI eTRIKS, and the VU University Medical Center (VUmc) in Amsterdam.

Not entirely incidentally, Ann Arbor (Feb 2013), London (Feb 2013) and Amsterdam (June 2013) are also the places where community meetings around this initiative of rewiring and expanding tranSMART have been and will be organized. The mission at hand is to rewrite and enhance the current tranSMART open source codebase to a structured ecosystem of open source modules, API's, processing pipelines and data sources that will stand the test of time and form a solid base for the rich platform that tranSMART will become in the next 5 to 10 years. This includes alignment with other similar open source initiatives and projects, some of which have an arguably better documented and tested codebase or richer feature sets for certain area's. Examples of such projects include of course i2b2, but also projects such as TCGA / Firehose, cBioPortal, SAGE Synapse, OpenPHACTS, BioMedBridges and many others. Not to mention interfacing with tools such as OpenClinica, Galaxy, XNAT and Phenotype Database, all of which are on the roadmap for the CTMM TraIT project. None of this should slow down the tranSMART core API development, however. On the contrary, by building upon standards and decisions already taken in these projects we will be able to ensure interoperability and take advantage of all the work done by projects by seeing them as reference architectures.

In this process, we will also enable the persistance layer of tranSMART to have a much broader range of possibilities, going from the RDBMs it uses today such as Oracle and Postgres into NoSQL solutions such as Hadoop HBase and MongoDB. In the end, designing the tranSMART API's will amount to taking some reasonable, argumented decisions, based upon the complex and rich background already present, and to provide reference implementations for these API's. So far this has been approached in an agile way, taking the various tranSMART modules piece by piece, starting with the clinical data layer that rested in an outdated version of i2b2. A first version of that part has already been delivered. Currently, the high dimensional data API's are being designed, and up next will be the job execution and workflow layer as well as the biomolecular knowledge base which powers the tranSMART search interface. Besides further core development, another equally important step is to improve the current state of installation, administration and user documentation of the various tranSMART components, both from a developer and from a user perspective. Important hallmarks in this process are the community meetings, so if you are reading this and want to contribute, feel free to register for our upcoming Amsterdam meeting!