TranSMART data exploration and analysis using Python Client and Jupyter Notebook

Elisa Cirillo

13-12-2018 3 min read

In a previous blog on Glowing Bear we highlighted its visualization features, based on data stored in the tranSMART 17.1 data warehouse. The advantage of Glowing Bear is that it allows users without any programming skills to perform queries and display the results for one or more patient groups. In this blog, we’ll discuss how to access and analyse the data in tranSMART 17.1 programmatically, using Python Client and Jupyter Notebook.

Jupyter Notebook

Many scientist use Jupyter Notebook to analyse specific data subsets and visualize the results. The Notebook can contain both code and rich text elements, such as figures, links, equations, et cetera. This mix of code and text elements allows the user to incorporate analysis descriptions and results in one document. Jupyter Notebook supports a range of programming languages. In recent years, the Hyve participated in a project that has improved the analytical power of the server version of Jupyter Notebook.

Python and R client

Earlier this year, The Hyve released a Python client and a R client to speed up the process of accessing data in tranSMART. You could say, the client provides a door which allows access to the patient and sample data in tranSMART in a programmatic way. Previously, the user needed to export a subset of data in the right format, download it and subsequently load it into the Jupyter Notebook or another application of choice. With the Python and R client applications, the data subset in tranSMART can be accessed directly from the Notebook.

Cohort selection

A cohort can be selected from tranSMART 17.1 in two ways: in the Jupyter Notebook via the Python client and in Glowing Bear. When selecting a cohort in Jupyter Notebook, you access the full tranSMART database and enter the selection criteria in the Notebook environment. This method works fine, as long as you know which parameters are available for the patient group of interest.
The second method works by selecting patients in Glowing Bear based on certain selection criteria. Once this cohort is saved in Glowing Bear, an identifier (ID) is created. By copying this ID in the Jupyter Notebook, the scientist can access this subset of patients directly via the Python client. Further analysis of the selected cohort can then be performed using Notebook. An exercise explaining both types of cohort selection can be found here.

Custom analysis

A big advantage of using a Jupyter Notebook in combination with the Python client is that it enables advanced data visualization (such as graphs) and analysis of one or more cohorts, thanks to the libraries for data analysis that are available in the Notebook. An example of visual analysis of a data subset is described her e.

Data upload

The Jupyter Notebook cannot only be used for data analysis and visualisation, it can also be used when uploading data into the tranSMART data warehouse. The Hyve has developed this feature as part of the CTMM TraIT project, an initiative that wants to accelerate translational medicine by improving diagnosis. Jupyter Notebook is used to pre-process different types of data by formatting them in such a way that they can be handled by the upload tool, transmart-batch. This last step is actually done outside of Notebook.

Conclusion

Using Jupyter Notebook to visualize and validate data in tranSMART is easy for scientists with programming skills and it unlocks a whole range of extra functionalities.