How to run the tranSMART open source package for translational research on Amazon AWS

Great news for translational research: Johnson & Johnson recently open sourced their translational medicine datawarehouse, called tranSMART. A paper on tranSMART was published in 2010, and just a few days ago a first version of the source code was put on GitHub.

In this blog post, I will do a walkthrough of how to get tranSMART running on Amazon Web Services. It is a Grails project, so that part is not too complex to deploy, but it also has a lot of dependencies, including an Oracle database, the open source i2b2 project, and various bioinformatics tools such as GenePattern.

Obtaining a host machine

First, we need a suitable host machine. A large part of the software runs on the JVM, so it should be relatively cross-platform. A Linux server would be an obvious choice. But we also need Oracle, so we are a little bit limited in which Linux flavors we can use here. For this walkthrough, I will use Oracle XE, since I would like to use ''freely'' available and open source components as much as possible. Oracle XE is available under the Oracle Technology Network Developer License Terms, and you need to create an account with Oracle for that. Licensing for this project is horribly complicated anyway: i2b2 has its own i2b2 Software License, Oracle has its OTN license (but you will want to get a commercial one for a serious deployment), GenePattern has its own license, and the tranSMART Grails application itself is released under GPLv3.

Considering all this, I went for RedHat Enterprise Linux - it works well with Oracle, and it's well supported by AWS. I used the 64-bit version, and and an m1.large instance. I''m assuming you are familiar with AWS, if not, you of course can also go for a RHEL flavour such as CentOS on a local cloud server, a VM on your local computer with virtualization software like VirtualBox, or even try to install it directly on your OS. I use RightScale for managing my AWS instances, and I tested this with RHEL 6.2 starter (ami-41d0052) on the AWS US East region, and the RightScale CentOS 5.6 (ami-6c0c3f18) on the AWS EU region. For convenience, I also entered the IP address of the newly started server in the DNS via Amazon Route53, so that I can just use transmarttest.thehyve.nl as hostname. You might want to configure the security group to at least close all ports except 22, because Oracle and the various web applications needed for tranSMART expose several ports.

Adding swap memory to the host

For Oracle and the various web applications to work smoothly, you also need swap memory, preferably several GBs of it. Vanilla AWS RedHat images don''t come with swap attached by default, so you need to create for example a 4GB EBS volume for this first, and attach this volume to your transmart host machine. I like to call the volume /dev/sdm, for memory. You can do this easily with RightScale, however, for the purpose of this tutorial I used the AWS management console:

I like to call it /dev/sdm (m for memory). After attaching the volume, you have to enable it in the OS. We need to SSH into the machine for that:

mbpkees:~ kees$ ssh -i ~/.ssh/transmarttest.pem -l root transmarttest.thehyve.nl
The authenticity of host 'transmarttest.thehyve.nl (107.22.91.110)' can't be established.
RSA key fingerprint is 67:2e:29:3c:12:1a:fb:8d:ad:9a:c0:32:b3:2f:62:d4.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'transmarttest.thehyve.nl,107.22.91.110' (RSA) to the list of known hosts.
[root@ip-10-250-107-148 ~]# free
total used free shared buffers cached
Mem: 7646264 219792 7426472 0 10044 50032
-/+ buffers/cache: 159716 7486548
Swap: 0 0 0
[root@ip-10-250-107-148 ~]# mkswap /dev/xvdm
mkswap: /dev/xvdm: warning: don't erase bootbits sectors
on whole disk. Use -f to force.
Setting up swapspace version 1, size = 4194300 KiB
no label, UUID=fbdd5854-21ca-4873-b303-c121a45b4be3
[root@ip-10-250-107-148 ~]# echo '/dev/xvdm swap swap defaults 0 0' >> /etc/fstab
[root@ip-10-250-107-148 ~]# swapon -a
[root@ip-10-250-107-148 ~]# free
total used free shared buffers cached
Mem: 7646264 225248 7421016 0 10092 51004
-/+ buffers/cache: 164152 7482112 Swap: 4194296 0 4194296

Installing Oracle XE

Next, you need to obtain an Oracle XE (I used Oracle Database Express Edition 11g Release 2) RPM from the Oracle website, and get that onto your Linux server. If you copy the download link with the AuthParam you get after logging in, you can use wget or curl. Unzip the file, and then install the RPM:

[root@ip-10-250-107-148 ~]# cd Disk1/
[root@ip-10-250-107-148 Disk1]# rpm -ivh oracle-xe-11.2.0-1.0.x86_64.rpm
Preparing... ########################################### [100%]
1:oracle-xe ########################################### [100%]
Executing post-install steps...
You must run '/etc/init.d/oracle-xe configure' as the root user to configure the database.
[root@ip-10-250-107-148 Disk1]# /etc/init.d/oracle-xe configure

Just accept the defaults, enter a password, and configure Oracle to start on boot.

Now, you should be able to login into the database with the Oracle sqlplus shell:

[root@ip-10-250-107-148 bin]# export ORACLE_HOME=/u01/app/oracle/product/11.2.0/xe
[root@ip-10-250-107-148 bin]# export ORACLE_SID=XE
[root@ip-10-250-107-148 bin]# export NLS_LANG=`$ORACLE_HOME/bin/nls_lang.sh`
[root@ip-10-250-107-148 bin]# export PATH=$ORACLE_HOME/bin:$PATH
[root@ip-10-250-107-148 bin]# ./sqlplus
SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 14 11:57:52 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Enter user-name: system
Enter password:
Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production
SQL> exit
Disconnected from Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

Import of the tranSMART database

From this point on, you can import the tranSMART database. However, we first have to create a number of tablespaces that are referenced in the import, which we can do in sqlplus:

SQL> create tablespace i2b2_data logging datafile 'i2b2.dat' size 32m autoextend on;
Tablespace created.
SQL> create tablespace transmart logging datafile 'transmart.dat' size 32m autoextend on;
Tablespace created.
SQL> create tablespace biomart logging datafile 'biomart.dat' size 32m autoextend on;
Tablespace created.
SQL> create tablespace deapp logging datafile 'deapp.dat' size 32m autoextend on;
Tablespace created.
SQL> create tablespace searchapp logging datafile 'searchapp.dat' size 32m autoextend on;
Tablespace created.
SQL> create tablespace indx logging datafile 'indx.dat' size 32m autoextend on;
Tablespace created.

Also, we can create an Oracle directory that will hold the import files:

SQL> create or replace directory test_dir as '/u01/app/oracle/import_transmart';
Directory created.
SQL> grant read,write on directory test_dir to system;
Grant succeeded.

Next, we download the tranSMART GPL 0.9 .dmp and .exp files:

[root@ip-10-250-107-148 local]# cd /u01/app/oracle/
[root@ip-10-250-107-148 oracle]# su oracle
bash-3.2$ mkdir import_transmart
bash-3.2$ cd import_transmart
bash-3.2$ wget http://transmartproject.org/do...
--2012-02-14 13:13:44-- http://transmartproject.org/do...
Resolving transmartproject.org... 184.73.171.235
Connecting to transmartproject.org|184.73.171.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4251648 (4.1M) [video/unknown]
Saving to: `transmart.dmp'
100%[=============================================================>] 4,251,648 1.42M/s in 2.8s
2012-02-14 13:13:47 (1.42 MB/s) - `transmart.dmp' saved [4251648/4251648]
bash-3.2$ wget http://transmartproject.org/do...
--2012-02-14 13:14:18-- http://transmartproject.org/do...
Resolving transmartproject.org... 184.73.171.235
Connecting to transmartproject.org|184.73.171.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19371 (19K)
Saving to: `transmart.exp'
100%[=============================================================>] 19,371 98.4K/s in 0.2s
2012-02-14 13:14:19 (98.4 KB/s) - `transmart.exp' saved [19371/19371]

Now that we have created the tablespaces and set up the database dump, the only thing that is left is to do the actual import:

bash-3.2$ ./impdp system/[yourpassword] directory=test_dir dumpfile=transmart.dmp logfile=transmart.imp schemas=i2b2hive i2b2metadata i2b2sampledata i2b2demodata i2b2workdata biomart biomart_user deapp searchapp

This will take a while, but it should fill your database with a starting point for tranSMART.

The final step is to install and start JBoss (with i2b2), Tomcat (with the tranSMART Grails application), Solr, R, and optionally GenePattern and PLINK. These are steps are documented in the Install Guide document, so I won't repeat them here. Happy researching!

Tags