These days, computers need to be able to cope with the vast amounts of data that the Life Science sector is constantly generating. I am not talking about storing the data but about having the ability to perform computationally intensive analysis on these large data sets. To improve the analysis power for one of our customers, we decided to bring together the following two components: JupyterHub and HPC.
Jupyter and HPC
Some of you may already know the popular notebook application Jupyter. Others may already know the server version of this application: JupyterHub. And the researchers who are using Jupyter a lot may know that there is a new edition coming, JupyterLab. Even though JupyterLab is unfortunately still in beta version, it already looks very promising: it promises to be more flexible, extensive and have better “app” support, so you can adjust it to your specific needs more easily.
Jupyter is an interactive computational notebook, actually, a programming environment, which allows you to see the results of the computation nearly as soon as you type (think about Excel but with extra power, like Excel on steroids). Although it is possible to install Jupyter on an end-user’s machine, its installation and configuration require a lot of expertise. Also, the computational power of a locally installed Jupyter is inherently limited and the user’s notebooks and data are tied to a certain machine. To overcome these limitations, Jupyter community has created JupyterHub, a multi-user web-based hosting solution for Jupyter. The benefits are obvious: users don’t need to install Jupyter locally and can access it from any computer.
What made this project so interesting to us, was the fact that we needed to integrate Jupyter notebooks with an HPC.HPC? This abbreviation stands for High Performance Computing, for some also known as a “supercomputer”, although a supercomputer is only one type of HPC.
For the non technical readers, you can compare a HPC to a lot of computers connected to each other by a tool that distributes the jobs and workload over these computers. HPCs are becoming more necessary every day as the amount of data which scientists, analysts and statisticians use is growing rapidly. Genomics data nowadays can easily take up more than 1TB. Try doing some calculation on that with your own laptop. I can almost guarantee you it will not go well... or you will end up with a serious coffee addiction before it is finished.
So why am I telling you this? For a recent project we looked into installing, integrating and configuring JupyterHub and JupyterLab for one of our customers. At the time when we began this project, JupyterLab was still in alpha state, therefore we sadly could not trust it enough to deliver a reliable solution for our customer and had to limit ourselves to JupyterHub. In this blogpost I want to share with you the knowledge we gained throughout the execution of this project.
HPCs have become an essential addition to the resources of a bioinformatician when it comes to tackling the issue of extra computing power for big datasets analysis. Another application - which is rapidly becoming indispensable - is JupyterHub. JupyterHub provides an easy, interactive, experimental, almost playful way of analysing data. Through the integration with an HPC, more powerful analytics can be done, and users do not have to lose significant time learning how to run analyses on the HPC.
So what approaches did we try and what have we learned from them
Approach one: just run it (on a more powerful machine)
Our first approach was to run every single notebook as a separate HPC job, assuming that HPC provides more computational power than the end-user’s machine, so the limits of the analytics that you can do within a notebook expand dramatically. We encapsulated every notebook within a container such as Docker or Singularity, to capture content like environmental variables, and this container was than spawned across several nodes on an HPC.
However, the starting time of these notebooks could take up quite some time. Additionally, an HPC system administrator will always require users to do certain operations within the infrastructure themselves, as there are risks involved with spawning these notebooks directly on an HPC. For example, it is certainly possible that a user forgets to shut down their notebook and upholds a portion of the resources on the HPC unknowingly. This can be overcome, but was not within the scope of the project.
Approach two: parallelizable with more interactivity
So we decided to go for another, more “code interactive” approach. For that, we chose an old trusted friend from the Python and HPC community: ipyparallel. If you’re not familiar with this plugin, it is a child module of iPython, which has grown into Jupyter notebooks.
ipyparallel establishes a cluster consisting of one controller and multiple engines. Within your Python code you can call this controller and distribute your code, scripts, calculations and analytics across several engines (parallelization), each running separately on a node on the HPC.
The only thing necessary is getting the engine to run on the HPC and connect to the controller. This sounds like a very easy step, but unfortunately we discovered: it is not.
There are many configurable parameters within an HPC: the networks security measurements, differences in environments, accessibility of tools, location of modules and more. So a lot has to be taken into account and transferred between the server where JupyterHub is running and the HPC. Luckily, for this project a lot of these possible caveats could be circumvented by using a shared folder between the HPC and JupyterHub.
Finally, the last thing that you need to keep in mind: which tool will be used for scheduling jobs. There are a lot of different options available these days: SLURM, LSF, MPI, and many more. Most of these tools have existed for quite some time now, which implies that you get to choose from a variety of trustworthy job scheduler tools. This results in a module, such as ipyparallel, supporting multiple of these tools by default. So if you know what you are doing, it will hopefully be just a matter of changing the configuration file and creating a template for submitting the ipengine.
The flexibility of JupyterHub surprised me a lot. There are so many plugins, interchangeable parts like “Spawners”, authentication mechanisms and supported languages. For almost any functionality there is already something available. Most of the problems arise from getting all the components to work together properly… so hopefully this will be fixed in JupyterLab.
As for the HPC, there are many things you need to keep in mind: variables a user can or wants to alter, what they actually may alter, the data getting to the HPC in a secure, consistent way, and the data coming back to JupyterHub without possible data lossage.
Overall, in my opinion, the most important thing to keep in mind here is that a lot of work in integrating Jupyter with HPCs has already been done. So after doing some research into your specific infrastructure - for example, which tools are already used there - you can try setting it up, and hopefully you will succeed!
And if not, just contact us! We are happy to help you.
Have you also worked with Jupyter and a HPC? Share your experiences with us!