The excellent comic by Jeff Lofvers illustrates what you often face in software development but also in data science. You are preparing a data analysis or predictive model, but when you want to share it, then it does not work on someone else machine. It fails, because libraries are missing, libraries are having the wrong version (“dependency hell”), or configurations are differing. Time-Consuming troubleshooting starts.
The solution is not far away: Docker solves the problem of reproducibility in a lightweight manner, but also offers you many other advantages.
What is Docker?
Docker is a free software that performs operating-system-level virtualisation. Docker is used to running software packages called containers. Containers are isolated from each other and bundle their application, tools, libraries and configuration files. All containers are run by a single operating system kernel and are thus more lightweight than virtual machines. [Wikipedia on Docker]
Docker makes it easy to create, run and distribute applications. Applications are packaged up with everything that is needed to run the application. The concept guarantees that the container can be run on every docker runtime environment.
Advantages of Docker:
With Docker, you ensure that your software artefact (application, data analysis, predictive model etc.) runs on all docker runtime environments. Your shipments are more robust, as the container contains everything that’s needed to run your artefact. You are not distributing only the code, but also the environment.
Docker equips you with one uniform and consistent runtime environment for all kinds of software artefacts. It reduces the time for system administration and lets you focus on your core work. You might know Anaconda environments; Docker is something similar for the whole software ecosystem.
a.) Version controlling of Docker container code
A Docker container is built from a script which is a human-readable summary of the necessary software dependencies and environment. This script can be version controlled. The script is entirely traceable this way.
b.) Uniform distribution environment for all artefacts
Docker containers can be stored in a repository within your organisation. You keep the whole version history this way.
Docker containers can easily be ported from one docker environment to another. Docker Swarm (or Kubernetes) lets you scale applications automatically. Costs for system administration and operation are reduced this way.
However, what are the use-cases for Docker in the data science universe? I will concentrate on data science OSEMN process:
Reality is today that the process consists of a wide variety of tools and programming languages. Docker is the go-to platform to manage these heterogenous technology stacks, as each container provides the runtime environment it needs to run exactly the one application it is packed around. The interference of technology stacks is reduced this way.
Data is the oil for data science. You retrieve it, e.g. from surveys, clinical trials, web scraping, scientific experiments, corporate applications or simulations. Typically data engineers are dealing with the data, but also other stakeholders are involved, which leads to a wide diversity of database systems and programming languages.
All these technology stacks can be run independently within Docker containers.
The Data which was obtained in Step 1 is the oil, but right now it’s raw oil. You need to clean, process and combine it to the data you need for analysis and modelling.
Some of these use cases might be already done in the data retrieval step and have more a data engineering technology stack. Other use cases overlap with the exploration and modelling phase and involve technologies more typical for data analytics.
A lot of data analytics work is done in Notebooks (Jupyter, RMarkdown) which need to be published. You can use a central Jupyter instance for the organisation. The problem with this approach is that you might be stuck with fixed configurations and library versions. Another method would be to publish one or more Notebooks with Docker containers. Then you are more flexible with particular setups.
In the exploration phase, all you have to do is to understand what patterns and values are in the hands of the data. You want to make the results available to everyone interested.
The cleaned and preprocessed data is used to train machine or deep learning algorithms. You create models which are a mathematical representation of observed data this way. They can be used for predictions, forecasts and quantification of the ineffable.
To train neural networks you need a lot of GPU power. You need Nvidia Docker for isolating the training process to a Docker container, as using GPU cannot be done in a hardware-agnostic and platform-agnostic way.
The data science insights are communicated and visualised. Models are distributed as microservices.
Docker is the way to go to manage the heterogeneous technology landscape in data science.