Jens Laufer
writes about Software Development, Data Science, Entrepreneurship, Traveling and Sports

Example Use Cases of Docker in the Data Science Process

...or how to avoid the It-works-on-my-computer-but-nowhere-else-problem

The excellent comic by Jeff Lofvers illustrates what you often face in software development but also in data science. You are preparing a data analysis or predictive model, but when you want to share it, then it does not work on someone else machine. It fails, because libraries are missing, libraries are having the wrong version (“dependency hell”), or configurations are differing. Time-Consuming troubleshooting starts.

The solution is not far away: Docker solves the problem of reproducibility in a lightweight manner, but also offers you many other advantages.

What is Docker?

Docker is a free software that performs operating-system-level virtualisation. Docker is used to running software packages called containers. Containers are isolated from each other and bundle their application, tools, libraries and configuration files. All containers are run by a single operating system kernel and are thus more lightweight than virtual machines. [Wikipedia on Docker]

Docker makes it easy to create, run and distribute applications. Applications are packaged up with everything that is needed to run the application. The concept guarantees that the container can be run on every docker runtime environment.

Advantages of Docker:

  1. Reproducibility

    With Docker, you ensure that your software artefact (application, data analysis, predictive model etc.) runs on all docker runtime environments. Your shipments are more robust, as the container contains everything that’s needed to run your artefact. You are not distributing only the code, but also the environment.

  2. Consistency

    Docker equips you with one uniform and consistent runtime environment for all kinds of software artefacts. It reduces the time for system administration and lets you focus on your core work. You might know Anaconda environments; Docker is something similar for the whole software ecosystem.

  3. Traceability

    a.) Version controlling of Docker container code

    A Docker container is built from a script which is a human-readable summary of the necessary software dependencies and environment. This script can be version controlled. The script is entirely traceable this way.

    b.) Uniform distribution environment for all artefacts

    Docker containers can be stored in a repository within your organisation. You keep the whole version history this way.

  4. Portability

    Docker containers can easily be ported from one docker environment to another. Docker Swarm (or Kubernetes) lets you scale applications automatically. Costs for system administration and operation are reduced this way.

However, what are the use-cases for Docker in the data science universe? I will concentrate on data science OSEMN process:

OSEMN: Data Science Process

Use Cases of Docker in the Data Science Process

Reality is today that the process consists of a wide variety of tools and programming languages. Docker is the go-to platform to manage these heterogenous technology stacks, as each container provides the runtime environment it needs to run exactly the one application it is packed around. The interference of technology stacks is reduced this way.

1. Obtain: Gather Data from relevant sources

Data is the oil for data science. You retrieve it, e.g. from surveys, clinical trials, web scraping, scientific experiments, corporate applications or simulations. Typically data engineers are dealing with the data, but also other stakeholders are involved, which leads to a wide diversity of database systems and programming languages.

All these technology stacks can be run independently within Docker containers.

2. Scrub: Clean and aggregate data to formats the machine understands

The Data which was obtained in Step 1 is the oil, but right now it’s raw oil. You need to clean, process and combine it to the data you need for analysis and modelling.

Some of these use cases might be already done in the data retrieval step and have more a data engineering technology stack. Other use cases overlap with the exploration and modelling phase and involve technologies more typical for data analytics.

A lot of data analytics work is done in Notebooks (Jupyter, RMarkdown) which need to be published. You can use a central Jupyter instance for the organisation. The problem with this approach is that you might be stuck with fixed configurations and library versions. Another method would be to publish one or more Notebooks with Docker containers. Then you are more flexible with particular setups.

In the exploration phase, all you have to do is to understand what patterns and values are in the hands of the data. You want to make the results available to everyone interested.

4. Model: Construct models to predict and forecast

The cleaned and preprocessed data is used to train machine or deep learning algorithms. You create models which are a mathematical representation of observed data this way. They can be used for predictions, forecasts and quantification of the ineffable.

To train neural networks you need a lot of GPU power. You need Nvidia Docker for isolating the training process to a Docker container, as using GPU cannot be done in a hardware-agnostic and platform-agnostic way.

5. Interpret: Put the results into good use

The data science insights are communicated and visualised. Models are distributed as microservices.

Conclusion

Docker is a powerful tool also for data scientists and can be applied to all stages in the OSEMN process. You can ship all kind of artefacts in a consistent, reproducible and traceable way. The artefacts can be very different in their technology stack, which is the reality in data science projects. Data engineers work with databases like Oracle, MySQL, MongoDB, Redis or ElasticSearch or programming languages like Java, Python or C++. In the analytics and modelling team, people might work with R, Python, Julia or Scala, while data storytellers tell their story with d3.js in JavaScript or use Tableau. As specialists are rare, it’s better to let them work with familiar technologies instead of pushing them into something unknown. You get better results faster.

Docker is the way to go to manage the heterogeneous technology landscape in data science.

References

Running Jekyll with Docker

Running Kibana on Docker

HOWTO: Tableau Server Linux in Docker Container

Running dash app in docker container

Learn How to Dockerize a ShinyApp in 7 Steps

How to compile R Markdown documents using Docker

Tutorial: Running a Dockerized Jupyter Server for Data Science Kafka-Docker: Steps To Run Apache Kafka Using Docker

Dockerizing an Angular App

Dockerizing a Node.js web app

Dockerize Vue.js App

Running MongoDB as a Docker container

Dockerize PostgreSQL

Docker Image with Chromedriver and Geckodriver

Dockerize your Python Application

comments powered by Disqus