Docker Compose for Data Science

05 January, 2018

Two of the great advents of the past few years that have enabled easily repeatable development environments and experiments are Docker and Jupyter. Let's look at how we can use Docker Compose to rapidly deploy a Jupyter environment with a Postgres database.

jupyter

Before moving forward, make sure you have Docker and Docker Compose installed. Installing Docker and Docker Compose are out of the scope of this tutorial, but notes can be found on the Docker website.

Once Docker and Docker Compose are up and running, let's define the file structure of our project. Start by creating a couple of directories in your current directory: jupyter, notebooks, and data. Within the jupyter directory, create a file Dockerfile, and then in your current directory, create a file docker-compose.yml.

$ mkdir ./jupyter ./data ./notebooks
$ touch ./jupyter/Dockerfile
$ touch ./docker-compose.yml

At this point, your working directory should have the following file structure:

.
├── data
├── docker-compose.yml
├── jupyter
│   └── Dockerfile
├── notebooks
└── tree

Let's add some contents to our Dockerfile. Fortunately, the Jupyter project provides a series of Dockerfiles with various configurations that we can start from. For this project, we will base our own Dockerfile on the datascience-notebook.

Add the following contents to ./jupyter/Dockerfile:

FROM jupyter/datascience-notebook

RUN python --version

RUN conda install --quiet --yes -c \
    conda-forge osmnx dask

RUN pip install -U geopandas \
                   geopy \
                   rtree \
                   folium \
                   shapely \
                   fiona \
                   six \
                   pyproj \
                   numexpr==2.6.4 \
                   elasticsearch \
                   geojson \
                   plotly \
                   tqdm \
                   mapboxgl \
                   cufflinks \
                   geohash2 \
                   tables \
                   mixpanel \
                   GeoAlchemy2

VOLUME /notebooks
WORKDIR /notebooks

While the bulk of what's needed is provided by the jupyter/datascience-notebook base image, I find it's useful to define our own Dockerfile to include dependencies that may not be included in the base image. In this case, we're installing a handful of dependencies that are useful for geospatial data processing such as geopandas and geopy.

Now that we have our Dockerfile defined, let's add some content to ./docker-compose.yml that will help us get Jupyter and Postgres kicked off.

Add the following content to ./docker-compose.yml:

version: "3"
services:
  jupyter:
    build:
      context: ./jupyter
    ports:
      - "8888:8888"
    links:
      - postgres
    volumes:
      - "./notebooks:/notebooks"
      - "./data:/data"

  postgres:
    image: postgres
    restart: always
    environment:
      POSTGRES_USER: data
      POSTGRES_PASSWORD: data
      POSTGRES_DB: data

This file defines two services: jupyter and postgres. The jupyter service will be built from the Dockerfile defined above, and the postgres service will be built from Dockerhub's library/postgres. The postgres service will be accessible from the jupyter service at the hostname postgres, and to keep things easy, the database name, username, and password for the postgres service are all set to "data". Finally, we mount two working directories created above, ./notebooks and ./data to the jupyter service at /notebooks and /data respectively.

If all the above went well, all that should be left to get things spinning is running the following in our working directory:

$ docker-compose up

This should result in the jupyter container being built, and then both the jupyter and postgres services being started. You should see something like this in your console:

docker compose up

The first time we access the Jupyter web interface, we need to use the link provided in the console, subsequently we can directly visit http://localhost:8888. Click the link in the console - a browser should open to the following view:

jupyter landing

From here, you can create a new notebook - the .ipynb file will live in the host machine's filesystem at ./notebooks. You can use the ./data directory to store any data that you might want to access from your notebook.

All code described above can be found on Github.

I hope this was helpful! If you have any questions, go ahead and open an issue.