This is the first in a series of blog posts about building better Docker images.
Docker Inc is widely acknowledged for transitioning containers from geekdom to the real world inhabited by us developers, and did this by providing easy to use tools for building, sharing and running containers. Key to this is docker build
command and the Dockerfile.
But whilst this makes building a container image fairly easy, it doesn’t necessarily make it easy to build a good container image. What do we mean by ‘good’? Well, this has a large number of factors, but this series of posts will focus on one aspect, the need to create small images. There are two key reasons for this.
Firstly, images are frequently and repeatedly pulled across the internet, so the smaller the image the faster this will happen. Clearly a good thing.
Secondly, small containers contain less ‘stuff’ and the less ‘stuff’ you have in your container the smaller the attack surface for hackers to get into your containers and cause damage. Hence, well designed lightweight containers are not only good because they load faster but they are also more secure. This series of posts describes approaches for achieving this.
As an example we’ll use a Docker image the contains the RDKit cheminformatics toolkit. Our Squonk Computational Notebook uses RDKit extensively and this container image and related ones are used frequently.
Let’s look at how we first went about this. The RDKit docs provide good information about how to build RDKit from the source code in GitHub. We wanted to be able to build versions at any time, including from the different branches and tags, so building from source seemed to be a sensible approach.
So we created a Dockerfile to handle this. It took a little bit of trial and error to define all the packages that were needed, but the end result is a well defined and repeatable process that builds a container image for RDKit.
The Dockerfile looks like this:
The approach should be reasonably clear:
It takes about an hour to build, but eventually you get an image that can be used to run RDKit:
Whilst this is a nice way to illustrate and reliably reproduce the process of building RDKit, it does have a number of significant issues.
It’s the last of these that we want to focus on. This image is an extreme case of a nasty anti-pattern that affects nearly all container images you will find on DockerHub or other repositories - that is the resulting image contains various artefacts that are needed to build the image, but not needed to run the container once it is built.
Specifically it contains git, wget and the entire build infrastructure including make, cmake, gcc and g++, as well as the apt package manager. And it also contains the checked out RDKit GitHub repository. Lots of extra fluff, none of which is needed to actually run RDKit which is the sole purpose of this container image.
So whilst this Dockerfile is useful for illustrating how to build different versions of RDKit, and could even be useful for a RDKit developer who needs to rebuild things and do some hacking, its a poor example of how to build a container for just running RDKit as it’s huge and has a pretty large attack surface with all those unnecessary extras.
We can do much better, and later posts will show various approaches for doing this. Take a look at the next post.
If these Docker images are of use to you can find the source code in GitHub and the images in DockerHub .