POST

This series of posts describes how we can generate smaller Docker images. In the first post we outlined a common problem with container images - that they frequently contain artefacts that were needed to build the software or to install it into the container. We’ll show one approach that can be used to avoid this extra bloat, and so generate smaller and more secure containers.

The RDKit container that we built in the first post was a fairly extreme case of this anti-pattern in that the RDKit source code, the all of gcc, g++, make, cmake and lots more that was needed to build that source code ended up inside the final container, even though they serve no purpose in running the RDKit that they have built.

Clearly there should be a better way to do this? There is. Well, in fact there are several!

One thing we could do is not build RDKit but just install from RPM or DEB packages. That would definitely address much of the problem, but that approach didn’t really work for us as the RPM and DEB packages for RDKit have historically not been kept up to date, and we also needed the flexibility to build bespoke distributions that would not be possible if we just relied on the standard distributions.

So no problem, we just build the packages ourselves, and then install them. But that wouldn’t help if we were building the packages inside our final container - in fact that would be worse as we’ve already show we can build and install directly without the need for any package manager.

So what we did is to use the builder pattern. This involves using one big fat image to build the artefacts and then install those packages into a second lightweight image. The process is orchestrated by a simple bash script.

Before we look at the details its worth pointing out that recently Docker have introduced multi stage builds that can also be used to achieve much the same thing, but this is a fairly new feature and has not yet worked its way into some Linux distributions, so we’ll stick with the more manual approach here. A nice comparison has been written by Alex Ellis.

So how does our approach work? Full details are in the GitHub repo.

The process is orchestrated by the build.sh bash script.

Step 1 is to use a Dockerfile similar to the one we described in the previous post. This builds RDKit from source, but rather than installing the resulting build into the image we build DEB and RPM packages. The Dockerfile is here but the key part is this:

RUN cmake -Wno-dev \
  -DRDK_INSTALL_INTREE=OFF \
  -DRDK_BUILD_INCHI_SUPPORT=ON \
  -DRDK_BUILD_AVALON_SUPPORT=ON \
  -DRDK_BUILD_PYTHON_WRAPPERS=ON \
  -DRDK_BUILD_SWIG_WRAPPERS=ON \
  -DCMAKE_INSTALL_PREFIX=/usr \
  ..

RUN nproc=$(getconf _NPROCESSORS_ONLN)\
  && make -j $(( nproc > 2 ? nproc - 2 : 1 ))\
  && make install\
  && cpack -G DEB\
  && cpack -G RPM

The main bash script runs this like this:

docker build -f Dockerfile-build-debian\
  -t $BASE/rdkit-build:$TAG\
  --build-arg RDKIT_BRANCH=$BRANCH .

This builds RDKit as before, but then uses cpack to build RPM and DEB packages. Yes, you can build both, in this case on a Debian system. We also build the artefacts we need for a Java based RDKit image.

The second step creates a running container from that image and copies the built artefacts from the container to the host system. This is done by this:

rm -rf artifacts/$TAG
mkdir -p artifacts/$TAG
mkdir artifacts/$TAG/debs
mkdir artifacts/$TAG/rpms
mkdir artifacts/$TAG/java
docker run -it --rm -u $(id -u)\
  -v $PWD/artifacts/$TAG:/tohere:Z\
  $BASE/rdkit-build:$TAG bash -c\
  'cp build/*.deb /tohere/debs && cp build/*.rpm /tohere/rpms && cp Code/JavaWrappers/gmwrapper/org.RDKit.jar /tohere/java && cp Code/JavaWrappers/gmwrapper/libGraphMolWrap.so /tohere/java'

The remaining steps build different Docker images for different purposes. We build:

  1. A Debian based image with Python using the DEB packages (Dockerfile-python-debian).
  2. A Centos7 based image with Python using the RPM packages (Dockerfile-python-centos).
  3. A Debian based image with Java using the DEB packages (Dockerfile-java-debian).
  4. A Debian based image with Java and the Tomcat servlet container using the DEB packages (Dockerfile-tomcat-debian).

All of those from one master build. Nice!

But the proof of the pudding is in the eating. What are the sizes of those images and how does this compare with the previous approach which yielded an image that was 1.25GB in size?

$ docker images
REPOSITORY                                        TAG                       IMAGE ID            CREATED             SIZE
informaticsmatters/rdkit-tomcat-debian            latest                    7fa32622d1fe        31 hours ago        381 MB
informaticsmatters/rdkit-java-debian              latest                    60c9fc7b7c72        31 hours ago        357 MB
informaticsmatters/rdkit-python-centos            latest                    380a50f7ddd3        31 hours ago        542 MB
informaticsmatters/rdkit-python-debian            latest                    eacb6065c14c        31 hours ago        414 MB
informaticsmatters/rdkit-build                    latest                    7b2cc073b265        31 hours ago        2.27 GB

You’ll see that the build image (rdkit-build) is even bigger at 2.27GB, but that’s expected as it now also contains the RPM and DEB packages. The real comparison is the rdkit-python-debian image which comes in at 414MB, 33% of the size of the original one. That’s a massive improvement!

You’ll notice that the centos based image is a bit bigger at 542MB. This is why we have typically used Debain based images as the debian:jessie image on Docker Hub release is 100MB whereas the centos:7 image is 204MB. More on this in the next post.

So we’ve succeeded in creating a container image that is much reduced in size, so is more efficient and more secure. But this isn’t the end of the story. These images are still not ideal for a number of reasons. For one thing we have the DEB or RPM packages ‘stuck’ inside the resulting images eating up unnecessary bytes and for another we still have unnecessary packages installed - those of the package managers themselves. We can tweak the build process a bit to address the first issue, but for the second we need a more radical approach. We’ll describe that in the next post in the series.

Checkout this material for this post in our GitHub repo. In there you’ll find a few other things. For example we’re trying to make a Centos based build image, but that’s proving to be a little more tricky.