How does Docker know when to use the cache during a build and when not?

CachingDocker

Caching Problem Overview


I'm amazed at how good Docker's caching of layers works but I'm also wondering how it determines whether it may use a cached layer or not.

Let's take these build steps for example:

Step 4 : RUN npm install -g   node-gyp
 ---> Using cache
 ---> 3fc59f47f6aa
Step 5 : WORKDIR /src
 ---> Using cache
 ---> 5c6956ba5856
Step 6 : COPY package.json .
 ---> d82099966d6a
Removing intermediate container eb7ecb8d3ec7
Step 7 : RUN npm install
 ---> Running in b960cf0fdd0a

For example how does it know it can use the cached layer for npm install -g node-gyp but creates a fresh layer for npm install ?

Caching Solutions


Solution 1 - Caching

The build cache process is explained fairly thoroughly in the Best practices for writing Dockerfiles: Leverage build cache section.

> - Starting with a parent image that is already in the cache, the next instruction is compared against all child images derived from that base image to see if one of them was built using the exact same instruction. If not, the cache is invalidated. > > - In most cases, simply comparing the instruction in the Dockerfile with one of the child images is sufficient. However, certain instructions require more examination and explanation. > > - For the ADD and COPY instructions, the contents of the file(s) in the image are examined and a checksum is calculated for each file. The last-modified and last-accessed times of the file(s) are not considered in these checksums. During the cache lookup, the checksum is compared against the checksum in the existing images. If anything has changed in the file(s), such as the contents and metadata, then the cache is invalidated. > > - Aside from the ADD and COPY commands, cache checking does not look at the files in the container to determine a cache match. For example, when processing a RUN apt-get -y update command the files updated in the container are not examined to determine if a cache hit exists. In that case just the command string itself is used to find a match. > > Once the cache is invalidated, all subsequent Dockerfile commands generate new images and the cache is not used.

You will run into situations where OS packages, NPM packages or a Git repo are updated to newer versions (say a ~2.3 semver in package.json) but as your Dockerfile or package.json hasn't updated, docker will continue using the cache.

It's possible to programatically generate a Dockerfile that busts the cache by modifying lines on certain smarter checks (e.g retrieve the latest git branch shasum from a repo to use in the clone instruction). You can also periodically run the build with --no-cache=true to enforce updates.

Solution 2 - Caching

It's because your package.json file has been modified, see Removing intermediate container.

That's also usually the reason why package-manager (vendor/3rd-party) info files are COPY'ed first during docker build. After that you run the package-manager installation, and then you add the rest of your application, i.e. src.

If you've no changes to your libs, these steps are served from the build cache.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionHedgeView Question on Stackoverflow
Solution 1 - CachingMattView Answer on Stackoverflow
Solution 2 - CachingschmunkView Answer on Stackoverflow