Rebuilding When a File's Checksum Changes

[article]
Summary:

In this article, Ask Mr. Make shows a simple hack to GNU Make to cause it to do the right thing when the contents of a source file change.

A common scenario is that an engineer working on a build on their local machine rebuilds all objects and later gets the latest version of source files from source code control.  Some source control systems set the timestamp on the source files to the timestamp set when the file was checked in; in that case the newer object files will have timestamps that are later than then, potentially changed, source code.

In this article I show a simple hack to GNU Make to cause it to do the right thing when the contents of a source file change.

A Simple Example

The following simple Makefile builds an object file foo.o from foo.c and foo.h using the GNU Make built in rule to make a .o file from a .c.

.PHONY: all
all: foo.o

foo.o: foo.c foo.h

If either of foo.c or foo.h are newer than foo.o then foo.o will be rebuilt.

If foo.h were to change without updating its timestamp then GNU Make would do nothing.   For example, if foo.h were updated from source code control, this Makefile might do the wrong thing.

To work around this problem what's needed is a way to force GNU Make to consider the contents of the file and not its timestamp.  Since GNU Make can only handle timestamps internally, we need to hack the Makefile so that file timestamps are related to file contents.

Hashing File Contents

An easy way to detect a change in a file is to use a secure hash function, such as MD5, to generate a hash of the file.  Since any change in the file will cause the has to change, just examining the hash will be enough to detect a change in the file's contents.

To force GNU Make to check the contents of each file we'll associate a file with the extension .md5 with every source code file that we want to test. Each .md5 file will contain the MD5 checksum of the corresponding source code file.

In the example above source code files foo.c and foo.h will have associated .md5 files foo.c.md5 and foo.h.md5.  To generate the MD5 checksum we can use the md5sum utility which outputs a hexadecimal string containing the MD5 checksum of its input file.

If we arrange that the timestamp of the .md5 file changes when the checksum changes then GNU Make can check the timestamp of the .md5 file in lieu of the actual source file. 

In the example, GNU Make would check the timestamp of foo.c.md5 and foo.h.md5 to determine whether foo.o needs to be rebuilt.

The Modified Makefile

Here's the completed Makefile with MD5 checksum checking:

to-md5 = $1 $(addsuffix .md5,$1)

.PHONY: all
all: foo.o

foo.o: $(call to-md5,foo.c foo.h)

%.md5: FORCE
    @$(if $(filter-out $(shell cat $@ 2>/dev/null),$(shell md5sum $*)),md5sum $* > $@)

FORCE:

The first thing to notice here is that the prerequisite list for foo.o has changed from foo.c foo.h to $(call to-md5,foo.c foo.h).  The to-md5 function defined in the Makefile adds the suffix .md5 to all the file names in its argument.  So after expansion the line reads foo.o: foo.c foo.h foo.c.md5 foo.h.md5.  This tells GNU Make that foo.o is to be rebuilt if either of the .md5 files is newer, as well as if either of foo.c or foo.h is newer.

To ensure that the .md5 files always contain the correct timestamp they are always rebuilt.  Each .md5 file is remade by the %.md5: FORCE rule.  The use of the empty rule FORCE: means that the .md5 files are examined every time.

The commands for the %.md5: FORCE rule will only actually rebuild the .md5 file if it doesn't exist, or if the checksum stored in the .md5 file has changed.  That works as follows.

The $(shell md5sum $*) checksums the file that matches the % part of %.md5.  For example, the this rule is being used to generate the foo.h.md5 file then % matches foo.h and is stored in $*.

The $(shell cat $@ 2>/dev/null) gets the contents of the current .md5 file (or a blank if it doesn't exist; note how the 2>/dev/null means that errors are ignored) and then the $(filter-out ...) compares the checksum retrieved from the .md5 file and the checksum generated by md5sum.  If they are the same then $(filter-out ...) is an empty string.

If the checksum has changed then the rule will actually run md5sum %* > $@ which will update the .md5 file's contents and timestamp.  The stored checksum will be available for later use when running Make and the changed timestamp on the .md5 file will cause the related object file to be built.

The Hack in Action

To see this in action we create files foo.c and foo.h and run GNU Make:

$ touch foo.c foo.h
$ ls
foo.c  foo.h  Makefile
$ make
cc    -c -o foo.o foo.c
$ ls
foo.c  foo.c.md5  foo.h  foo.h.md5  foo.o  Makefile

GNU Make has generated the object file foo.o and two .md5 files: foo.c.md5 and foo.h.md5.   Each .md5 file contains the checksum of the file:

$ cat foo.c.md5
d41d8cd98f00b204e9800998ecf8427e  foo.c

First, we verify that everything is up to date and then that changing the timestamp on either foo.c or foo.h causes foo.o to be rebuilt.

$ make
make: Nothing to be done for `all'.
$ touch foo.c
$ make
cc    -c -o foo.o foo.c
$ make
make: Nothing to be done for `all'.
$ touch foo.h
$ make
cc    -c -o foo.o foo.c

To demonstrate that changing the contents of a source file will cause foo.o to be rebuilt we can cheat by changing the contents of, say, foo.h and then touch foo.o.  In that way we know that foo.o is newer than foo.h but that foo.h's contents have changed since the last time foo.o was built.

$ make
make: Nothing to be done for `all'.
$ cat foo.h.md5
d41d8cd98f00b204e9800998ecf8427e  foo.h
$ cat > foo.h
// Add a comment
$ touch foo.o
$ make
cc    -c -o foo.o foo.c
$ cat foo.h.md5
65f8deea3518fcb38fd2371287729332  foo.h

There you can see that foo.o was rebuilt even though it was newer than all the related source files and that foo.h.md5 has been updated with the new checksum of foo.h.

Improvements

There are a couple of improvements that can be made to the code as it stands: the first is an optimization, the second makes the code ignore changes in whitespace in a source file.

When the checksum of a file has changed the rule to update the .md5 file actually ends up running md5sum twice on the same file with the same result.   That's a waste of time.   If you are using GNU Make 3.80 or above it's possible to store the output of md5sum $* in a temporary variable called CHECKSUM and just use the variable:

%.md5: FORCE
   @$(eval CHECKSUM := $(shell md5sum $*))$(if $(filter-out $(shell cat $@ 2>/dev/null),$(CHECKSUM)),echo $(CHECKSUM) > $@)

The other improvement is to make the checksum insensitive to changes in whitespace.  After all it would be a pity if two developers' differing opinions of the right amount of indentation caused object files to rebuild when nothing else had changed.

The md5sum utility itself does not have a way of ignoring whitespace, but it's easy enough to pass the source file through tr to strip whitespace before handing it to md5sum for checksumming.

Conclusion

I hope this hack proves useful in your real Makefiles.  If you do use it, or improve it, drop me a line.

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.