Release Engineering Refactored: Eliminate Variables with One True Machine

[article]
Summary:

Sometimes, the best answer is to rephrase the question. This was the approach that one of our biggest customers took when undertaking a new effort to improve their Release Engineering process. They first asked themselves: How can we make the process of making products faster, more reliable, and more efficient? It’s worth pausing to understand what the process is today before thinking about improving it. Whether managed by a dedicated team or not, Release Engineering is the part of a software development organization that’s responsible for actually converting the millions of lines of carefully crafted source code into a useful software product or service for the end-user. More interestingly, it must also be able to show definitively what went into a release should (when) the need to modify it arises.

At its simplest, this means when you run the compiler you should take good notes. In the real world, however, RelEng intersects with virtually every aspect of development: it has large touch points with source configuration management, testing, documentation, deployment, support, and product management. The ideal release process is transparent, flexible, efficient, all-encompassing, traceable, scalable, robust, and fast. RelEng teams often maintain dozens of active branches, each on a diverse set of platforms, each requiring tens of thousands of build and test steps to complete. Googling for “Release Engineering Best Practices” and a few other permutations yields no shortage of blogs, articles, and posts chock-full of checklists with plenty of good advice: number every build, always tag your sources, save the binaries, keep track of test outputs, etc. Like any other subject that’s paradoxically both uncharted and ubiquitous, Release Engineering is ripe for internet commentary, and practioners and pundits are happy to advise.

What’s conspicuously absent, however, are suggestions on how to contain the complexity and manage the workload. It’s easy to see why. So much of RelEng is concerned with validation, proof, consistency—it’s responsible for producing the golden master, the only build that actually matters—so no price tag in terabytes, clock cycles or man-hours is too high so long as the data collected was in the name of tracking every possible input and its impact on any possible output.

But what if instead of asking what we can do, we ask what we can do without? This requires a kind of process engineering leap-of-faith, a fundamental belief that speed is more valuable than quality, detail, or precision, simply because without speed you won't be capable of delivering the quality, detail, or precision you're looking for.  The idea is not that we would have positively identified issue X sooner—as the guys will happily tell you, no way to catch that without more resources, more tests, more time all yoked to more process—but rather that a substantial chunk of issues we’ve been working on up until X would have been discovered faster. If the ultimate goal of RelEng is to produce the flawless golden master as efficiently as possible, then it follows that we should be racing through the plastic prototypes (read: your nightly builds) that precede it as fast as we can.

What’s one surprising way we can optimize a Release Engineering process? Do the majority of the runs in one, unchanging compute environment. This means eliminating all platforms, architectures, tools, and environments except the most common one. It means carefully defining a box, which I’ll call the One True Machine (OTM), knowing precisely what’s on it, and making instances of that machine as easily accessible to everyone as a blank sheet of paper.

In the old days that may have meant disk-imaging systems, but today the name of the game is virtualization. The vision is simply this: at the stroke of one mouse click, a crisp new system spins up that’s ready to build the product from sources. Want to test your new product? Make another fresh system, and you can install built product and test. Want to install another version? Take another sheet from the OTM pad and have at it. Call it the power of the private clone cloud: not only is it virtual and elastic, every system is identical; the whole environment is only available in one flavor.

The discipline enforced by this restriction is incredibly liberating: whole categories of problems (botched PATHs, conflicting libraries, clashing toolchains) are suddenly impossible. More powerfully: explicitly publishing and promoting the default build/test/release environment makes it a new common language for the team. The base VM template is effectively a gold standard that is reliably consistent between projects and activities, as commonplace as a PDF file or a TCP socket. A customer team that adopted this approach experienced a transformative clarity best described as “a kind of reverse Tower of Babel”; for the first time, there was a single way to talk about both infrastructure and product issues.

Jettisoning everything except “one unchanging compute environment” may sound like the doctor just prescribed amputation at the neck for your headache. “Not gonna work for us,” you might be saying (if you’re still reading!). “Our product ships on eleven platforms times at least four active branches per platform. The whole point of RelEng is to actually build those bits, from those branches, for those platforms. Just skipping them is not a solution.” Naturally: the One True Machine pattern offers nothing to the supremely irritating Solaris 8 patch stream your biggest customer has bribed you to keep on life support since 1995. Nor does it do anything to relieve the burden of porting critical infrastructure to a shaky new platform when the product takes on a new row of cells in the OS support matrix. So really, what’s the point?

The point is best illustrated with a Pareto chart, a trust graphic long employed by quality managers from all walks for visualizing the impact of the lowest hanging fruit. Make a list of all your problems, sort into buckets by platform. What percent of the issues could you have caught on the single most popular host platform/configuration? If your picture looks anything like our customers, the answer is “a lot,” often a clear majority. Declaring that RelEng will only regularly test on one platform, only allow developer-scheduled builds on that platform, only run initial regressions on one platform is another way of saying that the team is optimizing its ability to attack its most common problems.

There’s a secondary, more subtle, but more important effect of using One True Machine: it’s so much faster, more efficient, and generally more fun to use that the rest of the organization will slowly align behind it. New features are written with testing on the OTM in mind. Cross-compilers and emulators for oddball platforms materialize. The default ‘OS:’ field in the defect tracker is set to it. Productive teams relish momentum, and the presence of an express train is reason enough to redefine goals as its destination.

At our customer, the One True Machine became the only way to build and test the product. By petrifying this part of the process, they opened up countless other possibilities that were not only easy to implement, they were easy to sustain. The result was a release team that was at once nimble, scalable, and, perhaps most importantly, transparent. When the rest of the software development organization saw how easily they could request builds, schedule test runs, and get fast feedback, they were directly motivated to continue to allow RelEng to take the one great shortcut—“we’re only going to make this work on one box,”—that fueled a huge, self-perpetuating productivity gain across the company.

About the Author Usman Muzaffar is Vice President of Product Management at Electric Cloud, the leading provider of software production management solutions. He was part of the team that founded the company and served as one of the original developers on both the ElectricAccelerator and ElectricCommander products. Prior to Electric Cloud, he worked as a Software Engineer at Scriptics, Inc. and Interwoven (acquired by Autonomy) designing and developing content management, syndication, and distribution systems. He holds a BA in Molecular Biology from Northwestern University.

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.