Managing an Agile product launch — over Christmas

The word Monolithic would seem to be a definition for the Saturn V rocket. 110 meters (330 feet) tall, filled to the brim with liquid oxygen, liquid hydrogen and kerosene.

Especially in DevOps, the word monolithic has negative connotations — unwieldy, cumbersome and difficult to change. But the Saturn V rocket, and the Apollo program of which it was part, was actually a more flexible solution than one might expect. Indeed, the Apollo project was the opposite; it was frankly Agile to an extent which would make modern day space entrepreneurs such as Elon Musk and Richard Branson envious!

Apollo 8 logo

Operational agility is a good thing, but sometimes the flexibility of deploying safely whenever you want and whenever you need leads to unexpected deployments — such as the flight of Apollo 8 of Christmas 1968!

In a previous article I discussed the mission types of Apollo. The idea was that, over a period of 3 years, the astronauts would fly a variety of missions which would become gradually more and more complex, testing and validating the various parts of the spacecraft, until they were confident that a Moon landing would succeed.

In fact, there were three main pieces of hardware developed for the missions:

The Saturn V rocket itself.
The Command/Service Module (CSM) which carried the astronauts to orbit around the Moon and then safely back to Earth.
The Lunar Module (LM) which carried two astronauts down to the Moon.

Each of these was designed, developed and tested independently by different contractors. So each mission type would validate a different combination of components, in an ever more advanced configuration and mission.

Mission types of the Apollo Program (NASA)

For example the A mission would be unmanned and test just the Saturn V with a simplified Command/Service Module (CSM) while the B mission would test only the Lunar Module.

The C mission would involve humans for the first time, but only the Saturn rocket and the CSM, not the Lunar Module (LM). The combined Saturn/CSM/LM flight would be designated as D mission.

However, in mid 1968 it became obvious that the Lunar Module was behind schedule and that there would be no combined flight possible until 1969. This would put a lot of pressure on the schedule, to achieve the D, E, F missions and the lunar landing itself — the G mission — all in one year.

One of the primary goals of a modern cloud-native / DevOps development environment is to avoid exactly these kinds of problems. Because each cloud-native micro-service is independent, you can deploy it on whatever schedule you like — without waiting for other teams to finish their development.

This is more or less what Apollo managers decided to do with Apollo 8. Instead of Apollo 8 being the first flight of the combined stack, Apollo 8 would be the first flight to the Moon, albeit with only two-thirds of the components.

Putting it in modern perspective, flying a mission in Earth orbit is equivalent to developing and testing your application in your private cloud (or a specific instance in a public cloud) and flying a mission to the Moon is like deploying it into a production environment (whether a different private cloud or a different public cloud instance makes no difference for the purposes of this analogy).

And like in a modern DevOps environment, the challenge is not whether the newly developed components will function in the new environment (whether in orbit around the Moon or in a new cloud location), but whether you have the operational capabilities to support them there.

For Apollo:

Do we have the technical capability to communicate with the astronauts when they are so far away?
Do we know which calculations to make to map out their route back and forth?
Will the communications delay (1.5 seconds to send a message to the Moon and get a response) affect our capability to support them?
With the sudden change of plans, will we have time to practice for the new mission and adapt the flight plans?

Using an IBM Mainframe to calculate flight plans (IBM)

For modern DevOps:

Do we have the proper monitoring and observability tools deployed to the new environment? Are they connected in a standard way to our central operational tools?
Have we addressed all the security and regulatory compliance requirements for the new environment?
Is the new environment covered in all our runbooks?

The fact of the matter is that while an Apollo mission was an event, due to the physical nature of the launch and the humans involved, a DevOps deployment should be a matter of course and not something notable. Indeed, a common good practice is to deploy as often as possible, even if there is no new component to deploy, simply to validate that the DevOps pipeline is working correctly and that the target environments are configured to accept new applications.

Once the decision was made to convert Apollo 8 from a D mission (testing all three components in Earth orbit) to a C’ mission (test only two components, but in Lunar orbit), there wasn’t much to be done for the hardware because it had been designed for this kind of mission in the first place. But in Houston, the flight controllers and NASA engineers scrambled to update their procedures and documentation to be able to support the new kind of mission.
The IBM mainframes needed to be re-programmed again and again to calculate the mission parameters:

“To [the flight controllers], Apollo 8 was the mission; it would be their greatest achievement. Living in the world of pure mathematics, they were the first generation fully at home with computers — incredibly young, dreamers and visionaries who were venturing in their imaginations and theories with the crew into the unknown, working at the very edge of our knowledge and primed to overcome any difficulties that came their way. Their work, coded into computers and plotted in piles of charts and graphs littering their consoles, was the foundation of every computer instruction in the Saturn rocket and aboard the spacecraft. The [engineers] were totally dependent on the millions of lines of code that they wrote in a variety of computer languages such as COBOL and HAL.”
— Flight Controller Gene Kranz, Failure Is Not An Option (pg 238)

All this work, and to launch during the Christmas season, no less!

Calculating Flight Plan to the Moon(NASA/JPL)

But time and tide wait for no man (or woman, as seen in the movie Hidden Figures), and due to the alignments of the Earth and Moon only way to complete the mission on schedule and fly to the Moon in 1968 was to fly during the Christmas season. Developers, Operators and Site Reliability Engineers try to avoid working during holidays, but it is indeed sometimes unavoidable.

One benefit modern operations teams have over their 60’s era brethren is the added automation and AI capabilities of the computers they use as tools. Instead of relearning and re-configuring the operational parameters for every mission (in DevOps parlance, every deployment or every new environment), we can simply use practices such as GitOps or Infrastructure-As-A-Service to keep monitoring and observability definitions synchronized across environments and we can use advanced solutions such as Watson AIOps which understand the topology and behaviour of new deployments automatically. The amount of knowledge engineers need to learn and memorize for each new deployment is much less; the amount of insight brought to their fingertips by the operational software is much higher.

Watson AIOps builds a topology and shows you changes over time (IBM)

Adding to this, AI solutions such as Watson AIOps can learn from previous deployments and runbooks and make recommendations — as if they were experienced engineers — to help engineers less familiar with the new deployments.

In the coming year I will start adding lessons from the Shuttle missions and more modern space flights to the articles I publish. January 2021 marks the 35th anniversary of the Challenger disaster. Unlike the lessons of Apollo, many of these articles will be lessons in the vein of “what not to do”.

There are still many parallels to make between the domain of space flight and the practices of DevOps, Cloud Service Management, Operations and Site Reliability Engineering. I look forward to 2021 and in the meantime a safe and happy holiday season to all

For future lessons and articles, follow me here or on Medium as Robert Barron, or as @flyingbarron on Twitter and Linkedin.

This article was originally published on Medium.

AIOps

AIOps