Co-Author : @Jun Duan, Research Staff Member, Hybrid Cloud Services
Zombie servers are infesting your data centers and cloud accounts. These zombies won't eat your brains, but they will cost you money while, at the same time, waste energy and increase your carbon footprint. Studies show between 25 – 30% of all servers & VMs are zombies, which translates into possibly over 10 million zombie servers [*][**]. Proliferation of zombie servers also means that resources can’t be applied to more critical workloads hence slowing down innovation. With an increased emphasis on fighting climate change and reducing greenhouse gases, it is important that the IT industry do its part to promote sustainability.
What are Zombie Servers ?
Zombie servers are difficult to identify in real-world environments. It is not a simple matter of looking at utilization. Some servers are important and can be used for lightweight user processes like text-editing, IDEs, notebooks, and various utilities that generate almost zero resource utilization. On the other hand, most servers that are loaded with insignificant maintenance processes – like virus-scans or updates – often consume high quantities of CPU and memory. These servers may appear active but serve no real productive purpose.
Applying AI To the Problem
An instinctive way to deal with this issue is to define a set of "if-then" rules, e.g. "If CPU usage is lower than 10%, then it is a zombie." Unfortunately, human-crafted rules often won't work well in this situation, especially where there is lightweight use.
However, thanks to recent advances in AI, we have now another weapon to attack the problem of zombie servers. AI is more or less a perfect complementary tool to rules-based systems. It is particularly good at use cases where there are too many rules to define, or too complex rules to define, or rules that cannot even be well-defined.
In order to empower AI to find zombies for us, two prerequisites must be met. The first one is data. The second one is ground truth. Enterprises are already using automation tools to manage their VMs from a central control plane. We piggyback on these tools to minimize the effort to acquire data. Moreover, from the control plane it is possible to track the owner of each VM, who holds the ground truth on whether a VM is zombie or not. It is the owner who can give advice to an AI on what a zombie looks like.
Gradually, the AI by itself becomes experienced enough when it has consumed more and more data, and has received more and more advice collectively from the owners.
Let's see how we bring this idea to reality in IBM Research, the research and development division for IBM.
Experiments at IBM Research
An AI infrastructure optimization solution from IBM Research adopts a very lightweight approach to collecting data from the virtual machines. A simple set of commands is used, which takes only a couple of seconds to run. These commands are native to operating systems - for both Linux and Windows. No extra agent needs to be installed. The output of the commands is sent to a central data collector, and then consumed by a prediction algorithm.
This AI solution has a web UI to interact with the users. Once connected to the database of virtual machine owners, the UI can send surveys and receive responses from the owners. In this way, the ground truth on which machines are zombies is collected. This information is also consumed by the prediction algorithm.
The algorithm gets two inputs, one is the data from the data collector, the other is the ground truth from the web UI. Beginning with the data, the IBM AI infrastructure optimization solution extracts a rich set of features. The features cover every aspect of a virtual machine: from users' activities to network connections, from running processes to resource utilization. From the UI, the ground truth is translated into binary labels, indicating "zombies or not". After that, the features and the labels are combined to create datasets. Finally, AI models are trained on the datasets. Now the models can make suggestions, which are communicated to the machine owners through the web UI.
This AI solution has been running in IBM Research since 2018, managing 2k+ virtual machines around the globe. It has generated roughly 15 million USDs in cost savings since the initial rollout. The accuracy of our models have increased over time with the latest generation model having an F1 score of 0.96.
But that's just the start of the story. We are now actively working with external clients to unleash the full potential of the tool. For example, where a client has one pool of VMs on-premises, combined with a second pool of VMs on public cloud, the AI solution seamlessly interacts with both pools, because it natively supports any hybrid cloud scenario.
IBM Automation and Watson AIOps
Fundamentally, the detection and elimination of zombie servers is an exercise in applying the principle of AI and Automation to an IT problem. The IBM Automation portfolio consists of a set of software products built on a common foundation for supporting event processing, data transformation, AI modelling and process mining. By leveraging these technologies IBM CloudPak for Watson AIOps enables organizations to apply AI techniques without the need for data scientists.
Cloud Pak for Watson AIOps takes an application-centric view of infrastructure and helps avoid incidents proactively as well as reactively, manage governance and compliance, optimize costs, and ensure efficiency. The research around zombie servers and the IBM research AI solution – described above – relates to the lifecycle management of infrastructure and services. It can help provide AI-powered actionable insights and recommendations which reduce costs and then optimize the overall energy footprint of the infrastructure. By combining innovative approaches with proven hybrid cloud management and AIOps tools, IBM is looking to drive the next level of sustainable computing.
While the initial effort is focused on zombie servers, we are exploring whether the same principles can be applied to identifying unproductive usage of containers, storage or SaaS services. We have started some early engagements and are looking for more interested customers who might benefit from this technology. So reach out to your IBM representative if you wish to discuss your use case with us.
* 2015 WSJ https://www.wsj.com/articles/zombie-servers-theyre-here-and-doing-nothing-but-burning-energy-1442197727
** 2017 Stanford Antithesis Group Study - https://www.anthesisgroup.com/wp-content/uploads/2019/11/Comatose-Servers-Redux-2017.pdf