It’s 9:00 PM on a Friday night, you get a call that applications are running slow and your Ceph cluster is posting some strange and cryptic error messages. You log into the Ceph console to see what is going on. You copy the error string and paste it into your search engine. It comes back with several references in chats and a couple of related blogs, but nothing jumps out at you as the cause or the solution. You start digging through pages and pages of chats, hoping to find something. You message a couple of people who posted about similar error messages, but no responses at midnight on a Saturday morning.
In the meantime, your applications performance is still in the tank. You need to do something. You know you can perform a rolling reboot of the nodes of the cluster. After a couple of hours your cluster has been rebooted, you have a new set of error messages, plus the original ones and application performance is still impacted. You search the new error messages, still nothing that points to a solution.
You reread the chats looking at the suggestions people posted. You start trying some that seem to make sense. You get some new messages, you search, you try, still nothing helps. Then you get an OSD failure message, then another. Ceph starts to rebuild the data replicates. This impacts performance further. Plus, as OSDs fail you become concerned if you have enough spare capacity for the system to completely rebuild replicates.
What was a performance problem is now a critical event putting your data at risk. Of course, management and business owners must be notified that system integrity is at risk as well as the data. A complete rebuild and restore will take days, the applications will be down for days.
If this sounds incredible, something very similar happened to one of my customers. I have left the details vague to protect the unlucky.
Imagine an alternative, where you contacted IBM support, you were directed to run some diagnostic commands. A video call was set up with a couple of development engineers who work on the code to further diagnose the issue. A remedy was defined, OSDs didn’t fail, data was never at risk, application impact was minimized, and you got some sleep over the weekend.
Why you should pay for IBM Storage Ceph.
Paying for support on open-source software is like buying insurance. This is not like your homeowner’s insurance. Homeowners insurance for IT is more like investments in disaster recovery infrastructure. You build it, but likely will never have to need it. Paying for support on Ceph is more like an extended warrantee type package for the appliances in your home. “Eventually, something is going to break.”
Maybe you are never going to face a cascade of OSD failures that could lead to data loss. But you can get a Ceph roadmap update with a discussion about how the changes affect your specific use case. You can ask about the effects of a tuning parameter, and learn this specific parameter is affected by other settings and will affect yet more parameters. (Tuning Ceph is like tuning a Formula 1 race car. It is about proper balance of many interrelated components.) Or you are onboarding new workloads, which requires purchase of additional nodes. Having an expert review your hardware configs before you make a purchase and wait weeks to take delivery before you can begin testing gives confidence and may eliminate a bad capital investment.
These are all nice to haves. You can probably live without them, many customers do. But let’s look at the real value of support for Ceph. It is 9:00 on a Friday night and … What is the value of opening a support ticket? I poked around the web looking for cost of downtime estimates. Seems the Ponemon report from 2016 is the most recent and referenced study. The cost of downtime per minute ranged from $926 to $17,244 with an average of about $9000. Minutes are hard to comprehend, they go by fast when things are going wrong. This is an average of about a half a $million per hour, $12M per day, assuming you can save your data. (The cost of losing data will be significantly higher and potentially expose you to legal, reputational, and ultimately even existential risk.)
Not all issues are hard outages. Some don’t even affect the business. Maybe the value is to have IBM support researching an issue while you get some sleep on Saturday morning. Most likely all it would take is to save you one day of full production or a few days of business impacting performance issues to justify the cost of your Ceph insurance policy.
I almost forgot to mention the meeting the customer IT leaders at the company above had with the line of business leaders after the event. I wasn’t invited to the meeting, but I doubt it was pleasant. Not sure how you measure the value of not having to explain to execs why you didn’t pay for support.
Having a support contract provides peace of mind as you are realizing the business benefits of Ceph or other Red Hat / IBM supported opensource technology.
If you not familiar with Ceph or IBM’s role in support the Ceph project, take a look at Marcel’s blog Ceph or IBM Storage Ceph, what are the differences?