Originally posted by: seb_
There are two major categories of problems usually reported to the technical support:
Ongoing problems, including problems that happen again and again and problems that can be recreated.
One-time occurrences where a so called root cause analysis (RCA) is necessary.
Technically problems from these both categories could be very similar, but the fact if the support person needs to investigate something in the past or something that will certainly happen again in the future totally changes the necessary approach. This is an important difference, because I often see things done wrong.
So what to do?
It's not that difficult. For the ongoing or re-occurring problems of category 1 it's all about setting a baseline and gathering the right data at the right time. You don't want the investigation being flawed by high-volume error messages that have nothing to do with the problem. Often there are error counters that increased sometime in the past without any relation to the actual problem you have right now. False indications and misled troubleshootings are time-consuming and in the worst case lead to wrong assumptions and therefore wrong and maybe even harming action plans. You certainly don't want that.
But for the category 2 problems deleting the messages from the past and clearing the current error counters would be a big problem. While this is often the first reaction of an admin or even the first suggested action of a support person, it will void any root cause analysis. If you make tabula rasa, what would be left to investigate? Therefore it's most important to gather as much as possible before destroying what's needed to find out what happened. And deliberately clearing the counters is not the only way to do so.
But it needs to go online again! Now!
Of course there's a high priority on getting thing running again. Every minute of downtime costs money and reputation. If it's absolutely clear what caused the problem: Go on! Do what needs to be done to get back into production. But if the problem is not understood and the outage could happen again, minimizing the downtime by prematurely changing things is actually not minimizing the downtime. This boomerang could hit you even harder the second time!
So the best approach is to gather data from all devices and components that are related to the problem, starting with the device reporting the problem and then the ones connected to it.
But even if no action takes place, evidences will be lost over time. Many of these one-time problems will recover itself. Nobody really did something and still it works again. That's then cases where I get data from today to find the reason for a problem from two weeks ago. Nobody cleared any counter or error log and still no root cause can be found anymore.
In Fibre channel-based SANs and many other storage technologies - even with all these new developments - there are still two fundamental roles known from good ol' SCSI times: the initiator and the target. They might not be called like that anymore in the technology of your choice, but the roles are still there. The initiator is the active guy that wants something - writing, reading, knowing, changing; the target is the passive one that is just there to serve and deliver. Usually not for just one but for a lot of initiators. A target should care about itself. It should be aware of everything happening to its own components and it should be responsive about that. But it's the initiator that is responsible for the error recovery against the target. That's the reason why you often see error messages in the host while the error log of the attached storage device is empty for that point in time. That's how it's designed. Imagine a storage subsystem serving 30 hypervisor hosts, each with 5-10 virtual machines - all with volumes on that storage. There's always "something going on". Slow drain devices, minor physical problems, rebooting hosts, bugs in applications, "workload agglutination", problems in the SAN and much more could lead to error recovery against that storage device. It's just not its business to care about all that.
So while you don't see much in the normal error log it could still be that the target logged something internally. Take the IBM SAN Volume Controller (SVC) for example. Towards its virtualized backend storages the SVC acts as an initiator. You'll find lots of information about error recovery that took place against them (if there was any). But you'll hardly find anything regarding the hosts. That's where the SVC is the target. And still it's important to gather its data - as early as possible and as much as possible. These internal logs wrap quite quickly, but if you gather them in time chances are good, that they still contain the timeframe of the problem. For SVC (and the whole Storwize family) it's usually in the livedumps (a.k.a. statesaves), so better create new ones. The other products usually have extended data collections, too.
So always keep in mind: For everything that happened in the past: Gather the data, before you actually do something. Or:
Before you jump, save a dump!