AIX

 View Only

Principles of troubleshooting for Power Systems with AIX

By Jose luis Ortega posted Sun February 28, 2021 10:49 AM

  

Principles of troubleshooting for Power Systems with AIX

Students attend classes for years to study computer programing and design systems, but they usually receive no troubleshooting training. It is expected this skill will be developed with the experience. This may be valid point. Most Power Systems professionals develop troubleshooting abilities as they work with the systems, but the results may vary widely, from having little or no troubleshooting skills to be expert troubleshooter.

The problem with aforementioned approach is troubleshooting becomes more art than science, depending on past experiences with the problem being dealt with. Therefore, in this article, I want to discuss general approach and concepts for troubleshooting on Power Systems with AIX.

A methodical approach

Usually, when dealing with a problem, we just prefer starting and fixing the problem rather than going through a methodical plan. This could be ok for simple issues. The problem is that is not always possible to identify easy issues. Sometimes, simple problems can result in quite complicated problems and difficult issues may be in turn simple. Therefore, managing a troubleshooting plan is an asset. Therefore, let’s talk about the common aspects for troubleshooting: Define the problem, Understand the system, Isolate and Fix the problem and Confirm the solution.

Defining the problem and its nature

Witnesses in real-life crimes, may misidentify suspects, exaggerate things and forget others; the same may happen when dealing with system users. We need to give the same critical treatment to his users as investigators to their witnesses. Therefore, ask everything. For example, these are common questions: have any changes been made recently? when did you first notice the problem? Do multiple servers experience the same problem? Also, keep in mind that asking allows to identify expectation problems, that is, user’s expectations beyond the scope of the machine or application. 

Questions are also helpful in dealing with angry users. Also, leverage active listening, that is, to recap what the user said in your own words and tell it back to them. For instance, you could say “Let me make sure I understand you correctly; you’re saying the process is taking four times than normally takes,” and ask confirmation.

If it’s possible replicate the problem and see it by yourself. Although, sometimes this is not possible, don't be happy with just some log files. This approach allows you to determine the nature of the problem. Because sometimes, the user’s point of view might be that “system is not working” and this includes all stacks: Power Systems, applications, networking and so forth.  Refinement is needed in this step.

The bottom line so far is, to understand exactly what the users of the system perceive the problem to be. Don't try to troubleshoot a system without a clear description of the problem. Keep in mind that small misconceptions can end up making a huge difference. Therefore, get firsthand information, identify expectation problems and remember that problems can be hidden by other problems.

Understand the system

So far, you have gathered some information about the system during problem definition step. Now, the problem occurs when you’ve received an unexpected response from the system. Therefore, keep in mind, the effective troubleshooting starts with a good understanding of the system and its components. The more information you have about the normal operation of system, the better. As a result, it’s in our best interest to have big schematics about the different systems: Storage, Ethernet, PowerVM and so forth. This also allows you as troubleshooter to theorize on potential causes.

On the other hand, gathering tools help reduce the amount of time spent during initial problem determination. Hence, master your skills on data gathering tools such as: snap, zsnap, perfpmr, pdump, among others. You can read about these tools on this IBM support page. In addition, learning about these troubleshooting topics will increase your troubleshooting capacity: error logs, LVM Commands, installation and backups, systems dumps, diagnostics, boot problems, LED codes.

The System History

Image the first appointment with your doctor. The first things he’d do is to review your medical history. Do the same with your systems. It is rare problems just to show up without changes. Usually something must happen which causes the problem. Therefore, it’s important to keep history for the system. For instance, the nmon and lpar2rrd tools help you analyze performance problems. The errpt, alog, diag, syslog, HMC user interface give indication of logged errors by the system. Cfg2html and HMCScanner tools allow you to identify configuration changes.

A common source of software problems is when the prerequisites are not met, or instructions are not followed exactly. To avoid the former problems, use the FLRT tool to verify compatibility levels across products. Also, bear in mind that a hardware problem can be the potential culprit. This happens software or workload changes have not happened. Leverage the errpt command, since reports hardware problems. But also, keep in mind that, some hardware problems may not be reported and can undetectable except in some settings. For example, a loose cable in a Port-Channel of a SEA causing dropped packets.

Finally, you could be dealing with dormant bug if the above checks passed. However, I've met a lot of people who, immediately they have a problem, are calling their product support to report a bug.  Certainly, there are a lot of software bugs out there causing problems. But probably some bad procedure may have happened. In either case, a better route before opening a PMR is to take a break and check APARs in AIX or VIOS HIPER Issues and see if your problem has already been reported and then download the fix. Check also AIX forums such as Reddit, or  rootvg.net and AIX to see if anyone else seen the problem before.

Isolate the problem: Identify the root cause

It is rare you find problems where the entire design is the problem.  It’s beneficial to have the big picture at times, but it’s rare to find a totally messed up system. If that happens, a recommendation is to start again the implementation. For example, you may have to recreate PowerVM implementations for storage designs problems. Most of the times, the problems are limited to few components. Here, leverage the schematics of your environment to isolate the problem and half-split each subsystem until you find the fault.

When a problem occurs on more than one system, look for differences and similarities between systems. For performance problems, it might be difficult to simulate the production workload in testing environments, despite the systems are identical. However, isolating the failing part is a good strategy to fix problems, since when finding the problem source, it can be easily fixed.

Fix the problem

Fixing the problem is really easy when you’ve discovered the root cause of the problem. Sometimes, you just need to select the necessary procedure to solve the problem. Avoid “reinventing the wheel”, so to speak. For example, you can find the service procedures for common problems in the Troubleshooting AIX  and  Troubleshooting and support for Power pages. On the other hand, there’ll be situations where you’ll need to come up with your own solution or where you’ll need to call IBM support and send all the collected data.

In any case, the troubleshooting efforts should be planned. Define the intended results and metrics to be measured before fixing the problem plan begins, document all efforts, learn what caused the problem, and confirm the solution worked. Plan to avoid the failure happens again.

In addition, it’s better preventing problems rather than dealing with them. The lack of attention to configuration best practices needlessly add considerable risk. Following The Service and support best practices for Power Systems is a great way to start for acquiring the best practices. For instance, having a Planned Service Pack updates will avoid many hitches that you might encounter in your environment.

Confirm the solution

Usually, we take this step for granted, but we need to confirm the solution worked. This let us know that we didn’t fall into a false fix trap, otherwise our efforts could be unproductive. Troubleshooting is very important skill because we're judged by others based upon how fast and reliable fix a problem, but it’s better to be cautious when quick and simple fix present themselves, because they could be false fixes.

1 comment
72 views

Permalink

Comments

Sat March 13, 2021 10:14 AM

This is a great reference for troubleshooting.  Thanks for the post!