Originally posted by: Casey_B
PowerHA allows any script to be configured to start, stop and even monitor
an application.
This ability provides a lot of power, but can also cause a lot of problems.
Here are some of the roadblocks that I have personally seen, and
some hints on avoiding them.
Please share any problem that you have seen, or tips that you have! I hope that this is helpful.
Logging.
-
Logging is so basic that everyone does it to some degree. Here are some things to consider to making logs more useful.
-
Don't only log failures, make sure to log progress, and times.
-
Take for example this made up script
#!/bin/ksh
# Program: App stop script
load_database_environment
open_database_connection
stop_database_command
if [
$? ne 0 ] then
print "ERROR! my database wont stop"
exit 1
fi
If the stop script hangs, then there is little in your log to determine if it is the load_database_environment,
of if it is the open_database_connection command that hung. ( Or even the stop_database command! )
-
If you use ksh, "set -x" is a good feature to read up on.
-
Another useful ksh feature is the $SECONDS built in variable
-
If you are looking at your application script logs, expect that you will be looking through alot of logs.
-
Make sure that you can easily determine what your important log entries are.
-
My personal favorite method is to use prefixes to each line using PS4 with ksh
-
I also personally use a prefix for every log entry to show if it is an error, or a warning, or just informational.
Assume nothing.
-
High availability is the art of performing the best possible action when the worst possible scenario.
-
When you have a hardware failure, things may work differently than in normal testing.
.h4 Specific examples and considerations:
-
Storing your libraries, or executables in an NFS mounted directory may be problematic, especially if the NFS mount is not controlled by PowerHA.
-
Consider the case where your application libraries and executables are stored in an NFS mounted directory.
-
For convenience, you added the directory to root's PATH, and LIBPATH. (Through /.profile, or maybe even /etc/environment)
LIBPATH=/nfs_mounted:$LIBPATH
PATH=/nfs_mounted:$PATH
-
Now assume that the network connection to the NFS directory fails. At this point, even "ls" may appear to hang!
-
Have a secondary plan for stopping the application when the normal method fails.
-
What if all your application's executables are missing on the node?
-
Would you want PowerHA to wait until you could sort it out? Would you want to manually kill your processes?
-
Scripts may not perform in the same way on the command line as automated.
-
Maybe the "ops" user id is used by PowerHA to stop applications.
-
Maybe also it is used by the application administrators for interactive login
-
The application administrators want to see the following prompt when they login:
$ su - ops
WARNING: You are on the production machine, please hit enter if you want to continue!!
Now imagine the following in the application stop script:
#/bin/ksh
#Program: App stop script 2
su - ops -c /apps/bin/stop_app
-To expand on the previous example, the operators worked out how to avoid the prompt in non-interactive mode.
-
Now one of the ops added the following into the ops .profile:
if [
-e /var/apps/log/something ] then
echo "WARN: We have to do something with something."
echo "WARN: Or maybe we have new mail!!"
fi
Now consider the revised application stop script:
#/bin/ksh
#Program: App stop script 3
result=$(su - ops -c /apps/bin/stop_app)
if [
-n $result ] then
echo "ERROR: stop_app returned a message, must be an error!"
fi
This application stop script would work in testing, but fail once the ops user got some mail, or the .profile printed anything to the screen.
-
Leave nothing
-
Make sure that the application leaves nothing behind.
-
Even if your stop script performs well under normal test conditions,
your application may fail to stop processes, remove shared memory segments, remove ipc sockets, and unload shared libaries.
-
Know what processes are used by the application, and make sure to kill any of them left after a normal stop.
-
You can check for shared memory, and ipc sockets with "ipcs"
-
Any that appear to belong to your application, and are not used can be deleted with "ipcrm"
-
Unused shared libraries can be unloaded from memory by using "slibclean"
-
slibclean is a fairly safe command to run.
-
Be careful with kills
-
When you kill processes as mentioned above, make sure you never kill more than you need.
-
I wrote a stop script that killed everything running under the database user id.
While I was working together with the database administrators to return a system to working order...
I was logged in under my id, and they were working through logs under the database user id.
They said to stop the database while they looked through the logs...
Seconds later, I heard a "Heeeeyyy...I got kicked off of the system" :)
-
grep -w can help with preventing wrong kills
For example:
"ps -ef | grep db2" will return lines with db2, db2das, db2prod, db2test, etc.....
"ps -ef | grep -w db2" will only return lines with "db2" as a seperate word.
.h2 What other common problems have you seen? What other tips do you have to share about writing application scripts in PowerHA?
#PowerHA-(Formerly-known-as-HACMP)-Technical-Forum#PowerHAforAIX