IBM Verify

IBM Verify

Join this online user group to communicate across Security product users and IBM experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only
Expand all | Collapse all

wga_notifications and it's watchdogs

  • 1.  wga_notifications and it's watchdogs

    Posted Wed September 25, 2019 01:13 PM
    Hi community,

    We are running ISAM9 in an docker environment and we notice that the wga_nogirications process seems to take quite a bit of CPU (sometimes up to 40%).

    It seems to me that the process is responsible for monitoring the appliance's health, but I'm not sure if this has a place in a container, as in such an environment, the healthcheck is actually responsible for reporting about the health of the container to docker.

    As a consequence, I am considering to stop the wga_notifications process and it's watchdog.
    Does anybody has seen similar issues or has some insight in possible consequences of stopping these processes?

    Thx in advance

    ------------------------------
    Kristof Goossens
    ------------------------------


  • 2.  RE: wga_notifications and it's watchdogs

    Posted Thu September 26, 2019 03:22 AM
    Hi Kristof,

    From a recent OpenMic on ISAM monitoring (see:https://www-01.ibm.com/support/docview.wss?uid=ibm10872972) I understood that the wga_notifications process reports on disk, CPU, certificate expiration and possibly other things.
    It seems to read SNMP data and can be tuned using tuning parameters such as: 
    • wga_notifications.disk.usage_warning_percentage
    • wga_notifications.disk.usage_alert_percentage
    • wga_notifications.cpu.usage_warning_percentage
    • wga_notifications.cpu.usage_alert_percentage
    • wga_notifications.cert.expiration_date_warning_days
    • wga_notifications.cert.expiration_date_alert_days
    Hopefully this helps a bit in your understanding of this process. Not sure if it's a good idea to stop such process, let's see what others think.

    Kind regards, Peter.

    ------------------------------
    Peter Volckaert
    Senior Sales Engineer
    Authentication and Access
    IBM Security
    ------------------------------



  • 3.  RE: wga_notifications and it's watchdogs

    Posted Fri October 18, 2019 03:48 PM
    Hi Peter

    Would you happen to know what are the Appliance default values for all wga_notifications ? We can set these values but we cannot query the Appliance for the defaults.

    Thanks

    ------------------------------
    Sylvain Gilbert
    ------------------------------



  • 4.  RE: wga_notifications and it's watchdogs

    Posted Mon October 21, 2019 03:01 AM

    Hi Sylvain,

    No, I do not know the defaults. And I couldn't find them in the Knowledge Center neither.
    Someone else on the forum might know of course. Meanwhile, it's a good idea to put your on thresholds.

    Kind regards, Peter.



    ------------------------------
    Peter Volckaert
    Senior Sales Engineer
    Authentication and Access
    IBM Security
    ------------------------------



  • 5.  RE: wga_notifications and it's watchdogs

    Posted Mon October 21, 2019 05:02 PM
    The defaults are:

    • cert.expiration_date_alert_days = 14
    • cert.expiration_date_warning_days = 30
    • cpu.usage_alert_percentage = 90
    • cpu.usage_warning_percentage = 80
    • disk.usage_alert_percentage = 90
    • disk.usage_warning_percentage = 80
    • hvdb.usage_alert_percentage = 90
    • hvdb.usage_warning_percentage = 80

    I hope that this helps,

    Scott.

    ------------------------------
    Scott Exton
    IBM
    Gold Coast
    ------------------------------



  • 6.  RE: wga_notifications and it's watchdogs

    Posted Mon October 21, 2019 09:12 PM

    Thanks Scott

    While we are on the subject, I have received questions from team mates asking why the Appliance does not send an SNMP trap (or just report to the Event Log depending on the System Alert configured) indicating when the CPU usage (for instance) decreases under a warning or critical level.

    Currently, one can only observe in the Appliance Event log occurrences when the CPU usage increases beyond any of the warning/critical threshold. The fact that when the CPU usage goes back under any of the thresholds is not reflected in the Event log, it prevents one from assessing the duration of such condition.

    Although that one can use the LMI/Restapi to query CPU usage data points to visualize the CPU usage pattern (trend analysis), from a pure Event Management standpoint, knowing when the CPU usage returns to a more normal pattern could help external monitoring system to auto-resolve incident, or on the opposite help delay the automatic opening of incident ticket if the high CPU usage only reflects a very short one-time spike.

     

    I am open understanding better from others in the field as well: what are the best practices in this perspective of event management.

     

    Thanks



    ------------------------------
    Sylvain Gilbert
    ------------------------------



  • 7.  RE: wga_notifications and it's watchdogs

    Posted Tue October 22, 2019 02:19 AM
    Sylvain,
     
    This is just a limitation with the current event framework which we are using.  Feel free to raise an RFE if you need the event framework to raise an event when the system recovers from a prior alert.
     
    Thanks.
     



    Scott A. Exton
    Senior Software Engineer
    Chief Programmer - IBM Security Access Manager

    IBM Master Inventor


    Phone: 61-7-5552-4008
    E-mail: scotte@au1.ibm.com
    L11 & L7 Seabank
    Southport, QLD 4215
    Australia






  • 8.  RE: wga_notifications and it's watchdogs

    Posted Thu October 24, 2019 09:26 AM
    Edited by Sylvain Gilbert Thu October 24, 2019 09:27 AM
    Hello Scott

    I've deleted yesterday the RFE that I had opened earlier this week.

    The Appliance DOES records in its Event Logs or sends SNMP Traps indicating the expected behevior as can be observed in this event trail in our back-end monitoring system:

    MajorMajor someserver 2019-10-23 07:49:32.0 [ISS]LogData: WGAWA0643E High CPU utilization: 100% (CPUUtilizationState)[name=system,priority=high] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_CPUUtilizationState IS-AAM IM317009 N

    Minor someserver 2019-10-23 07:48:32.0 2019-10-23 08:05:11.0 [ISS]LogData: WGAWA0043W High CPU utilization: 81% (WGAWA0043W)[name=system,priority=medium] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_WGAWA0043W IS-AAM N

    Minor someserver 2019-10-23 07:25:08.0 2019-10-23 07:26:11.0 [ISS]LogData: WGAWA0043W High CPU utilization: 85% (WGAWA0043W)[name=system,priority=medium] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_WGAWA0043W IS-AAM Y

    Minor someserver 2019-10-23 07:16:04.0 2019-10-23 07:17:11.0 [ISS]LogData: WGAWA0043W High CPU utilization: 88% (WGAWA0043W)[name=system,priority=medium] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_WGAWA0043W IS-AAM Y

    Minor someserver 2019-10-23 07:01:08.0 2019-10-23 07:01:12.0 [ISS]LogData: WGAWA0043W High CPU utilization: 89% (WGAWA0043W)[name=system,priority=medium] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_WGAWA0043W IS-AAM N
    Major someserver 2019-10-23 06:47:24.0 2019-10-23 07:26:08.0 [ISS]LogData: WGAWA0643E High CPU utilization: 90% (CPUUtilizationState)[name=system,priority=high] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_CPUUtilizationState IS-AAM Y

    Indeterminate someserver 2019-10-23 06:43:43.0 2019-10-23 06:43:47.0 [ISS]LogData: WGAWA0650I The CPU utilization has fallen below the configured threshold: 79% (CPUUtilizationState)[name=system,priority=low]***Clear*** logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_CPUUtilizationState IS-AAM N

    Minor someserver 2019-10-23 06:41:43.0 2019-10-23 06:50:11.0 [ISS]LogData: WGAWA0043W High CPU utilization: 86% (WGAWA0043W)[name=system,priority=medium] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_WGAWA0043W IS-AAM Y

    Indeterminate someserver 2019-10-23 06:27:57.0 2019-10-23 06:30:01.0 [ISS]LogData: WGAWA0650I The CPU utilization has fallen below the configured threshold: 76% (CPUUtilizationState)[name=system,priority=low]***Clear*** logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_CPUUtilizationState IS-AAM Y

    Minor someserver 2019-10-23 06:20:13.0 2019-10-23 06:31:11.0 [ISS]LogData: WGAWA0043W High CPU utilization: 82% (WGAWA0043W)[name=system,priority=medium] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_WGAWA0043W IS-AAM Y

    Indeterminate someserver 2019-10-23 06:06:46.0 2019-10-23 06:11:31.0 [ISS]LogData: WGAWA0650I The CPU utilization has fallen below the configured threshold: 75% (CPUUtilizationState)[name=system,priority=low]***Clear*** logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_CPUUtilizationState IS-AAM Y

    Minor someserver 2019-10-23 06:05:46.0 2019-10-23 06:11:11.0 [ISS]LogData: WGAWA0043W High CPU utilization: 80% (WGAWA0043W)[name=system,priority=medium] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_WGAWA0043W IS-AAM Y

    Major someserver 2019-10-22 19:09:21.0 2019-10-22 19:21:04.0 [ISS]LogData: WGAWA0643E High CPU utilization: 92% (CPUUtilizationState)[name=system,priority=high] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_CPUUtilizationState IS-AAM Y

    Indeterminate someserver 2019-10-22 19:04:59.0 2019-10-22 19:05:05.0 [ISS]LogData: WGAWA0643E High CPU utilization: 93% (CPUUtilizationState)[name=system,priority=high] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_CPUUtilizationState IS-AAM N

    Indeterminate someserver 2019-10-22 18:49:00.0 2019-10-22 18:49:04.0 [ISS]LogData: WGAWA0650I The CPU utilization has fallen below the configured threshold: 72% (CPUUtilizationState)[name=system,priority=low]***Clear*** logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_CPUUtilizationState IS-AAM N

    Indeterminate someserver 2019-10-22 18:41:30.0 2019-10-22 18:44:31.0 [ISS]LogData: WGAWA0643E High CPU utilization: 94% (CPUUtilizationState)[name=system,priority=high] logdatatrap SNMPTRAP-iss-ISS-MIB-logdatatrap_CPUUtilizationState IS-AAM Y

    Cheers


    ------------------------------
    Sylvain Gilbert
    ------------------------------



  • 9.  RE: wga_notifications and it's watchdogs

    Posted Thu September 26, 2019 05:28 PM
    Kristof,
     
    It's interesting that the process is taking so much CPU.  Is there any chance that you could get a pstack of the wga_notifications daemon while it is consuming the CPU?  Also, in which container is it consuming the CPU (i.e. config / WRP/ runtime / DSC)?
     
    Anyway, the notifications daemon monitors a number of aspects of the system.  Some of these things are runtime based and will be handled by the Docker environment (e.g. CPU / Memory / Disk / WRP health).  There are other notifications however which are configuration based and won't be handled by the Docker environment (e.g. outstanding pending changes, certificate expiry warnings).
     
    The bottom line is that the notifications daemon is not required in the runtime container (i.e. WRP, runtime profile, DSC) as the Docker health check will be adequate to monitor the health of the container.  The notifications daemon does however provide some additional benefits to the configuration container.
     
    I hope that this helps,
     
    Scott.
     
     



    Scott A. Exton
    Senior Software Engineer
    Chief Programmer - IBM Security Access Manager

    IBM Master Inventor


    Phone: 61-7-5552-4008
    E-mail: scotte@au1.ibm.com
    L11 & L7 Seabank
    Southport, QLD 4215
    Australia






  • 10.  RE: wga_notifications and it's watchdogs

    Posted Fri September 27, 2019 02:30 AM
    Hi Scott,

    Thx for your answer.
    It's not a constant CPU usage, but regular peaks. We saw this in the WRP containers. We don't run an LMI container in our environments. We run it locally and export the snapshot. Locally, we have no issues with the wga_notifications problem.

    It could be due to rigid segregation of our containers: we limit network access to the bare minimum where possible.

    I am not sure what you mean with "a pstack of the process", but I guess I could give you the information you want if you can elaborate on that :)

    Kind regards

    ------------------------------
    Kristof Goossens
    ------------------------------



  • 11.  RE: wga_notifications and it's watchdogs

    Posted Fri September 27, 2019 02:56 AM
    Kristof,
     
    If you run 'pstack <pid>' (substituting <pid> with the process identifier for the wga_notifications process) it will provide you with a stack trace which shows what the various threads are doing within this process.  This could help narrow down the cause for the CPU spike.  Having said this, you should be OK to simply kill the notifications daemon in the WRP container (I've actually just delivered a change in our current development stream so that we no longer start the notifications daemon in the WRP or runtime containers).
     
    Thanks.
     



    Scott A. Exton
    Senior Software Engineer
    Chief Programmer - IBM Security Access Manager

    IBM Master Inventor


    Phone: 61-7-5552-4008
    E-mail: scotte@au1.ibm.com
    L11 & L7 Seabank
    Southport, QLD 4215
    Australia






  • 12.  RE: wga_notifications and it's watchdogs

    Posted Fri September 27, 2019 04:07 AM
    Hi Scott,

    I did try to issue "pstack $pid", but as it doesn't return any output, I figured I was doing something wrong or maybe you ment something else :)

    [isam@5e16f511dbeb /]$ ps -f -C wga_notifications
    UID        PID  PPID  C STIME TTY          TIME CMD
    isam      3907     1  0 09:58 ?        00:00:00 wga_notifications
    isam      3908  3907  3 09:58 ?        00:00:10 wga_notifications
    [isam@5e16f511dbeb /]$ for pid in $(ps -C wga_notifications -o pid |grep -v PID); do echo $pid; pstack $pid; done
    3907
    3908
    [isam@5e16f511dbeb /]$​

    Kind regards

    ------------------------------
    Kristof Goossens
    ------------------------------



  • 13.  RE: wga_notifications and it's watchdogs

    Posted Fri September 27, 2019 04:45 PM
    Kristof,
     
    Sorry, I forgot that the SETPCAP Docker capability is required in order to be able to run the pstack command.  There is no way to add this capability after the container has been started and so you probably won't be able to run this command.
     
    Thanks.
     



    Scott A. Exton
    Senior Software Engineer
    Chief Programmer - IBM Security Access Manager

    IBM Master Inventor


    Phone: 61-7-5552-4008
    E-mail: scotte@au1.ibm.com
    L11 & L7 Seabank
    Southport, QLD 4215
    Australia