You're a Guardium admin. It's Monday morning. One of your stakeholders tells you the email they usually get from the Guardium audit process never arrived, or it arrived but there is no data at all in the report, or there is some data but important data is missing. What can you do?
Here's a repost of the Open Mic I did on this topic back in 2017:
This looks like one issue, but it might have any number of causes. Narrowing down the issue first is essential. Let's start by looking at all the steps in the data-flow, every link in the chain where something could break.
- Traffic is captured by KTAP (Unix) or WFP (Windows) and matched against the Inspection Engine parameters.
- KTAP sends matched traffic to STAP, and STAP transmits the data to the collector, unless the ignore session flag is set for that traffic.
- The sniffer receives the traffic, parses it, applies policy and logs the traffic.
- The data is exported to an aggregator and imported. (Optional)
- The scheduled audit job fires and runs each task in order.
- The report results are sent to the list of receivers.
Troubleshooting KTAP, WFP and STAP
If KTAP or WFP has a problem, you won't get any TCP traffic. You might get some shmem or local traffic depending on the DBMS and IE configuration. One exception: if you are using EXIT libraries to capture traffic, KTAP is not required.
If all traffic from one host is missing, this is the place to start.
- Make sure the STAP appears in STAP Control on the collector and the status is green.
- If STAP is green, use the STAP Monitor view to run Verify on the inspection engine. If it fails, the "Run Diagnostics" link it provides might show the problem.
- If STAP is yellow, run lsmod |grep ktap or the equivalent Unix command to make sure KTAP is loaded and check guard_tap.ini for ktap_installed=1
- If STAP is red, check that the Guardium services are installed and running on the host. Check the Guardium logs on the host too, run STAP diag.
If you sometimes get everything you expect from a host but randomly traffic is missing, the issue could be an overflow in the STAP buffer. Check the STAP Event Log on the collector and look for "STAP Buffer Overflow" messages. Note the timestamp of each overflow. Is there a pattern? What was happening on the DB at that time? Sometimes a heavy job starts running. DB restarts result in heavy traffic when thousands of users suddenly reconnect and open new sessions. Any buffer overflow will result in a small amount of lost traffic. Frequent overflows should be addressed with the DBA. You may also need to adjust your policy to ignore more sessions where the traffic is not needed to meet audit requirements.
Troubleshooting the Sniffer
If the sniffer goes down, you won't see any traffic from that collector at all. Everything goes through the sniffer. The Buff Usage Monitor report is your best friend here. Changes in the TID or PID column indicate a sniffer restart. Frequent restarts can cause random, sporadic lost traffic. High volumes in the Analyzer Queue or Logger Queue can trigger sniffer restarts and should be investigated too. Some load balancing might be needed.
If the sniffer is down, the Buff Usage Monitor will show all zeroes. Log into CLI and try "start inspection-core". If the Buff Usage Monitor has no rows at all, the monitor service is down. Run "restart stopped services" from CLI.
If the scope of the missing traffic is oddly specific, like certain DB_USERs or certain kinds of commands, the issue is likely your policy. Use a session report that includes the Ignored Flag column to see if some sessions you wanted to capture are being ignored.
If you make changes to policy, always clone your policy so you can easily roll back. Test outside production whenever feasible. To prove if policy is or is not involved in your issue, briefly install the Allow All policy that ships with Guardium. It has no rules at all and should capture everything. Have the DBA standing by to run the exact traffic you are missing. Give the system 2 minutes to process, then reinstall your normal policy. If you leave the Allow All policy in place for hours there is a real chance the collector could fill up.
Troubleshooting Reports and Audits
If the policy is not at fault, conditions on your reports might be filtering out data that was actually logged by Guardium. You might see the traffic in one report but not in another. Clone your report and remove as many conditions as necessary until you see the traffic you want.
It's usually worthwhile to check the collector directly if traffic is missing from an audit or report that ran on an aggregator. The Aggregation/Archive log is the best tool for chasing these issues. Don't forget you can drill down on this report to see debug-level information.
A weird issue can happen if you restore old data but you didn't include the first day of that month. Archives are incremental. Only the first day of each month contains a full set of static tables like GDM_ACCESS. Without those tables your reports will not be able to show the data. So if you need data from July 4th you should restore July 1, July 2, July 3 and July 4 to be sure you have it all.
Some of these cases will need to go to IBM Support, but any troubleshooting you do in advance will greatly speed resolution of the issue.#Featured-area-3#Featured-area-3-home#Guardium