IBM QRadar

IBM QRadar

Join this online user group to communicate across Security product users and IBM experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

View Only

Back to Blog List

Disconnected Log Collector uplink failure detection

By Katsuyuki Hirayama posted Wed March 22, 2023 02:19 AM

The blog is intended to be a help for the users who is trying to solve a certain problem. It does not replace product publication or information from support or anything like that. Series of results are not necessarily tested on the formally supported environment. Test results are under limited conditions over a short period of time and is not the cumulative result of multiple tests. Results may vary depending on the environment differences and timing dependencies.

Introduction
How to tune the Linux OS
Uplink failure detection time differences
When QRadar HA is used
Conclusion
References

Introduction

Disconnected Log Collector (DLC) is a software which gathers events from a set of log sources and sends them to an IBM QRadar deployment (Event Collector, All-in-One, etc). It does not parse events but gathers events via multiple supported PROTOCOLs and send them to an upstream node via UDP or TLS over TCP.

When TLS over TCP is chosen, DLC buffers incoming events during the times disconnected from the uplink node and sends them when the connection is restored. Buffer capacity can be configured and limited by the available disk space.

But this only occurs when DLC knows connection is disrupted.

TLS over TCP protocol at the moment seems to be relying on the TCP layer to detect uplink failure. When uplink node such as Event Collector could send back RST packet to DLC node, that'll be fine. DLC then start to buffer events until the uplink becomes online again. But what happens if the counterpart became unresponsive suddenly?

The situation is typically seen when layer 2 problem occurs, such as Ethernet cable is removed, L2 switch issues, firewall misconfiguration, etc. Nobody can notify DLC node what is going on in such scenarios.

TCP layer is well designed to recover from this type of disruption of course. TCP layer retransmits the packet until they get ACK from the counterpart again. This works perfectly if the failure could be resolved withing the short period of time. But TCP retransmission times out. Once the TCP layer believes the counterpart is no longer available from the fact that the retry count has been exhausted, it's the time for DLC app to know the uplink is not working any more.

DLC starts to buffer events. But it only happens after the TCP timeout. TCP layer takes minutes to finally give up the retransmission and minutes worth of lost logs is not trivial in some cases.

Note: If your EPS to the DLC is high enough, TCP send buffer will be filled up and DLC may notice the failure before TCP timeout (not tested).

How to tune the Linux OS

TCP timeout is a common problem of multiple TCP applications and not limited to DLC. So there is a way to control it.

Useful parameter to tune is net.ipv4.tcp_retries2 and the Linux documentation ^[1] explains that "this value influences the timeout of an alive TCP connection, when RTO retransmissions remain unacknowledged. Given a value of N, a hypothetical TCP connection following exponential backoff with an initial RTO of TCP_RTO_MIN would retransmit N times before killing the connection at the (N+1)th RTO."

Seeing the current net.ipv4.tcp_retries2 setting is as easy as typing the following:

# sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15

To give it a change, simply add a value to the parameter like this:

sysctl net.ipv4.tcp_retries2=9

Change will be applied immediately.

To make the change permanent, just add the line to the /etc/sysctl.conf file.

net.ipv4.tcp_retries2 = 9

Note: The parameter is system wide. If you have any other services running on the same DLC machine, the other services will also be affected by the parameter. Even when your DLC machine is dedicated to the DLC purpose, fundamental services like SSH will also be affected.

Uplink failure detection time differences

To know the time period of lost logs, special syslog messages like below are used. This is generated from a simple python script ^[2].

<14>Syslog Message Test - device_time=2023-03-05 09:41:01.082521 index=0 seq=1061 [end]
<14>Syslog Message Test - device_time=2023-03-05 09:41:02.083629 index=0 seq=1062 [end]
<14>Syslog Message Test - device_time=2023-03-05 09:41:03.084509 index=0 seq=1063 [end]

A custom DSM is also defined to parse the above device time and the sequence number to monitor the lost log messages.

The following is the test configuration.

Steps to create uplink disruption is the following:

Confirm the syslog messages are coming on the QRadar Log Activity
Disable the NIC of the QRadar All-in-One machine from the VMware infrastructure (we also lose the Console UI access)
Wait for approx. 30 minutes
Re-enable the NIC (recover the Console UI access)
Watch the QRadar Log Activity until we see the incoming syslog messages again

And the following is the results of each test case. Tested net.ipv4.tcp_retries2 parameters are 5, 8, 9, 10, and 15.

Test cases	Parsing disruption time (starttime delta)	Log lost duration (device time delta)	Remarks
net.ipv4.tcp_retries2=15 (Default)	0:30:53	0:11:12	Default setting causes longer log lost duration
net.ipv4.tcp_retries2=10	0:30:51	0:06:01
net.ipv4.tcp_retries2=9	0:30:19	0:03:27
net.ipv4.tcp_retries2=8	0:30:42	0:01:44
net.ipv4.tcp_retries2=5	0:30:42	0:00:26	Maybe too short to wait for temporal network disruption conversion

Default net.ipv4.tcp_retries2=15 setting takes more than 10 minutes for DLC to realize the uplink failure.

Shorter timeout is not always better, especially when your network convergence time is long in case of disruption. If the network can recover from disruption in time, there will be no lost logs. But it could be a worst bet if it can time out before the network recovery.

If you prefer to use DLC buffer rather than praying for TCP recovery, setting smaller number could be a meaningful option.

Some mean point would be around net.ipv4.tcp_retries2=9 like below:

When QRadar HA is used

If uplink node is in QRadar HA configuration, timeout behavior was quite different.

The following is the test configuration.

Steps to create uplink disruption is the following:

Confirm the syslog messages are coming on the QRadar Log Activity
Disable the NIC of the QRadar All-in-One PRIMARY machine from the VMware infrastructure
Wait for QRadar HA SECONDARY to take over the VIP and Console UI access is restored
Watch the QRadar Log Activity (on SECONDARY) until we see the incoming syslog messages again

And the following is the results of each test case. Tested net.ipv4.tcp_retries2 parameters are 8 and 9.

Test cases	Parsing disruption time (starttime delta)	Log lost duration (device time delta)	Remarks
net.ipv4.tcp_retries2=9	0:12:31	0:12:31	Tried 3 times in total, almost the same results
net.ipv4.tcp_retries2=8	0:12:08	0:03:59

For the net.ipv4.tcp_retries2=9 case, log lost duration became a lot longer than the previous single uplink node scenario. Comparing the tcpdump shows that the RST packet was returned from the HA secondary to the very last retransmission of the previous TCP session. So no timeout was occurred.

The net.ipv4.tcp_retries2=8 case timed out before getting RST from the HA secondary.

QRadar All-in-One is probably the slowest type of node to complete HA takeover while Event Collector is usually the fastest. And HA takeover time may vary depending on the size and load of each QRadar deployment. So measuring the actual TCP timeout in each environment before deciding the net.ipv4.tcp_retries2 is recommended.

Conclusion

As of v1.7.3, DLC seems to be relying on TCP retransmission timeout.

Unless you are pretty sure that your network failure can always be solved within defined time frame or you are fine with losing log data in case of longer network failure, you'll need to tune the TCP retransmission timeout.

Because there is no single value suitable for all DLC scenarios, you'll need to verify the current QRadar environment to decide the best timeout setting for your DLC deployment.

References

linux/ip-sysctl.txt at v4.15 · torvalds/linux — https://github.com/torvalds/linux/blob/v4.15/Documentation/networking/ip-sysctl.txt
Send UDP timestamped Syslog messages for SIEM test purpose — https://gist.github.com/khirazo/d979417011d8fcbab0af6a8135191021

0 comments

16 views

Permalink

https://community.ibm.com/community/user/blogs/katsuyuki-hirayama1/2023/03/21/disconnected-log-collector-uplink-failure-detectio

IBM QRadar

Disconnected Log Collector uplink failure detection

By Katsuyuki Hirayama posted Wed March 22, 2023 02:19 AM

Introduction

How to tune the Linux OS

Uplink failure detection time differences

When QRadar HA is used

Conclusion

Permalink

Additional
Resources

Office

Quick Links

IBM QRadar

Disconnected Log Collector uplink failure detection

By Katsuyuki Hirayama posted Wed March 22, 2023 02:19 AM

Introduction

How to tune the Linux OS

Uplink failure detection time differences

When QRadar HA is used

Conclusion

Permalink

Additional Resources

Office

Quick Links

Additional
Resources