IBM Security QRadar

 View Only

Disconnected Log Collector High Availability using VRRP

By Katsuyuki Hirayama posted Sun March 26, 2023 09:35 PM

  

The blog is intended to be a help for the users who is trying to solve a certain problem. It does not replace product publication or information from support or anything like that. Series of results are not necessarily tested on the formally supported environment. Test results are under limited conditions over a short period of time and is not the cumulative result of multiple tests. Results may vary depending on the environment differences and timing dependencies.

Introduction

Disconnected Log Collector (DLC) is a software which gathers events from a set of log sources and sends them to an IBM QRadar deployment (Event Collector, All-in-One, etc). It does not parse events but gathers events via multiple supported PROTOCOLs and send them to an upstream node via UDP or TLS over TCP.

Unlike QRadar Event Collector, DLC is a software running on a generic Linux OS, not an appliance with High Availability feature. DLC is an important device for log ingestion and needs to be up and running with minimal disruption, especially when UDP protocol is used by the log source devices that do not care if the destination is receiving the messages or not.

This is where VRRP can help. The Virtual Router Redundancy Protocol (VRRP) [1] is a networking protocol that provides automatic virtual IP assignment to the HA participating hosts. This increases the availability of the DLC devices as virtual IP moves when priority changes or advertisement is lost.

The VRRP implementation used in this blog is keepalived [2].

Back to top

Configuration

DLC configuration

As two nodes are participating in the HA configuration, DLC is also two configurations with different UUID.

TLS over TCP log source definition listening on port 32500 is shared among multiple DLC instances. This works as tunnels from the DLC to the QRadar Collector.

Actual log source definition is a forwarded log source. Log source id of the log source is suffixed by the DLC UUID so the same identifier (typically an IP address) looks as two different log source ids.

From QRadar Log Source Management app, three log sources are configured: one shared log source for TLS over TCP tunnel, one for DLC1 running on VRRP MASTER, and another for DLC2 running on VRRP BACKUP.

Back to top

VRRP installation and configuration

keepalived configuration is transparent to DLC software. So no configuration change is required. For log sources, the only consideration is to use the virtual IP as a destination.

keepalived has its own configuration file as well as Linux OS and firewall related configurations.

1. First, we need to install the keepalived.

yum -y install keepalived

2. Next, configure the keepalived (/etc/keepalived/keepalived.conf) like below:

MASTER

There are two parts in the configuration.

vrrp_script is to define monitoring script. chk_dlc is to monitor DLC service availability and chk_syslog_port is to monitor whether DLC is listening on internal syslog port.

vrrp_instance defines the virtual ip for this HA cluster and points to the two vrrp_script definitions.

! Configuration File for keepalived

global_defs {
   notification_email {
   }
   enable_script_security
}

vrrp_script chk_dlc {
    script "/usr/bin/systemctl is-active dlc"
    interval 3
    timeout 3
    fall 3
    rise 2
}

vrrp_script chk_syslog_port {
    script "< /dev/tcp/127.0.0.1/1514"
    interval 3
    timeout 3
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface ens33
    virtual_router_id 51
    priority 101
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        192.168.254.55
    }
    track_script {
        chk_dlc
        chk_syslog_port
    }
}

BACKUP

Define priority 100 instead of priority 101 above. Higher value has more priority and becomes the MASTER in the election.

Other than the priority, all configurations are the same as the MASTER.

3. keepalived requires a specific user to run vrrp_script.

groupadd -r keepalived_script
useradd -r -s /sbin/nologin -g keepalived_script -M keepalived_script

4. If SELinux is enabled, vrrp_script execution will be blocked. Use the setenforce 0 command to disable SELinux. To make the change permanent, update the /etc/sysconfig/selinux file as the following:

SELINUX=permissive

5. Your firewall configuration must also be updated to allow VRRP traffic like below, otherwise both node don't understand each other and both becomes MASTER so that ARP tables of the adjacent nodes are completely messed up for the virtual IP.

firewall-cmd --add-rich-rule='rule protocol value="vrrp" accept' --permanent
firewall-cmd –reload

6. You can start the keepalived just as the other services.

systemctl start keepalived

To make the service to start automatically after the reboot, use the following command.

systemctl enable keepalived

Once all configurations are done, you'll see the virtual IP address assigned to the MASTER node with ip a command.

MASTER

root@centos ~]# ip a
(omitted)
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:00:82:56 brd ff:ff:ff:ff:ff:ff
    inet 192.168.254.53/24 brd 192.168.254.255 scope global noprefixroute ens33
       valid_lft forever preferred_lft forever
    inet 192.168.254.55/32 scope global ens33
       valid_lft forever preferred_lft forever
(omitted)

BACKUP

inet 192.168.254.55/32 scope global ens33 line will not be displayed.

If you see the virtual IP address is assigned to the BACKUP node, check the keepalived.conf priority and vrrp_script definitions.

If both nodes have the virtual IP, check the firewall configuration.

If you don't really understand the cause, try tcpdump to see the advertisements coming from the other side.

tcpdump -v -i ens33 host 224.0.0.18

keepalived logs are in /var/log/messages.

MASTER

Mar 18 16:35:23 centos Keepalived_vrrp[1090]: VRRP_Script(chk_dlc) failed
Mar 18 16:35:23 centos Keepalived_vrrp[1090]: VRRP_Script(chk_syslog_port) failed
Mar 18 16:35:24 centos Keepalived_vrrp[1090]: VRRP_Instance(VI_1) Now in FAULT state

BACKUP

Mar 18 16:35:24 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Transition to MASTER STATE
Mar 18 16:35:25 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Entering MASTER STATE

Back to top

Fail-over scenarios

Three fail-over scenarios are tested for this blog:

  1. DLC service stop, restart after 5 minutes (MASTER → BACKUP → MASTER)
  2. DLC VM NIC disable, re-enable after 5 minutes (MASTER → BACKUP → MASTER)
  3. DLC VM MASTER restart (MASTER → BACKUP → MASTER)

To know the time period of lost logs, special syslog messages like below are used. This is generated from a simple python script [3].

<14>Syslog Message Test - device_time=2023-03-05 09:41:01.082521 index=0 seq=1061 [end]
<14>Syslog Message Test - device_time=2023-03-05 09:41:02.083629 index=0 seq=1062 [end]
<14>Syslog Message Test - device_time=2023-03-05 09:41:03.084509 index=0 seq=1063 [end]

A custom DSM is also defined to parse the above device time and the sequence number to monitor the lost log messages.

Back to top

DLC service stop, restart after 5 minutes

It took longer to trigger the VRRP takeover than other scenarios, as vrrp_script needs time to finally conclude that the service and the port are no longer available. You can tune this by changing the vrrp_script interval and fall parameters.

Linux logs (/var/log/messages) extracts

MASTER
Mar 18 16:35:17 centos systemd: Stopping Disconnected log collector...
Mar 18 16:35:18 centos systemd: Stopped Disconnected log collector.
Mar 18 16:35:23 centos Keepalived_vrrp[1090]: VRRP_Script(chk_dlc) failed
Mar 18 16:35:23 centos Keepalived_vrrp[1090]: VRRP_Script(chk_syslog_port) failed
Mar 18 16:35:24 centos Keepalived_vrrp[1090]: VRRP_Instance(VI_1) Now in FAULT state

BAKCUP
Mar 18 16:35:24 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Transition to MASTER STATE
Mar 18 16:35:25 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Entering MASTER STATE

MASTER
Mar 18 16:40:24 centos systemd: Starting Disconnected log collector...
Mar 18 16:40:24 centos systemd: Started Disconnected log collector.
Mar 18 16:40:30 centos Keepalived_vrrp[1090]: VRRP_Script(chk_dlc) succeeded
Mar 18 16:40:36 centos Keepalived_vrrp[1090]: VRRP_Script(chk_syslog_port) succeeded
Mar 18 16:40:36 centos Keepalived_vrrp[1090]: VRRP_Instance(VI_1) Entering BACKUP STATE

BACKUP
Mar 18 16:40:37 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Received advert with higher priority 101, ours 100
Mar 18 16:40:37 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Entering BACKUP STATE

MASTER
Mar 18 16:40:38 centos Keepalived_vrrp[1090]: VRRP_Instance(VI_1) Transition to MASTER STATE
Mar 18 16:40:39 centos Keepalived_vrrp[1090]: VRRP_Instance(VI_1) Entering MASTER STATE

Back to top

DLC VM NIC disable, re-enable after 5 minutes

It took shorter time for the BACKUP to detect MASTER absence as NIC down will prevent further advertisement arrival. Defined advertisement interval is 

advert_int 1 and three strike rule makes the BACKUP to start transition to the MASTER after 3 seconds.

Linux logs (/var/log/messages) extracts

MASTER
Mar 18 17:04:23 centos kernel: e1000: ens33 NIC Link is Down
Mar 18 17:04:24 centos Keepalived_vrrp[1090]: Kernel is reporting: interface ens33 DOWN
Mar 18 17:04:24 centos Keepalived_vrrp[1090]: VRRP_Instance(VI_1) Entering FAULT STATE
Mar 18 17:04:24 centos Keepalived_vrrp[1090]: VRRP_Instance(VI_1) Now in FAULT state

BACKUP
Mar 18 17:04:24 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Transition to MASTER STATE
Mar 18 17:04:25 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Entering MASTER STATE

MASTER
Mar 18 17:04:31 centos DLC: WARNING: Failed to check connection: java.net.SocketException: Network is unreachable (connect failed)

Mar 18 17:09:28 centos kernel: e1000: ens33 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Mar 18 17:09:28 centos Keepalived_vrrp[1090]: VRRP_Instance(VI_1) Entering BACKUP STATE

BAKCUP
Mar 18 17:09:29 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Received advert with higher priority 101, ours 100
Mar 18 17:09:29 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Entering BACKUP STATE

MASTER
Mar 18 17:09:30 centos Keepalived_vrrp[1090]: VRRP_Instance(VI_1) Transition to MASTER STATE
Mar 18 17:09:31 centos Keepalived_vrrp[1090]: VRRP_Instance(VI_1) Entering MASTER STATE

Back to top

DLC VM MASTER restart

As the MASTER node is restarted to simulate the problem, some of the /var/log/messages are missing.

Linux logs (/var/log/messages) extracts

BACKUP
Mar 18 17:30:24 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Transition to MASTER STATE
Mar 18 17:30:25 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Entering MASTER STATE

MASTER
Mar 18 17:30:43 centos network: Bringing up interface ens33:  [  OK  ]
Mar 18 17:30:55 centos Keepalived_vrrp[1093]: VRRP_Instance(VI_1) Entering BACKUP STATE

BACKUP
Mar 18 17:30:56 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Received advert with higher priority 101, ours 100
Mar 18 17:30:56 centos2 Keepalived_vrrp[1097]: VRRP_Instance(VI_1) Entering BACKUP STATE

MASTER
Mar 18 17:30:57 centos Keepalived_vrrp[1093]: VRRP_Instance(VI_1) Transition to MASTER STATE

Back to top

Results summary

Number Test scenarios Log lost duration (device time delta)
1

DLC service stop, restart after 5 minutes 
(MASTER → BACKUP → MASTER)

0:00:11
2 DLC VM NIC disable, re-enable after 5 minutes
(MASTER → BACKUP → MASTER)
0:00:04
3 DLC VM MASTER restart
(MASTER → BACKUP → MASTER)
0:00:03

Back to top

Considerations

VRRP can provide a quick HA fail-over solution for DLC machines but there are some considerations and differences compared to the QRadar HA feature available on QRadar Event Collector.

Pros

  • Much faster fail-over time than QRadar HA (because no storage sync is done as described in Cons)
  • From DLC perspective, there is nothing shared between the MASTER and the BACKUP except the virtual IP
    • This means two DLC doesn't have to be at the exact same version

Cons

  • Storage sync is not a part of the solution
    • This means only the new incoming traffic after the fail-over is covered by the BACKUP node to forward to the upstream Collectors and remainders in the buffer of the MASTER is left unforwarded until the MASTER to Collector connection is recovered
    • This is the opposite of the pro, but as nothing is shared, we need to configure the DLC twice even if the configuration is very similar.
  • PULL type log sources (such as LogFile, REST API, etc) are not covered by VRRP HA solution
    • You can still use the PULL type protocol on each of the DLC but once the node is down, there is no mechanism to takeover the pull cursor/pointer information on the other node to resume the poll from the point of failure
    • You'll need to combine another HA solution such as the one from VM infrastructure or any other HA software to restore the MASTER DLC. This will take longer time to recover, but PULL type log source will not lose events during the DLC disruption as the logs are waiting on the source device until they are pulled.

keepalived can have additional vrrp_script definition and you can create your own shell script to check whatever you think is necessary. For example, you'll be able to monitor the upstream Collector reachability, but I do not personally recommend it from the following reasons:

  • If the uplink is redundant (L2/L3 level recovery, QRadar HA in Collectors, etc), the failure will be temporary and will be restored in minutes. DLC takeover may cause additional confusion to the backup scenario which will eventually lose more events
  • Upstream failure can happen on both MASTER and BACKUP node at the same time

It is also important to note that DLC (as of 1.7.3) is not good at detecting some type of network failure [4] and VRRP is not a solution for that.

Back to top

Conclusion

As DLC is not an appliance, it is not a complete solution by itself and we need to consider other aspects such as high availability, system backup, OS monitoring, etc.

VRRP is relatively a simple high availability solution and transparent to the DLC service, and can be stopped by systemctl command if you need to isolate the problem to see whether VRRP is related to the DLC behavior. It is relatively easy to maintain as configuration file is small.

DLC is not a managed host and supported version doesn't have to be a 1:1 match against QRadar SIEM versions (although latest version is usually recommended). As VRRP HA pair doesn't share anything about DLC service, MASTER and BACKUP is not required to be at the exact same DLC version. This makes the QRadar SIEM and DLC upgrade plan flexible.

Although Red Hat recommend that all systems should run the same keepalived version [5], as it can keep the virtual IP alone without having the pair node (just becomes the MASTER when no neighbor is there), you'll be able to upgrade the node one by one when carefully planned.

Back to top

References

  1. Virtual Router Redundancy Protocol - Wikipedia — https://en.wikipedia.org/wiki/Virtual_Router_Redundancy_Protocol
  2. Keepalived for Linux — https://www.keepalived.org/
  3. Send UDP timestamped Syslog messages for SIEM test purpose — https://gist.github.com/khirazo/d979417011d8fcbab0af6a8135191021
  4. Disconnected Log Collector uplink failure detection — https://community.ibm.com/community/user/security/blogs/katsuyuki-hirayama1/2023/03/21/disconnected-log-collector-uplink-failure-detectio
  5. Chapter 2. Keepalived Overview Red Hat Enterprise Linux 7 | Red Hat Customer Portal — https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/load_balancer_administration/ch-keepalived-overview-vsa

0 comments
16 views

Permalink