By Jeff Rusk and Jenna Degaust
Disaster Recovery (DR) is a key aspect to the resiliency of a QRadar deployment. There is a wide variety of solutions currently deployed in the field for DR including; redundant console only configurations, event and flow forwarding based solutions, and even full event distribution to two deployments (often termed “dual home”). These solutions vary greatly in terms of complexity, cost, and effectiveness. However, for the most part, customers are reliant on significant customization, usually offered by IBM professional services, in the setup and configuration on their DR solution.
Future of QRadar Disaster Recovery
Given the importance that QRadar customers place on the resiliency of their SIEM, IBM is currently developing a number of enhancements to make DR a more standardized, supportable, and cost effective offering. Development is actively working to improve our DR capabilities in the following ways:
- Facilitate setup
- Automate data synchronization
- Automate configuration synchronization
- Enable easier activation of DR if an incident occurs
- Provide better monitoring of both sites
At a high-level, the solution is intended to utilize an enhanced Backup/Recovery API to transfer configuration data from a Main Site to the DR Site, as well as an advanced efficient Ariel Copy mechanism to frequently move event and flow data stored in the Ariel database from any Event Processor in production to a comparable Event Processor in the DR deployment. All this functionality will be managed by a DR App that is installed on both the production environment and the DR deployment to guide the user in setup, configuration, monitoring, failover, and failback operations. The resulting solution will reduce the level of advanced internal product knowledge required to configure DR, reduce the troubleshooting required when things change in the production environment, and leave the customer with a fully supported solution that QRadar Customer Support can assist with when required.
Overall, the intended solution follows this model:
DR Site Setup
Currently with most DR solutions, there is significant custom work that is required, and most of it needs the assistance of professional services to execute on. Furthermore, subsequent maintenance is likely to require additional scheduling of these services for further customization as well as further expense. The DR app solution intends to assist the customer, whether they have contracted professional services for some of this work or not, in navigating them through some of the most important steps in the DR site setup. Automatic workflow, a sample of which is illustrated below, will expedite both initial configuration and subsequent maintenance.
Another major hurdle for engineers setting up a QRadar DR site is the mapping of hosts from the production environment to the DR deployment. There is no tool that allows you to map hosts between sites and this can be a very time-consuming exercise, especially directing where the Ariel data will be synchronized with on the DR deployment. We are developing, as part of the DR app, functionality which will allow for automatic host discovery, facilitate intuitive host pairing and providing a view that can be customized to display the relevant information required to make host mapping decisions.
Event and Flow Synchronization
One of the most significant challenges of DR, and certainly the one with the most at stake, is that of synchronizing the event and flow data between the production environment and the DR deployment. There is no out-of-the-box solution for this. Typically what is done in the field is a manual implementation of forwarding rules, primarily dependent on custom scripting provided by professional services. A process is being developed which would efficiently and consistently copy ariel data from any given Event Processor in the production environment to a specified Event Processor in the DR deployment. While the DR app will be pre-loaded with the recommended default settings, the QRadar DR administrator will have all the ability to customize, tune, and monitor this operation without being burdened with the implementation details or having to constantly maintain custom scripting. This capability includes the ability to retroactively synchronize data, as far back as required (provided there are no disk space or data accessibility blockers), and report through the app's UI the last successful synchronization time and results. Advanced bandwidth management will be made available through the UI to allow users to manage network resources and constraints around this operation.
Also critical to any DR solution is the transfer of all the configuration data necessary for the DR deployment to be activated and take over as a live QRadar SIEM environment in the event of an actual disaster (or test failover for compliance testing). There are many ways that this is attempted in the field, and often a combination of customer scripting, /opt/qradar/bin/contentManagement.pl operations, and manual file handling is employed here. Ideally QRadar's existing backup/restore functionality could more easily be utilized here. Currently, this functionality is not suitable out-of-the-box. As part of our suite of DR enhancements, we are developing the necessary APIs and Backup/Restore granularity to enable seamless transfer of configuration data between production and DR deployments. This feature will combine the ability to independently restore critical configuration items using the DR app with appropriate scheduling and monitoring through the app's UI.
DR Site Activation
In the event of an actual disaster, or even test failover to validate compliance, there is usually significant process involved in getting the DR site fully activated. This may involve manual investigation and confirmation of the state of both systems, execution of scripts, services to be manually restarted, repointing of various Event Collectors and/or log sources, in addition to any number of other processes. While the initial release of the DR app will not remove all work at the Event Collector and log source level, the activation of the DR system will be greatly streamlined through this app. In keeping with most workflow and compliance requirements we've discussed with stakeholders, we will keep the actual activation of the DR site in the hands of an administrator – there will be no automatic failover without human intervention. However, this will be greatly simplified in the app's UI to the point where the DR administrator can select to “Take Over as Primary” and be directed through any subsequent steps. Once the required human activation is initiated, the app will handle all the required service management to bring the DR system into production function.
While the main goal of this series of enhancements is to help our customer meet their DR use cases, there are a number of features that will be included into the product for other uses. There will be a Backup/Restore API made available which will allow users to customize and automate certain backup and recovery functions, such as running on-demand data backups of the previous day, and scripting a custom configuration backup schedule if the nightly backup schedule does not meet their needs. Administrators can also expect to see additional restore options. Furthermore, while bandwidth throttling capability is primarily applicable to the ariel data synchronization in a DR context, bandwidth throttling between other hosts will be made more readily available and configurable through exposure of an API for its use.
These are all enhancements that are currently in development, or under consideration for design and development. A phased approach is planned, to ensure the most immediately required value is added to the QRadar product line first. However, future enhancements and features are planned. There is still an opportunity for customers to have significant input as to later phases in this development. Satisfying customer use cases around DR is of key interest to the authors and we welcome any input you, as a QRadar user or administrator, can provide to help us scale-out the best solutions for customers' DR deployments.