Last February, IBM General Manager Rob Thomas wrote in a blog entitled “Scaling the AI Ladder” that "There is no AI without IA". This fundamental principle of machine learning / AI is as true today as it has ever been - yet very often overlooked; not just with respect to machine learning, but any kind of analytics, orchestration, or automation you plan to base on data. I would argue that lack of a proper and well-defined IA (Information Architecture) in the cybersecurity operations landscape, is one of the areas that has been holding the industry back the greatest. Lack of a solid and well-defined cybersecurity information architecture is part of why your security data lake project will likely fail, it is why analytics and correlation rules are not portable across tools, and it is a large contributor to the cybersecurity skills shortage.
What is an "information architecture?" In short, information architecture is the structural design of organizing information so that it can reliably support things such as findability, usability, and analytics. It encompasses multiple aspects of data curation, including validation, translation, normalization, and cleansing.
Without a proper information architecture, the data that one must work tends to not look the same, be stored in different places, in different formats, and is accessed with disparate, different API models. Without such a model to organize your data against, querying data becomes very hard, and building analytics, machine learning, and AI models becomes extremely difficult.
Does this problem area sound familiar? It should — because this is the typical world that we deal with daily in all aspects of cybersecurity operations. We encounter it while ingesting data, while correlating data, while threat hunting, and while responding to incidents.
No two tools in your enormous, unmanageable arsenal speak the same language, no two log sources look the same, no analytics work across the tools, and insights are not easily shared among them. This is the root problem area of many challenges in the domain. It is why extracting actionable value from a security data lake, is so hard. It is why your analyst who is an expert in one toolchain can not easily write a query that works in another tool, and why an analytic written against one data source is not easily portable to another. It also why there is such a large skills gap in cybersecurity — the tools and data change so often, that keeping up with the landscape is a never-ending task.
IBM Security has long recognized the need for this information architecture in cybersecurity operations. In the Threat Intelligence space, several such information architectures exist — with STIX 2 and TAXII 2 rapidly becoming the most widely adopted standard used in this domain. But no such information architecture exists and is widely adopted for cybersecurity operations. True, there are some nascent efforts afoot such as Sigma that attempt to help solve this problem, but they take a fundamentally different approach instead of creating a true information architecture around the data.
We asked ourselves, could the already existing STIX 2 standard, help in this problem space? It turns out the answer is yes.
One facet of STIX 2 is called the STIX Cyber Observable Model, or SCO. The purpose of SCO in STIX is to allow one to model the kinds of "observations" one typically is interested in in a cybersecurity context — things such as network activity, file accesses, user activity, process activity, and the like. SCO allows one to model all of these observations in a structured way, and is also easily extensible to add new observation types that do not already exist.
While SCO was created in order to be able to report observations of activity tied to threat intelligence reports, it turns out that SCO is also the perfect basis for our much-needed information architecture in cybersecurity operations.
This is why we chose STIX 2 as the open information architecture to build IBM Cloud Pak for Security around. By using STIX 2 as our unifying IA and translating all native data to and from it, across our entire data layer, it allows us to free the analytics and applications from the data and be able to deliver our future services and software identically across any security data in any product or cloud. It will allow security analytics to be able to operate across all products in the SOC and allow insights and orchestration to be able to work seamlessly across them, without an unmanageable morass point-to-point integrations.
We did not stop there. We decided to go further to try to revolutionize this industry, and fully open source the core engine we developed to enable this — IBM STIX Shifter — on Github. By using STIX Shifter yourself — either as a library, or from the command line — you can add basic security data federation to any environment, product, or SOC that allows external integrations. We already have a library of 9 different connectors open sourced and are adding more all the time. We hope that others will follow our lead and leverage (and contribute to!) this project, allowing practitioners to start to move away from an unmanageable security data lake as a solution to their problems, and focus on real outcomes that can be achieved today with their existing data instead.
If you're interested in joining us on this journey, either as a developer of products, or a SOC analyst, a threat hunter, or an incident responder — please reach out and engage! We think that the future of cybersecurity operations under a proper and open "information architecture", is very bright — and are looking forward to increased engagement from the community.