Data Governance - Knowledge Catalog

 View Only

Accelerate privacy compliance reporting with automated sensitive data discovery in IBM Knowledge Catalog

By Michal Szylar posted Tue September 12, 2023 04:42 AM


By Pat O'Sullivan and Michal Szylar

As organizations navigate a dynamic data privacy landscape, they need to reckon with sensitive data residing in multiple data sources across the organization. This is particularly a challenge as organizations grow and sensitive information propagates to multiple locations.

To handle sensitive data in a compliant manner, it is essential to discover where such data exists, classify and tag appropriately to be able to automatically apply protection rules to safeguard sensitive information.

To support the data privacy and governance goals of organizations, IBM introduced a new data privacy accelerator to augment  the powerful data governance capabilities of IBM Knowledge Catalog.   The data privacy accelerator, introduced as part of IBM Knowledge Catalog 4.7, provides a curated selection of pre-defined constructs specifically designed to assist organisations who deploy IBM Knowledge Catalog to meet their data privacy requirements. Design of the accelerator was based on feedback from earlier client deployments for addressing data privacy needs, and insights garnered from  IBM's own experience with privacy compliance across the more than 170 countries we do business.

Your organization can address  data privacy requirements at speed and scale by leveraging the capabilities of IBM’s data privacy accelerator listed below:

Unified view of organisation-wide sensitive data

To put in place adequate guardrails around how sensitive data is used, it is imperative to  have a complete and accurate organization-wide view of what data assets have been identified as significant in terms of their sensitivity.  One effective approach to get a comprehensive view of data asset sensitivity is to build a dashboard that outlines which data assets are classified as sensitive, personal or are not classified. Such a dashboard provides valuable assistance to data stewards, data engineers and other users responsible for the governance of the data assets.

IBM Knowledge Catalog provides such a dashboard.

The dashboard combines IBM Knowledge Catalog reporting database with a set of sample dashboard reports and related SQL queries.  One of the key objectives of the dashboard views is to give data stewards and admin personnel the means to review key information regarding the status of their data assets in relation to sensitivity. For example, a data steward can easily view which assets are classified as Personal Information (PI) or Sensitive Personal Information (SPI), and which assets have PI or SPI classifications, but do not have associated data protection or data quality rules.

You can adopt and tailor this sample dashboard provided by IBM for the needs of your organization, or you can use it as a guide for creating a dashboard using other business intelligence reporting tools. For more information about how to download and implement these dashboards, see this other blog post.

A data privacy-oriented hierarchy of data classes

A critical step to ensure effective discovery of  sensitive data is to analyze, classify and assign the appropriate data classes to enterprise-wide data. It can be a daunting task to execute the data discovery and enrichment process needed to define and maintain an extensive set of well-categorized data classes.

IBM’s  data privacy accelerator streamlines the process by providing a pre-defined set of data classes, organized in a hierarchy of categories and includes new data classes specifically tailored for data privacy. You can leverage this work by choosing to apply or ignore data classes when you import and enrich the technical metadata for your data assets. This gives you a high degree of control over how your data is processed and classified with minimal set up. The following image shows an example of some of this data class hierarchy.

A data privacy-specific taxonomy

A core part of this accelerator is a set of 500-600 business terms (depending on industry) that specifically cover all of the main elements of the business language that is relevant to data privacy.  These business terms are categorized in a specific data privacy taxonomy, so that users can view at a glance which business terms pertain to various areas of data privacy such as Financial Information, Health & Biometric, Government IDs, and so on.

The sets of industry-specific business terms are integrated into the overall core vocabulary of the IBM Knowledge Accelerators. This means that the data privacy taxonomy of terms can be used on its own, or can be used as one of many views of the central vocabulary of the enterprise.  Thus, data privacy does not operate in isolation, but can be managed as part of a range of other business initiatives. The following image shows categorized views of the taxonomy of business terms.

Additionally, all the business terms and other artifacts in the data privacy accelerator have already been assigned very specific PI (Personal Information) and SPI (Sensitive Personal Information) classifications. These classifications have been extensively reviewed to ensure that they align with recommended classification in accordance with major data privacy regulations such as GDPR (EU General Data Protection Regulation) and CCPA (California Consumer Privacy Act), accelerating your productive use of these artifacts to support the relevant regulatory requirements.

Using the data classes and business terms as part of metadata enrichment.

In IBM Knowledge Catalog, the metadata enrichment process is the means by which data profiling, quality analysis and business term assignment is carried out. During this automatic and ML-powered discovery process, it is possible to indicate what set of data classes and business terms should be used during a specific metadata enrichment job. This means that an organization can be very prescriptive about when the process should use various set of data classes or business terms, depending on what data assets are being analyzed . Having these governance artifacts readily arranged in their respective category hierarchies greatly assists this selection process.  The following image shows the point in the metadata enrichment process where the user is invited to select the relevant categories containing data classes or business terms.

A set of sample data privacy policies and rules.

Data protection regulations are typically enforced via a set of policies and associated rules. Given the various regulations across different jurisdictions, how they are enforced by organizations, and the different business needs to be addressed, it is not possible to define an exhaustive set of policies and rules. However, the data privacy accelerator includes a set of sample policies and rules that show how such artifacts can be defined and how they relate with the other data governance artifacts. You can use the samples as both a guide and a shortcut for creating policies and rules that meet the needs of your organization. The following image shows a typical policy – specifically a sample policy covering data disclosures and the associated sub-policies covering areas such as masking, disclosing and protecting of data assets.

These sample policies and associated rules are grouped in what are typically the main functional areas to be addressed as part of any data privacy program.


The data privacy accelerator offers clients of IBM Knowledge Catalog a set of pre-defined components to kickstart the automatic discovery and processing of sensitive data as part of a broader data privacy initiative. It is possible to deploy just the data privacy accelerator on IBM Knowledge Catalog or the same content can be deployed part of a broader industry-wide knowledge accelerator covering a range of business issues.

You can learn more this topic with the below resources: