Some of most commonly asked questions from customers new to data governance or Watson Knowledge Catalog are: Where do we start? What is the operating model? What business metadata can we capture?
This blog provides an overview of the core governance artifacts that underpins the data governance framework in Watson Knowledge Catalog (WKC). These artifacts not only help you define your business vocabulary, policies, standards, rules and relationships, but they are also key to enable the built-in automated processes in WKC to speed up the curation, cataloguing and protection of your assets. These processes include profiling and classification of your data to understand their content, automatically linking business and technical metadata, evaluating the quality of your assets and dynamically protecting your sensitive data. Therefore, a good understanding of these core governance artifacts and how they can be managed and used will ensure that you are taking full advantage of WKC’s advanced capabilities to support your enterprise data governance initiatives.
Business terms are used to describe a business concept or definition. For example, what does the term “Customer Lifetime Value” mean and how is it measured? Business terms form the enterprise vocabulary that is understood by your users. They are used to provide the business context to your technical assets, whether it is a table, column, file or document. They are fundamental to provide users with a good understanding of their data. Business terms are typically assigned to the relevant technical assets and/or columns within data assets. Some customers with mature data governance programs have made it their practice to ensure that all technical metadata must be associated with a business term before they can be published to the data catalog and shared with users. This enables users to extract the business context as they discover technical metadata, and to fully understand their meaning and how they can be used. While the task of assigning business terms to every piece of technical metadata may sound daunting, WKC’s automated Data Discovery processes can facilitate the automatic tagging of business terms to technical metadata by leveraging the business term definitions and associations especially data classes. Additionally, business terms can be used in Data Protection Rules to determine which sensitive data need to be protected and how, and in Automation Rules to determine how data quality can be adjusted based on the conditions.
Data Classes define the logic or expression used to classify and identify the type of data for a given Column or Field in a data asset. WKC includes over 200 data classes with predefined matching patterns to discover values such email addresses and social security numbers, amongst many others. The matching pattern logic can be provided by java code, regular expression, a set of listed valid values, a reference data set, or a cluster of similar columns using patent-protected “Fingerprint” algorithm. Data classes are used in automatic data profiling and data discovery to infer or classify the type of content that are in each column of your data assets. Data classes are the means by which WKC’s data discovery can automate the process of finding where your critical data elements and personally identifiable information are located. When business terms are mapped to data classes, business terms can be automatically assigned to discovered and classified columns in the same data discovery job. You should review the default data classes and decide which ones you want to keep or disable, then add your own custom data classes for all the critical data elements that you want WKC to discover automatically. Similar to business terms, data classes are used in other areas of WKC. In Data Protection Rules, data classes can be used to dynamically mask data. For example, a rule can be created to mask all columns that contain Social Security Numbers, as identified by the data class assigned to the columns. In Data Quality analysis, data class violations can impact data quality scores. In Automation Rules, data classes can be used as part of the condition to automate tasks to fine-tune how data quality should be measured.
Classifications are special labels that can be used to classify assets based on the level of sensitivity or confidentiality to your organization. Unlike data classes which include logic to match data values, classifications are more like labels. WKC includes three commonly used classifications: Personally Identifiable Information, Sensitive Personal Information and Confidential, which you can decide to keep, change or add your own classifications that are relevant to your organization. For example, an organization may create classifications for Restricted Data, Private Data and Public Data according to their own corporate data security guidelines. To protect highly sensitive data, you can create a Data Protection Rule in WKC to block users from access to the data asset based on its asset classification.
Reference Data Set
Reference Data Sets provide logical groupings of code values (reference data values), such as product codes and country codes. These are typically sets of allowed values associated with data fields. You create reference data sets in WKC so that enterprise standards can be accessed centrally by users or by consuming applications through APIs. Reference data sets can also be used to provide the matching pattern for data classes, allowing data fields to be automatically classified through data profiling and discovery. These data classes can then be used in data quality analysis to evaluate the quality and consistency of the values in data columns.
Policies are used to describe and document your organization’s guidelines, regulations, standards or procedures to ensure data and information assets are properly managed and used. Some examples of policies are: Sensitive Data Handling, Data Sharing Agreement. You can create policies and sub-policies, and associate Governance Rules and Data Protection Rules that support those policies.
Governance rules provide the business description of the required behaviour or actions to be taken in order to implement a given governance policy. These are business descriptive rules and are not enforceable, unlike other rules that you can define in WKC such as Data Protection Rules, Data Rules, Data Quality Rules or Automation Rules. (Enforceable rules will be explained in another blog.)
Organizing governance artifacts with categories
Categories provide logical groupings according to subject or domain, such as geography and lines of businesses. They can be organized in a hierarchy, very much like file folder systems on your computer. All the governance artifacts that I have described above can be organized by categories. Each category can contain different governance artifact types. There is no limit to the number of category hierarchies you can create, or the number of categories within a hierarchy. However, from usability perspective, the simpler the structure the easier it is for your users to navigate. Categories that go beyond 5 levels deep become tedious and can make information less discoverable for users. Each governance artifact can have only one primary (owning) category that defines the category that “owns” the artifact. If you want the same governance artifact to appear under other categories as references, you can associate additional categories as secondary (referencing) categories to the governance artifact.
Creating relationships between governance artifacts
You can create associations between different governance artifact types. The table below summarizes the relationships that can be created via the WKC UI. Note that some of the relationships cannot be created bi-directionally. For example, you can add a Classification to a Business Term, but not vice versa. In this scenario, the related Business Term can be viewed from the "Related Content" tab of the Classification in the UI.
As you can see, some of these governance artifacts, especially business terms, data classes and their relationships, are heavily leveraged in other areas WKC like data discovery, data quality and data protection. They provide the critical knowledge in your data governance framework that makes it possible to automate complex processes to discover, understand and enrich metadata.