Purpose
In this article, we’ll see how to add remote data sources to IBM watsonx.data, and how to add tables from these remote sources to an IBM Cloud Pak for Data governed catalog. We’ll additionally apply a data protection rule to mask some of that data, protecting sensitive information whether the table is viewed in IBM Cloud Pak for Data or in IBM watsonx.data.
Requirements
- IBM Cloud Pak for Data 5.0.3+ (with IBM Knowledge Catalog)
- IBM watsonx.data 2.0.3+
- an external data source (e.g., PostgreSQL)
Audience
This article is intended for users who have familiarity with IBM Cloud Pak for Data and IBM Knowledge Catalog, and who are interested in integration with IBM watsonx.data.
How does data protection work in IBM watsonx.data?
With the introduction of IBM watsonx.data-Knowledge Catalog service integrations in IBM Cloud Pak for Data 4.8.4, tables in IBM watsonx.data that are additionally included in an IBM Cloud Pak for Data governed catalog will be subject to any data protection rules that are defined in that cluster. In IBM Cloud Pak for Data 5.0.3, this integration was extended to include remote data sources in IBM watsonx.data.
IBM watsonx.data Admins can create or modify these service integrations on the Access Control -> Integrations page:
Creating an IBM watsonx.data-Knowledge Catalog integration
Once the integration is configured, tables in the specified storage catalogs are subject to any data protection that is applied in IBM Cloud Pak for Data.
Adding remote data sources to IBM watsonx.data
Remote data sources can be added by any IBM watsonx.data user who has Admin or Manager role for the instance’s Presto engine. Presto engine Users can add a data source, but will be unable to associate this with the engine (this will need to be done by an engine Admin or Manager in order for the data source to be active).
To check which users have these roles, we can navigate to Infrastructure Manager, click the Presto engine tile, and click ‘Access control’. Alternatively, we can navigate to Access Control -> Infrastructure, find the Presto engine entry, and use its ellipsis menu to ‘Manage access’. In both cases, we see the list of users who have various rights to the instance’s Presto engine.
Presto engine’s users, and their respective roles
An authorized user can navigate to the Infrastructure Manager and ‘Add component’.
Adding a remote data source to IBM watsonx.data
In this example, we will add a PostgreSQL data source.
Adding a remote PostgreSQL data source
After entering the various data source details, we ‘Test connection’ to ensure successful access to the remote data. We can opt to ‘Associate catalog’ while creating this data source, or we can do this afterwards on the Infrastructure Manager page. Once the data source and its catalog are created, the catalog needs to be associated with the Presto engine. This is done on the Infrastructure Manager page, by hovering over the new catalog, clicking to ‘Manage associations’, and selecting the Presto engine in the resulting dialog.
Associating (or disassociating) a catalog and engine
We can click the catalog tile and ‘Access control’ to specify which users should have access to this remote data.
Adding users to a catalog’s Access Control
Authorized users can now browse tables in our PostgreSQL database.
Browsing remote data in IBM watsonx.data
Applying data protection to remote tables
Since data protection is controlled by IBM Knowledge Catalog, we need to add our table to a governed catalog in IBM Cloud Pak for Data. We create an IBM watsonx.data Presto connection so we can browse IBM watsonx.data tables.
IBM Cloud Pak for Data connection to IBM watsonx.data
And our data steward uses this connection to add the PostgreSQL table to the governed catalog.
Remote PostgreSQL table in a governed catalog
Additionally, our data steward makes the determination that this table includes some Personally Identifiable Information, and tags the asset with this classification.
IBM Knowledge Catalog assigned the Person Name data class to the contact_name column, and the data steward labeled the asset as containing Personally Identifiable Information
In our example, the data steward creates a data protection rule to redact columns that have been assigned the Person Name data class, if these are included in assets that are classified as containing Personally Identifiable Information — and if the individual viewing the asset is not included in the ‘Authorized for PII’ group. Note that IBM Knowledge Catalog’s profiler runs automatically against all tables added to a governed catalog, and had assigned the data classes in the above screenshot without any user intervention (although users can modify these assignments).
Data protection rule to redact person names that are deemed Personally Identifiable Information
When this table is previewed by a user who is not in the ‘Authorized for PII’ group, the contact_name column data is redacted. Both in IBM Cloud Pak for Data:
Redacted data in IBM Cloud Pak for Data
And in IBM watsonx.data:
Redacted data in IBM watsonx.data
Summary
With the introduction of IBM Cloud Pak for Data 5.0.3, data protection rules can be applied not only to tables that have been ingested into IBM watsonx.data, but also to remote data — external data sources that are linked to an IBM watsonx.data instance. We can use data protection rules to mask the data in particular columns, to allow or deny access to a table, or to filter rows in a table (including or excluding rows that contain particular values). These protections are in effect when users view the remote data, whether in IBM watsonx.data or in IBM Cloud Pak for Data.
#watsonx.data