Cloud Pak for Data

 View Only

Data source definitions de-mystified: using DSDs with IBM watsonx.data

By Lisa Mallahan posted 14 days ago

  

What is a DSD?

Data source definitions are a new asset type in IBM Cloud Pak for Data 5.0. They are used to organize and manage connections, and to determine whether any data protection (e.g., data masking through data protection rules) will be performed at data source by the protection method chosen or by IBM Knowledge Catalog. The function of connections does not change in IBM Cloud Pak for Data 5.0; data source definitions can optionally be used to group all connections to the same data source, and to ensure consistent data protection through these connections.

IBM Cloud Pak for Data’s data source definitions support the protection methods used by the following services: IBM Data Virtualization, IBM Security Guardium Data Protection, and IBM watsonx.data. Data sources that do not use any of these protection methods will continue to be protected by IBM Knowledge Catalog.

Note that most of the examples in this article are specific to IBM watsonx.data.

Benefits of using DSDs

Multiple endpoints are understood to describe the same data source: Data source definitions provide a means of grouping all endpoints associated with a particular data source, and of ensuring consistent data protection on data accessed through those different routes. In this example, data accessed through any of these different endpoints is guaranteed to use the same data protection method, since these have all been defined as different paths to the same data source:

Visibility into connections across all components of IBM Cloud Pak for Data: Whether or not you create any data source definitions, a new ‘Connection assignments’ view provides a list of all connections used throughout the IBM Cloud Pak for Data deployment, whether these occur in catalogs, in projects, or in deployment spaces. We can see which connections are governed by data source definitions, where they exist, and who owns them. Additionally, a blue ‘i’ notation shows which connections are of types that are incompatible with data source definitions. (In this example, we see that Presto connections can’t be governed by data source definitions; data accessed by a Presto connector cannot be protected/masked at its source, so any data protection applied to these connected assets will be performed by IBM Knowledge Catalog, instead of by the source.)

No more double-masking: Data source definitions prevent the double-masking that sometimes occurred in previous versions of IBM Cloud Pak for Data. As an example, when a table in IBM watsonx.data is added to a governed catalog in IBM Cloud Pak for Data, and data protection rules are applied to mask particular columns in the table:

  • Without any data source definition, IBM watsonx.data masks the columns needed to comply with the data protection rules, and IBM Knowledge Catalog additionally masks the values it receives from IBM watsonx.data — which leads to masking of already-masked values. In this case, the masked values in IBM watsonx.data and in IBM Cloud Pak for Data do not match:

  • With a data source definition that specifies IBM watsonx.data as the data protection method, IBM Knowledge Catalog delegates the enforcement to IBM watsonx.data, which will be responsible for masking data at source. Whether viewed in IBM watsonx.data or in IBM Cloud Pak for Data, the masked values are the same:

Where to find DSDs

Data source definitions reside in the Connectivity catalog, which is the new name for what was called Platform Connections in pre-5.0 releases (and which is still accessed via the ‘Data’ heading in the main menu):

As with any other catalog, in order to be able to see Connectivity, users must be included in its Access Control. Additionally, in order to see the new ‘Data source definitions’ tab in Connectivity:

Users must have one or both of the new ‘Create data source definitions’ or ‘Manage data source definitions’ permissions:

For users without either of these permissions, Connectivity’s ‘Data source definitions’ tab won’t be displayed. And the difference between these two permissions applies only to the new ‘Connection assignments’ view. When viewed by a user with ‘Create data source definitions’ permission, the ‘Connection assignments’ tab will only show connections that the user has access to; when viewed by a user with ‘Manage data source definitions’, the ‘Connection assignments’ tab will list all connections (these users will be able to see summary information for connections they don’t have access to, but will only be able to get details on connections they do have access to).

Creating an IBM watsonx.data DSD

Prerequisites:

  • on-prem IBM Cloud Pak for Data, with IBM Knowledge Catalog and IBM watsonx.data services installed
  • an IBM Knowledge Catalog integration, configured in IBM watsonx.data
  • an IBM Cloud Pak for Data governed catalog that includes a connection to IBM watsonx.data

For more information on how to set up the integration between IBM watsonx.data and IBM Knowledge Catalog, or for detailed instructions on creating the IBM Cloud Pak for Data connector to IBM watsonx.data see: https://developer.ibm.com/tutorials/awb-data-privacy-using-watsonx-data-with-ibm-knowledge-catalog

Process:

When IBM Data Virtualization is included in a Cloud Pak for Data deployment, an IBM Data Virtualization data source definition is automatically generated by the system. But any data source definitions for IBM watsonx.data must be created manually, by a user with one or both of the permissions described above. A data source definition can be created before or after the connections it will describe.

  • Creating a DSD from an existing connection:  If an IBM watsonx.data connection has already been created, a corresponding data source definition can be easily generated, by finding an instance of this connection on the Data Source Definitions - Connection Assignments tab, expanding it, and clicking ‘Add to data source definition’ and ‘Create new’:

In the resulting popup, provide a name, select ‘IBM watsonx.data’ as the data source type:

And confirm the endpoint details that are automatically retrieved and populated from the connection:

You can manually add more endpoints to this data source definition, or select another existing connection to ‘Add to data source definition’ and ‘Add to existing’, at any point.

  • Creating a DSD from scratch:  Alternatively, a data source definition can be created prior to the creation of any associated connections, via the ‘New data source definition’ button on the Data Source Definitions - Data Source Definitions tab:

This follows the same UI flow, but endpoint details will need to be provided by the user. In this case, provide IBM watsonx.data’s hostname, port: 443, and the instance ID as shown in the IBM watsonx.data URL or in its ‘Instance details’ popup (from ‘i’ icon in left panel of IBM watsonx.data):

  • Allowing access to the DSD:  With either creation method, the new data source definition will be visible only to the person who created it. Determine who else in the organization should be able to see it, and add those individuals by clicking the ‘Add members’ link in the success notification:

If you want all users who have data source definition permissions to be able to see this data source definition, you can change the data source definition from ‘private’ to ‘public’. Alternatively, you can add specific users — but only users who already have data source definition permissions.

NOTE: If you miss the ‘Add members’ link in the success notification, you can find the data source definition in Platform Assets Catalog, and update its access requirements there. It is not possible to do this from the Connectivity catalog.

Summary:

Data source definitions offer a means to define how data protection will be applied to assets from a particular data source. They allow multiple endpoints to be listed for each data source, so that connections following different routes to the data will all be recognized as using the same source. And they ensure the integrity of protected data, by eliminating the possibility of double-masking. Additionally, the new Connection Assignments view lists every connection used across the various components in IBM Cloud Pak for Data, allowing for greater oversight and management of the use of these connections. Try out data source definitions today!


#CloudPakforData
0 comments
19 views

Permalink