Cloud Pak for Data

 View Only

Simplified Data Management in Cloud Pak for Data – Introduction to Data Virtualization

By Tanishq Verma posted Wed November 16, 2022 07:55 AM

  

In the ever-growing world of analysis, the businesses are deeply influenced and grow depending on how fast and accurate their operations are based on the domain in which they operate. For each of these data-driven operations, relying on data fetched from disparate and remote data sources, a significant cost is involved in the movement of data into a central repository which is error-prone, and is a major concern due to the overhead involved in tasks like copying the data, ensuring consistency and compatibility with the target location for the operation.

This is where IBM Data Virtualization comes to rescue - with simplified data management, improvised collaboration, reduced costs, and centralized access as some of its features. Data Virtualization helps businesses in variety of use cases, like accessing disparate data with universal query engine, building a trusted data foundation with a data fabric solution, and breaking down data silos and accelerating queries. Data Virtualization hides the complexity, not required for users, while encouraging the development and implementation of strategy based on the virtualized data.

This topic is divided in two blogs that provide step-by-step information on how to use Data Virtualization.


This first blog covers about Data Virtualization providing information on following concepts:

  • What is Data Virtualization and why do we need it?
  • How does Data Virtualization work?
  • How to get access to Data Virtualization in Cloud Pak for Data?
  • Roles in Data Virtualization
  • How to Virtualize?


What is Data Virtualization and why do we need it?

Data Virtualization (DV) provides the ability to view, access, and analyze the organizational business data without the need to be concerned about its location (i.e., data source) and as close to real time as possible.

This read-only feature integrates data sources across multiple types and locations, like traditional structured data, relational databases, social media, Internet of Things (IOT) etc. and turns it into one logical data view without the need for data movement or replication.

The data is maintained at their source locations and made available as one singular view for applications, analytics, data science tools and business users in an organization. Hence the challenges like the constant need of data replication, data inconsistent and unnecessary duplication due to requirement by business users for different kinds of views, etc. are overcome.

The IBM's integrated data and AI platform, the Cloud Pak for Data (CP4D), specializes in its ability to break down organizational silos to provide easy access to all business data. And with IBM Virtual Data Pipeline to instantly provision database clones in short amount of time regardless of the data size, the organizations get more time to focus on the business logic and other operations instead of handling the data from data sources. These two components work together which makes the DV reachable to users and helps them stay ahead in their fields.

How does Data Virtualization work?

The DV comprises of three technologies: IBM's Common SQL Engine, Db2 Big SQL and QueryPlex. The combination of these three technologies provides a unique solution, providing businesses the ability to easily virtualize data without having to go through the effort of merging the data into a common data source.

Some of its characteristics are:

  • DV includes several performance related features to improve the distributed access to data. These includes generating optimized queries that take advantage of native capabilities of the individual data sources such as doing joins and aggregation locally, the ability to generate additional filters based on cross source joins to further limit the data retrieved and the ability to retrieve data results in parallel to multiple worker nodes running within Cloud Pak for Data itself.
  • DV also includes a managed caching layer which can be used to cache data and/or query results and to automatically refresh the cache. IBM’s optimizer is intelligent enough to match incoming queries against the available caches without the user queries specifically referencing them.
  • DV also provides deep policy enforcement in conjunction with Watson Knowledge Catalog (WKC) to ensure that all access to data is properly protected based on the rules and policies defined in WKC.

With these characteristics and technology behind Data Virtualization, its scope can be summarised as:


Scope of Data Virtualization

Connect: The process starts by connecting to data sources. Users can add databases, cloud-based storages, files, etc. For more information, refer Supported data sources in Data Virtualization.

Join, create, and govern: Create virtualized tables and views

Consume: Use the virtualized data in analytics project, catalogs, and other applications

How to get access to Data Virtualization in Cloud Pak for Data?

This service is not available by default. An administrator must install this service on the CP4D platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Note: On CP4D 4.6 and CP4D on Cloud, the service is referred to as Watson Query in the Services Catalog. And on CP4D 4.0 and 4.5, it is shown as Data Virtualization.

The user who provisions the DV service is automatically assigned the “Data Virtualization Admin” role. After the service is provisioned, the DV Admin can give other users or groups access to the service. The DV Admin is the manager of the DV instance and assigns appropriate DV roles to CP4D users or groups.

To assign DV roles to CP4D users, a Data Virtualization Admin could follow the steps mentioned below:

  1. On the CP4D navigation menu, click Data > Data virtualization


  2. From the service menu, go to Administration > User management and then select the appropriate tab to give access to an individual user or to a platform user group


    List of users with their DV roles

    In CP4D 4.0, the User management option is directly available under the service menu


  3. Click Grant access to assign DV roles to CP4D users and search for the user to assign role to

    Search results in 'Grant access to new users' window
  4. Click the dropdown for the user and select a DV role from the list namely, Admin, Engineer, Steward, and User

    Selecting DV role for new user
  5. Click Add

    Role selected for new DV user
    New user visible in list of DV users

To check/verify access, DV Admin(s) can search the usernames or names in search bar like below:

Checking access for user's account in DV by DV Admin


For more information, readers can refer Managing roles for users and groups in Data Virtualization on Cloud Pak for Data docs.

Roles in Data Virtualization

The roles in Data Virtualization are different from the roles in Cloud Pak for Data. They are not assigned to any CP4D user implicitly. As mentioned before, only a user with DV Administrator role can grant access to other users.
There are four roles in DV, and the following table describes the permissions that are associated with each of the roles:

Roles

Permissions

Data Virtualization Admin

  • Administer the service
  • Administer the database
  • Access data
  • Manage data sources
  • Manage users and assign Data Virtualization roles
  • Create and share any schema
  • Manage data caches
  • Manage data queries
  • Assign virtual tables and views to data requests and catalogs

Data Virtualization Engineer

  • Access connection details
  • Manage data sources
  • Create virtual tables and views
  • Create and manage private schema
  • Assign virtual tables and views to data requests and catalogs

Data Virtualization User

  • Access connection details
  • Create virtual views over existing virtual tables and views
  • Create and manage private schema

Data Virtualization Steward

  • Access connection details
  • Access data
  • Create virtual views over existing virtual tables and views
  • Create and manage private schema

How to virtualize data?

Any user with DV Engineer or DV Admin role can virtualize data from data sources added to Data Virtualization.

Adding a data source can be done in one of the following ways:

  1. Adding a new connection: User enters the connection details for the data source to add the connection
  2. Adding an existing platform connection: User selects the already existing Platform connection from the Platform assets catalog to add the connection. This option is available as 'Existing connection' in CP4D 4.0 and as 'Existing platform connection' in CP4D 4.5
  3. Adding a remote data source: User adds the remote data source to access data to virtualize in DV

Note:

  • To add a platform connection, a DV Engineer or Admin must be added as a viewer or above to CP4D Platform assets catalog first
  • After adding data source, any DV Engineer or Admin can use the data source to create virtual tables, no matter which user added the data source
  • CP4D documentation mentions the supported data sources for DV at Supported data sources in Data Virtualization.


About data used in this blog:

  • This blog makes use of synthetic data based on the UCI German Credit dataset, "Statlog (German Credit Data)". Only a part of the data has been taken and split into 2 parts to be uploaded to databases as "APPLICANT FINANCIAL DATA" and "APPLICANT PERSONAL DATA".
  • This blog also makes use of data based on OpenML dataset called "Titanic".

For example, adding a Db2 Warehouse platform connection as a data source:

  1. On the CPD navigation menu, click Data > Data virtualization.
    The service menu opens to the Data sources page by default (since the user is a DV Admin)

    Available data sources in DV
  2. Click on Add connection > Existing platform connection

    Filter the list, if required, and select the required connection of the Type 'IBM Db2 Warehouse'


    Filtered list of Platform connections of IBM Db2 Warehouse type
  3. Click 'Add'
  4. To enhance parallelism, user can add a remote connector.
    This step is optional and can be skipped by clicking on 'Skip'

    Adding remote connector (optional step)
  5. User should see the updated table of data sources with the Db2 Warehouse data source in 'Active' status.

    Db2 Warehouse connection added as data source with 'Active' status
To virtualize tables from the Db2 Warehouse connection/data source:
  1. Go to 'Virtualize' page from the DV service menu
  2. Add the tables to be virtualized to the cart
    In CP4D 4.5, DV provides two ways to explore through the data sources to select the tables to add to the cart:
    • List type interface
    • Explorer type interface

    The List type interface, also available in CP4D 4.0, provides users with a list of all the available tables to virtualize at once from all the valid data sources. This list is updated whenever the page is loaded or there is a change in the data sources and is often visible to users with a load process running to update the list with latest status of tables in the data sources.

    Note:

    • If the data source(s) contain a very large number of tables, it can take time to prime the meta-data cache. Users might need to reload through the refresh button explicitly to force DV to update the tables from data sources.
    • User must wait until all the values for the results are loaded to proceed with virtualization steps

    List of tables from all valid data sources in DV
    1. Once loaded, users can filter the list based on 3 parameters, namely, Source type, Database and Schema, to view the desired table(s) to virtualize

      Filtered list of tables for Db2 Warehouse database and DVTESTDB2WH schema


      Users can view the data in the table by clicking the eye icon at the end of the row in the list


      Data present in the table APPLICANTFINANCIALDATA
    2. After selecting the desired table(s), they can be added to the cart by clicking 'Add to cart' button

      Selected table to add to cart for virtualization
      Table added to cart for virtualization


    The Explorer type interface, available in CP4D 4.5, provides a more responsive and intuitive way to select tables for virtualization, where users can traverse their way to table(s) through definite path, instead of filtering out the results.

    The DV shows initial view as the list of the data sources (called connections):


    List of all connections in DV - Explorer type interface
    1. Users click on the connection to select
      From the selected connection, users then select the desired schema and view the tables present inside it

      List of all schemas in selected Db2 Warehouse connection
      List of all tables in the DVTESTDB2WH schema of Db2 Warehouse connection

      Users can view the data in the table by clicking the entry in the list


      Data present in the table APPLICANTFINANCIALDATA
    2. Users can then either select the table(s) by clicking the checkbox beside the tables, then click 'Add to cart' button that appears or directly add them to cart by pressing the add-to-cart icon

      Add to cart button visible on clicking the checkbox beside the table entry
      Add to cart icon visible beside the table entry
      View cart button becomes active as cart becomes non-empty
  3. Configure the location and tables for virtualization
    1. Click 'View cart'

      Viewing the items in cart and configuration options for virtualization
    2. Select 'Virtualized data' in the 'Assign to' section. The section gives options to assign the virtualized data to:

      Assign to

      When to use this option

      Data request

      Select Data request if you created the virtual table in response to a data request. Then, choose the appropriate request. The table also appears in Virtualized data.

      Project

      Select Project if you created the virtual table to use in a specific analytics project. Then, choose the appropriate project. The table also appears in Virtualized data.

      Virtualized data

      Select Virtualized data if the table was not created in response to a data request or to use in a specific project. This setting is the default if no data requests or projects exist.


      Note: If the final product user wishes to expose to the Project or Data Request is going to be a combination of multiple virtual tables via joins or other SQL operations, the underlying virtual tables do not need to be assigned at this stage and can remain private in "Virtualized data". Doing this helps to avoid extraneous assets in the project that user would not expect to access directly.
    3. Select 'Publish to catalog', if the virtualized data is needed to be published to a selected catalog.

      Selecting catalog to publish the virtualized table (enforced publishing - disabled)

      It might happen that the 'Publish to catalog' is already selected, showing a catalog selected in the dropdown and the options being disabled. This is "enforced publishing" feature which can be modified or set up by a Data Virtualization Admin. See section "Enforced Publishing" for details


      'Publish to catalog' showing primary catalog (enforced publishing - enabled)
    4. Edit the table and schema names, if required. Note that by default, the schema name is same as the username of the user virtualizing the data.
    5. Edit the schema of the virtual table by selecting 'Edit columns' from the 3-dot menu at the end of the table entry.

      Checking options for the added table in cart
      Here, user can perform operations on the table structure like selecting columns, edit column names and change data types for the virtualized table. Once done, click 'Apply'

      Viewing/Editing schema of virtualized table
  4. Virtualize the table
    1. Click 'Virtualize'
    2. Click 'Continue' on the 'Confirm virtualization' pop up. Wait for the virtualization and publish (if catalog was selected) processes to complete

      Virtualize objects dialog box with successful status for virtualization and publish
    3. Click 'View virtualized data' or navigate from the service menu. Observe the virtualized table in the list of virtual objects

      Observing the recently virtualized table in list of all virtual objects

Virtual Objects

There are two types of virtual objects:

  1. Virtualized table: The virtualization of a table in data source, creates virtualized table
  2. (Join) View: The joining of 2 or more virtualized tables, creates a join view

To create a join view, user must have two or more virtualized tables in the DV instance. There are two ways to create a view in DV:

  1. Using DV 'Join' functionality
  2. Using DV SQL Editor

Data Virtualization Join Functionality

DV provides users a way to create a view from two virtualized tables present in the instance which the users have access to.

Users can follow the steps to create a view, built from joining virtualized tables made from a table in Db2 Warehouse and a collection in MongoDB:

Note: The collection on MongoDB was virtualized in the same way the table from Db2 Warehouse was virtualized, as mentioned at the start of this section where user adds data source to Data Virtualization and then select the table in DV to virtualize it.

This assumes that user has virtualized tables from Db2 Warehouse table and MongoDB


  1. On the CPD navigation menu, click Data > Data virtualization
  2. From the service menu, select Virtualized data
  3. Select the checkboxes for the virtualized tables

    Selecting two virtual objects to join
  4. Click 'Join'

    Viewing available columns in each virtualized table
  5. Select the columns to join the tables on, by joining those columns through drag and drop of column from one section to other; and select the columns to include in the resulting view by checking or unchecking the boxes beside the individual column names

    Joining the key columns from each table


    Note:
    Users can also use “Open in SQL editor” button to use the SQL directly and express more complex SQL operations in this view. This is described in more detail in section "Data Virtualization SQL Editor".

  6. Click 'Next'
    User observes a preview of the view that will be created. Here, user can change column names as desired

    Viewing/Editing column names for the view
  7. Click 'Next'
  8. Select 'Virtualized data' to generate and store the virtualized table in Data Virtualization. However, user can also select appropriate option from the available choices:

    Assign to

    When to use this option

    Data request

    Select Data request if you created the virtual table in response to a data request. Then, choose the appropriate request. The table also appears in Virtualized data.

    Project

    Select Project if you created the virtual table to use in a specific analytics project. Then, choose the appropriate project. The table also appears in Virtualized data.

    Virtualized data

    Select Virtualized data if the table was not created in response to a data request or to use in a specific project. This setting is the default if no data requests or projects exist.


    Assign the data to 'Virtualized data', choose catalog to publish to, as applicable and edit the table and schema names, if required.

    Providing name for the join view with enforced publishing enabled
  9. Click 'Create view'. Wait for the virtualization and publish (if selected/enforced) processes to complete

    Dialog box with successful status for virtualization and publish for Join view
  10. Confirm the entry of the view in 'Virtualized data' page

    Observing the recently created join view in the list of all virtual objects
  11. Confirm the entry of the view in the primary catalog, assuming the catalog was selected

    Data published after virtualization to primary catalog

Enforced Publishing

This feature of DV enforces each new virtualized objects to be published to the catalog, known as "primary catalog". This primary catalog needs to be a "governed" catalog. From Cloud Pak for Data documentation:

“A governed catalog is a catalog that is enabled with the Enforce data policies setting when the catalog is created.”


When Enforce publishing to a governed catalog option is enabled:

  1. The "Publish to catalog" option is checked and disabled when user review their cart and virtualize tables. Hence, the publish process is executed subsequently after the virtualization process.

    Primary catalog selected and option to select any other catalog or option to publish is disabled
    Successful status for view creation and publish to catalog for the created view
  2. In this mode, if a user publishes the virtualized data from DV, they can publish the data only to the primary catalog.



    Details/Status for publish to catalog
    Data published from DV to primary catalog

When Enforce publishing to a governed catalog option is disabled:

  1. The Publish to catalog option is unchecked and enabled for user to choose it and the catalog to publish the table post virtualization. If not selected, the data is only virtualized and assigned to the selected option, i.e., 'Data request', 'Project' or 'Virtualized data', but it is not published to any catalog.

    Option to publish to catalog and dropdown to select catalog is active
  2. In this mode, if a user publishes the virtualized data from DV, they can publish the data to any of the catalog(s), they have Editor/Admin access to.

    Selecting catalog to publish the virtual object due to disabled enforced publishing
    Details/Status for publish to catalog
    Data published from DV to user selected catalog

As a DV Admin, a user can enable or disable enforced publishing by following the steps:
  1. Go to Service settings from the service menu in DV

    General tab in Service settings
  2. Go to 'Governance tab'

    Governance tab in Service settings
  3. Toggle the switch for 'Enforce publishing to a governed catalog'
    Note: Ensure that when turning it on, user has 'Admin' access to a governed catalog to set it as primary catalog.

Data Virtualization SQL Editor

DV provides users a way to query the virtual objects in SQL syntax. It also enables users to create views using SQL in complexity of their choice, whether it be a view on a single virtualized table or multiple virtualized tables/views, whatever the users require.

To create the same view from the previous section, follow the steps to generate SQL query by DV:

  1. On the CP4D navigation menu, click Data > Data virtualization
  2. From the service menu, select Virtualized data
  3. Select the checkboxes for the virtualized tables
  4. Click the Join button
  5. Select the columns to join the tables on, by joining those columns through drag and drop of column from one section to other
  6. Click the 'Open in SQL editor' hyperlink

    Open in SQL editor link to generate SQL query for join view

    The user is navigated to SQL Editor populated with query to create view.

    Here, user can edit view name, column names, remove/add columns and add complex SQL expressions such as functions, case expressions, aggregation, and subqueries to fully express the logic desired. Finally, execute the query to create the view.

    Generated SQL query for join view and its result after execution

Note:

  • Users can directly go to 'Run SQL' from the service menu and execute custom SQL queries to create views.
  • The enforced publishing feature is not applied on view created from SQL Editor. Thus, view will not be published to the primary catalog automatically, even if enforced publishing is enabled; user must publish it explicitly from “Virtualized data” page.

Summary

This blog covered a brief introduction and the need of Data Virtualization. It explained how one could get access to Data Virtualization in Cloud Pak for Data and the four roles of Data Virtualization.

The next blog covers how users could virtualize data and access it in Cloud Pak for Data: Simplified Data Management in Cloud Pak for Data – Working with Virtualized Data

References

Save on data movement and storage costs with IBM Cloud Pak for Data by Mukta Singh, Program Director of IBM Data and AI, IBM

Data Virtualization: The Evolution of the Data Lake by Jim Wankowski, Digital Technical Engagement Team

Managing roles for users and groups in Data Virtualization - Cloud Pak for Data official documentation

Supported data sources in Data Virtualization - Cloud Pak for Data official documentation




#CloudPakforDataGroup
#Highlights
#Highlights-home
0 comments
107 views

Permalink