In the ever-growing world of analysis, the businesses are deeply influenced and grow depending on how fast and accurate their operations are based on the domain in which they operate. For each of these data-driven operations, relying on data fetched from disparate and remote data sources, a significant cost is involved in the movement of data into a central repository which is error-prone, and is a major concern due to the overhead involved in tasks like copying the data, ensuring consistency and compatibility with the target location for the operation.
This is where IBM Data Virtualization comes to rescue - with simplified data management, improvised collaboration, reduced costs, and centralized access as some of its features. Data Virtualization helps businesses in variety of use cases, like accessing disparate data with universal query engine, building a trusted data foundation with a data fabric solution, and breaking down data silos and accelerating queries. Data Virtualization hides the complexity, not required for users, while encouraging the development and implementation of strategy based on the virtualized data.
This topic is divided in two blogs that provide step-by-step information on how to use Data Virtualization.
This first blog covers about Data Virtualization providing information on following concepts:
- What is Data Virtualization and why do we need it?
- How does Data Virtualization work?
- How to get access to Data Virtualization in Cloud Pak for Data?
- Roles in Data Virtualization
- How to Virtualize?
What is Data Virtualization and why do we need it?
Data Virtualization (DV) provides the ability to view, access, and analyze the organizational business data without the need to be concerned about its location (i.e., data source) and as close to real time as possible.
This read-only feature integrates data sources across multiple types and locations, like traditional structured data, relational databases, social media, Internet of Things (IOT) etc. and turns it into one logical data view without the need for data movement or replication.
The data is maintained at their source locations and made available as one singular view for applications, analytics, data science tools and business users in an organization. Hence the challenges like the constant need of data replication, data inconsistent and unnecessary duplication due to requirement by business users for different kinds of views, etc. are overcome.
The IBM's integrated data and AI platform, the Cloud Pak for Data (CP4D), specializes in its ability to break down organizational silos to provide easy access to all business data. And with IBM Virtual Data Pipeline to instantly provision database clones in short amount of time regardless of the data size, the organizations get more time to focus on the business logic and other operations instead of handling the data from data sources. These two components work together which makes the DV reachable to users and helps them stay ahead in their fields.
How does Data Virtualization work?
The DV comprises of three technologies: IBM's Common SQL Engine, Db2 Big SQL and QueryPlex. The combination of these three technologies provides a unique solution, providing businesses the ability to easily virtualize data without having to go through the effort of merging the data into a common data source.
Some of its characteristics are:
- DV includes several performance related features to improve the distributed access to data. These includes generating optimized queries that take advantage of native capabilities of the individual data sources such as doing joins and aggregation locally, the ability to generate additional filters based on cross source joins to further limit the data retrieved and the ability to retrieve data results in parallel to multiple worker nodes running within Cloud Pak for Data itself.
- DV also includes a managed caching layer which can be used to cache data and/or query results and to automatically refresh the cache. IBM’s optimizer is intelligent enough to match incoming queries against the available caches without the user queries specifically referencing them.
- DV also provides deep policy enforcement in conjunction with Watson Knowledge Catalog (WKC) to ensure that all access to data is properly protected based on the rules and policies defined in WKC.
With these characteristics and technology behind Data Virtualization, its scope can be summarised as:
Connect: The process starts by connecting to data sources. Users can add databases, cloud-based storages, files, etc. For more information, refer Supported data sources in Data Virtualization.
Join, create, and govern: Create virtualized tables and views
Consume: Use the virtualized data in analytics project, catalogs, and other applications
How to get access to Data Virtualization in Cloud Pak for Data?
This service is not available by default. An administrator must install this service on the CP4D platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.
Note: On CP4D 4.6 and CP4D on Cloud, the service is referred to as Watson Query in the Services Catalog. And on CP4D 4.0 and 4.5, it is shown as Data Virtualization.
The user who provisions the DV service is automatically assigned the “Data Virtualization Admin” role. After the service is provisioned, the DV Admin can give other users or groups access to the service. The DV Admin is the manager of the DV instance and assigns appropriate DV roles to CP4D users or groups.
To assign DV roles to CP4D users, a Data Virtualization Admin could follow the steps mentioned below:
- On the CP4D navigation menu, click Data > Data virtualization
- From the service menu, go to Administration > User management and then select the appropriate tab to give access to an individual user or to a platform user group
In CP4D 4.0, the User management option is directly available under the service menu
- Click Grant access to assign DV roles to CP4D users and search for the user to assign role to
- Click the dropdown for the user and select a DV role from the list namely, Admin, Engineer, Steward, and User
- Click Add
To check/verify access, DV Admin(s) can search the usernames or names in search bar like below:
For more information, readers can refer Managing roles for users and groups in Data Virtualization on Cloud Pak for Data docs.
Roles in Data Virtualization
The roles in Data Virtualization are different from the roles in Cloud Pak for Data. They are not assigned to any CP4D user implicitly. As mentioned before, only a user with DV Administrator role can grant access to other users.
There are four roles in DV, and the following table describes the permissions that are associated with each of the roles:
Roles
|
Permissions
|
Data Virtualization Admin
|
- Administer the service
- Administer the database
- Access data
- Manage data sources
- Manage users and assign Data Virtualization roles
- Create and share any schema
- Manage data caches
- Manage data queries
- Assign virtual tables and views to data requests and catalogs
|
Data Virtualization Engineer
|
- Access connection details
- Manage data sources
- Create virtual tables and views
- Create and manage private schema
- Assign virtual tables and views to data requests and catalogs
|
Data Virtualization User
|
- Access connection details
- Create virtual views over existing virtual tables and views
- Create and manage private schema
|
Data Virtualization Steward
|
- Access connection details
- Access data
- Create virtual views over existing virtual tables and views
- Create and manage private schema
|
How to virtualize data?
Any user with DV Engineer or DV Admin role can virtualize data from data sources added to Data Virtualization.
Adding a data source can be done in one of the following ways:
- Adding a new connection: User enters the connection details for the data source to add the connection
- Adding an existing platform connection: User selects the already existing Platform connection from the Platform assets catalog to add the connection. This option is available as 'Existing connection' in CP4D 4.0 and as 'Existing platform connection' in CP4D 4.5
- Adding a remote data source: User adds the remote data source to access data to virtualize in DV
Note:
- To add a platform connection, a DV Engineer or Admin must be added as a viewer or above to CP4D Platform assets catalog first
- After adding data source, any DV Engineer or Admin can use the data source to create virtual tables, no matter which user added the data source
- CP4D documentation mentions the supported data sources for DV at Supported data sources in Data Virtualization.
About data used in this blog:
- This blog makes use of synthetic data based on the UCI German Credit dataset, "Statlog (German Credit Data)". Only a part of the data has been taken and split into 2 parts to be uploaded to databases as "APPLICANT FINANCIAL DATA" and "APPLICANT PERSONAL DATA".
- This blog also makes use of data based on OpenML dataset called "Titanic".
For example, adding a Db2 Warehouse platform connection as a data source:
- On the CPD navigation menu, click Data > Data virtualization.
The service menu opens to the Data sources page by default (since the user is a DV Admin)
- Click on Add connection > Existing platform connection
Filter the list, if required, and select the required connection of the Type 'IBM Db2 Warehouse'
- Click 'Add'
- To enhance parallelism, user can add a remote connector.
This step is optional and can be skipped by clicking on 'Skip'
- User should see the updated table of data sources with the Db2 Warehouse data source in 'Active' status.
To virtualize tables from the Db2 Warehouse connection/data source:
- Go to 'Virtualize' page from the DV service menu
- Add the tables to be virtualized to the cart
In CP4D 4.5, DV provides two ways to explore through the data sources to select the tables to add to the cart:
- List type interface
- Explorer type interface
The List type interface, also available in CP4D 4.0, provides users with a list of all the available tables to virtualize at once from all the valid data sources. This list is updated whenever the page is loaded or there is a change in the data sources and is often visible to users with a load process running to update the list with latest status of tables in the data sources.
Note:
- If the data source(s) contain a very large number of tables, it can take time to prime the meta-data cache. Users might need to reload through the refresh button explicitly to force DV to update the tables from data sources.
- User must wait until all the values for the results are loaded to proceed with virtualization steps
- Once loaded, users can filter the list based on 3 parameters, namely, Source type, Database and Schema, to view the desired table(s) to virtualize
Users can view the data in the table by clicking the eye icon at the end of the row in the list
- After selecting the desired table(s), they can be added to the cart by clicking 'Add to cart' button
The Explorer type interface, available in CP4D 4.5, provides a more responsive and intuitive way to select tables for virtualization, where users can traverse their way to table(s) through definite path, instead of filtering out the results.
The DV shows initial view as the list of the data sources (called connections):
- Users click on the connection to select
From the selected connection, users then select the desired schema and view the tables present inside it
Users can view the data in the table by clicking the entry in the list
- Users can then either select the table(s) by clicking the checkbox beside the tables, then click 'Add to cart' button that appears or directly add them to cart by pressing the add-to-cart icon
- Configure the location and tables for virtualization
- Click 'View cart'
- Select 'Virtualized data' in the 'Assign to' section. The section gives options to assign the virtualized data to:
Assign to
|
When to use this option
|
Data request
|
Select Data request if you created the virtual table in response to a data request. Then, choose the appropriate request. The table also appears in Virtualized data.
|
Project
|
Select Project if you created the virtual table to use in a specific analytics project. Then, choose the appropriate project. The table also appears in Virtualized data.
|
Virtualized data
|
Select Virtualized data if the table was not created in response to a data request or to use in a specific project. This setting is the default if no data requests or projects exist.
|
Note: If the final product user wishes to expose to the Project or Data Request is going to be a combination of multiple virtual tables via joins or other SQL operations, the underlying virtual tables do not need to be assigned at this stage and can remain private in "Virtualized data". Doing this helps to avoid extraneous assets in the project that user would not expect to access directly.
- Select 'Publish to catalog', if the virtualized data is needed to be published to a selected catalog.
It might happen that the 'Publish to catalog' is already selected, showing a catalog selected in the dropdown and the options being disabled. This is "enforced publishing" feature which can be modified or set up by a Data Virtualization Admin. See section "Enforced Publishing" for details
- Edit the table and schema names, if required. Note that by default, the schema name is same as the username of the user virtualizing the data.
- Edit the schema of the virtual table by selecting 'Edit columns' from the 3-dot menu at the end of the table entry.
Here, user can perform operations on the table structure like selecting columns, edit column names and change data types for the virtualized table. Once done, click 'Apply'
- Virtualize the table
- Click 'Virtualize'
- Click 'Continue' on the 'Confirm virtualization' pop up. Wait for the virtualization and publish (if catalog was selected) processes to complete
- Click 'View virtualized data' or navigate from the service menu. Observe the virtualized table in the list of virtual objects
Virtual Objects
There are two types of virtual objects:
- Virtualized table: The virtualization of a table in data source, creates virtualized table
- (Join) View: The joining of 2 or more virtualized tables, creates a join view
To create a join view, user must have two or more virtualized tables in the DV instance. There are two ways to create a view in DV:
- Using DV 'Join' functionality
- Using DV SQL Editor
Data Virtualization Join Functionality
DV provides users a way to create a view from two virtualized tables present in the instance which the users have access to.
Users can follow the steps to create a view, built from joining virtualized tables made from a table in Db2 Warehouse and a collection in MongoDB:
Note: The collection on MongoDB was virtualized in the same way the table from Db2 Warehouse was virtualized, as mentioned at the start of this section where user adds data source to Data Virtualization and then select the table in DV to virtualize it.
This assumes that user has virtualized tables from Db2 Warehouse table and MongoDB
- On the CPD navigation menu, click Data > Data virtualization
- From the service menu, select Virtualized data
- Select the checkboxes for the virtualized tables
- Click 'Join'
- Select the columns to join the tables on, by joining those columns through drag and drop of column from one section to other; and select the columns to include in the resulting view by checking or unchecking the boxes beside the individual column names
Note: Users can also use “Open in SQL editor” button to use the SQL directly and express more complex SQL operations in this view. This is described in more detail in section "Data Virtualization SQL Editor".
- Click 'Next'
User observes a preview of the view that will be created. Here, user can change column names as desired
- Click 'Next'
- Select 'Virtualized data' to generate and store the virtualized table in Data Virtualization. However, user can also select appropriate option from the available choices:
Assign to
|
When to use this option
|
Data request
|
Select Data request if you created the virtual table in response to a data request. Then, choose the appropriate request. The table also appears in Virtualized data.
|
Project
|
Select Project if you created the virtual table to use in a specific analytics project. Then, choose the appropriate project. The table also appears in Virtualized data.
|
Virtualized data
|
Select Virtualized data if the table was not created in response to a data request or to use in a specific project. This setting is the default if no data requests or projects exist.
|
Assign the data to 'Virtualized data', choose catalog to publish to, as applicable and edit the table and schema names, if required.
- Click 'Create view'. Wait for the virtualization and publish (if selected/enforced) processes to complete
- Confirm the entry of the view in 'Virtualized data' page
- Confirm the entry of the view in the primary catalog, assuming the catalog was selected
Enforced Publishing
This feature of DV enforces each new virtualized objects to be published to the catalog, known as "primary catalog". This primary catalog needs to be a "governed" catalog. From Cloud Pak for Data documentation:
“A governed catalog is a catalog that is enabled with the Enforce data policies setting when the catalog is created.”
When Enforce publishing to a governed catalog option is enabled:
- The "Publish to catalog" option is checked and disabled when user review their cart and virtualize tables. Hence, the publish process is executed subsequently after the virtualization process.
- In this mode, if a user publishes the virtualized data from DV, they can publish the data only to the primary catalog.
When Enforce publishing to a governed catalog option is disabled:
- The Publish to catalog option is unchecked and enabled for user to choose it and the catalog to publish the table post virtualization. If not selected, the data is only virtualized and assigned to the selected option, i.e., 'Data request', 'Project' or 'Virtualized data', but it is not published to any catalog.
- In this mode, if a user publishes the virtualized data from DV, they can publish the data to any of the catalog(s), they have Editor/Admin access to.
As a DV Admin, a user can enable or disable enforced publishing by following the steps:
- Go to Service settings from the service menu in DV
- Go to 'Governance tab'
- Toggle the switch for 'Enforce publishing to a governed catalog'
Note: Ensure that when turning it on, user has 'Admin' access to a governed catalog to set it as primary catalog.
Data Virtualization SQL Editor
DV provides users a way to query the virtual objects in SQL syntax. It also enables users to create views using SQL in complexity of their choice, whether it be a view on a single virtualized table or multiple virtualized tables/views, whatever the users require.
To create the same view from the previous section, follow the steps to generate SQL query by DV:
- On the CP4D navigation menu, click Data > Data virtualization
- From the service menu, select Virtualized data
- Select the checkboxes for the virtualized tables
- Click the Join button
- Select the columns to join the tables on, by joining those columns through drag and drop of column from one section to other
- Click the 'Open in SQL editor' hyperlink
The user is navigated to SQL Editor populated with query to create view.
Here, user can edit view name, column names, remove/add columns and add complex SQL expressions such as functions, case expressions, aggregation, and subqueries to fully express the logic desired. Finally, execute the query to create the view.
Note:
- Users can directly go to 'Run SQL' from the service menu and execute custom SQL queries to create views.
- The enforced publishing feature is not applied on view created from SQL Editor. Thus, view will not be published to the primary catalog automatically, even if enforced publishing is enabled; user must publish it explicitly from “Virtualized data” page.
Summary
This blog covered a brief introduction and the need of Data Virtualization. It explained how one could get access to Data Virtualization in Cloud Pak for Data and the four roles of Data Virtualization.
The next blog covers how users could virtualize data and access it in Cloud Pak for Data: Simplified Data Management in Cloud Pak for Data – Working with Virtualized Data
References
Save on data movement and storage costs with IBM Cloud Pak for Data by Mukta Singh, Program Director of IBM Data and AI, IBM
Data Virtualization: The Evolution of the Data Lake by Jim Wankowski, Digital Technical Engagement Team
Managing roles for users and groups in Data Virtualization - Cloud Pak for Data official documentation
Supported data sources in Data Virtualization - Cloud Pak for Data official documentation
#CloudPakforDataGroup#Highlights#Highlights-home