Cloud Pak for Data

 View Only

Simplified Data Management in Cloud Pak for Data – Working with Virtualized Data

By Tanishq Verma posted Thu November 17, 2022 01:18 AM

  

The previous blog of this topic covered brief introduction and the need of Data Virtualization. It explained how one could get access to Data Virtualization in Cloud Pak for Data and the four roles of Data Virtualization. Readers can refer Simplified Data Management in Cloud Pak for Data – Introduction to Data Virtualization.

This second blog covers how to utilize the virtualized data from Data Virtualization in Cloud Pak for Data:

  • How to access virtualized data through connectors?
    • Using Watson Knowledge Catalog
    • Using Flight Service
  • Modification of data and its effect on virtualized data
  • Tech notes
    • Working with MongoDB
    • Working with S3 bucket

How to access virtualized data through connectors?

What is connector, connection and connected data?

Connector: A connector is a piece of software that allows connecting to a data source.

Connection: The connection is used by the connector that allows connecting to a datasource. From the UI perspective, user provides the connection details required by the specific connector.

For example, Data Virtualization connector requires details like database, hostname, port, instance ID of DV instance, authentication method, credentials, SSL connectivity status and SSL certificate.

Connected data: Once the connection is established, the data retrieved from the data sources using the connection is called connected data.

 

For more details on connection and connected data, please refer Connectivity Introduction — Part I, by Virginie Grandhaye, Senior Product Manager, Cloud Pak for Data Connectivity.

Accessing virtualized data using Watson Knowledge Catalog

When data is virtualized in DV with the enforced publishing feature enabled, the data is published to the primary catalog along with a connection named 'Data Virtualization', because the data published is a connected data using the DV connection.

On a similar note, whenever data is published from DV to WKC or to a project, the DV connection is added to that catalog/project along with the data, if not already present.


When data is virtualized in DV with the enforced publishing feature disabled, the data remains in DV, and user must explicitly publish it to a catalog.

To add virtualized data from DV, user can perform one of the following actions:

  1. Publish virtual objects from DV to WKC
  2. Add Data Virtualization connection manually to WKC to add virtual objects as connected data

Publish virtual objects from Data Virtualization console

From the Virtualized data in DV, select the virtual object(s) and click 'Publish to catalog'.

If publishing is enforced, user has no choice but to publish the virtualized data to the primary catalog. Else, user should see list of available catalogs to the publish the virtualized data.

When data is published from DV to a catalog, the DV connection is added automatically along with the virtualized data, if not present already.

Data published from DV to catalog and presence/addition of DV connection (Enforced publishing is disabled)
Data published from DV to catalog and presence/addition of DV connection (Enforced publishing is enabled)

Add the Data Virtualization connection manually to WKC to add virtual objects as connected data

Adding DV connection to WKC requires fields specific to the DV, like instance ID, SSL status, SSL certificate which can be retrieved from Administration > Configure connection from DV service menu:

Configure connection page in DV

For hostname and port, user can use the OpenShift service name for DV in the following format:

<service-name>.<namespace>.svc.cluster.local



The service name for DV is c-db2u-dv-db2u-engn-svc.



DV is always installed in the same namespace as the rest of the CP4D which can be extracted from CP4D Monitoring page.

Note: To access Monitoring page, user must have CP4D Administrator role.


  1. From CP4D navigation menu, go to Administration > Monitoring

    Monitoring under Administration in CP4D navigation menu
  2. Open the 'Events' page

    Monitoring page in CP4D
  3. Search for ‘Data Virtualization’ service and note the namespace

    Namespace for DV from Events and alerts page

Hence, the hostname becomes: c-db2u-dv-db2u-engn-svc.datacp.svc.cluster.local

And the port number would be the default port used by the service: 50001



This convention for hostname is required, to make the service visible to services/applications in other namespaces present in the same OpenShift cluster.

However, while accessing it within the same namespace of the OpenShift cluster, user may specify only the service name and/or namespace accordingly, for example, in CP4D projects, catalogs and Jupyter notebooks (Flight Service).

User can create this connection in a catalog by adding a new connection of type ‘IBM Data Virtualization’

  1. User clicks ‘Add to catalog’ > Connection and searches for ‘Data Virtualization’ type connection

    Data Virtualization connection to add in catalog
  2. User fills the details and click ‘Test connection’ to test the connectivity

    Filling up the details and testing the connection successfully
  3. User clicks ‘Add’ to add the connection to the catalog

    DV Connection added successfully to catalog
  4. User clicks 'Add to catalog' > 'Connected data' and clicks on 'Select source' and traverse to select the virtual object to add


    Selecting virtualized data from DV connection to add to catalog
  5. User clicks on ‘Select’ to add the table to the catalog as data asset

    Adding virtualized data successfully to catalog

To add more data from DV, now user can either publish from DV or add connected data to WKC. From here on, user can make use of virtualized data like any other data asset.

Note: User should have Editor/Admin access to the catalog where the data is getting published to be able to manage the virtualized data in WKC.

Accessing virtualized data using Flight Service

IBM has added its own flavor to the Apache Arrow Flight which is available through its Python libraries in Cloud Pak for Data for faster and hassle-free access to data.

To access virtualized data from DV through Flight Service, user is required to perform following actions through Python code in Jupyter notebook in Analytics project in Cloud Pak for Data:

  1. Get the Flight Service client
  2. Create a data request in form of a Python dictionary. The request parameters include:
    1. Data source type
    2. Connection properties
      • Database
      • Host
      • Port
      • Instance ID
      • SSL status
      • Username
      • Password
    3. Interaction properties
      • Select statement (SQL syntax)
  3. Send the request and obtain the response in a FlightInfo type object
  4. Read the data from FlightInfo object into a Pandas DataFrame to present the data in a human understandable format

A sample Python snippet for the same is given below:

import itc_utils.flight_service as itcfs

DV_INSTANCE_ID = "11111-1111-11111-1111"
USERNAME       = "username"
PASSWORD       = "********"
SCHEMANAME     = "SAMPLE"
TABLENAME      = "VFDATA"
SSL_ENABLED    = "true"

flightClient = itcfs.get_flight_client()

data_request = {
    'datasource_type': {
        'entity': {'name': "dv"}
    },
    'connection_properties': {
        'database': "bigsql",
        'port': "50001",
        'host': "c-db2u-dv-db2u-engn-svc",
        'instance_id': DV_INSTANCE_ID,
        'ssl': SSL_ENABLED,
        'username': USERNAME,
        'password': PASSWORD,
    },
    'interaction_properties': {
        'select_statement':  f'''SELECT * FROM "{SCHEMANAME}"."{TABLENAME}"
                            ORDER BY CUSTOMERID
                             LIMIT 10;'''
    }
}

flightInfo = itcfs.get_flight_info(flightClient, data_request=data_request)
read_vfdata = itcfs.read_pandas_and_concat(flightClient, flightInfo)
read_vfdata

The expected output would look like:

Output from accessing virtualized data from DV through Flight Service

For more details on how to use Flight Service, please refer Data Transfer Made Faster Using Flight Service on IBM Cloud Pak for Data by Ravi Kumar Kappaganthu, Shivansh Gupta and Tanishq Verma (Strategic Business Partnerships, Palantir)

Modification of data and its effect on virtualized data

To update the data in Db2 Warehouse, user can simply access the database through its UI and run SQL query or run the same query as a static statement through Flight Service. On information regarding modification of data in data source using Flight Service, refer Data Transfer Made Faster Using Flight Service on IBM Cloud Pak for Data by Ravi Kumar Kappaganthu, Shivansh Gupta and Tanishq Verma (Strategic Business Partnerships, Palantir).

The modification to the data has been done using following queries:

  1. The INSERT query to insert 2 records (rows) in the APPLICANTFINANCIALDATA table of DVTESTDB2WH schema:
    INSERT INTO "DVTESTDB2WH"."APPLICANTFINANCIALDATA"
    VALUES
    ('00023-06db2whcpd-add-2', 'no_checking', 2022, 'outstanding_credit', 'less_100', 'bank'),
    ('00023-06db2whcpd-add-3', 'less_0', 2022, 'credits_paid_to_date', '500_to_1000', 'stores');
  2. The UPDATE query to update a record (row) in the APPLICANTFINANCIALDATA table of DVTESTDB2WH schema:
    UPDATE "DVTESTDB2WH"."APPLICANTFINANCIALDATA"
    SET "EXISTINGCREDITSCOUNT" = 2306
    WHERE CUSTOMERID='002d24cb-dd3e-***********';
  3. The DELETE query to delete a record (row) from the APPLICANTFINANCIALDATA table of DVTESTDB2WH schema:
    DELETE FROM "DVTESTDB2WH"."APPLICANTFINANCIALDATA"
    WHERE CUSTOMERID='00458606-a2fd-***********';


Modifying table in Db2 Warehouse (Base Data Source)

Once the data has been modified, user can navigate to SQL Editor in Data Virtualization and run SQL SELECT statement for the affected rows.

SELECT * FROM "SAMPLE"."VFDATA"
WHERE CUSTOMERID IN
                     ('002d24cb-dd3e-***********',
                      '00023-06db2whcpd-add-2',
                      '00023-06db2whcpd-add-3',
                      '00458606-a2fd-***********'
                     )
ORDER BY CUSTOMERID;


Verifying the changes in virtualized table made from Db2 Warehouse table
Results of the query on virtualized table

The results shows that the changes were reflected as soon as the data was modified in the base data source.

Tech Notes

Working with MongoDB

Unable to see collections to virtualize

While virtualizing data from MongoDB, collections created in the database after MongoDB has been added as a data source will not be read by DV to display in the tables to virtualize.

Observed in: CPD 4.0 and CPD 4.5

Workaround: Delete the data source and add it again

Fix: In progress (IBM Cloud Pak for Data documentation)


Steps to reproduce:

  1. Create a new collection in MongoDB

    Creating new collection in MongoDB
    Inserted document in the newly created collection in MongoDB
  2. Refresh the data in DV and observe that data for existing MongoDB connection is not updated with the entry for the new collection

    Filtered tables from MongoDB in DV - no presence of newly created collection
    Available collections in database DVTESTMONGO - no presence of newly created collection
  3. Remove and re-add the MongoDB connection as data source in DV

    Removing existing MongoDB data source from DV
    Filtered list of existing Platform connections
    Successfully adding the MongoDB connection as data source in DV
  4. Explicitly refresh the data through refresh button in DV, filter/traverse through and observe the results for the MongoDB data source

    Newly created collection in MongoDB visible in filtered results in DV – List interface
    Newly created collection in visible in database “DVTESTMONGO” of the MongoDB

Working with S3 bucket

The virtualization of flat files present in S3 bucket is same as any table in a database. User can either add the connection anew or add existing Platform connection. To virtualize, the user just navigates to the ‘Files’ tab in ‘Virtualize’ page of Data Virtualization. Then user drills down the file structure to locate and select the file to virtualize:

  1. In the 'Virtualize' page, user clicks on the 'Files' tab and view the available sources to virtualize files from them, and selects the endpoint

    Available sources to virtualize files
  2. User selects the root folder named same as the bucket to view its contents

    Root folder (bucket name) of the S3 bucket
    Files/Folders at bucket level
  3. User traverse through the bucket and add the file to cart

    Selected CSV file inside a folder in the S3 bucket
  4. User modifies the structure of the virtualized table to be created, if required

    Options for the flat file in DV cart
    Columns and their types in virtualized table
  5. User starts the virtualization process by clicking the ‘Virtualize’ button

    Virtualization status for the file


However, a user must pay attention to the following points while selecting the data from S3 bucket:

Level of file location in S3 bucket

The file to virtualize should not be present at the root level in the bucket, rather within a file path. For example, a file at s3://mys3bucket/mydata.csv cannot be virtualized; user must put that file in a file path like s3://mys3bucket/myfilepath/mydata.csv

Alert when selecting file at bucket level to virtualize

Virtualizing a single file in S3 bucket

The URL for the file(s) resolves to parent path and the entire file path is virtualized. Hence, to virtualize only a single file, it must be present in a file path that contains only that file.

For example, virtualizing a file mydata.csv should have file path like s3://mys3bucket/myfilepath/mydata.csv and no other file should have the same parent path, i.e., s3://mys3bucket/myfilepath/

Alert when virtualizing a single file
No alert displayed when selecting the only file present inside a filepath in S3 bucket

Data in flat file

In CP4D 4.0 and 4.5, DV does not support flat file encoding such as quoteChar, separatorChar and escapeChar. The only supported separatorChar is comma. So, the data inside the flat file should not contain comma in the data other than the separator.

This issue has been resolved in CP4D 4.6.

Data with string delimiters and commas in values present in the CSV file in S3 bucket
Error displayed while selecting file with string delimiters and commas in values

If a flat file that delimits strings is virtualized, then preview of the table or queries against the table might not work as expected.

Data with no commas in values but string delimiters for values in file

So, users can either remove the string delimiters before virtualizing the table or modify the queries to recognize the delimiters as part of strings.

For detailed information regarding this case, readers can refer Unexpected preview or query results for virtualized tables over flat files in Data Virtualization in CP4D documentation.

For more information readers can refer Creating a virtualized table from files in Cloud Object Storage in Data Virtualization in CP4D documentation.

 

Summary

In this topic “Simplified Data Management in Cloud Pak for Data”, spread across two blogs, we have covered the following points:

  • Data Virtualization helps in a faster, secure and reliable aggregation of data from different types of data sources without actually copying the data
  • Enforced Publishing feature allows users to publish every virtual object created or already created to a primary catalog
  • Getting access to the service and how to use it
  • Data Virtualization updates the data in virtualized tables and any dependent tables as soon as the data in base data sources is modified
  • How to access the virtualized data through connectors in WKC and through Flight Service in Jupyter notebooks
  • Some points to consider while working with MongoDB and S3 data sources

References

Publishing virtual data to a catalog with Data Virtualization - Cloud Pak for Data official documentation

Connectivity Introduction – Part I by Virginie Grandhaye, Senior Product Manager, Cloud Pak for Data Connectivity

Data Transfer Made Faster Using Flight Service on IBM Cloud Pak for Data by Ravi Kumar Kappaganthu, Shivansh Gupta and Tanishq Verma (Strategic Business Partnerships, Palantir)

IBM Watson Data Platform Data Flows Service - Data Asset and Connection Properties by IBM

Unexpected preview or query results for virtualized tables over flat files in Data Virtualization - Cloud Pak for Data official documentation

Creating virtual table from files in Cloud Object Storage in Data Virtualization - Cloud Pak for Data official documentation


Special Thanks

For providing deep insight and reviewing the content for the blogs – Bob Neugebauer (Chief Architect, Watson Query and Data Virtualization, Senior Technical Staff Member, IBM Data and AI)

For providing the opportunity to document the findings – Michael Connor (Product Manager, MongoDB, EDB, DataStax OSDb)

For trusting the author with this responsibility – Meg Bernal (Senior Software Engineer, Strategic Build Partnerships)

For guiding and inspiring the team to always move forward – Maureen Kraft (Program Director - IBM Master Inventor, Strategic Build Partnerships, Data & AI)





#CloudPakforDataGroup
#data-ai-highlights-home
#Highlights
#Highlights-home
0 comments
1331 views

Permalink