Netezza Performance Server



Enforcing data governance and virtualization in Netezza for Cloud Pak for Data

By Shruthi Subbaiah Machimada posted Wed November 25, 2020 01:31 PM

Cloud Pak for Data Virtualization gives you the option of replacing an expensive data lake with a simpler alternative that simplifies your heterogeneous multi-cloud data enterprise by making it look, act and feel like a single database, and can perform as well as traditional methods. You can also ensure data governance and protection for your whole data enterprise using the deep integration between Data Virtualization and Watson Knowledge Catalog both in Cloud Pak for Data
Customers typically have two options, to bring data together from across their hybrid multi-cloud data enterprise- 
  • Building a data lake
They can build a data lake or warehouse, ETL data from multiple sources into the lake, run all analytics against the single data lake (ie database). 
The SQL engine in the database can combine data to provide insight along with reporting tools and data science tools that can now access the data through a single interface. However, it is expensive to build and maintain a separate data lake, requires constant data movement or ETL as data is updated or the data is always out of date
  • Complex ETL
The other option is to leave data where it is and draw data as needed from each source into a reporting tool or application 
This avoids the cost of building a data lake and the constant cost of maintaining ETL jobs. 
But the complexity of managing many heterogeneous sources is pushed onto the data analysts, data scientist or application developer. Each project will see and use data differently which creates issues in governance, access and interpretation. Also, reporting tools typically are poor at efficiently joining data together from multiple sources with high performance. 

Data Virtualization
We now provide a third option, through Cloud Pak for Data - Data Virtualization, which supports 12 heterogeneous data sources, including Netezza through a single powerful SQL interface (the Db2 SQL engine), and integrates data protection and Governance with your data sources, due to the tight integration with the full CPD platform. 
By combining Watson Knowledge Catalog with Data virtualization, you can access, curate, categorize and share data, knowledge assets and their relationships, wherever they reside. When queried, masked columns return disguised data. Data masking applies to the result sets of the queries only. The original data in tables and columns remains untouched. You can use data masking to avoid exposing sensitive data. However, data masking does not stop Data Virtualization user from connecting to the service and running queries against that data. Users can join and group data, generate the reports, perform analytics and collect insights by using the raw data, while masking the result set only.
In addition, Data Virtualization provides these benefits over other options to bring data together
  • Access to data through RESTful services or SQL
  • Avoids the high cost of building and maintaining a Data Warehouse or Data Lake and the related ETL
  • Simplifies data access for developers, data analysts, data scientist or application developer
  • Ensure a simplified point of management for governance
In typical data federation or virtualization scenarios, query performance may not be as fast as building a data lake. Cloud Pak for Data- Data Virtualization address this through: 
  • Caching (MQTs): They provide a way to improve performance based on query results not on constant ETL of source data
  • Query pushdown: This reduces the amount of data that needs to be moved to only what is required to answer an question, and does not require constant movement and ETL of data into a lake

Try it with the Hands on Lab environment!
There is a Hands on Lab environment using a full live Cloud Pak for Data Cluster that walks you through virtualizing tables from a Netezza data warehouse, adding a data protection rule to mask some confidential data from a user. The masked data will not be seen when the user queries the data through the Data Virtualization environment or through your end-user application (in this case, a jupyter notebook). 
To get the step-by-step instructions and play in an interactive environment with data from multiple data sources, please request access to a hands-on Lab (HOL) environment. 

To request access to the Hands on Lab environment, contact your IBM rep.