Originally developed by Aniket Kulkarni and Michael DeRoy for Medium
Data lakes have become an essential tool for storing large amounts of structured and unstructured data on the cloud. Netezza Performance Server on Cloud makes it easy to perform analytics on this data using its robust massively parallel execution engine. This guide will walk through some real world scenarios, covering how to run queries against parquet data stored in a data lake, how to ingest this data into Netezza Performance Server on Cloud, and how to seamlessly query both local data and remote data sitting in the data lake together.
This exercise requires downloading the publicly available New York taxi trip record data (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) for Yellow taxis in January of 2021 and 2022 and uploading it to an accessible S3 bucket.
Querying Data From A Data Lake
In this example, we’ll be performing some basic analysis of taxi fares in New York City. This parquet dataset has already been uploaded onto our data lake in our ‘exampledatalakebucket’. External datasources allow an administrator to grant access to S3 and Azure Blob buckets without providing the keys directly to a user. For this example we’ll create an external datasource to access our ‘exampledatalakebucket’ (for more info on external datasources and how to use them see https://www.ibm.com/docs/en/netezza?topic=tables-create-external-datasource-command)
With our external datasource, we can create an external table that accesses the yellow taxi data from January of 2022. (for more information on how to create external tables, refer to https://www.ibm.com/docs/en/netezza?topic=tables-create-external-table-command)
External parquet format tables can be directly queried like you would any other table within Netezza Performance Server on Cloud without the need to load the data into the database. Note that parquet column names are case sensitive, so double quotes are required when querying specific columns. First, lets figure out how many total passengers were picked up in New York City in January of 2022
Now we’ll run a more complex query. Lets find out which vendor has the most passengers between 1:00am and 6:00am.
Ingesting Data From A Data Lake
We were able to do some basic analysis right in the data lake with no need to load any data, but we may want to load the data to get the performance benefits if we plan on regularly querying this data. Loading data from the data lake into Netezza Performance Server on Cloud is easy! Just do a simple CTAS (create table as select) from the external table to load the data into a native Netezza table
Now you can query the loaded data, taking advantage of improved read/write performance, zonemaps, etc.
You’re not limited to loading every column either. Since parquet is a columnar format, Netezza Performance Engine can load a subset of columns without needing to transfer the entire table over the internet. For large tables this may be a significant reduction in ingress traffic, and also result in faster load times. Note that our query engine will always fetch only the columns that are needed from a parquet table. This is true both when ingesting the data, as well as when querying a parquet table directly.
Performing Analysis With Both Local Data, And Data From The Lake
There will be times when some data is loaded locally, but even more data exists in the data lake. Imagine for example that you only keep the most recent data loaded in the database, and use the data lake as your long term storage. Netezza Performance Server on Cloud makes it easy to seamlessly query both local data and data from the lake together without the need to load it. In our example, we now have the January 2022 taxi data loaded into the database, but we want to compare it with the January 2021 taxi data that is sitting in the data lake. Our first step is to create an external table pointing to the 2021 data.
Now we can query both the local 2022 data that was loaded, and the 2021 data sitting in the data lake all in the same query to figure out which year had the most passengers.
As a more complex example, lets again take a look at passengers who were picked up between 1:00am and 6:00am.
You should now be able to connect to your own data lakes to query and ingest data. Netezza Performance Server on Cloud makes working with data lakes easy