File and Object Storage

 View Only

IBM Storage Ceph integration with Snowflake

By Daniel Alexander Parkes posted Fri June 14, 2024 02:50 AM

  

IBM Storage Ceph Snowflake Integration

 

Data Challenges with Cloud and On-prem Environments

Today, businesses integrate various datasets from different sources to derive valuable business insights promptly. Conventional analytics infrastructures, often reliant on specialized hardware, can lead to data silos, lack scalability, and result in escalating costs over time.

The rise of modern analytics architectures in public cloud-based SaaS environments has helped overcome many limitations, allowing for efficient operations and the ability to adapt dynamically to changing workload demands without compromising performance.

However, despite these advancements, not all organizations can realistically shift entirely to a cloud-based environment. Several crucial reasons exist for retaining data on-premises, such as regulatory compliance, security concerns, cost considerations, etc.

Consequently, many organizations are exploring the benefits of hybrid cloud architectures, making their datasets from on-premises object-based data lake environments available to Cloud SaaS data platforms like Snowflake.

 

A hybrid cloud solution with Snowflake and IBM Storage Ceph

Snowflake is a cloud-based data platform that enhances data-driven insights by allowing governed access to vast amounts of data for collaboration and analysis. Thanks to its native support for the S3 API, it can unify diverse data sources and integrate seamlessly with on-premises solutions like IBM Storage Ceph. This integration enables Snowflake to leverage Ceph's robust, scalable storage capabilities, effectively bringing cloud data warehouse functionalities into the on-premises environment while ensuring comprehensive data control and security.

IBM Storage Ceph is open-source, software-defined, runs on industry-standard hardware, and has best-in-class coverage of the lingua Franca of object storage, the Amazon S3 API. Ceph was designed from the ground up as an object store, contrasting with approaches that bolt on S3 API servers to a distributed file system. With Ceph, data placement is by algorithm instead of by lookup. This allows Ceph to scale well into the billions of objects, even on modestly sized clusters. Data stored in Ceph is protected with efficient erasure coding through in-flight and at-rest checksums encryption and robust access control that thoughtfully integrates with enterprise identity systems; it is the perfect complement to Snowflake for establishing a security-first hybrid cloud data lake environment. 

IBM Storage Ceph is now a Snowflake supported S3 compatible storage vendor; using Ceph's S3-compatible APIs, enterprises can configure Snowflake to access data stored on Ceph through external S3 stages or external S3 tables, enabling efficient queries without requiring data migration to and from the cloud.

IBM Storage Ceph Object: An ideal platform for data lakes

 

IBM Storage Ceph Storage is the perfect platform for creating data lakes or lakehouses with key advantages:

 

  • Cost-effectiveness: IBM Storage Ceph utilizes commodity hardware and open-source software to reduce the upfront infrastructure costs.

  • High scalability: IBM Storage Ceph allows horizontal scaling to accommodate large volumes of growing data in a data lake or lake house.

  • High flexibility: IBM Storage Ceph can handle various data types, including structured, semi-structured, and unstructured data (for example, text, images, videos, sensor data, and so forth), making it versatile and appropriate for data lakes.

  • High availability: IBM Storage Ceph is designed to provide durability and reliability for the data stored in a data lake or lake house. Data is always accessible despite hardware failures or disruptions in the network through data replication across multiple geographic locations, providing redundancy and fault tolerance to prevent data loss.

  • High performance: IBM Storage Ceph allows for parallel data access and processing through integration with data analytical frameworks to enable high throughput and low latency for data ingestion, processing, and analysis within a data lake or lake house. Ceph Object also provides a cache data accelerator (D3N) and Query pushdown with S3 Select.

  • Data governance: IBM Storage Ceph provides efficient management of metadata to enforce data governance policies, track data lineage, monitor data usage and provide valuable information about the data stored in the data lake, such as format, data source and so forth

  • Security: IBM Storage Ceph has a broad feature set regarding security: Encryption, external identity integration, Secure Token Service, IAM roles/policies, per-object granular authorization, Object Lock, Versioning, and MFA delete. 

 

Example of Connecting Snowflake with data residing on S3 On-Prem IBM Storage Ceph 

The most common way of accessing external S3 On-prem object storage from Snowflake is to create an External Stage and then use the Stage to copy the data into Snowflake or access it directly using an External Table.

Next, we will provide two simplistic examples for reference:

For context, our On-prem Ceph cluster.

Our IBM Storage Ceph Cluster has an S3 Object Gateway configured at “s3.cephlabs.blue”, we have a bucket called ecommtrans with a .csv file with the “transactions”

$ aws s3 ls s3://ecommtrans/transactions/                                       

2024-06-04 11:33:54   13096729 transaction_data_20240604112945.csv

The .csv file has the following format defined.

client_id,transaction_id,item_id,transaction_date,country,customer_type,item_description,category,quantity,total_amount,marketing_campaign,returned

799315,f47b56a5-2392-4d7c-a3fe-fad18c8b0901,a06210e5-217f-4c3d-8ab9-06e1d8f605e2,2024-03-17 20:35:26,DK,Returning,Smartwatch,Electronics,3,1790.2,,False

858067,9351638c-9d23-4d32-9218-69bbba6b258d,858aa970-9a95-4c99-8b64-d783129dd5cb,2024-02-13 16:18:42,ES,New,Dress,Clothing,4,196.96,,False

528665,7cc494c8-a19d-4771-9686-989d7dfa4c96,0bb7529b-59e8-4d15-adb8-c224b7d7d5b9,2024-03-04 

Example 1. Copy data from an S3 External Stage provided by IBM Storage Ceph

In the Snowflake UI, we open a new SQL worksheet and run the following SQL command:

-- Set the Role

USE ROLE accountadmin;

-- Set the Warehouse

USE WAREHOUSE compute_wh;

-- Create the Database

CREATE OR REPLACE DATABASE onprem_database_ingest;

-- Create the Schema

CREATE OR REPLACE SCHEMA onprem_database_ingest.raw;

-- Use the Database and Schema

USE DATABASE onprem_database_ingest;

USE SCHEMA onprem_database_ingest.raw;

-- Create the External Stage

CREATE OR REPLACE STAGE IBM_STORAGE_CEPH_INGEST_STAGE

  URL = 's3compat://ecommtrans/'

  ENDPOINT = 's3.cephlabs.blue'

  CREDENTIALS = (AWS_KEY_ID = 'XXXXX' AWS_SECRET_KEY = 'YYYYYYY')

 FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY = '"' SKIP_HEADER = 1 FIELD_DELIMITER = ',' NULL_IF = (''));

-- Create the External Table with defining expressions for each column

CREATE OR REPLACE TABLE onprem_database_ingest.raw.transactions

(

  client_id VARCHAR(16777216),

  transaction_id VARCHAR(16777216),

  item_id VARCHAR(16777216),

  transaction_date TIMESTAMP_NTZ(9),

  country VARCHAR(16777216),

  customer_type VARCHAR(16777216),

  item_description VARCHAR(16777216),

  category VARCHAR(16777216),

  quantity NUMBER(38,0),

  total_amount NUMBER(38,0),

  marketing_campaign VARCHAR(16777216),

  returned BOOLEAN

);

LIST @IBM_STORAGE_CEPH_INGEST_STAGE/transactions/;

---> copy the Menu file into the Menu table

COPY INTO onprem_database_ingest.raw.transactions

FROM @IBM_STORAGE_CEPH_INGEST_STAGE/transactions/;

-- Sample query to verify the setup

SELECT * FROM onprem_database_ingest.raw.transactions

LIMIT 10;

 

 

Example 2. Create an external table from an S3 External Stage provided by IBM Storage Ceph

 

In the Snowflake UI, we open a new SQL worksheet and run the following SQL command:

-- Set the Role

USE ROLE accountadmin;

-- Set the Warehouse

USE WAREHOUSE compute_wh;

-- Create the Database

CREATE OR REPLACE DATABASE onprem_database_ingest_trans;

-- Create the Schema

CREATE OR REPLACE SCHEMA onprem_database_ingest_trans.raw;

-- Use the Database and Schema

USE DATABASE onprem_database_ingest_trans;

USE SCHEMA onprem_database_ingest_trans.raw;

-- Create the External Stage

CREATE OR REPLACE STAGE IBM_STORAGE_CEPH_INGEST_STAGE_TRANS

  URL = 's3compat://ecommtrans/'

  ENDPOINT = 's3.cephlabs.blue'

  CREDENTIALS = (AWS_KEY_ID = 'XXXXX' AWS_SECRET_KEY = 'YYYYY')

  FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY = '"' SKIP_HEADER = 1 FIELD_DELIMITER = ',' NULL_IF = (''));

LIST @IBM_STORAGE_CEPH_INGEST_STAGE_TRANS/transactions/;

-- Create the External Table with defining expressions for each column

CREATE OR REPLACE EXTERNAL TABLE onprem_database_ingest_trans.raw.trans_external

(

  client_id STRING AS (VALUE:"c1"::STRING),

  transaction_id STRING AS (VALUE:"c2"::STRING),

  item_id STRING AS (VALUE:"c3"::STRING),

  transaction_date TIMESTAMP AS (VALUE:"c4"::TIMESTAMP),

  country STRING AS (VALUE:"c5"::STRING),

  customer_type STRING AS (VALUE:"c6"::STRING),

  item_description STRING AS (VALUE:"c7"::STRING),

  category STRING AS (VALUE:"c8"::STRING),

  quantity NUMBER AS (VALUE:"c9"::NUMBER),

  total_amount NUMBER AS (VALUE:"c10"::NUMBER),

  marketing_campaign STRING AS (VALUE:"c11"::STRING),

  returned BOOLEAN AS (VALUE:"c12"::BOOLEAN)

)

LOCATION = @IBM_STORAGE_CEPH_INGEST_STAGE_TRANS/transactions/

FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY = '"' SKIP_HEADER = 1 FIELD_DELIMITER = ',' NULL_IF = (''))

REFRESH_ON_CREATE = FALSE

AUTO_REFRESH = FALSE

PATTERN = '.*.csv';

-- Refresh the metadata for the external table

ALTER EXTERNAL TABLE onprem_database_ingest_trans.raw.trans_external REFRESH;

-- Sample query to verify the setup

SELECT * FROM onprem_database_ingest_trans.raw.trans_external

LIMIT 10;

 

 

Closing comments

Hybrid cloud architectures are increasingly being utilized, incorporating on-premises solutions like IBM Storage Ceph and cloud-based SAAS platforms like Snowflake. IBM Storage Ceph, which is now supported by Snowflake as an S3-compatible store, makes it possible to access on-premises datalake datasets enhancing Snowflake's data warehousing capabilities. This integration establishes a secure, scalable, cost-effective hybrid data lake environment.


#Highlights
#Highlights-home

0 comments
19 views

Permalink