IBM Storage Ceph

Connect, collaborate, and share expertise on IBM Storage Ceph

View Only

Back to Blog List

IBM Storage Ceph integration with Hadoop/Spark

By Daniel Alexander Parkes posted Wed January 31, 2024 02:07 AM

IBM Storage Ceph integration with Hadoop/Spark.

Introduction

Modern cloud-based architectures have the intrinsic need for over-the-network storage, such as object storage. Object storage provides numerous benefits in cloud-native environments, including independent scaling of compute and storage hardware, streamlined operations for larger-scale applications and top-notch performance for all block sizes.

Modern architectures store data over the network instead of the outdated method of bringing data to the compute.

S3 has quickly become the standard for accessing object storage. IBM Storage Ceph offers a high-performance object storage solution with a rich S3 dialect.

Object storage has become the default storage architecture for modern Hadoop/Spark deployments. This article will cover the benefits of using the IBM Storage Ceph S3 interface with analytical frameworks like Hadoop and Spark.

Why S3 Storage for Data Lakes?

Because data lakes aggregate data from various sources, they can quickly reach the petabyte scale and beyond. This data volume exceeds the capacity of traditional database technologies, such as relational database management systems (RDBMS), which were primarily designed to handle structured data.

S3-storage is reliable, efficient, cheap, and scalable to almost infinite numbers.

It's worth noting that data lakes can be vulnerable to capacity issues, especially given the vast amount of structured, semi-structured, and unstructured data they accumulate. Fortunately, S3 Object Storage is specifically designed to tackle this challenge by providing a scalable and high-performing solution that can seamlessly ingest and manage these diverse data types.

When it comes to access, more and more analytics applications are utilizing the S3 API and taking advantage of its features. Over time, the number of tools that object storage-based data lake repositories can use is growing exponentially.

Moving from HDFS to S3A

Almost all applications within the Hadoop ecosystem offer built-in support for S3-compatible object storage backends. This has been the case since 2006, when the technology first incorporated an S3 client implementation. All Hadoop-related platforms use the Hadoop-aws module and aws-java-sdk-bundle to support the S3 API.

Applications can easily migrate between HDFS and S3 storage backends by specifying the appropriate protocol. For S3, the protocol scheme is s3a://, while for HDFS, it is hdfs://.

The S3 client implementation in Hadoop SDK has gone through various changes, The most widely used s3 client in the Hadoop ecosystem is s3a://, which is designed for S3 Object Stora Compatible backends.

One way to transfer data between various storage backends is by using a Hadoop tool known as distcp. This tool, which stands for distributed copy, requires two parameters: the source and destination. Using any storage backend supported by Hadoop as the source or destination is possible.

Why choose IBM Storage Ceph for S3 Object Storage?

IBM Storage Ceph provides a first-class, highly compatible S3 API for on-premises deployments.

IBM Storage Ceph confidently meets the needs of critical large-scale installations and the ever-growing demand for data. Its performance scales alongside capacity, resulting in substantial cost savings and the ability to manage exponential data growth.

IBM Storage Ceph offers first-class mission-critical support, exceeding enterprise SLA’s requirements with IBM Level 2 direct access to the engineers who write the code.

What enhancements does IBM Storage Ceph bring to Hadoop and Spark?

Hadoop and Spark use the popular S3A interface to access data in an S3 object Store like IBM Storage ceph, Instantly unlocking all of the benefits the S3 API object storage brings to the table:

Best-in-class implementation of the S3 and S3 adjacent APIs, including IAM and STS
Multi-cluster federated bucket namespaces
Scalability to billions of objects
Unmatched security

FIPs 140-2 accredited in-transit and server-side encryption
IAM and Bucket policy
STS with OIDC integration
Bucket versioning
Object lock
MFA delete

On top of the S3 features available in IBM Storage Ceph, there are other enormous benefits for Hadoop and Spark users when using the S3A integration.

Decoupling of Storage and Compute, With Hadoop and HDFS, there is a tight coupling between storage and compute that won’t allow scaling the components individually. With the S3A integration and IBM Storage Ceph, each component, compute, and storage are decoupled and can be scaled independently of the others as needed.

With HDFS, connecting different versions of Hadoop and Spark to query the same dataset is impossible. With S3A and object storage, we can connect different tooling versions querying the same dataset, providing great flexibility to the Data Scientists.

Notably, IBM Storage Ceph provides erasure-coded pools that offer higher efficiency than the initial HDFS replication scheme. This leads to a significant improvement in storage optimization and cost-effectiveness.

IBM Storage Ceph can offer out-of-the-box multi-site replication. Hadoop and Spark users can use this feature to provide DR capabilities to their Datalakes without needing third-party tools, such as WANdisco, and reduce the maintenance costs for DR capabilities.

IBM Storage Ceph Object Storage provides high-end security, encryption at rest and in transit, advanced authentication and authorization schemes through STS and IAM roles, Server side encryption(SSE-KMS,SSE-C,SSE-S3), MFA delete, Audit logs and Object Lock providing immutability.

IBM Storage Ceph integration With S3A

IBM Storage Ceph provides tight integration with the S3A implementation. Several customers use IBM Storage Ceph Object Storage capabilities with S3A to run their analytical frameworks based on Hadoop and Spark successfully.

Joint efforts between IBM Storage Ceph Engineering and one of our customers have developed and extended the S3A functionalities in different areas, for example, A custom Identity Provider that enhances the integration of S3A with the Secure Token Service S3 feature. This enhancement allowed the customer to remove and transform their RBAC authorization framework into a new full-fledged Attribute-based access control provided by the STS/IAM features in IBM Storage Ceph.

The combination of Secure Token Service Authentication with the flexibility of Attribute-based access control dramatically simplified the application authentication policy, reducing the Hadoop Platform Operators.

Which Customers can be interested?

Customers that are currently using the community-maintained Hadoop or Spark and are looking to replace HDFS with S3A:

To reduce costs, with the segregation of storage and computation, allow scaling the storage layer only when required.
Achieve OpEx Efficiency. API access for automating operations. Scale cluster capacity without adding staff.
Achieve CapEx Efficiency. Runs on low-cost x86 servers. Flash drives and/or HDDs.
To Increase Flexibility, with IBM Storage Ceph and the Data Lake concept, you can connect different versions of Hadoop to the same Data Set.
By modernizing the Storage Stack, Customers can vary their Ingest scheme by moving to a Data Stream approach and not only working with Batch ingest data.

Object storage is necessary for customers who want to keep innovating their Analytical Tool Stack and Introduce a data-lake house architecture. IBM Storage Ceph ticks all the boxes for a best-of-breed on-premise S3 Object Store.
For customers looking to standardize on Object Storage for storing all their data, on-premise or in the cloud. IBM Storage Ceph's highly compatible S3 API implementation makes it seamlessly simple to connect your analytical tools to AWS S3 on the cloud or to IBM Storage Ceph On-premise. Also, Ceph is a unified storage solution that also provides block and file,

IBM Storage Ceph resources

Find out more about IBM Storage Ceph

IBM Storage Ceph website: https://www.ibm.com/products/ceph
IBM Storage Ceph documentation: http://docs.ceph.blue
IBM Storage Ceph video demos: http://easy.ceph.blue
IBM Redbook. Concepts and architecture of IBM Storage Ceph: https://www.redbooks.ibm.com/abstracts/redp5721.html
IBM Redbook. IBM Storage Ceph use cases: https://www.redbooks.ibm.com/abstracts/redp5715.html

#Highlights
#Highlights-home

0 comments

79 views

Permalink

https://community.ibm.com/community/user/blogs/daniel-alexander-parkes/2024/01/31/ibm-storage-ceph-integration-with-hadoopspark

IBM Storage Ceph

IBM Storage Ceph

IBM Storage Ceph integration with Hadoop/Spark

By Daniel Alexander Parkes posted Wed January 31, 2024 02:07 AM

IBM Storage Ceph integration with Hadoop/Spark.

Introduction

Why S3 Storage for Data Lakes?

Moving from HDFS to S3A

Why choose IBM Storage Ceph for S3 Object Storage?

What enhancements does IBM Storage Ceph bring to Hadoop and Spark?

IBM Storage Ceph integration With S3A

Which Customers can be interested?

IBM Storage Ceph resources

Permalink

Additional
Resources

Office

Quick Links

IBM Storage Ceph

IBM Storage Ceph

IBM Storage Ceph integration with Hadoop/Spark

By Daniel Alexander Parkes posted Wed January 31, 2024 02:07 AM

IBM Storage Ceph integration with Hadoop/Spark.

Introduction

Why S3 Storage for Data Lakes?

Moving from HDFS to S3A

Why choose IBM Storage Ceph for S3 Object Storage?

What enhancements does IBM Storage Ceph bring to Hadoop and Spark?

IBM Storage Ceph integration With S3A

Which Customers can be interested?

IBM Storage Ceph resources

Permalink

Additional Resources

Office

Quick Links

Additional
Resources