Authors: @CHINMAYA MISHRA @KEDAR KARMARKAR @GOPIKRISHNAN VARADARAJULU
This blog describes a IBM Data & AI solution using IBM watsonx.data and IBM Storage Scale. It highlights how IBM watsonx.data's advanced analytics features together with IBM Storage Scale's robust enterprise storage capabilities come together to build a scalable and cost-effective Data and AI platform. This solution builds upon IBM watsonx.data v2.0.2+ and the new High Performance S3 protocol service (non-containerized) included in IBM Storage Scale 5.2.1+.
For a detailed description of the solution, see the IBM Redpaper https://www.redbooks.ibm.com/abstracts/redp5743.html
About IBM Watsonx.data
IBM watsonx.data empowers enterprises to scale Analytics and AI workloads. Based on an open architecture Lakehouse, it is a unique solution that allows co-existence of open source technologies and proprietary products. Key features include:
- An architecture that fully separates compute, metadata, and storage, offering industry leading flexibility and lower costs
- Next-generation engines such as Presto and Spark that provide fast, reliable, and efficient processing of big data at scale.
- Open formats, such as Apache Iceberg, allow different engines and IBM proprietary solutions (e.g. DB2 Warehouse and Netezza) to access and share the data at the same time.
- Leverages cost-effective, scalable object storage available across multi-clouds.
- Integration with a robust ecosystem of IBM’s best-in-class solutions, governance capabilities and third-party services.
Top use cases for IBM watsonx.data are as following:
- Rapid analytics with data virtualization in Presto, with 35+ connectors to external databases, hdfs and object stores from various vendo
- Datalake modernization Modernize Hadoop Datalakes with Apache Iceberg and object stores.
- Data warehouse optimization : Replace ETL jobs with Spark, and reduce costs of data warehouses by “right sizing” workloads.
- Streamline data engineering : Reduce data pipelines, simplify data transformation, and enrich data for consumption using Spark, SQL, Python, or an AI infused conversational interface.
- Prepare Data for AI: Acquire, transform and prepare data efficiently for use by AI with Spark and Milvus Vector Database. Vectorized embedding capabilities in Milvus enable Retrieval Augmented Generation (RAG) use cases for AI inference.
watsonx.data supports the vendor agnostic open table format, Apache Iceberg, that enables different engines to access the same data at the same time, thereby enabling data sharing across multiple repositories (e.g. data warehouses and data lakes). This allows using new technology with old data through metadata integration, and allows users to migrate data and workload at their own pace. For more information, see https://www.ibm.com/products/watsonx-data
About IBM Storage Scale
IBM Storage Scale (formerly known as IBM Spectrum Scale or IBM General Parallel File-System (GPFS)) is an industry-leading storage software for file and object storage. It has market leading performance, scalability, reliability and a wealth of sophisticated data management capabilities to meet the demands of AI, big data, analytics, and HPC workloads. With IBM Storage Scale, customers can build a highly scalable Global Data Platform for their Lakehouse, offering higher performance and cost advantage. The Global Data Platform, powered by IBM Storage Scale offers following differentiated data services
- Data Access Services Rich multi-protocol access to data, including NFS, SMB, Object, HDFS, CSI and Posix protocols.
- Storage Abstraction and Acceleration Services : Abstract and virtualize remote data silos from any cloud, any edge or any legacy data silos, whether Object, File or HDFS format, to be managed under a common storage namespace and accelerate them for high performance data access.
- Data Management Services: Comprehensive Information life cycle management (ILM) services that can organize customers’ data in cost/performance optimized tiers, based on an organization's retention, archiving and data governance goals.
- Data Resiliency Services : identify and detect threats, to protect organization's data. The data resilience services align with the NIST security framework, from practicing cyber hygiene before an event, all the way through detection, response, and recovery.
For more information, see https://www.ibm.com/products/storage-scale
The Solution Architecture
The solution architecture, as shown in the Figure below consists of the compute layer with IBM watsonx.data software deployed on Red Hat OpenShift container platform. IBM Storage Scale provides the storage environment and is deployed outside Openshift, in a non-containerized environment. The compute infrastructure consists of
- Red Hat OpenShift container cluster running IBM watsonx.data applications, including Presto and Spark.
- A shared metadata service powered by Hive Metastore (HMS)
- Milvus vector database service
The storage infrastructure consists of
- IBM Storage Scale file systems holding the data
- Active File Management (AFM) for storage abstraction and acceleration
- S3 Data Access protocol service for high performance object access.
Depending on the customer's use case, IBM Storage Scale is leveraged in this solution in either of the following two ways, or a combination of both:
- As the primary object storage layer for the Lakehouse. The data buckets reside locally on the Storage Scale file system itself.
- As a persistent cache and storage acceleration layer for accessing remote object stores globally dispersed across various clouds, data centers and locations
The S3 service exposes the buckets (local or accelerated) to IBM watsonx.data for attachment to a query engine such as Presto or Spark. Multiple instances of Spark and Presto engines connect to the IBM Storage Scale S3 service using S3 protocol to access these data buckets.
This solution paves the way for a disaggregated architecture, being able to manage, operate, scale and grow the compute and storage layers independent of each other.
Value proposition of IBM Storage Scale for watsonx.data
Even as IBM watsonx.data is capable of processing data from various sources, the data sprawl in terms of multiple storage silos poses a challenge to enterprises in terms of management, visibility, governance and security of the data. The Global Data Platform powered by IBM Storage Scale offers the following top benefits for watsonx.data applications:
- Storage abstraction and virtualization, eliminate silos
Virtualize and abstract dispersed storage silos all over the enterprise and manage them from a common storage namespace. Reduce unnecessary data copies and improve efficiency, security and governance.
- Accelerate storage where performance matters
Perform performing automatic, transparent caching of back-end storages. Accelerate data queries and improve economics by fronting lower performance storage.
- Simplify data integration with multi-protocols
With multi-protocols, eliminate multiple copies of the same data for traditional applications, analytics and AI. This facilitates in-place analytics and simplifies enterprise-wide data workflows starting from data cleansing all the way to AI.
- A Proven platform for Performance, Scalability and Growth
Proven performance for HPC, Analytics and AI workloads. Extreme performance for AI with GPU Direct Storage (GDS) for NVIDIA platforms Support for billions of objects in terms of number and exabyte scale storage capacity, thanks to it’s distributed architecture.
Multiprotocol access to same data, together with multiple performance tiers for storage, eliminates copies and improve storage economics. The IBM Storage Scale System provides industry leading storage density, enabling savings in terms of power, real estate and cooling costs.
- A Lakehouse optimized for AI
- With multi-protocol support, enable a unified data platform for analytics and AI, simplify data workflows.
- Extreme performance for AI with GDS and NVIDIA, for training Generative AI models faster
- Using Milvus, enable Retrieval Augmented Generation (RAG) use cases at scale across large datasets residing on IBM Storage Scale.
Conclusion
The accelerated growth of AI across enterprises and the ever-growing volume of data, multiple variety, formats and locations of data, adds to the complexity and cost of building a modern AI & Analytics solution.
IBM Watsonx.data, with it’s innovative and open standards based software, together with IBM Storage Scale’s proven leadership in high-performance and scalable storage, positions itself as a modern and cost-effective solution for Data and AI. This helps organizations to expand AI pilot projects to production by providing the right tools, platforms and software-defined storage on which to run it all.
Thank you for reading the blog. Your comments and feedback is appreciated.