File and Object Storage

 View Only

Learn how to deploy IBM Storage Ceph with watsonx.data and consolidate data lakes with object storage made easy

By David Wohlford posted Tue May 09, 2023 01:00 PM

  

The exponential growth of data is driving the adoption of object storage as the growth rate of object storage continues to outpace primary storage solutions. Customers are finding they can not afford to store massive amounts of data on their primary storage systems and do not want to store all their information in the cloud.  The cloud is not always the most cost effective way of storing or accessing data and customers are always looking to lower their costs and provide easier ways to access and control their data.  The problem is not all vendors object storage is created equal.  Object storage was designed for the cloud with an architecture that provides a scalable and cost-effective solution for managing and storing this vast amount of data.  Customers choose object storage for its scalability, durability, cost-effectiveness, accessibility, metadata management capabilities, and integration with cloud services. These features make object storage a versatile and efficient solution for storing and managing large volumes of unstructured data in various use cases, ranging from backups and archiving to big data analytics and content distribution.  Many object storage solutions have drifted away from the goal of efficiency and global accessibility trying to make object storage something it was mot designed to be.  If an object storage solution is going to make storing thousands, millions or billions of objects and files cost effective it needs to use commodity hardware and provide the flexibility and ease of scalability with data redundancy to ensure that the data is protected and available against hardware failures and corruptions.  This makes object storage suitable for long-term data retention, backups, and disaster recovery scenarios.

IBM Storage is a leader in distributed file and object storage and with the recent announcement of IBM Storage Ceph  has solidified its position with a new open source software option that is enabling customers to drive more cost effective solutions for their object storage
requirements.  IBM Storage Ceph storage was created to address four key areas:

  • Scalability: IBM Storage Ceph was designed to overcome traditional storage systems and their limitations by providing a scalable storage solution that can handle the ever-increasing amounts of data generated by modern applications. It allows for the seamless addition of storage nodes, enabling organizations to scale their storage infrastructure easily.

  • Fault Tolerance: IBM Storage Ceph prioritizes fault tolerance and data reliability. It uses a distributed architecture that replicates data across multiple storage nodes, ensuring that data remains accessible even in the event of hardware failures or network disruptions. By eliminating single points of failure and implementing data redundancy techniques, IBM Storage Ceph provides high levels of availability and durability.

  • Open Source: IBM Storage Ceph is an open-source project, which means that its source code for the storage is freely available to the public. The open-source nature of the storage encourages community collaboration and fosters innovation. It allows developers and organizations to contribute to the project, improve its features, and customize it to their specific needs. The open-source model also promotes transparency and reduces vendor lock-in.

  • Software-Defined Storage: IBM Storage Ceph is a software-defined storage (SDS) solution, which means that it decouples storage management and control from the underlying hardware. This abstraction layer allows organizations to use commodity off-the-shelf hardware for storage, reducing costs and providing flexibility in hardware choices. It also enables the dynamic provisioning and management of storage resources, simplifying storage administration tasks.

In Addition IBM adds full software support for current and previous versions of IBM Storage Ceph and Red Hat Ceph Storage software and adds IBM Storage Insights software for centralized storage management, proactive monitoring and analytics, simplified troubleshooting, capacity planning, cost optimization, and multi-vendor support.  IBM Storage Insights adds:

  1. Proactive Monitoring and Analytics: Storage Insights leverages advanced analytics and machine learning capabilities to proactively monitor storage systems and identify potential issues or performance bottlenecks. It provides real-time alerts and notifications to administrators, enabling them to take proactive actions to resolve issues before they impact operations. The analytics-driven insights also help organizations optimize storage performance and capacity planning.

  2. Simplified Troubleshooting and Support: The platform offers troubleshooting tools and recommendations to assist administrators in resolving storage-related problems quickly. It provides diagnostic information, suggested actions, and access to IBM support resources, streamlining the troubleshooting process. This simplifies the support experience and minimizes downtime.

  3. Capacity Planning and Optimization: IBM Storage Insights provides visibility into storage capacity utilization, trends, and forecasts. It helps organizations analyze historical data and predict future capacity requirements, facilitating effective capacity planning. By understanding storage utilization patterns, organizations can optimize their storage resources, avoid unnecessary purchases, and ensure efficient utilization of existing storage assets.
       

    There are a number of components that are software defined that make IBM Storage Ceph work.  The components include a monitor, a manager, and an Object Storage Daemon or OSD that is responsible for storing objects on a local file system and providing access to them over the network.  Ceph was really built to have no single point of failure, and so IBM recommends to have minimum two to three of each of these services running at any time and you can run hundreds of services in a cluster and more than one service on a single host. 

    The IBM Storage Ceph "monitor" maintains a map of the entire cluster, so it has a copy of the OSD map, the monitor map, the manager map, and finally the data optimization map also known as the crush map. So these maps are extremely critical for the storage components to coordinate with each other.  IBM requires at least three monitors when building storage clusters to ensure high availability, redundancy, and for the monitors to reach quorum.

    IBM Storage Ceph managers do things like keeping track of the runtime metrics, as well as your system utilization, things like your CPU performance, disk load, things like that. So these managers also host the dashboard web GUI, was well as perform most tasks required for configuration and management of the system. IBM also recommends three managers, although two will suffice.

    IBM Storage Ceph has something called an OSD or an “Object Storage Daemon”, but it also has things called OSD nodes. So with our clusters, the minimum OSD nodes to begin with is 3 for test/dev and minimum of 4 for production. So the OSD is where your data is stored, and they also handle things like rebalancing and replication.  OSD’s will also send some information to your monitors and your managers so in nearly all cases you will have one OSD per HDD or SSD in your cluster. So you'll most likely have dozens or possibly hundreds or thousands of OSD’s depending on how big your cluster is, although three is the minimum to ensure redundancy and high availability.

    The object-based interfaces or RADOS Gateway (RGW) ia meant to provide the end-user with RESTful based object storage into the cluster. So the RGW currently supports two different interfaces and those are S3 and Swift. So one unique feature about this is if the end user wants to use both the S3 and the Swift API; because it's under the same namespace, they can then write under one API and read with the other.

    All these nodes have to communicate somehow and Ceph does it is through the network so IBM recommends a private network for your OSD’s and a public network for the rest of the cluster.  It is important that OSD traffic is not restricted as   they handle self-healing, replication, and moving data through the cluster, and so it's best to keep that off the public network Ceph can run on one gigabit networks, but typically its best with a 10 gigabit network and if desired one can use a LACP bonded 10 gigabit network to give 20 gigabit of bandwidth over your private networks.

What does a configuration of IBM Storage Ceph look like?  It's not hard.  Start with 3 nodes to get IBM Storage Ceph up and running and see how easy it is to configure and use S3 object storage.  When ready to go to production add another node or more depending on how much data and how much performance you need to drive from your object storage.  Here are the recommended configurations:

IBM Storage Ceph can run on inexpensive commodity hardware. Small production clusters and development clusters can run successfully with modest hardware.

Process

Criteria

Minimum Recommended

ceph-osd

STORAGE NODES

Processor

  • 1 core minimum

  • 1 core per 200-500 MB/s

  • 1 core per 1000-3000 IOPS

  • Results are before replication.

  • Results may vary with different CPU models and Ceph features. (erasure coding, compression, etc)

  • ARM processors specifically may require additional cores.

  • Actual performance depends on many factors including drives, net, and client throughput and latency. Benchmarking is highly recommended.

RAM

  • 4GB+ per daemon (more is better)

  • 2-4GB often functions (may be slow)

  • Less than 2GB not recommended

Volume Storage

1x storage drive per daemon

DB/WAL

1x SSD partition per daemon (optional)

Network

1x 1GbE+ NICs (10GbE+ recommended)

ceph-mon

MONITORS

Processor

  • 2 cores minimum

RAM

2-4GB+ per daemon

Disk Space

60 GB per daemon

Network

1x 1GbE+ NICs

ceph-mds

MANAGEMENT

Processor

  • 2 cores minimum

RAM

2GB+ per daemon

Disk Space

1 MB per daemon

Network

1x 1GbE+ NICs



Start by using IBM Storage Ceph as a secure backup target for IBM Storage Protect, IBM Storage Defender, Veeam, Commvault or  any other backup application that can use object storage.  Customer can use the "object lock" capability to create a cyber secure backup that can not be deleted even by the storage administrator.  

Customers can safely store backups and critical data protecting them from ransomware and other malicious or accidental deletion.  It’s a simple 4 step process. 1– backup your data to IBM Storage Ceph. 2- the application applies the object lock using the S3 object lock API.  3- now ransomware or other forms of modification or deletions can not change the data.  4- Now data can be safely restored when necessary.

Storage Infrastructure with IBM watsonx.data

Watsonx.data makes it possible for enterprises to scale AI workloads using all their data with a fit-for-purpose data lakehouse architecture optimized for governed data and AI workloads, supported by querying, governance, and open data formats to access and share data. This is based on open-source technologies, including Presto and Iceberg. IBM Storage Ceph can provide the storage infrastructure for a watsonx.data on-premises deployment. IBMwatsonx.dataincludes768TBofIBMStorageCephsoftwarelicense and support. The easiest way to start with IBM Storage Ready Nodes is with 4 nodes or 7 nodes depending if you desire perfromance or storage efficiency. The 4 node configuration provides the least number of nodes and the fastest perfromance with the ability to loose 2 nodes without incident. The 7 node configuration provides more nodes nodes but yield greater storage efficiency with the ability to loose 2 nodes without incident and the configuration can scale one node at a time as capacity needs grow.

With watsonx.data and IBM Storage Ceph, you can access all your data across both databases and data lakes as each configuration can be optimized in the same Storage Ceph cluster. Share large volumes of data through open table formats, such as Apache Iceberg, built for high performance analytics and large-scale data processing and at the same time store large amount of data for other large data set analysis. IBM Storage Ceph also supports multiple vendor open formats for analytic data sets while allowing different engines to access and share the same data, at the same time using tools like Parquet, Avro, Apache Orc and more.

To learn more information on using IBM Storage Ready Nodes with IBM Storage Ceph download and view: https://www.ibm.com/downloads/cas/OV6RDQX7
 
  1. Planning and Preparing:

    • Determine your deployment requirements, including the number of IBM Storage Ceph nodes, storage capacity, networking considerations, and hardware requirements.
    • Ensure that the operating system on each node meets the prerequisites for installation, including supported kernel versions and required packages.
  2. Setting up the IBM Storage Ceph Cluster:

    • Install the IBM Storage Ceph packages on each node of your cluster. 
    • Configure the cluster by creating an IBM Storage Ceph configuration file (ceph.conf) that specifies the cluster settings, network addresses, storage devices, and other parameters.
    • Generate and distribute unique authentication keys (cephx keys) for each component (monitor, OSD, MDS) to ensure secure communication within the cluster.
    • Start the IBM Storage Ceph monitor service on the designated monitor nodes using the ceph-mon command.
  3. Adding Storage Nodes (OSDs):

    • Prepare storage devices on each storage node to be used as IBM Storage Ceph Object Storage Devices (OSDs). This may involve partitioning and formatting the disks.
    • Create OSDs by running the ceph-osd command on each storage node and specifying the storage devices to be used.
    • Monitor the status of OSDs using the IBM Storage Ceph management tools to ensure they are successfully added to the cluster.
  4. Testing and Validation:

    • Verify the health and status of the IBM Storage Ceph cluster using the Ceph management tools (ceph command) and monitoring utilities.
    • Perform tests and benchmarks to ensure proper functionality, data accessibility, and performance of the IBM Storage Ceph storage cluster.


Read the Blog to lean more how to build a data lakehouse with IBM Storage Ceph and IBM watsonx.data
Are you ready to start?  Visit us at: IBM Storage Ceph  

 


#Featured-area-2
#Featured-area-2-home
#Highlights
#Highlights-home

0 comments
618 views

Permalink