File and Object Storage

 View Only

Scaling IBM Storage Ceph Object Storage: A billion objects per bucket and Multisite sync

By Vidushi Mishra posted 2 days ago

  

Scaling IBM Storage Ceph Object Storage: A billion objects per bucket and Multisite sync

One critical challenge when managing large-scale data with any Object Storage is ensuring seamless scalability without paying any performance penalty. With IBM Storage Ceph, we constantly strive to push all dimensions of scalability. As the IBM Storage Ceph Quality Engineering (QE) team, we work closely with the IBM Ceph Object and Core RADOS engineering teams to rigorously pursue ways to ensure deterministic performance at scale and drive enhancements that improve the experience of operating large Ceph estates. 

In the past we demonstrated storing a billion, then 10 billion objects in a single Ceph cluster. These tests were informed primarily by the needs of organizations running into HDFS’s scalability limitations. Object storage is, after all, the preferred storage substrate for Data Lakehouse Platforms

Other object storage systems in the market impose limitations on the number of objects that can exist in a single bucket or with a single prefix. We want to ensure that data practitioners can confidently place massive data sets into Ceph, without forcing them to spread those data sets across buckets. It’s worth noting that Ceph does not utilize an adjacent database to store bucket index information. Instead, we create some RADOS objects per bucket, which we refer to as bucket shards, and that bucket’s index information is distributed across those shards as OMAP key-value pairs. Years ago, changing the number of shards per bucket was a manual process, and we added dynamic bucket resharding to improve the operational experience for Ceph administrators who provide storage services to big data teams. Several configuration parameters control how many shards to create for each bucket initially (the poorly named rgw_override_bucket_index_max_shards), whether to scale the number of shards as the number of the objects in the bucket increases (rgw_dynamic_resharding), what the per-shard threshold should be for dynamic sharding (rgw_max_objs_per_shard), and the maximum number of shards that will be dynamically created for any given bucket (rgw_max_dynamic_shards). The cluster administrator can also manually adjust the number of shards for any given bucket using the radosgw-admin CLI utility.

Our test procedure involves uploading a billion objects to a single bucket without manually pre-sharding the bucket with radosgw-admin, using defaults for the configuration as mentioned earlier parameters. For extra credit, we configured multi-site replication to test index scaling robustness in that context. The choice of a billion objects per bucket was driven by increased requests for multi-billion object buckets from customers with huge catalogs in their Data Lakehouse.

This ongoing effort highlights our dedication to making IBM Storage Ceph an efficient and reliable solution for large-scale data management. The IBM Ceph Object engineering team has set an objective of 5 billion objects per bucket, removing the existing trade-offs between Object ingestion and listing performance, and establishing a benchmark of 1 billion objects as a regular, everyday standard for IBM Storage Ceph operations.

 

The Test Setup

The Environment

Our testing ground consisted of a compact Ceph setup designed to push the limits of Ceph Object Gateway. The Environment was a simple multisite exercising active-active zone replication with the zones present on 2 different ceph clusters.

Ceph Configuration for each cluster

  • Three nodes (bare-metal) cluster handling MON, OSD, and MGR roles.

    • With Nine OSDs distributed evenly among the above three nodes.

    • Hardware details for the MON/OSD/MGR nodes 

      • CPU: Intel Xeon Silver 4210, 20 (10 cores x 2 threads each), 2.20GHz base frequency with max 3.2GHz.

      • Memory: 62 GiB 

      • Storage: Total capacity of 21 TiB across three 7 TiB SSD disks per node. 

  • Six RADOS Gateway (bare-metal) nodes run an object gateway instance.

    • Half of these nodes were dedicated to synchronization tasks, while the other half were focused on IO operations.

    • Hardware details for the rgw nodes

      • CPU: Intel Xeon E5-2630,  20 (10 cores x 2 threads each), 2.20GHz base frequency with max 3.1GHz.

      • Memory: 62 GiB

    • All clients were managed behind a HAProxy load balancer, ensuring smooth traffic distribution.

  • Pools:

    • Data pools were 3x replicated.

    • Index pools were 3x replicated.

  • Non-default Configuration

    • To optimize our setup, we set rgw_run_sync_thread to false on the IO RGWs.

Benchmark workload configuration

Multisite Bidirectional Workload

This refers to concurrent write operations to the same bucket from the two active sites participating in the replication.

Object Size

We varied object sizes to simulate a realistic workload:

  • 25% of objects were 1 to 2 KB.

  • 40% were 2 to 4 KB.

  • 25% were 4 to 8 KB.

  • 10% ranged from 8 to 256 KB.

Workload Type

Our workload was primarily PUT operations, stressing the write capabilities of Ceph RGW.

The Objective

Our goal was simple: upload a billion objects to a single bucket and replicate these objects across an RGW multisite deployment. This endeavor would test the scalability and robustness of Ceph RGW in handling such massive data volumes.

The Methodology

We knew we would heavily exercise Ceph RGW’s dynamic bucket index resharding capabilities to tackle the bucket scalability challenge, pushing the default parameters to the limit. Dynamic sharding will, by default, not exceed the 1999 shards per bucket, and with 1 billion objects in a single bucket, we’re looking at a mean of 500k index entries per shard. It's essential to take into account two critical details with RADOS.

Processes such as recovery and scrubbing progress through each placement group and iterate over each object. Timeouts may occur when an object's size increases due to the longer time required to read it. This is why all RADOS client interfaces, such as RBD, RGW, and CephFS, by default spread data across 4MB RADOS objects.

If you plan to upload more than 200 million objects to a single RGW bucket, the current wisdom is to pre-shard the bucket to avoid the issues discussed later in the document. The shard count depends on two variables; the maximum number of objects per shard (rgw_max_objs_per_shard, which is 100,000) and the total number of objects you intend to ingest into the bucket. You can calculate the appropriate number of shards by the formula (shards = number of objects / 100,000).
This ensures that the bucket index is adequately prepared for the high volume of object count.

 

The Workflow

Step 1: The Initial Upload

We began our journey by uploading objects to the bucket bidirectionally, with the initial milestone of 400 million objects. The system performed admirably, without any issues regarding the multisite synchronization.

  1. Incremental upload to 400 million objects.

    1. Stage 1  Total 20M objects in the bucket

      1.  20 million objects bi-directionally (10M ⇔ 10M)

    2. Stage 2  Total 60M in the bucket

      1. 40 million objects bi-directionally

    3. Stage 3 Total 100M objects  in the bucket

      1. 40 million objects bi-directionally

    4. Stage 4 Total 150M objects in the bucket

      1. 50 million objects bi-directionally

    5. Stage 5: Total 200M objects in the bucket 

      1. 50 million objects bi-directionally

      2. The bucket has been dynamically resharded to the max limit of 1999 shards

    6. Stage 6 Total  250M objects in the bucket 

      1. 50 million objects bi-directionally

      2. Started seeing large omaps

    7. Stage 7 Total  300M objects in the bucket 

      1. 50 million objects bi-directionally

    8. Stage 8 Total  350M objects in the bucket 

      1. 50 million objects bi-directionally

    9. Stage 9 Total  400M objects in the bucket 

      1. 50 million objects bi-directionally

      2. Large omaps had increased in count

 

Step 2: The Large Omaps after reaching the shard limit

As we neared the 1999 shard limit, we closely monitored system performance. Here, we first encountered some issues: large omaps (when an excessive number of objects were written to a bucket index) began to form, indicating that the system struggled with the shard limitations.

Step 3: Manual Resharding

To reach 1 billion objects in the bucket, we manually resharded it into 10,000 shards. We calculated this by dividing 1 billion objects by the maximum number of objects per shard (100,000), resulting in 10,000 shards.  Since using a prime number for the shard value is recommended, we chose 10,007 shards.

This process was labor-intensive but necessary to continue our quest toward a billion objects. However, this phase was challenging. We observed high CPU utilization and prolonged IO downtime during the transition, which took approximately 84 minutes in our test environment.

We opted for manual resharding instead of dynamic resharding because using the dynamic method would require us to increase the max_default_dynamic_shards value from 1999 and restart the gateways. However, we are also exploring the dynamic resharding approach.

Step 4: Observation and Insights

Throughout the process, we continuously monitored CPU utilization and downtime. Despite the challenges, we managed to achieve our objective, albeit with some critical observations that the Ceph Object Engineering team is working on:

  • OSD Utilization: We noted uneven distribution, even with a moderately used cluster (60% usage post the upload of 1 billion objects).

  • Performance Comparison:

    • The system’s behavior changed significantly as we moved from 11 to 1999 shards.

    • Large Omaps were prevalent when the shard count reached 1999, especially with 400 million objects.

    • Transitioning from 1999 to 10,007 shards was resource-intensive but necessary.

    • After 10,007 shards, the system stabilized, and we did not observe any issues with the large volume of data in the bucket with multisite sync.

Open Thoughts and Issues seen

Through our efforts in scaling Ceph Object Gateway, we documented and opened multiple issues and feature enhancement requests in Bugzilla to enhance product scalability. Link to RFEs and bugs

Issues seen and under investigation

  • 2267632 [rgw/reshard]: 300% CPU utilization observed during manual resharding to 10K shards with 400 million objects.

  • 2269655: Uneven OSD utilization despite moderate cluster usage.

Future IBM Ceph Object Enhancement Requests

  • 2266547 [RFE]: Optimal num_shards required for resharding beyond 1999 shards for huge RGW buckets.

    • bug 2266547#c4

      • We Opened a request for Less or No IOs Quiesce during the resharding of the bucket. I saw 84mins IO quiesce during the resharding of a bucket having 400 million objects 

        • The IBM Ceph Object team is already working on expediting this feature; a PR is already available upstream. 

  • Bug 2272535

    • We propose an RFE for automatic alerts once a bucket reaches 1999 shards, prompting manual resharding to prevent large object maps. This could help manage large-scale Ceph RGW deployments. 

Dashboard Observations

  • 1 Billion Objects: The dashboard correctly reflected 1 billion objects, a reassuring sight from the UI.

  • Alerts: Were there any alerts from the dashboard related to our testing? This remains an area for further investigation.

Command Reference

The radosgw-admin bucket list --bucket <bucket_name> (ordered) for a billion+ objects took ~10 hours in our test environment.

Conclusion

The collaboration between the IBM Storage Ceph Quality Engineering (QE) team and the IBM Ceph Engineering (ENG) teams is crucial for improving the scalability and performance of Ceph Object Gateway. Together, we are addressing the challenges of managing and replicating large volumes of data, Making 1 billion objects in a single bucket our baseline. We aim to ensure that Ceph Object Gateway is a robust and scalable solution for large-scale object storage.


#ibmtechxchange-ai
0 comments
35 views

Permalink