Overview
One of the advantages of IBM Storage Ceph over other storage solutions is the flexibility to refresh or expand a cluster's hardware without service interruption. In fact, these operations can be orchestrated so that clients and users don't even notice!
This post was inspired by a valued customer who experienced issues when adding eighty (yes, eighty) nodes to an IBM Storage Ceph cluster.
Adding hosts or OSDs to an existing cluster can result in a "thundering herd" effect as IBM Storage Ceph brings new capacity online and redistributes data. This process includes built-in measures to limit the impact of rebalancing on clients. However, when a large number or percentage of hosts or OSDs are added at once, there can still be impact, including client performance and inactive placement groups.
The Manager's balancer module uses the pg-upmap facility to incrementally converge toward uniform data distribution across OSDs. We will use this facility and the balancer in our expansion process. Note that before proceeding, your cluster's balancer mode should be upmap:
# ceph balancer status | grep mode
"mode": "upmap",
If your balancer mode is set to crush-compat, do not proceed. This mode may be set due to prior (or current) very old clients that cannot handle pg-upmaps; investigate your complement of clients. If any do not report luminous or later, track them down and remove or upgrade them before switching the balancer mode to the default upmap. Adding pg-upmaps to a cluster with very old clients may result in those clients being unable to perform I/O operations.
# ceph features | jq .client
[
{
"features": "0x2f018fb87aa4aafe",
"release": "luminous",
"num": 35
},
{
"features": "0x3f01cfbffffdffff",
"release": "luminous",
"num": 34
}
]
Similarly, if any OSDs have a legacy REWEIGHT value lower than 1.0, we advise setting them to 1.0 and allowing the cluster to adjust before proceeding.
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5577.10254 root default
-34 465.31158 host ngc1701
217 hdd 18.53969 osd.217 up 1.00000 1.00000
The Expansion Process
Expansion of an IBM Storage Ceph cluster at a high level progresses through several phases:
- Rack & stack the new systems or add drives to existing OSD nodes
- Provision Linux and Ceph prerequisites on new nodes
- Deploy OSDs on the new resources
- Rebalance for uniform utilization
Below we present a method for orchestrating the latter two of these phases.
Preparation
Before executing any expansion steps, run the ceph status command in a cephadm or admin node shell. If the overall health status is not HEALTH_OK, if there is any backfilling or recovery in progress, if any OSDs are down or out, resolve these issues before proceeding. Executing cluster expansion or other maintenance when the cluster is not squeaky clean can result in all manner of problems.
Brand new hosts and storage drives are mostly 100% functional right out of the box.
Mostly.
Before adding new drives or hosts to your cluster, we recommend testing their integrity. First, we direct the orchestrator to not automatically detect new drives and deploy OSDs on them until we're ready. Determine the names of your orchestrator OSD service(s):
# ceph orch ls | grep osd
osd.cost_capacity
# ceph orch set-unmanaged osd.cost_capacity
Set unmanged to True for service osd.cost_capacity
Your cluster may have different or additional OSD services; repeat the set-unmanaged command for each. Now inventory each host to ensure that the expected number and types of new OSD drives are visible. Depending on your hardware complement, tools for this include:
- lsblk
- lsscsi
- nvme list
- storcli64 /c0 show (or perccli64)
We advise testing each added drive before deploying an OSD, for example by running a command of the following form:
# /usr/bin/time dd if=/dev/urandom of=/dev/nvmen111n0 bs=1M count=1000
Check dmesg and syslog logs for any reported errors, and investigate any drives that report errors or take unusually long to complete the test. Alternately, the stress-ng utility is a useful tool for consolidated testing of storage drives, RAM, CPUs, and other server components.
Next we pause automatic rebalancing until we're ready. As OSDs and hosts are added to a cluster, the topology maps maintained by the Monitors are updated multiple times. Each OSD as it comes online must catch up with this flurry of updates, peer with other OSDs, and in general get ready to rumble. By disabling rebalancing in advance, we forestall the movement of data onto new OSDs, and thus the need for them to serve client requests, until they are fully ready. If we do not do this, the cluster will begin rebalancing with each new OSD that comes online, which creates additional churn and results in data being moved multiple times.
# ceph osd set norebalance
# ceph balancer off
Deployment
Now, deploy the new host or OSD resources in the usual fashion, which for many clusters means letting the orchestrator detect them. Restore the orchestrator osd service(s) to the managed state:
# ceph orch set-managed osd.cost_capacity
...
You will see the OSDs progressively join the cluster and come up and in:
# ceph osd stat
324 osds: 324 up (since 8m), 324 in (since 6m); epoch: e1701
Ensure that all OSDs are deployed, up, and in, and that they appear in the appropriate CRUSH host, rack, etc. buckets:
# ceph osd tree
If the new OSDs / hosts do not appear as expected in the CRUSH topology, use the CLI tools to remediate or take other actions as appropriate.
As a result of having temporarily disabled rebalancing, you should see lots of misplaced objects, and placement groups in the backfill_wait state, but there should be no active backfilling. Cluster status should look something like the below:
# ceph status
cluster:
id: 44928f74-9f90-1138-8862-ngc17017f06d07
health: HEALTH_OK
...
osd: 324 osds: 324 up (since 2h), 324 in (since 6m); 634 remapped pgs
rgw: 13 daemons active (13 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 13 pools, 6979 pgs
objects: 2.67G objects, 2.8 PiB
usage: 3.7 PiB used, 1.8 PiB / 5.4 PiB avail
pgs: 1104285280/19467663314 objects misplaced (75.401 %)
2315 active+clean
3666 active+remapped+backfill_wait
At this point our new OSDs are in the cluster, but since we've disabled balancing, they have not yet received any data. Our cluster has a great many misplaced objects, but we want to mitigate the thundering herd, including subtle dynamics including PG overdosing. When the cluster moves data around to balance distribution, it first makes a full new replica/shard, and only then removes the replica/shard from the old location. This "make before break" approach is part of IBM Storage Ceph's assiduous concern for data durability and availability. During very large cluster topology changes, though, there can be a race condition of sorts between moving PGs onto and off of a given OSD. Our approach here forestalls that dynamic by moving data in smaller, manageable batches.
Moreover, if a component fails during balancing, or if we need to perform an upgrade or other maintenance, would like to be able to get the cluster back to a steady state quickly. A large expansion to a large cluster with erasure coded pools on HDDs could easily take a metric month or longer to converge.
The Magic
The heart of IBM Storage Ceph's RADOS foundation is CRUSH, which can be thought of as a sophisticated hash function that distributes data across hosts and OSDs. The balancer module manages pg-upmap entries to fine-tune this placement, which inspires this process for managing expansion.
We will use the handy-dandy upmap-remapped.py script from the good folks at CERN to facilitate our process. Download a copy to a cephadm shell or a node with the admin key, and chmod +x the file so that it may be executed.
Let's briefly show what the script does by giving it a whirl:
# ./upmap-remapped.py | head
while ceph status | grep -q "peering\|activating\|laggy"; do sleep 2; done
ceph osd pg-upmap-items 19.17f 306 1 271 43 0 84 293 66 163 94 &
ceph osd pg-upmap-items 19.21c 60 102 54 90 &
ceph osd pg-upmap-items 19.39d 41 166 17 90 279 102 &
ceph osd pg-upmap-items 19.70d 179 122 121 163 102 17 85 6 7 299 &
wait; sleep 4; while ceph status | grep -q "peering\|activating\|laggy"; do sleep 2; done
Now, fire up a second window with a cephadm or admin node shell, and run
watch ceph status
for a frequently-refreshed view of cluster state. You should see the bolus of misplaced RADOS objects and PGs waiting for backfill.
The script analyzes the cluster's topology and state and emits supported CLI commands, similar to how the Manager's balancer module works. For real-life use, we pipe it into a shell so that the commands are executed:
./upmap-remapped.py | sh
./upmap-remapped.py | sh
Due to certain subtleties and knock-on effects, we sometimes need to make two or three passes with the script. When the script is running and piped into a shell for processing, it emits batches of pg-upmap commands, waiting between each for the cluster to process the latest batch before piling on more. As each batch of commands is processed by the cluster, in your status window you will see brief spurts of PGs peering to exchange the new mappings, and the numbers of misplaced RADOS objects and backfill_wait placement groups will progressively drop. After two, perhaps three runs of the script, the pending backfill vanishes because we've effectively told the cluster that where the data currently sits is where it belongs, really, just trust us.
We still need to let the cluster balance data across the new resources,-sot we'll do so now on our own terms: incrementally so that if a problem arises we aren't stuck with no expedient way forward or back, and so that manageable steps do not result in subtle problematic intermediate states like we discussed above.
Now we turn the balancer module back on:
ceph osd unset norebalance
ceph balancer on
The cluster immediately sees the imbalance that our new capacity has wroute, and endeavors to converge to uniformity. The balancer does this by incrementally removing or adjusting the pg-upmaps that the script emplaced or that already existed, taking a divide-and-conquer approach. The balancer will not make additional adjustments once a configurable fraction of data is considered replaced; by default this is 5% of the cluster's total. This often works well, but for especially large adjustments, we additionally constrain how much change the balancer will bite off at once:
ceph config set target_max_misplaced_ratio 0.02
Another setting bears adjustment, especially if your cluster comprises (or will one day) comprise OSDs of differing sizes:
ceph config set mgr/balancer/upmap_max_deviation 1
Unless your cluster has already crossed the nearfull warning threshold, you and your users will be best served by allowing the cluster to gradually and smoothly rebalance over time. You will periodically see a modest amount of backfill appear on the cluster as the balancer does its dance. Over time you can run
ceph df
to see the AVAIL capacity of your pools increase as the OSDs become more balanced. You can also run
ceph osd df
to see placement groups and data move to the new OSDs as the PGS and VAR values grow and converge toward 1.00 respectively.
Note
The IBM Storage Ceph Monitors cannot compact while backfill is ongoing, so when executing this process we recommend settings like the below to give the cluster a period break to catch its breath:
ceph config set mgr mgr/balancer/begin_weekday 4
ceph config set mgr mgr/balancer/end_weekday 1
These values allow the balancer to make adjustments during each extended weekend: day 0 is Sunday and day 6 is Saturday. We give it several days off so that each batch of adjustment has time to backfill completely, which gives the Monitors a chance to compact their databases so they can remain lean and mean.
Expanding non-OSD Daemons
IBM Storage Ceph clusters comprise a number of daemons and services in addition to OSDs. A detailed discussion of daemon placement is beyond the scope of this post but may be found in the documentation. A few notes though as pertain to expansions or refreshes:
Monitors
If your cluster is currently configured to only deploy three Monitors, we strongly urge you to scale up to five, to enhance availability and resilience. Many clusters today are converged: they colocate Monitors, Managers, RGWs, and other daemons with OSDs on the same nodes. Expanding the mon complement is straightforward:
Create a file called mymonservice.yml with the following service spec:
service_type: mon
service_name: mon
placement:
count: 5
label: mon
and ensure that at the very least five hosts, and ideally more, have the mon orchestrator label:
# ceph orch host ls
HOST ADDR LABELS
ceph0cfc 10.10.10.1 _admin,mgr,mon
ceph8ce2 10.10.10.2 _admin,mgr,mon
ceph6998 10.10.10.3 _admin,mgr,mon
cepha09a 10.10.10.4 _admin,mgr,mon
cephab92 10.10.10.5 _admin,mgr,mon
cephac0f 10.10.10.6 _admin_mgr_mon
cephad37 10.10.10.7 _admin,mgr,mon
ephbe73e 10.10.10.8 _admin,mgr,mon
8 hosts in cluster
If your cluster spans multiple data center racks with rack as the CRUSH failure domain, it is good practice to assign the mon label to at most two hosts in each rack. The Paxos consensus algorithm requires that more than half of the configured Monitors be available to form a quorum. Were the orchestrator to place three or more Monitors on hosts in a single rack, an outage of that rack would take down the mon quorum.
After ensuring appropriate host labels, instruct the orchestrator to apply the changes:
# ceph orch apply -i mymonservice.yml --dry-run
You may need to repeat the dry run after a minute or two for the orchestrator to gather current cluster information. Once this command shows the expected addition of daemons, apply it for real:
# ceph orch apply -i mymonservice.yml
The orchestrator should deploy additional Monitors as necessary to meet the new policy, and after they peer with the existing daemons the ceph mon stat command should show all five up and in quorum.
RGW and Ingress services
These services scale horizontally, and when object storage service is provisioned it is often best practice to deploy at least one instance of every cluster node that has adequate network, CPU, and memory resources. When expanding such a cluster, manage rgw and ingress service types in the above fashion, without the constraint of only two hosts per rack.
Other Expansion and Hardware Lifecycle Considerations
This same technique can be used to smooth the curves of other tasks that might require significant data movement:
Changing the CRUSH rule for one or more RADOS pools or device class for a set of OSDs
By replacing the OSD deployment step above with CRUSH changes, we can also calm the thundering herd. This approach allows us to, for example. safely constrain a CRUSH rule to only SSD or HDD OSDs, which may entail moving as much as 100% of the data, a task that can otherwise be quite impactful.
Adjusting CRUSH tunables
As IBM Storage Ceph was evolving successive refinements were made to the CRUSH algorithm, including values for various tunables that are stored as part of the CRUSH map. These have been stable for a number of releases now, but older clusters may find that their tunables profiles are outdated, which results in suboptimal data distribution:
# ceph osd crush dump | grep optimal
"optimal_tunables": 0,
In such a case, one would substitute the below command for OSD deployment in the above process:
# ceph osd crush dump | grep optimal
"optimal_tunables": 1,
Hardware Refresh
IBM Storage Ceph's flexibility means that you can not only expand your cluster without client downtime or significant impact, but also can replace 100% of the cluster's hardware, again without clients noticing. This can be accomplished by adding the below steps after the OSD deployment in the above procedure, but before re-enabling the balancer:
- Set the CRUSH weight of the outgoing OSDs to 0
The balancer thus will now gradually move all data to the new OSDs. When this is complete, and ceph osd df shows that the old OSDs and hosts are really really empty and no more backfill remains, the old OSDs and hosts may be removed from the cluster.
Note that when using this technique, you must first ensure that the aggregate capacity and thus CRUSH weight of the new OSDs and hosts is greater than or equal to that of the outgoing hosts, or the balancing process will not be able to complete properly and your cluster will enter NEARFULL and possibly BACKFILLFULL or FULL states, in which case everyone has a very bad day.
Out of an abundance of caution, an adjusted process would be to let the cluster balance as in the original procedure above, and then set the CRUSH weights of outgoing hosts to 0 a few at a time. This will take somewhat longer, but offers additional checkpoint and auditing opportunities.
When all OSD data has been evacuated from the old hosts that are to be decommissioned, check the placement of Monitor daemons:
# ceph orch ps --service_name mon
NAME HOST STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
mon.dd13-25 dd13-25 running (2d) 13s ago 18M 758M 2048M 18.2.1 d2cdd87030d1 eb0cea7cf13f
mon.dd13-33 dd13-33 running (2d) 11s ago 19M 934M 2048M 18.2.1 d2cdd87030d1 d02d83e252c8
mon.dd13-37 dd13-37 running (3M) 10s ago 19M 1610M 2048M 18.2.1 d2cdd87030d1 fa11209be5f5
mon.i18-24 i18-24 running (8w) 9s ago 18M 1596M 2048M 18.2.1 d2cdd87030d1 129143135cd3
mon.i18-28 i18-28 running (8w) 9s ago 18M 1610M 2048M 18.2.1 d2cdd87030d1 8d60bf5a71c9
Say the i18-28 host is to be removed from service. Issue the following command to undeploy daemons. If a Monitor, Manager, RGW, MDS are running there, the orchestrator will fire up a replacement on another node, though in the case of MDS it is best to first deploy the new instance and fail over to it explicitly.
# ceph orch host drain i18-28
# ceph orch ps i18-28
When the ps command shows all daemons evacuated from the host, and ceph status shows the cluster healthy, proceed to remove the host from the cluster:
# ceph orch host rm i18-28
Additional detail regarding removing hosts from service may be found in the IBM Storage Ceph documentation.
We hope that this blog post has been useful and informative, and welcome your feedback.