IBM Storage Ceph

IBM Storage Ceph

Connect, collaborate, and share expertise on IBM Storage Ceph

 View Only

Choosing the Right Data Protection Strategies For Your IBM Storage Ceph Deployments

By Anthony D'Atri posted 29 days ago

  

Choosing the Right Data Protection Strategies

for your IBM Storage Ceph Deployments

Choosing data protection strategy can be complicated

Multiple factors and tradeoffs

  • Usable to raw capacity ratio
  • Replication vs erasure coding
  • EC profile values for k and m
  • Read and Write performance
  • Recovery performance
  • Failure domains
  • Fault tolerance
  • Media saturation
  • min_alloc_sizevs IU

Replication vs Erasure Coding

Replication

  • Three (usually) copies of all data
  • Fast and simple
  • As few as 3 nodes or racks
  • Well-suited to small and/or hot S3 / CephFS objects and metadata pools
  • High space amplification
    • Ratio of raw capacity to usable capacity
    • Default 3 copies: space amp factor of 3
    • Other values are possible: 1 and 2 are dangerous if your data isn't disposable
    • Stretch mode clusters require 4

Erasure Coding (EC)

  • Splits data into chunks
  • Computes and stores additional parity chunks
  • Akin to RAID6 / RAID60 but more flexible
  • Requires additional CPU for parity calculation
  • Usually requires more hosts/racks than Replication
  • Less space amp
  • EC is often selected to maximize usable capacity for a given amount of raw underlying storage
  • Or to minimize raw capacity needed for desired usable capacity
  • More TB/RU, TB/node, TB/watt
  • Extra IOPs on underlying drives: HDDs saturate
  • Slower recovery; client writes may have higher latency and the cluster have lowered redundancy until recovery completes, which means that the risk of data unavailability or loss in the event of an overlapping failure
  • May burn SSD endurance more quickly (though this is mostly FUD) (mostly)

EC Profile

  • EC pools have attributes: the most important are Kand M
  • K is the number of data chunks: when K=4, 1MB of user data is split into 4 data chunks @ 256KB
  • M is the number of coding chunks, which are the same size as data chunks
  • When M=2, 1MB of user data generates 2 coding chunks @ 256KB
  • We call the above a 4+2 or 4,2 profile
  • Space amp factor here is just 1.5
  • Diminishing returns with large values ofK
  • Space amp factor is (K+M)/K
  • As K increases beyond, say, 6, the incremental space saving of larger values quickly declines
  • At the cost of more IOPS, slower recovery, and scrubbing
  • Subtly, consider relative vs absolute space amp.  In the below table the absolute difference between 8+2 and 4+2 is 25%, but the relative difference is just 16%.
  • EC overhead table
  • You can select a profile with M=1

  • But.  Don't.  Just ... Don't.
  • Unless you can afford to lose data
  • This is not facetious: sometimes data can be reconstructed or is only a scratchpad.
  • High risk of data loss
  • Data is unavailable if even one host / rack is down for maintenance

Selecting K

  • Tradeoff of space amp, performance, fault tolerance
  • Data is preserved if any K data or coding chunks survive
  • Data is available for reads and writes when any K+1 data or coding chunks are online
  • There are durability and availability benefits of larger values of K: K=3 offers a significant improvement over K=2, beyond 3 there are -- you guessed it -- rapid diminishing returns on the timescale of cosmological heat-death.
  • Many admins choose K=2. As drive sizes increase with a concomitant increase in MTTR, the risk of data unavailability or loss due to overlapping failures increases.
  • Consider how long a cluster will take to recover from the failure of a single 30TB HDD or 122TB SSD.
  • Now consider the recovery time when an entire host of those halts and catches fire.
  • 3+3, 6+3 are examples of profiles with higher fault tolerance at the expense of performance and space amp.
  • Decision factors include OSD size and media, network resources, and how existential a threat partial data loss would be.

When to choose Replication or EC

  • EC is best across at least M+K+1 failure domains
  • You can do EC on fewer, but it's tricky
  • Strategy is configured at pool creation
  • Can't switch between the two later, so choose .... wisely
  • Some pools require replication:
    • RGW index,CephFS metadata
    • RBD pools are almost always replicated
    • RBD using EC is possible with recent IBM Storage Ceph releases
      • Requires a small adjunct replicated metadata pool
      • Inherently higher latency is usually unacceptable

EC for CephFS and RGW data pools

  • Applications are often less sensitive to latency than with block storage
  • Driven by space amp: cost and density over performance
  • Read throughput can even be higher than with replication, in limited circumstances
  • Applications are often less sensitive to latency

Multiple RGW data pools

  • EC data pools can result in larger space amp for very small objects (this should improve in IBM Storage Ceph 9)
  • Small / hot RGW objects benefit from replicated pool performance
  • User objects in a secondary data pool still have a small HEAD RADOS object in the default pool
  • Larger / tepid / cold objects often are fine on an EC pool for efficiency

Mixed media RGW + CephFS data pools

  • Best of both worlds
  • Most capacity of fast TLC SSDs for replicated metadata + tiny / hot objects
  • Possible inlining of small user objects / files
  • Dense, cost-effective HDD for EC data pools, but spinners gonna seek
  • Dense pTLC or QLC SSDs for EC data pools: up to 122TB today, larger on the horizon
  • When using coarse-IU pTLC or QLC SSDs, set bluestore_use_optimal_io_size_for_min_alloc_size = true before OSD creation.  These media are fantastic for object and file storage, not great choices for RBD pools or CephFS / RGW metadata pools.
  • PCIe Gen 4/5/6 allow huge NVMe capacity without bottlenecks

May the Tentacles be with You

This post was requested by IBMer Greg Deffenbaugh and adapted from a deck presented at Ceph Day Seattle 2025

0 comments
5 views

Permalink