IBM Storage Ceph

Connect, collaborate, and share expertise on IBM Storage Ceph

View Only

Back to Blog List

Ceph's Scheduling Algorithm Evolution: From WPQ to dmClock

By Srinivasa Bharath Kanta posted Thu February 13, 2025 05:09 AM

Prior to dmclock(WPQ):

The main operations in the ceph cluster are client I/O, background recovery, and background operations(scrubbing, backfilling, trimming, PG deletion..etc). Before the 6.1 CEPH version, the requests are scheduled by using the WPQ (Weighted Priority Queue) algorithm.WPQ prioritizes the tasks based on the assigned weights. Requests with higher weights are served before those with lower weights.

For example:

Requests	Arrival Time	Weight
I/O	t1	3
Recovery	t2	1
Trimming	t3	5
PG deletion	t4	2

Table-1

The final execution order will be: Trimming(5)->I/O(3)->PG deletion(2)->Recovery(1)

For any operation within the cluster, the system dynamically allocates and manages CPU, memory, network, and storage I/O resources. In WPQ, The system dynamically adjusts resource allocation (CPU, memory, network, and storage I/O) based on these weights.

To illustrate this let's take an example-

Requests	weights	Reservation
Recovery	1	250
PG deletion	2	250
I/O	3	1000

Table-2

The throughput allocated to requests is in proportion to their weights.

Throughput allocation ∝ Weight

Assume that the maximum throughput in the device is 1200 IOPS

The IOPS allocation is based on the formula - T × (wi/∑wi) where T = Total throughput w=Weight

So, for the recovery, the allocated throughput is = 1200 X ( 1/6) = 200 IOPS

Similarly for PG deletion = 1200 X ( 2/6) = 400 IOPS

for I/O = 1200 X ( 3/6) = 600 IOPS

The minimum requirement for Recovery operation is 250, but 200 IOPS are allocated. Similarly, for the I/O operations, the minimum requirement is 1000, but 600 are allocated. In both scenarios, the resources are less than what is required. For PG deletion, the minimum requirement is 250, and allocated resources are 400 which is more than the required resources.

Drawbacks of WPQ:

WPQ is not inherently a fairness algorithm. While it does provide a mechanism for prioritizing tasks based on weight.which means higher-weighted tasks will always receive more resources or attention than lower-weighted ones. This can lead to unfairness if low-priority tasks are consistently starved of resources.WPQ does not ensure that all tasks get an equal or proportional share of resources.

mclock

The mClock algorithm was developed by Ajay Gulati of VMware, Arif Merchant of HP Labs, and Peter J. Varman of Rice University.

Like WPQ the mClock doesn't pre-allocate fixed, dedicated portions of a resource (like bandwidth or IOPS) to specific services. Instead, mClock uses a more dynamic and flexible approach. Mclock uses the “Tag-based scheduling” approach, the “Tag-based scheduling” is a common foundation for many fair scheduling algorithms. The core idea is to associate each request (e.g., an I/O request, a task, etc.) with a tag. These tags are then used to determine the order in which requests are serviced. Specifically, requests are scheduled and processed according to the order of their tag values. The tag values themselves can be assigned based on various criteria, such as priority, fairness weights, deadlines, or other relevant factors.

Mclock introduced three tags to manage I/O resource allocation.These tags are the Reservation tag, Limit tag, and Shares tag.

Mclock Algorithm

Reservation is a guaranteed minimum resource allocation to an entity. It represents the amount of the resource the entity is guaranteed to receive, regardless of the overall system load or the demands of other entities. Even if the system is under heavy contention, the entity will still receive its reserved minimum.

A Limit is a defined maximum allocation of a resource that a service is permitted to consume. Reservations and limits are expressed in absolute units of the resource (e.g., bandwidth, IOPS)

Shares represent a relative allocation measure, defining the proportional distribution of a resource among competing entities. They specify the ratio in which different entities receive service. A higher share value indicates a larger allocation of the resource

These parameters are externally provided, the responsibility falls on the process providing them to choose values that are suitable for the system's needs.

The mClock, manages resources for both active and idle services. For instance, In CEPH multiple services like I/O, recover, backfill, scrub, timing, PG deletion..etc share resources, some services might be actively requesting resources, while others might be temporarily idle. It's important to keep track of the progress of both active and idle services in terms of resource allocation. Even though an idle service isn't actively making requests, resource share should still be accounted for idle services. If an idle service suddenly becomes active, it shouldn't be unfairly penalized because it was previously idle. To achieve this synchronization, the mclock scheduling algorithm maintains a global tag value, often called "global virtual time" or just "virtual time."

Algorithm

Mclock Algorithm provides pseudo code of various components of the algorithm.

The scheduler has three main components: (i) Tag Assignment (ii) Tag Adjustment and (iii) Request Scheduling.

Tag Assignment:

Tag assignment assigns values to the R, L, and P tags for each I/O request. When a request comes in from a specific service at a certain time, this routine determines the appropriate tag values based on its reservations, limits, and share. The scheduler then uses these tags to decide when and how to process the request.

Reservation Tag R_i^r = max{R_i^r-1 +1/r_i ,t}

Where R_i^r =Reservation tag of request r from v_i

Limit Tag L_i^r = max{L_i^r-1 +1/l_i ,t}

Where L_i^r =Limit tag of request r from v_i

Shares Tag P_i^r = max{P_i^r-1 +1/p_i ,t}

Where P_i^r =Share tag of request r from v_i

Tag Adjustment:

Tag adjustment is a mechanism employed to synchronize proportional share tags with actual real time. This calibration is essential when a previously idle service transitions to an active state. By adjusting the tags of the now-active service, the scheduler ensures that its resource allocation accurately reflects its fair share, preventing it from being unfairly penalized for its prior inactivity.

minPtag = minimum of all P tags;

foreach active VM vj do

P_j^r = P_j^r − ( minPtag − t);

Request Scheduling:

mClock's request scheduling uses three tags (Reservation, Limit, and Weight) instead of just one. It alternates between two phases: constraint-based and weight-based. In the constraint-based phase, it prioritizes requests from services whose Reservation tags are less than the current time, ensuring minimum guaranteed service. Once all services have met their reservations, it switches to the weight-based phase. Here, it selects requests based on the Weight tag (proportional share), but only from VMs that haven't reached their Limit. Crucially, during the weight-based phase, mClock adjusts the Reservation tags of outstanding requests to prevent weight-based allocations from interfering with guaranteed reservations.

Example

To illustrate let's take the example with the Max QueueDepth = 32

Services	Reservations(r)	Limit(l)	Shares(w)
I/O	10	50	30
Recovery	5	40	20
Best-effort(backfill, scrub, snap trim and PG deletion requests)	8	60	25

For t = 0 (Only I/O Request Arrives, All Services Idle)

Tag Adjustment

At t = 0, only an I/O request arrives, and all services (I/O, Recovery, Best-effort) are idle. The “minPtag” adjustment is required.

minPtag = minimum of all active P tags

Since no active requests exist, so minPtag = 0

Only the I/O service is active now so the “p” Tag of I/O needs to adjust

P_I/O¹ = 0−(0−0)=0

Tag Assignment

(i) Reservation Tag-

Since R_I/O^r-1= 0 and the reservation is 10

(ii) Limit Tag-

Since L_I/O^r-1= 0 and the limit is 50

(iii) Shares Tag-

Since P_I/O^r-1= 0 and the share is 30

Schedule Request

The number of currently active I/O operations is 0 which is greater than the maximum allowed queue depth of 32. The statement Active IOs ≥ Max QueueDepth is valid.

The set E is empty due to the absence of services satisfying the condition where the R tag ≤ t.Currently, the R tag value is 0.1, while t is 0.

Similarly, The set t E′ is empty due to the absence of services satisfying the condition where the L tag ≤ t.Currently, the L tag value is 0.02, while t is 0.

After execution of the method, The request remains in the queue and will be considered in the next scheduling cycle.

t=0 execution output

Services	Reservations(r)	Limit(l)	Shares(w)	Is Scheduled
I/O	0.1	0.02	0.033	Waiting
Recovery	idle	idle	idle	No Request
Best-effort(backfill, scrub, snap trim and PG deletion requests)	idle	idle	idle	No Request

For t = 1 (Only Recovery Request Arrives, Best-effort is Idle)

Tag Adjustment

At t = 1, only a Recovery request arrives. The I/O request from t = 0 is still in the queue but was not scheduled.The Recovery service was idle before this request and needs to perform minPtag adjustment.

At this time only I/O priority tag is available and P_I/O = 0.033

So, minPtag= 0.033

We need to adjust all active service priority tags. Now only I/O is active so,

Currently, the Recover is not active so not required to adjust.

Tag Assignment

(i) Reservation Tag-

Since R^r-1_Recovery= 0 and the reservation is 5

(ii) Limit Tag

Since L^r-1_Recovery= 0 and the limit is 40

(iii) Shares Tag

Since P^r-1_Recovery= 0 and the share is 20

Schedule Request

The Active_IOs are not greater than Max_QueueDepth so we proceed to the next step.

E is the set of requests with R tag ≤ t.

Now E = { R_I/O, R_Recovery } where R_I/O = 0.1 <1 and R_Recovery = 1 =1

Choose the request with the lowest R value, The I/O has the lowest R value so I/O is selected.

Now Active_IOs = 1

t=1 execution output

Services	Reservations(r)	Limit(l)	Shares(w)	Is Scheduled
I/O	0.1	0.02	1	Yes
Recovery	1	1	1	Waiting
Best-effort(backfill, scrub, snap trim and PG deletion requests)	idle	idle	idle	No Request

For t = 2 (Best-effort Request Arrives)

Tag Adjustment

Since Best-effort is idle, It required the minPtag adjustment.

minPtag = minimum of all active P tags = 1

It needs to adjust P tags for active VMs:

Tag Assignment

(i) Reservation Tag-

(ii) Limit Tag

(iii) Shares Tag

Schedule Request

The Active_IOs are not greater than Max_QueueDepth so we proceed to the next step.

E is the set of requests with R tag ≤ t.

Now E = { R_Recovery,R_BestEffort } where RRecovery = 1 <2 and RBestEfort = 2 =2

Choose the request with the lowest R value, The Recovery has the lowest R value so Recovery is selected.

Now Active_IOs =2

t=2 execution output

Services	Reservations(r)	Limit(l)	Shares(w)	Is Scheduled
I/O	0.1	0.02	2	Yes
Recovery	1	1	2	Yes
Best-effort(backfill, scrub, snap trim and PG deletion requests)	2	2	2	Waiting

Similarly at t=3 the I/O request arrived then the execution output is

Services	Reservations(r)	Limit(l)	Shares(w)	Is Scheduled
I/O	3	3	3	Yes
Recovery	1	1	2	Yes
Best-effort(backfill, scrub, snap trim and PG deletion requests)	2	2	2	Yes

The example illustrates that the algorithm effectively balances competing demands by prioritizing requests based on reservation guarantees, proportional shares, and limits. The dynamic adjustment of tag values is crucial for maintaining fairness in the face of varying workloads.

dmClock

The mClock algorithm was designed for implementation within a single server or node environment not on the distributed environments. The distributed version of mclock is dmClock.

To enhance the MClock algorithm to DMClock, a key modification is introduced in the Tag Assignment component to account for the distributed model. The original MClock algorithm assigns tags based on local resource availability and priorities within a single server. However, in the distributed environment of DMClock, the Tag-Assignment component is adapted to consider the global state of resources across multiple servers.

Two parameters, ρ (rho) and δ (delta) are introduced to account for global resource states across servers in the Tag Assignment component.

(i) Number of Requests Served Under Reservation (ρ)

ρ_i is the number of requests from vi that have been served as part of the reservation-based scheduling since the last request from v_i to the storage server.

For example-

If VM-1 has a reservation of 10 IOPS.VM-1 sends a request at t=0 and the system schedules 5 of its requests under reservation. At t=1 the reservations already served was 5.

So ρ₁ for VM-1 at t=1 is 5.

(ii) Total Requests Completed Across All Servers Since Last Request (δ)

δ_i is the number of total requests completed across all servers (for any services) between the previous request from v_i and the current request.

For example-

Suppose the system has three VMs (VM-1, VM-2, and VM-3).

From t=0 to t=3 the server completed-

10 requests for VM-1, 2 requests for VM-2 and 8 requests for VM-3

when VM-1 sends another request at t=4 the total number of requests completed across all servers is

δ₁ = 20 (10+2+8)

So δ₁ for VM-1 at t=4 is 20

These two new parameters ρ (rho) and δ (delta) are used in the “Tag assignment” method.

The three new equations in the Tag assignment method are-

Apart from the ρ (rho) and δ (delta) adjustments, which track global service across multiple servers, the rest of the dMClock algorithm operates the same as mClock in terms of “Tag Adjustment” and “Schedule Request”.

In conclusion, the analysis and implementation of the provided algorithm have significantly enhanced our understanding of the CEPH MClock scheduler's different profiles, namely balanced, high_client_ops, and high_recovery_ops.

0 comments

44 views

Permalink

https://community.ibm.com/community/user/blogs/srinivasa-bharath-kanta/2025/02/13/cephs-scheduling-algorithm-evolution-from-wpq-to-d

IBM Storage Ceph

IBM Storage Ceph

Ceph's Scheduling Algorithm Evolution: From WPQ to dmClock

By Srinivasa Bharath Kanta posted Thu February 13, 2025 05:09 AM