Storage

 View Only

Ceph's Scheduling Algorithm Evolution: From WPQ to dmClock

By Srinivasa Bharath Kanta posted 13 days ago

  

Prior to dmclock(WPQ):

The main operations in the ceph cluster are client I/O, background recovery, and background operations(scrubbing, backfilling, trimming, PG deletion..etc). Before the 6.1 CEPH  version, the requests are scheduled by using the WPQ (Weighted Priority Queue) algorithm.WPQ prioritizes the tasks based on the assigned weights. Requests with higher weights are served before those with lower weights.

For example:

Requests Arrival Time Weight

I/O

t1 3

Recovery

t2 1

Trimming

t3 5

PG deletion

t4 2

                                                                                       Table-1                             

The final execution order will be:  Trimming(5)->I/O(3)->PG deletion(2)->Recovery(1)

For any operation within the cluster, the system dynamically allocates and manages CPU, memory, network, and storage I/O resources. In WPQ, The system dynamically adjusts resource allocation (CPU, memory, network, and storage I/O) based on these weights.

To illustrate this let's take an example-

Requests

weights

Reservation 

Recovery

1

250

PG deletion

2

250

I/O

3

1000

                                                                           Table-2

The throughput allocated to requests is in proportion to their weights.

               Throughput allocation ∝  Weight   

Assume that the  maximum throughput in the device is  1200 IOPS

The IOPS allocation is based on the formula -  T × (wi/∑wi)  where T = Total throughput w=Weight                    

So, for the recovery, the allocated throughput is =  1200 X ( 1/6) = 200 IOPS

                                    Similarly for PG deletion      =  1200 X ( 2/6) = 400 IOPS

                                                   for  I/O                     =  1200 X ( 3/6) = 600 IOPS

The minimum requirement for Recovery operation is 250, but 200 IOPS are allocated. Similarly, for the I/O operations, the minimum requirement is 1000, but 600 are allocated. In both scenarios, the resources are less than what is required. For PG deletion, the minimum requirement is 250, and allocated resources are 400 which is more than the required resources.

Drawbacks of WPQ:

        WPQ is not inherently a fairness algorithm. While it does provide a mechanism for prioritizing tasks based on weight.which means higher-weighted tasks will always receive more resources or attention than lower-weighted ones. This can lead to unfairness if low-priority tasks are consistently starved of resources.WPQ does not ensure that all tasks get an equal or proportional share of resources. 

mclock

 

The mClock algorithm was developed by Ajay Gulati of VMware, Arif Merchant of HP Labs, and Peter J. Varman of Rice University.


Like WPQ the mClock doesn't pre-allocate fixed, dedicated portions of a resource (like bandwidth or IOPS) to specific services. Instead, mClock uses a more dynamic and flexible approach. Mclock uses the “Tag-based scheduling” approach,  the “Tag-based scheduling” is a common foundation for many fair scheduling algorithms. The core idea is to associate each request (e.g., an I/O request, a task, etc.) with a tag. These tags are then used to determine the order in which requests are serviced. Specifically, requests are scheduled and processed according to the order of their tag values. The tag values themselves can be assigned based on various criteria, such as priority, fairness weights, deadlines, or other relevant factors.  


Mclock introduced three tags to manage I/O resource allocation.These tags are the Reservation tag, Limit tag, and Shares tag.


                    Mclock Algorithm

Reservation is a guaranteed minimum resource allocation to an entity. It represents the amount of the resource the entity is guaranteed to receive, regardless of the overall system load or the demands of other entities. Even if the system is under heavy contention, the entity will still receive its reserved minimum. 

 A Limit is a defined maximum allocation of a resource that a service is permitted to consume. Reservations and limits are expressed in absolute units of the resource (e.g., bandwidth, IOPS)

Shares represent a relative allocation measure, defining the proportional distribution of a resource among competing entities. They specify the ratio in which different entities receive service. A higher share value indicates a larger allocation of the resource

These parameters are externally provided, the responsibility falls on the process providing them to choose values that are suitable for the system's needs.


 
The mClock, manages resources for both active and idle services. For instance,  In CEPH  multiple services like I/O, recover, backfill, scrub, timing, PG deletion..etc share resources, some services might be actively requesting resources, while others might be temporarily idle. It's important to keep track of the progress of both active and idle services in terms of resource allocation. Even though an idle service isn't actively making requests, resource share should still be accounted for idle services. If an idle service suddenly becomes active, it shouldn't be unfairly penalized because it was previously idle. To achieve this synchronization, the mclock scheduling algorithm maintains a global tag value, often called "global virtual time" or just "virtual time."

Algorithm

   Mclock Algorithm provides pseudo code of various components of the algorithm.

The scheduler has three main components: (i) Tag Assignment (ii) Tag Adjustment and (iii) Request Scheduling.

 

Tag Assignment:

    Tag assignment assigns values to the R, L, and P tags for each I/O request. When a request comes in from a specific service at a certain time, this routine determines the appropriate tag values based on its reservations, limits, and share. The scheduler then uses these tags to decide when and how to process the request.

                    

 Reservation Tag   Rir  = max{Rir-1 +1/ri ,t}            

                                    Where Rir  =Reservation tag of request r from vi

 Limit  Tag             Lir  = max{Lir-1 +1/li ,t}         

                                   Where Lir  =Limit tag of request r from vi

 Shares Tag          Pir  = max{Pir-1 +1/pi ,t}          

                               Where Pir  =Share tag  of request r from vi

Tag Adjustment:

Tag adjustment is a mechanism employed to synchronize proportional share tags with actual real time. This calibration is essential when a previously idle service transitions to an active state. By adjusting the tags of the now-active service, the scheduler ensures that its resource allocation accurately reflects its fair share, preventing it from being unfairly penalized for its prior inactivity.

 

minPtag = minimum of all P tags; 

foreach active VM vj do 

Pjr  = Pjr − ( minPtag − t);

 

Request Scheduling:

mClock's request scheduling uses three tags (Reservation, Limit, and Weight) instead of just one. It alternates between two phases: constraint-based and weight-based. In the constraint-based phase, it prioritizes requests from services whose Reservation tags are less than the current time, ensuring minimum guaranteed service. Once all services have met their reservations, it switches to the weight-based phase. Here, it selects requests based on the Weight tag (proportional share), but only from VMs that haven't reached their Limit. Crucially, during the weight-based phase, mClock adjusts the Reservation tags of outstanding requests to prevent weight-based allocations from interfering with guaranteed reservations.

Example

To illustrate let's take the example with the  Max QueueDepth = 32

Services

Reservations(r)

Limit(l)

Shares(w)

I/O

10

50

30

Recovery

40

20

Best-effort(backfill, scrub, snap trim and PG deletion requests)

8

60

25

For t = 0 (Only I/O Request Arrives, All Services Idle)

Tag Adjustment

At t = 0, only an I/O request arrives, and all services (I/O, Recovery, Best-effort) are idle.    The “minPtag” adjustment is required.

    minPtag = minimum of all active P tags

   Since no active requests exist, so minPtag = 0

 

Only the I/O service is active now so the “p” Tag of I/O needs to adjust 

                 PI/O1 = 0−(0−0)=0

Tag Assignment

 

         (i)  Reservation Tag-

                     

                                      Since RI/Or-1= 0  and the reservation is 10

          (ii) Limit Tag-

                 

                                     Since LI/Or-1= 0  and the limit is 50

         (iii) Shares Tag-

               

                                     Since PI/Or-1= 0  and the share is 30

  

Schedule Request

        The number of currently active I/O operations is 0 which is greater than the maximum allowed queue depth of 32. The statement Active IOs ≥ Max QueueDepth is valid.

The set E is empty due to the absence of services satisfying the condition where the R tag ≤ t.Currently, the R tag value is 0.1, while t is 0.

Similarly, The set t E′  is empty due to the absence of services satisfying the condition where the L tag ≤ t.Currently, the L tag value is 0.02, while t is 0.

After execution of the method, The request remains in the queue and will be considered in the next scheduling cycle. 

 t=0 execution output

Services

Reservations(r)

Limit(l)

Shares(w)

Is Scheduled

I/O

0.1

0.02

0.033

Waiting

Recovery

idle

idle

idle

No Request

Best-effort(backfill, scrub, snap trim and PG deletion requests)

idle

idle

idle

No Request

For t = 1 (Only Recovery Request Arrives, Best-effort is Idle)

Tag Adjustment

   

At t = 1, only a Recovery request arrives. The I/O request from t = 0 is still in the queue but was not scheduled.The Recovery service was idle before this request and needs to perform minPtag adjustment.

At this time only I/O priority tag is available  and PI/O = 0.033

  So,   minPtag= 0.033

We need to adjust all active service priority tags. Now only I/O is active so,

Currently, the Recover is not active so not required to adjust.

  

Tag Assignment

 

         (i)  Reservation Tag-

                        

  Since Rr-1Recovery= 0  and the reservation is 5               

(ii) Limit Tag                  

   Since Lr-1Recovery= 0  and the limit is 40               

           

             (iii)  Shares Tag

                             

 Since  Pr-1Recovery= 0  and the share is 20     

        

  Schedule Request       

              The  Active_IOs  are not greater than  Max_QueueDepth so we proceed to the next step.

              E is the set of requests with R tag ≤ t.   

             Now E = { RI/O, RRecovery } where  RI/O  = 0.1 <1 and RRecovery = 1 =1 

Choose the request with the lowest R value, The  I/O has the lowest R value so  I/O is selected. 

Now  Active_IOs = 1                        

  t=1 execution output

Services

Reservations(r)

Limit(l)

Shares(w)

Is Scheduled

I/O

0.1

0.02

1

Yes

Recovery

1

1

1

Waiting

Best-effort(backfill, scrub, snap trim and PG deletion requests)

idle

idle

idle

No Request

For t = 2 (Best-effort Request Arrives)

Tag Adjustment

         Since Best-effort is idle, It required the minPtag adjustment.

         minPtag = minimum of all active P tags = 1

          It needs to adjust P tags for active VMs:

 

Tag Assignment    

    (i) Reservation Tag-

           

     (ii) Limit Tag

                      

       (iii) Shares Tag

                                                    

     Schedule Request   

             The  Active_IOs  are not greater than  Max_QueueDepth so we proceed to the next step.

             E is the set of requests with R tag ≤ t.   

             Now E = { RRecovery,RBestEffort } where  RRecovery  = 1 <2 and RBestEfort = 2 =2 

Choose the request with the lowest R value, The Recovery has the lowest R value so  Recovery is selected. 

           Now  Active_IOs =2

 t=2 execution output

Services

Reservations(r)

Limit(l)

Shares(w)

Is Scheduled

I/O

0.1

0.02

2

Yes

Recovery

1

1

2

Yes

Best-effort(backfill, scrub, snap trim and PG deletion requests)

2

2

2

Waiting

Similarly at t=3 the I/O request arrived then the execution output is 

Services

Reservations(r)

Limit(l)

Shares(w)

Is Scheduled

I/O

3

3

3

Yes

Recovery

1

1

2

Yes

Best-effort(backfill, scrub, snap trim and PG deletion requests)

2

2

2

Yes

The example illustrates that the algorithm effectively balances competing demands by prioritizing requests based on reservation guarantees, proportional shares, and limits. The dynamic adjustment of tag values is crucial for maintaining fairness in the face of varying workloads.

dmClock

      The mClock algorithm was designed for implementation within a single server or node environment not on the distributed environments. The distributed version of mclock is dmClock.

To enhance the MClock algorithm to DMClock, a key modification is introduced in the Tag Assignment component to account for the distributed model. The original MClock algorithm assigns tags based on local resource availability and priorities within a single server. However, in the distributed environment of DMClock, the Tag-Assignment component is adapted to consider the global state of resources across multiple servers. 

Two parameters, ρ (rho) and δ (delta) are introduced to account for global resource states across servers in the Tag Assignment component.

(i)  Number of Requests Served Under Reservation (ρ)

      ρi is the number of requests from v​i that have been served as part of the reservation-based scheduling since the last request from v​i to the storage server.

    For example-

        If VM-1 has a reservation of 10 IOPS.VM-1 sends a request at t=0 and the system schedules 5 of its requests under reservation. At t=1 the reservations already served was 5.

                  So ρ1 for VM-1 at t=1 is 5.

 

(ii)  Total Requests Completed Across All Servers Since Last Request (δ)

         δi ​ is the number of total requests completed across all servers (for any services) between the previous request from v​i  and the current request.

    For example-

         Suppose the system has three VMs (VM-1, VM-2, and VM-3).

         From t=0 to t=3 the server completed-

            10 requests for VM-1, 2 requests for VM-2 and 8 requests for VM-3

              when VM-1 sends another request at t=4 the total number of requests completed across all   servers is

                     δ1   = 20 (10+2+8)

               So δ1  for VM-1 at t=4 is 20

  

These two new parameters ρ (rho) and δ (delta) are used in the “Tag assignment”  method.

The three new equations in the Tag assignment method are-     

 

Apart from the ρ (rho) and δ (delta) adjustments, which track global service across multiple servers, the rest of the dMClock algorithm operates the same as mClock in terms of “Tag Adjustment” and “Schedule Request”.

In conclusion, the analysis and implementation of the provided algorithm have significantly enhanced our understanding of the CEPH MClock scheduler's different profiles, namely balanced, high_client_ops, and high_recovery_ops. 

0 comments
44 views

Permalink