Scheduling algorithm

How does the thread scheduler pick a thread to run?

See the next answer.

How does the scheduling algorithm work?

The thread scheduler evaluates a merit function on each partition and chooses the partition with the highest merit. It then picks the highest-priority thread in that partition. A partition with budget has more merit than a partition that has exhausted its budget.

First, let's look at a few helper functions. The details are provided below:

Some operating modes, defined by these boolean expressions, are also defined:

underload
When COMPETING(p) && (HAS_BUDGET(p)||MAY_RUN_CRITICAL(p)) == True for at least one partition p.
all_at_load
When COMPETING(p) == True for all p, and HAS_BUDGET(p)||MAY_RUN_CRITICAL(p) == False, for all partitions p.
free_time
When COMPETING(p) == False for at least one partition p that has a nonzero budget.
idle
when COMPETING(p) == False, for all partitions p.

The scheduler picks up one of the merit functions, depending on the operating mode:

underload
merit(p) = (COMPETING(p), HAS_BUDGET(p)||MAY_RUN_CRITICAL(p), HIGHEST_PRIO(p), RFF(p) )
all_at_limit
merit(p) = (COMPETING(p), RFF(p))
free_time, default
merit(p) = (COMPETING(p), HAS_BUDGET(p)||MAY_RUN_CRITICAL(p), HIGHEST_PRIO(p), RFF(p) )
free_time, SCHEDPOL_RATIO
merit(p) = (COMPETING(p), HAS_BUDGET(p)||MAY_RUN_CRITICAL(p), RFF(p) )
idle
Undefined.

If the mode is idle, the scheduler chooses to run the idle thread in the System partition.

Otherwise, the scheduler chooses to run the highest-priority thread that has a compatible runmask for the CPU on which the scheduler was invoked from the partition p such that:

merit(p) > merit(p')

for all p' not equal to p.

Merit functions return tuples, and are compared as tuples. For example:

(a,b) < (c,d) if (a<c) || ( (a=c) && (b<d) ) 

How does the scheduler find the highest-priority thread in a partition?

It does it very quickly. Each partition has a bitmap that tracks the priority levels (between 0 to 255) that are in use by some ready to run thread in that partition.

Each time the scheduler makes a thread ready to run, it sets the bit corresponding to that thread's priority. When the scheduler runs a thread (its state changes from ready to run), the scheduler examines the queue of threads in that partition that are ready-to-run and at the same priority. If there are no other threads of that priority, the scheduler clears the bit for that thread's priority.

When the scheduler needs to know the highest priority thread that is ready to run in a partition, it uses the bitmap to index a table that maps integers to the number of their highest 1 bit. This is done with a set of tables to avoid the need for 2255 table elements.

The same mechanism is also used in classic Neutrino scheduling. The macros are:

How are RFFs (relative fraction free) computed?

For the scheduling algorithm, the computation of RFF() requires floating-point division. However, Neutrino doesn't perform floating-point operation inside the kernel or even fixed-point division; these operations are very slow on some platforms.

Neutrino computes a function equivalent to RFF() that requires only addition and multiplication.

How does the scheduler algorithm avoid division and floating-point mathematics?

For the scheduling algorithm, the computation of RFF() requires floating-point division. However, Neutrino doesn't need the absolute values of RFF(); it needs to know only the relative ordering of RFF(p1), RFF(p2), .... RFF(pn).

Therefore, Neutrino computes a different function that has the same ordering properties as RFF(). This function is computable with only addition and 16×16 bit multiplication.

The idea is:

  1. relative_fraction_free(P), or RFF(P) = 1 - cycles_used/budget_cycles

    However, instead of finding partition p, such that RFF(p) > RFF(p') for p' not equal p, define relative_fraction_used(p) = RFU(p) = cycles_used/budget_cycles , and find partition p such that RFU(p) < RFU(p') for p' not equal to p.

  2. Then find a function that has the same ordering properties as RFU():
    • Find:

      used_cycles(p0)/budget_cycles(p0) < used_cycles(p1)/budget_cycles(p2)< .... < used_cycles(pn)/budget_cycles(pn)

    • let

      k = budget_cycles(p0) * budget_cycles(p1) * ... * budget_cycles(pn), then

    • k/budget_cycles(p0)*used_cycles(p0) < k/budget_cycles(p1)*used_cycles(p1) < ... < k/budget_cycles(pn)*used_cycles(pn), as long as all numbers are >0.
    • Values of c(p)=K/budget_cycles(p), for all p, are computed once, or whenever any partition's percentage budget is changed. The values are stored and aren't recalculated during scheduling
    • At scheduling time, Neutrino computes f(p) = used_cycles(p) * c(p)

      and compare f(p) to f(p') to find which has the better RFF().

However, there are two complications:

Running out of bits
So far, f(p) = used_cycles(p) * c(p) requires 64-bit multiplication. However, since the accuracy specification is 0.2%, Neutrino scales all values of c(p) by a common factor, until the largest value fits in 16 bits. Neutrino also shifts used_cycles(p) until its largest possible value fits in 16 bits. Therefore, at scheduling time, Neutrino computes only:
f(p) = (used_cycles(p)>>scaling_factor) * scaled_c(p)
Zero-budget partitions
The above algorithms nominally require Neutrino to multiply and divide everything by zero. However RFF() of a zero-budget partition is defined to be a constant that is smaller than any nonzero partition's possible value of RFF(). Neutrino defines RFU(p) for a zero budget partition as a constant that is greater than RFU(). The largest value of f() is window size * c(pm) where c(pm) > c(p') for all p' not equal to pm.

Therefore, Neutrino can set f() for a zero-budget partition as:

f_zero = 1 + window size*c(pm)

and then scale it as described for running out of bits.

Note: The window size is expressed in cycles.

How does the scheduler algorithm determine if a thread that's allowed to run as critical, should actually run as critical?

See the next answer.

How does the scheduler algorithm decide when to bill critical time?

When the scheduler algorithm picks a thread that is allowed to run as critical to run, it doesn't always charge its CPU time to its partition's critical budget. A thread t charges its CPU time to the critical budget of its partition p only when the following are true when the scheduler algorithm is invoked:

For definitions of COMPETING(), HAS_BUDGET(), and MAY_RUN_CRITICAL(), see "How does the scheduling algorithm work?" in Scheduling behaviour.

What are the algorithm's size limitations?

The mathematics of the algorithm are extendable to any number of partitions. However, these are the limitations of the current implementation:

What are the algorithm's accuracy limitations?

Accuracy refers to the closeness of the scheduler's guarantee or limit that a partition can consume only its budget on a loaded system. For Neutrino, the accuracy is measured based on whichever is greater:

Note: When you changes the averaging window size to x ms, the accuracy is undefined for the next x ms.

The first limitation comes from the accuracy in which the RFF() calculation is carried out. The accuracy of RFF() is calculated to a limited number of bits, specifically to speed up the scheduling algorithm.

The second limitation comes from the uncertainty in predicting how long a thread runs before it voluntarily blocks, is preempted by a higher-priority thread, or when the next tick interrupt occurs. This limitation comes from the fact that the thread scheduler has guaranteed control of the system only every tick (but may run more often).

In practice, the last limitation implies that when a window size is changed, the scheduler clears its history of used CPU time. So the partition (p) with the highest priority thread runs for budget(p)*window size (ms) before another partition runs. After the window size (in milliseconds) has elapsed, all budgets are again guaranteed. So a partition, configured for a budget of 40%, with a 100 milliseconds averaging window, is considered to be scheduled accurately when its usage over the last 100 ms was 39 to 41 ms. This happens when the window size hasn't changed in the last 100 milliseconds. In practice, the scheduling accuracy is usually much better.

When is the scheduling algorithm approximated?

In order to save overhead, a very short version of the scheduling algorithm is used on some paths involved in message passing. This short version is implemented with the internal scheduler functions, such as ready_ppg(), block_and_ready_ppg() and adjust_priority_ppg().