Scheduling algorithm

How does the thread scheduler pick a thread to run?

See the next answer.

How does the scheduling algorithm work?

The thread scheduler evaluates a merit function on each partition and chooses the partition with the highest merit. It then picks the highest-priority thread in that partition. A partition with budget has more merit than a partition that has exhausted its budget.

First, let's look at a few helper functions. The details are provided below:

The COMPETING(p) function is a boolean function of partition p. It returns True if:
- partition p is currently running a thread of priority greater than zero
  Or:
- partition p contains a thread that is ready to run, and has a priority greater than zero
The HAS_BUDGET(p) function is a boolean function of partition p. It returns True if cycles_used(p) + cycles_left_in_current_tick <= budget_cycles(p)
where:

cycles_used(p)

The CPU time that the partition has used during the current averaging window.

budget_cycles(p)

The size of the averaging window, expressed in ClockCycles() (not milliseconds - ms) multiplied by the percentage budget of p.
The MAY_RUN_CRITICAL(p) function is a boolean function of partition p. It returns True if:
- partition p is configured with a critical budget that's greater than zero
- partition p has used, during the last averaging window, a critical time that is less than its critical budget minus 1/32 of a tick
- the highest-priority thread that's ready to run or is currently running in partition p is allowed to run as critical.
The HIGHEST_PRIO(p) function is the numerical priority of the highest priority thread that is either running or ready to run in partition p.
If the partition has a nonzero budget, then the relative function free (RFF(p)) function is: 1 - used_cycles(p)/budget_cycles(p)
If the partition has a zero budget, then RFF(p) is defined to be a constant smaller than the smallest possible value of RFF() for all other nonzero partitions.

Some operating modes, defined by these boolean expressions, are also defined:

underload

When COMPETING(p) && (HAS_BUDGET(p)||MAY_RUN_CRITICAL(p)) == True for at least one partition p.

all_at_load

When COMPETING(p) == True for all p, and HAS_BUDGET(p)||MAY_RUN_CRITICAL(p) == False, for all partitions p.

free_time

When COMPETING(p) == False for at least one partition p that has a nonzero budget.

idle

when COMPETING(p) == False, for all partitions p.

The scheduler picks up one of the merit functions, depending on the operating mode:

underload: merit(p) = (COMPETING(p), HAS_BUDGET(p)||MAY_RUN_CRITICAL(p), HIGHEST_PRIO(p), RFF(p) )
all_at_limit: merit(p) = (COMPETING(p), RFF(p))
free_time, default: merit(p) = (COMPETING(p), HAS_BUDGET(p)||MAY_RUN_CRITICAL(p), HIGHEST_PRIO(p), RFF(p) )
free_time, SCHEDPOL_RATIO: merit(p) = (COMPETING(p), HAS_BUDGET(p)||MAY_RUN_CRITICAL(p), RFF(p) )
idle: Undefined.

If the mode is idle, the scheduler chooses to run the idle thread in the System partition.

Otherwise, the scheduler chooses to run the highest-priority thread that has a compatible runmask for the CPU on which the scheduler was invoked from the partition p such that:

merit(p) > merit(p')

for all p' not equal to p.

Merit functions return tuples, and are compared as tuples. For example:

(a,b) < (c,d) if (a<c) || ( (a=c) && (b<d) )

How does the scheduler find the highest-priority thread in a partition?

It does it very quickly. Each partition has a bitmap that tracks the priority levels (between 0 to 255) that are in use by some ready to run thread in that partition.

Each time the scheduler makes a thread ready to run, it sets the bit corresponding to that thread's priority. When the scheduler runs a thread (its state changes from ready to run), the scheduler examines the queue of threads in that partition that are ready-to-run and at the same priority. If there are no other threads of that priority, the scheduler clears the bit for that thread's priority.

When the scheduler needs to know the highest priority thread that is ready to run in a partition, it uses the bitmap to index a table that maps integers to the number of their highest 1 bit. This is done with a set of tables to avoid the need for 2²⁵⁵ table elements.

The same mechanism is also used in classic Neutrino scheduling. The macros are:

DISPATCH_SET()
DISPATCH_CLEAR()
DISPATCH_HIGHEST_PRI()

How are RFFs (relative fraction free) computed?

For the scheduling algorithm, the computation of RFF() requires floating-point division. However, Neutrino doesn't perform floating-point operation inside the kernel or even fixed-point division; these operations are very slow on some platforms.

Neutrino computes a function equivalent to RFF() that requires only addition and multiplication.

How does the scheduler algorithm avoid division and floating-point mathematics?

For the scheduling algorithm, the computation of RFF() requires floating-point division. However, Neutrino doesn't need the absolute values of RFF(); it needs to know only the relative ordering of RFF(p1), RFF(p2), .... RFF(pn).

Therefore, Neutrino computes a different function that has the same ordering properties as RFF(). This function is computable with only addition and 16×16 bit multiplication.

The idea is:

relative_fraction_free(P), or RFF(P) = 1 - cycles_used/budget_cycles
However, instead of finding partition p, such that RFF(p) > RFF(p') for p' not equal p, define relative_fraction_used(p) = RFU(p) = cycles_used/budget_cycles , and find partition p such that RFU(p) < RFU(p') for p' not equal to p.
Then find a function that has the same ordering properties as RFU():
- Find:
  used_cycles(p0)/budget_cycles(p0) < used_cycles(p1)/budget_cycles(p2)< .... < used_cycles(pn)/budget_cycles(pn)
- let
  k = budget_cycles(p0) * budget_cycles(p1) * ... * budget_cycles(pn), then
- k/budget_cycles(p0)*used_cycles(p0) < k/budget_cycles(p1)*used_cycles(p1) < ... < k/budget_cycles(pn)*used_cycles(pn), as long as all numbers are >0.
- Values of c(p)=K/budget_cycles(p), for all p, are computed once, or whenever any partition's percentage budget is changed. The values are stored and aren't recalculated during scheduling
- At scheduling time, Neutrino computes f(p) = used_cycles(p) * c(p)
  and compare f(p) to f(p') to find which has the better RFF().

However, there are two complications:

Running out of bits

So far, f(p) = used_cycles(p) * c(p) requires 64-bit multiplication. However, since the accuracy specification is 0.2%, Neutrino scales all values of c(p) by a common factor, until the largest value fits in 16 bits. Neutrino also shifts used_cycles(p) until its largest possible value fits in 16 bits. Therefore, at scheduling time, Neutrino computes only:

f(p) = (used_cycles(p)>>scaling_factor) * scaled_c(p)

Zero-budget partitions

The above algorithms nominally require Neutrino to multiply and divide everything by zero. However RFF() of a zero-budget partition is defined to be a constant that is smaller than any nonzero partition's possible value of RFF(). Neutrino defines RFU(p) for a zero budget partition as a constant that is greater than RFU(). The largest value of f() is window size * c(pm) where c(pm) > c(p') for all p' not equal to pm.

Therefore, Neutrino can set f() for a zero-budget partition as:

f_zero = 1 + window size*c(pm)

and then scale it as described for running out of bits.

Note: The window size is expressed in cycles.

How does the scheduler algorithm determine if a thread that's allowed to run as critical, should actually run as critical?

See the next answer.

How does the scheduler algorithm decide when to bill critical time?

When the scheduler algorithm picks a thread that is allowed to run as critical to run, it doesn't always charge its CPU time to its partition's critical budget. A thread t charges its CPU time to the critical budget of its partition p only when the following are true when the scheduler algorithm is invoked:

thread t has the highest priority in the system
thread t is allowed to run as critical now
partition p has a critical budget configured to be greater than zero
CPU cycles used by all threads in partition p during the last averaging window are less than the critical budget of partition p
partition p has exhausted its normal CPU budget

at least one partition, p' not equal to p, has

COMPETING(p') &&(HAS_BUDGET(p')||MAY_RUN_CRITICAL(p')) == True

For definitions of COMPETING(), HAS_BUDGET(), and MAY_RUN_CRITICAL(), see "How does the scheduling algorithm work?" in Scheduling behaviour.

What are the algorithm's size limitations?

The mathematics of the algorithm are extendable to any number of partitions. However, these are the limitations of the current implementation:

It has <= 32 partitions, because of use of bit sets and 32-bit integers.
It has <= 16 partitions, because of an internal step of RFF calculation limited to 16×16 bit multiplication.
It has <= 8 partitions, a practical limit to prevent the scheduler from consuming too much memory or CPU time.
You must specify budgets, in integer percentages, e.g. 30% or 31%, but not 30.5%.
There's no limit on the number of threads per partition.

What are the algorithm's accuracy limitations?

Accuracy refers to the closeness of the scheduler's guarantee or limit that a partition can consume only its budget on a loaded system. For Neutrino, the accuracy is measured based on whichever is greater:

0.5%
Or
Tick size (in ms) or window size (in ms). For a 100 milliseconds window with a default tick, this is 1%.

Note: When you changes the averaging window size to x ms, the accuracy is undefined for the next x ms.

The first limitation comes from the accuracy in which the RFF() calculation is carried out. The accuracy of RFF() is calculated to a limited number of bits, specifically to speed up the scheduling algorithm.

The second limitation comes from the uncertainty in predicting how long a thread runs before it voluntarily blocks, is preempted by a higher-priority thread, or when the next tick interrupt occurs. This limitation comes from the fact that the thread scheduler has guaranteed control of the system only every tick (but may run more often).

In practice, the last limitation implies that when a window size is changed, the scheduler clears its history of used CPU time. So the partition (p) with the highest priority thread runs for budget(p)*window size (ms) before another partition runs. After the window size (in milliseconds) has elapsed, all budgets are again guaranteed. So a partition, configured for a budget of 40%, with a 100 milliseconds averaging window, is considered to be scheduled accurately when its usage over the last 100 ms was 39 to 41 ms. This happens when the window size hasn't changed in the last 100 milliseconds. In practice, the scheduling accuracy is usually much better.

When is the scheduling algorithm approximated?

In order to save overhead, a very short version of the scheduling algorithm is used on some paths involved in message passing. This short version is implemented with the internal scheduler functions, such as ready_ppg(), block_and_ready_ppg() and adjust_priority_ppg().