How does the thread scheduler pick a thread to run?
See the next answer.
How does the scheduling algorithm work?
The thread scheduler evaluates a merit function on each partition and chooses the partition with the highest merit. It then picks the highest-priority thread in that partition. A partition with budget has more merit than a partition that has exhausted its budget.
First, let's look at a few helper functions. The details are provided below:
Or:
where:
If the partition has a zero budget, then RFF(p) is defined to be a constant smaller than the smallest possible value of RFF() for all other nonzero partitions.
Some operating modes, defined by these boolean expressions, are also defined:
The scheduler picks up one of the merit functions, depending on the operating mode:
If the mode is idle, the scheduler chooses to run the idle thread in the System partition.
Otherwise, the scheduler chooses to run the highest-priority thread that has a compatible runmask for the CPU on which the scheduler was invoked from the partition p such that:
merit(p) > merit(p')
for all p' not equal to p.
Merit functions return tuples, and are compared as tuples. For example:
(a,b) < (c,d) if (a<c) || ( (a=c) && (b<d) )
How does the scheduler find the highest-priority thread in a partition?
It does it very quickly. Each partition has a bitmap that tracks the priority levels (between 0 to 255) that are in use by some ready to run thread in that partition.
Each time the scheduler makes a thread ready to run, it sets the bit corresponding to that thread's priority. When the scheduler runs a thread (its state changes from ready to run), the scheduler examines the queue of threads in that partition that are ready-to-run and at the same priority. If there are no other threads of that priority, the scheduler clears the bit for that thread's priority.
When the scheduler needs to know the highest priority thread that is ready to run in a partition, it uses the bitmap to index a table that maps integers to the number of their highest 1 bit. This is done with a set of tables to avoid the need for 2255 table elements.
The same mechanism is also used in classic Neutrino scheduling. The macros are:
How are RFFs (relative fraction free) computed?
For the scheduling algorithm, the computation of RFF() requires floating-point division. However, Neutrino doesn't perform floating-point operation inside the kernel or even fixed-point division; these operations are very slow on some platforms.
Neutrino computes a function equivalent to RFF() that requires only addition and multiplication.
How does the scheduler algorithm avoid division and floating-point mathematics?
For the scheduling algorithm, the computation of RFF() requires floating-point division. However, Neutrino doesn't need the absolute values of RFF(); it needs to know only the relative ordering of RFF(p1), RFF(p2), .... RFF(pn).
Therefore, Neutrino computes a different function that has the same ordering properties as RFF(). This function is computable with only addition and 16×16 bit multiplication.
The idea is:
However, instead of finding partition p, such that RFF(p) > RFF(p') for p' not equal p, define relative_fraction_used(p) = RFU(p) = cycles_used/budget_cycles , and find partition p such that RFU(p) < RFU(p') for p' not equal to p.
used_cycles(p0)/budget_cycles(p0) < used_cycles(p1)/budget_cycles(p2)< .... < used_cycles(pn)/budget_cycles(pn)
k = budget_cycles(p0) * budget_cycles(p1) * ... * budget_cycles(pn), then
and compare f(p) to f(p') to find which has the better RFF().
However, there are two complications:
f(p) = (used_cycles(p)>>scaling_factor) * scaled_c(p)
Therefore, Neutrino can set f() for a zero-budget partition as:
f_zero = 1 + window size*c(pm)
and then scale it as described for running out of bits.
How does the scheduler algorithm determine if a thread that's allowed to run as critical, should actually run as critical?
See the next answer.
How does the scheduler algorithm decide when to bill critical time?
When the scheduler algorithm picks a thread that is allowed to run as critical to run, it doesn't always charge its CPU time to its partition's critical budget. A thread t charges its CPU time to the critical budget of its partition p only when the following are true when the scheduler algorithm is invoked:
COMPETING(p') &&(HAS_BUDGET(p')||MAY_RUN_CRITICAL(p')) == True
For definitions of COMPETING(), HAS_BUDGET(), and MAY_RUN_CRITICAL(), see "How does the scheduling algorithm work?" in Scheduling behaviour.
What are the algorithm's size limitations?
The mathematics of the algorithm are extendable to any number of partitions. However, these are the limitations of the current implementation:
What are the algorithm's accuracy limitations?
Accuracy refers to the closeness of the scheduler's guarantee or limit that a partition can consume only its budget on a loaded system. For Neutrino, the accuracy is measured based on whichever is greater:
Or
The first limitation comes from the accuracy in which the RFF() calculation is carried out. The accuracy of RFF() is calculated to a limited number of bits, specifically to speed up the scheduling algorithm.
The second limitation comes from the uncertainty in predicting how long a thread runs before it voluntarily blocks, is preempted by a higher-priority thread, or when the next tick interrupt occurs. This limitation comes from the fact that the thread scheduler has guaranteed control of the system only every tick (but may run more often).
In practice, the last limitation implies that when a window size is changed, the scheduler clears its history of used CPU time. So the partition (p) with the highest priority thread runs for budget(p)*window size (ms) before another partition runs. After the window size (in milliseconds) has elapsed, all budgets are again guaranteed. So a partition, configured for a budget of 40%, with a 100 milliseconds averaging window, is considered to be scheduled accurately when its usage over the last 100 ms was 39 to 41 ms. This happens when the window size hasn't changed in the last 100 milliseconds. In practice, the scheduling accuracy is usually much better.
When is the scheduling algorithm approximated?
In order to save overhead, a very short version of the scheduling algorithm is used on some paths involved in message passing. This short version is implemented with the internal scheduler functions, such as ready_ppg(), block_and_ready_ppg() and adjust_priority_ppg().