Appendix: Frequently Asked Questions: Adaptive Partitioning Thread Scheduler

This document contains a set of frequently asked questions (FAQ) and answers. It covers the implementation details of the scheduling extensions contained in QNX Neutrino Adaptive Partitioning, as well as any common questions related to partitions in general.

This chapter includes:

Scheduling behavior
Microbilling
Averaging window
Scheduling algorithm
Overhead
Critical threads and bankruptcy
Inheritance
Budgets
Joining a partition
Neutrino system considerations

The information contained in this Frequently Asked Questions section is subject to change at any time without notice. QSS makes no representation or warranty regarding the information and is not responsible whatsoever for reliance on the information contained herein.

Scheduling behavior

How does the thread scheduler guarantee a partition's minimum CPU budget?

The thread scheduler guarantees a minimum CPU budget by ensuring that other partitions don't overrun their budget. This determination is made every clock tick.

The clock interrupt handler invokes the thread scheduler. That means it runs a miniimum of every clock period (typically every millisecond). On each clock tick:

On a uni-processor, it examines the partition of the currently running thread to see if it should keep running. The adaptive partition (AP) thread scheduler will decide if a thread should stop running if its partition has less available time (budget-cycles minus used-cycles during this averaging window) than what is necessary to pay for the duration of the next clock period. If the currently running partition fails this test then the AP portion of the clock handler sets a “must examine all partitions” flag.
On an SMP processor, the AP scheduler's portion of the clock interrupt handler always sets the “must examine all partitions flag.”
If the currently running partition fails this test, then the adaptive partitioning portion of the clock handler sets a “must examine all partitions” flag.

On exit from the Neutrino clock interrupt handler, the handler examines the flag. If set, it causes the system to immediately enter the kernel and invoke the full scheduling algorithm.

The full thread scheduler algorithm examines all partitions. It stosp running the current partition if it is about to go out of budget (i.e. it no longer has enough to pay for another quarter clock period, in addition to one clock period for each additional CPU - if the system is multicore). In other words, the thread scheduler guarantees that budgets are met by forcing a paritition to temporarily stop running if it will run over its budget before the next time the scheduler is in control of the system. This also requires that some other partition has budget and threads that are ready to run.

The thread scheduler guarantees that budgets are met by forcing a partition to temporarily stop running if it runs over its budget before the next time when the scheduler is in control of the system.

When does the scheduler guarantee that a partition gets its budget?

The thread scheduler makes sure that a partition gets at least its budget in the current averaging window when:

The partition becomes ready to run often enough to consume at least its budget worth of time.
On multicore machines:
- let B(p) be the budget, in percent of partition p
- let R(p) be the number of ready to run threads in our partition, and
- let N be the number of CPUs
Then the thread scheduler guarantees that partition p gets B(p) percent of CPU time over the last averaging window if:
```
R(p) >= N * B(p)/100
```
In other words, it means that when the partition has enough ready to run threads to occupy the processors in the system.
The scheduler didn't bill any critical time to any partition.

In other words, budgets are guaranteed if the system is busy enough and no partition has used its critical budget.

Does a 100-ms window mean a CPU time-averaging occurs only once in every 100 ms?

See the next answer.

How often does the algorithm enforce partition budgets?

If you use a 100 ms averaging window, the scheduler doesnt' produce information about CPU usage only once every 100 ms. Rather, it stores a history of CPU usage, with details for each of the last 100 millisecond intervals. The window rotates, or slides forward in time, for every clock tick, so the window provides precise information about the average CPU consumption every millisecond (or clock period).

Between clock ticks, when the thread scheduler algorithm is called, CPU usage of each partition is approximated with the assumption that each partition will likely run continuously at least until the next clock tick.

In other words, the thread scheduler computes the used CPU time and enforces the budgets, many times per millisecond.

What system assumptions does the design of thread scheduler make?

In order to guarantee that the partitions get their guaranteed minimum CPU budgets, the design assumes:

The clock interrupt handler runs periodically. In other words, the users don't inhibit clock interrupts.
The ClockCycles() function is monotonic, except for 64-bit wraparound.
The ClockCycles() function increases at a constant rate.
Useful work done by the processor is proportional to ClockCycles().
The ClockCycles() functions, as seen by each CPU on an SMP. machine, increment at the same rate (though there may be a constant offset between each processor's ClockCycles()).
Each CPU works at the same rate on SMP machines.
The resolution of ClockCycles() is at least 1/200th of the clock period between timer ticks.
The user doesn't change the size of the averaging window often.

When does the thread scheduler calculate percentage CPU usage?

Never. It avoids doing division in order to execute quickly.

The scheduler only compares a partition's CPU usage with its budget, expressed as a total time over the last averaging window rather than as a percentage. To make a quick comparison, both usage and budgets are treated internally as counts of ClockCycles(), not as percentages.

How often does the thread scheduler compute CPU usage?

At least once every clock period (typically every millisecond). However, it also does it on kernel calls, such as message and pulse sending or mutex releases. For example, on a 733MHz x86 machine that performs a lot of I/O, the scheduler computes CPU usage around 50 times every millisecond.

When is the scheduler's behavior realtime?

Within a single partition, the thread scheduler always follows POSIX scheduling rules, i.e. preemptive priority-based scheduling with FIFO and sporadic policies. So a partition looks somewhat like a complete system in POSIX.

However the CPU time, seen by a partition, may be sliced by threads running in other partitions.

So the question remains: when does a partition get continuous realtime? Since our definition of realtime is to schedule strictly by priority, the answer is the thread scheduler schedules strictly by priority whenever a set of partitions has used less than their budgets over the last averaging window. This implies that all threads run by priority-preemption rules as long as their partitions have not exhausted their budget in the current averaging window. In brief, it's realtime when a partition is using less than its budget.

What is free-time mode?

See the next answer.

What is free time?

Free-time mode is a specific budget situation when at least one partition with a nonzero budget isn't using all of its budget. Free-time mode means other partitions may use up the free time even if they exceed their own budgets. This is one of the reasons why adaptive partitioning is adaptive.

The extra time a partition gets in free time mode is called “free time,” but it isn't always free; sometimes it must be paid back.

Do you have to repay free time?

Partly. In general, only the free time during the last averaging window needs to be paid back.

For example, suppose that partition p1 has exhausted its budget, and another partition p2 has available budget. Therefore partition p2 is running. Now assume that partition p2 becomes idle (i.e. goes to sleep) for 10 milliseconds. Because partition p2 has no competition and is in free-time mode, partition p1 begins running and exceeds its budget by 10 milliseconds.

Now, say partition p2 wakes up. The partition p2 won't run until the averaging window rotates enough to carry the history of its CPU over-usage past 100 milliseconds into the past. So, p2 may not run until window-size − budget milliseconds passes. This interval, where p2, is suspended is effectively paying back the free time.

In general, when free time is less than window size — budget must be paid back.

In a different example, suppose partition p2 goes to sleep for a minute. In this situation, partition p1 runs opportunistically and subsequently consumes 100% of the CPU. When partition p2 wakes up, it will have available budget, and partition p1 will be over budget, so partition p1 will run.

The partition p2 won't run again until window rotation removes history of its CPU usage past 100 milliseconds in the past. So in this case, partition p2 needs to pay back only window-size − budget milliseconds of the minute of CPU time that ran because partition p1 was asleep.

While the partition is over budget (because of the free time it received) — it won't run at all until enough time has passed to cause the total usage (recorded in the averaging window) to fall below budget. It implies that the partition has stopped running until its stopped time compensates for the free time it took earlier.

An exception is free time that occurred just before a call to SchedCtl(SCHED_APS_SET_PARMS,...) to change the window size. Changing the window size wipes the scheduler's memory so free time just before a change in window size isn't paid back.

How does the thread scheduler behave on HyperThreaded (HT) processors?

Adaptive partitioning treats a two-headed HT processor as a multicore system with two CPUs. It assumes that each virtual processor has equal and constant throughput. Whereas this is true for SMP machines, it's true on HT machines only when the system is sufficiently loaded to keep both pseudo-CPUs busy. Adaptive partitioning requires that a system's throughput be proportional to the ClockCycles() function.

How long can a round-robin thread run with the thread scheduler?

Without the thread scheduler (i.e. using classic Neutrino scheduling), a round-robin thread:

may be preempted at any time by a higher-priority thread
if not preempted or if there is no other thread at the same priority, it runs until it gives up control voluntarily
if not preempted and there is another thread at equal priority, it runs for 4 ticks (nominally 4 milliseconds) before being time-sliced with other thread

With the thread scheduler, a round-robin thread:

may be preempted at any time by a higher priority thread in the same scheduler partition
runs until it gives up control or its partition runs out of budget — if not preempted, and if there is no other thread of the same priority in that partition
may start running if its partition gets more budget on the next clock tick. This happens for a ready to run round-robin thread (in a partition that is out of budget). This also happens since the rotation of the window gives that available partition budget back.
runs for 4 ticks (nominally 4 milliseconds), before being time-sliced with the other thread — if its partition has
- at least 4 milliseconds of available budget
- not been preempted
- another thread of equal priority

The scheduler overrides the time slice for a round-robin thread. When a partition has more than 4 ticks of available time left in its budget, thread scheduler behavior is the same as the classic Neutrino scheduling. However on a loaded system, it's best to assume that a Round-Robin thread may be sliced every tick.

When a round-robin thread is preempted by the scheduler, it will be able to run a thread in a different partition. In other words, round-robin behavior is unchanged relative to the other threads in the same partition.

How long can a FIFO thread run with the thread scheduler?

Without the thread scheduler, if not preempted by a higher priority thread, a FIFO thread runs until it gives up control voluntarily.

With the thread scheduler, a FIFO thread runs if not preempted by a higher priority thread in the same partition until it gives up control voluntarily, or its partition runs out of budget.

FIFO behavior is unchanged as long as your partition has budget. On a loaded system, it's best to assume that a FIFO thread may be time sliced every millisecond with threads in other partitions. However, relative to all other threads in the same partition, FIFO behavior is the same as in classic Neutrino scheduling.

How long can a sporadic (SS) thread run with the thread scheduler?

Without the thread scheduler, if not preempted by a higher priority thread, an SS thread runs until it gives up control voluntarily. Since the priority of an SS thread alternates between normal and low priorities, it's likely to be preempted when running at its low priority.

With the thread scheduler, the SS thread runs if not preempted by a higher priority thread in the same partition until it gives up control voluntarily or its partition runs out of budget.

Some developers set the higher priority of a sporadic-scheduled thread to be the highest priority in the system, in order to make the thread nonpreemptible during its high-priority mode. With the thread scheduler, the thread is non-preemptible only as long as its partition hasn't exhausted its budget.

Sporadic scheduling behavior is unchanged as long as your partition has budget. On a loaded system, it's best to assume that an SS thread may be time-sliced every millisecond with threads in other partitions. However, relative to all other threads in the same partition, SS behavior is the same as in classic Neutrino scheduling.

How often does the thread scheduler algorithm run?

See the next answer.

How often does the thread scheduler enforce budgets?

The thread scheduler runs and enforces budgets:

every clock tick
every time a thread sleeps or blocks for a mutex
whenever a thread becomes ready after it has received an event, pulse, or message.

The frequency depends on how often messaging occurs.

How do power-saving modes affect scheduling?

If the system suspends, scheduler is unaware of the interruption. Upon resumption, partitions will have the same percentage consumption they had at suspension.

If the system varies the processor speed to conserve power, scheduler is unaware of the variation. Although the scheduler guarantees that all partitions get their budget percentages, it assumes that each millisecond has the same throughput. This means that partition budget enforcement is effectively inaccurate for the 100 milliseconds (or window size) after the CPU changes speed. Thereafter, it's accurate.

How does changing the clock period (using ClockPeriod()) affect scheduling?

If you change the clock period, the thread scheduler can't schedule accurately because it's unaware of the change in the size of the tick. However, calling SchedCtl(SET_APS_PARMS,...) with the existing window size causes the scheduler to recalculate all internal parameters that depend on the size of the clock period. Correspondingly this calling restores accuracy.

As described in the Adaptive Partitioning User's Guide, you should set the window size after changing the clock period.

Microbilling

How does microbilling work?

Microbilling refers to the accounting for the CPU time that is used by a thread to a much finer resolution than the clock period between tick interrupts.

The thread scheduler has been implemented where threads send or receive messages many times (as opposed to a single time) per clock period. Adaptive partitioning scheduling would not be possible if we were limited to counting integer ticks of CPU time. That's because most threads send or receive messages, or otherwise block, many times per clock period.

Microbilling works by taking a fine-resolution timestamp every time a thread changes state from ready to not-ready, and charging differences between sequential timestamps against that partition's used CPU cycles count.

Microbilling uses the system call ClockCycles() to get that fine-resolution timestamp.

How often does thread scheduler microbill?

The thread scheduler microbills each time that:

one thread stops running and another starts running
a clock tick occurs

How does ClockCycles() work?

The thread scheduler always depends on the processor being used. On x86 processors, Neutrino uses a free-running counter that is implemented on the CPU chip itself. This counter is read with a single instruction.

On PowerPC targets, Neutrino reads a similar free-running counter with just a few instructions. In these situations, ClockCycles() increments typically at about the processor's clock rates (i.e. ClockCycles() increases by 3 billion counts every second on a 3Ghz machine).

On both x86 and PowerPC processors, ClockCyles() increase by about 1 billion counts every second on a 1 GHz processor.

On processors that don't have a free-running counter for the purpose of being a fine-grained clock, Neutrino emulates ClockCyles(). For example, on ARM processors, Neutrino reads the intermediate value of the countdown timer that's used to trigger the clock interrupts. This value tells how far you're into the current clock tick. Neutrino further adds a scaled version of how far you're into the current clock tick to a constant determined at the last clock tick to get an emulated ClockCycles() value.

On some processors, such as ARM, the countdown timer used for emulating ClockCycles() is located off-chip and requires slow I/O operations to read it. On other processors, such as MIPS, the countdown timer is located on-chip, and can be quickly read.

How accurate is microbilling?

See the next answer.

How accurate is ClockCycles()?

The accuracy of microbilling or ClockCycles() is determined by the accuracy of the clock oscillator source used in the CPU. However, since the scheduling is relative between partitions, it doesn't require ClockCycles() be equal to the absolute time; it requires only that ClockCycles() be proportional to the work done by CPU. In fact, a wrongly calibrated ClockCycles() has no effect on the accuracy of the thread scheduler.

What is the resolution of thread timing?

It's the resolution of the ClockCycles() function. The resolution of clock cycles varies from platform to platform. In most cases, the resolution is much finer.

The thread scheduler requires 1/200 of a tick to meet its specification for accuracy. In some platforms, such as x86, the resolution is on the order of nanoseconds.

Averaging window

How does the averaging window work?

The averaging window consists of tables. There are two tables per partition, one for the CPU time spent while critical, and another for any CPU time spent. The tables have one slot per timer tick. So a 100-ms averaging window, with a 1-ms clock period, has 100 slots. Each slot is used to hold the CPU time spent during a particular tick interval. For example:

[99ms ago][98 ms ago][97 ms ago]....[1 ms ago][current ms]

The slots hold the total CPU times of all threads in that partition as measured by consecutive calls to ClockCycles(). Note that total CPU times are then scaled by a carefully chosen factor so that all numbers fit into a 32-bit unsigned integer register.

At any time, the sum of the elements of a table represents the total CPU time used by that partition over the averaging period.

When the scheduler stops a thread running, it adds the time spent by that thread since when it started, or since the last tick, into the current ms slot of the table. If the thread was running as critical, the scheduler also adds the time to the current ms slot of that partition's critical time table. The scheduler also does this when a clock tick occurs.

However, on a clock tick, after billing the current thread to its partition's [current ms] slot, the scheduler also rotates the table. To rotate the table, it does the following:

deletes the 99ms-ago slot
shifts the entire table to the left by one slot, moving the time in the 98ms-ago slot to the 99ms-ago slot etc.
creates a new current-ms slot, which the scheduler initializes to zero

This is called window rotation. Each rotation effectively provides available budget back to the partition that ran 99 ms ago. Window rotation is implemented without summing the entire table, shifting the table, or calls to the malloc() or free() functions.

What is the window-rotation algorithm?

The averaging window isn't physically rotated. It's logically rotated:

A separate field, used_cycles, is always maintained to contain the total of every slot in the table.
An integer, cur_hist_index, is an index into the table and points to the slot for the current ms.
On microbilling, the scheduler adds the CPU time of the current thread to both the current slot in the table, and also to the total field. For example:
```
usage_hist[cur_hist_index] += delta_time; 
used_cycles += delta_time;
```
On window rotation, the scheduler does the following:
- subtracts the oldest slot from the total:
```
used_cycles -= usage_hist[(cur_hist_index +1) MOD 100]
```
- increments the table index, modulo its table size (say 100):
```
cur_hist_index = (cur_hist_index+1) MOD 100
```
This is done for every partition, for both normal and critical CPU time.

Can I change the window size?

See the next answer.

How does changing the window size affect scheduling?

You can change the window size with SchedCtl(SCHED_APS_SET_PARMS,...) on the fly. The scheduler doesn't malloc() new tables, but it does zero the history in all tables, zeros all the totals, and zeros the table indexes.

The effect is to wipe the memory of the scheduler. Here the scheduler assumes that no partition has run in the last x ms, where x is the new window size.

We recommend you leave the window size at the default, or set it during startup. Also, you shouldn't change the window size often.

How do maximum latencies relate to the averaging window size?

In general, the longer the averaging window, the longer the partition has to wait before it gets the CPU time.

For example, with a 100 milliseconds averaging window and a partition p with a 10% budget, the partition p will exhaust its budget if it runs continuously for 10 milliseconds. It has to wait another 90 milliseconds before window rotations cause the averaging window to lose memory of its past execution. So, it will be 90 milliseconds before the partition p gets some available budget back and runs again.

However, in most real systems that engage in inter-partition interaction, partition p's 10 milliseconds of running time is likely to get spread out in the averaging window. So even if p exhausts the budget soon, it will most likely get available budget back in much less than 90 milliseconds.

The Adaptive Partitioning User's Guide describes an unlikely scenario where two interacting partitions result in a larger latency than the window size budget.

Scheduling algorithm

How does the thread scheduler pick a thread to run?

See the next answer.

How does the scheduling algorithm work?

The thread scheduler evaluates a merit function on each partition and chooses the partition with the highest merit. It then picks the highest-priority thread in that partition. A partition with budget has more merit than a partition that has exhausted its budget.

First, let's look at a few helper functions. The details are provided below:

The COMPETING(p) function is a boolean function of partition p. It returns True if:
- partition p is currently running a thread of priority greater than zero
  Or:
- partition p contains a thread that is ready to run, and has a priority greater than zero
The HAS_BUDGET(p) function is a boolean function of partition p. It returns True if
```
cycles_used(p) + cycles_left_in_current_tick <= budget_cycles(p)
```
where:

cycles_used(p)

The CPU time that the partition has used during the current averaging window.

budget_cycles(p)

The size of the averaging window, expressed in ClockCycles() (not milliseconds - ms) multiplied by the percentage budget of p.
The MAY_RUN_CRITICAL(p) function is a boolean function of partition p. It returns True if:
- partition p is configured with a critical budget that's greater than zero
- partition p has used, during the last averaging window, a critical time that is less than its critical budget minus 1/32 of a tick
- the highest-priority thread that's ready to run or is currently running in partition p is allowed to run as critical.
The HIGHEST_PRIO(p) function is the numerical priority of the highest priority thread that is either running or ready to run in partition p.
If the partition has a nonzero budget, then the relative function free (RFF(p)) function is:
```
1 - used_cycles(p)/budget_cycles(p)
```
If the partition has a zero budget, then RFF(p) is defined to be a constant smaller than the smallest possible value of RFF() for all other nonzero partitions.

Some operating modes, defined by these boolean expressions, are also defined:

underload

When COMPETING(p) && (HAS_BUDGET(p)||MAY_RUN_CRITICAL(p)) == True for at least one partition p.

all_at_load

When COMPETING(p) == True for all p, and HAS_BUDGET(p)||MAY_RUN_CRITICAL(p) == False, for all partitions p.

free_time

When COMPETING(p) == False for at least one partition p that has a nonzero budget.

idle

when COMPETING(p) == False, for all partitions p.

The scheduler picks up one of the merit functions, depending on the operating mode:

underload: merit(p) = (COMPETING(p), HAS_BUDGET(p)||MAY_RUN_CRITICAL(p), HIGHEST_PRIO(p), RFF(p) )
all_at_limit: merit(p) = (COMPETING(p), RFF(p))
free_time, default: merit(p) = (COMPETING(p), HAS_BUDGET(p)||MAY_RUN_CRITICAL(p), HIGHEST_PRIO(p), RFF(p) )
free_time, SCHEDPOL_RATIO: merit(p) = (COMPETING(p), HAS_BUDGET(p)||MAY_RUN_CRITICAL(p), RFF(p) )
idle: Undefined.

If the mode is idle, the scheduler chooses to run the idle thread in the System partition.

Otherwise, the scheduler chooses to run the highest-priority thread that has a compatible runmask for the CPU on which the scheduler was invoked from the partition p such that:

merit(p) > merit(p')

for all p' not equal to p.

Merit functions return tuples, and are compared as tuples. For example:

(a,b) < (c,d) if (a<c) || ( (a=c) && (b<d) )

How does the scheduler find the highest-priority thread in a partition?

It does it very quickly. Each partition has a bitmap that tracks the priority levels (between 0 to 255) that are in use by some ready to run thread in that partition.

Each time the scheduler makes a thread ready to run, it sets the bit corresponding to that thread's priority. When the scheduler runs a thread (its state changes from ready to run), the scheduler examines the queue of threads in that partition that are ready-to-run and at the same priority. If there are no other threads of that priority, the scheduler clears the bit for that thread's priority.

When the scheduler needs to know the highest priority thread that is ready to run in a partition, it uses the bitmap to index a table that maps integers to the number of their highest 1 bit. This is done with a set of tables to avoid the need for 2²⁵⁵ table elements.

The same mechanism is also used in classic Neutrino scheduling. The macros are:

DISPATCH_SET()
DISPATCH_CLEAR()
DISPATCH_HIGHEST_PRI()

How are RFFs (relative fraction free) computed?

For the scheduling algorithm, the computation of RFF() requires floating-point division. However, Neutrino doesn't perform floating-point operation inside the kernel or even fixed-point division; these operations are very slow on some platforms.

Neutrino computes a function equivalent to RFF() that requires only addition and multiplication.

How does the scheduler algorithm avoid division and floating-point mathematics?

For the scheduling algorithm, the computation of RFF() requires floating-point division. However, Neutrino doesn't need the absolute values of RFF(); it needs to know only the relative ordering of RFF(p1), RFF(p2), .... RFF(pn).

Therefore, Neutrino computes a different function that has the same ordering properties as RFF(). This function is computable with only addition and 16×16 bit multiplication.

The idea is:

relative_fraction_free(P), or
```
RFF(P) = 1 -
cycles_used/budget_cycles
```
However, instead of finding partition p, such that RFF(p) > RFF(p') for p' not equal p, define relative_fraction_used(p) = RFU(p) = cycles_used/budget_cycles , and find partition p such that RFU(p) < RFU(p') for p' not equal to p.
Then find a function that has the same ordering properties as RFU():
- Find:
  used_cycles(p0)/budget_cycles(p0) < used_cycles(p1)/budget_cycles(p2)< .... < used_cycles(pn)/budget_cycles(pn)
- let
  k = budget_cycles(p0) * budget_cycles(p1) * ... * budget_cycles(pn), then
- k/budget_cycles(p0)*used_cycles(p0) < k/budget_cycles(p1)*used_cycles(p1) < ... < k/budget_cycles(pn)*used_cycles(pn), as long as all numbers are >0.
- Values of c(p)=K/budget_cycles(p), for all p, are computed once, or whenever any partition's percentage budget is changed. The values are stored and aren't recalculated during scheduling
- At scheduling time, Neutrino computes
```
f(p) = used_cycles(p) * c(p)
```
  and compare f(p) to f(p') to find which has the better RFF().

However, there are two complications:

Running out of bits

So far, f(p) = used_cycles(p) * c(p) requires 64-bit multiplication. However, since the accuracy specification is 0.2%, Neutrino scales all values of c(p) by a common factor, until the largest value fits in 16 bits. Neutrino also shifts used_cycles(p) until its largest possible value fits in 16 bits. Therefore, at scheduling time, Neutrino computes only:

f(p) = (used_cycles(p)>>scaling_factor) * scaled_c(p)

Zero-budget partitions

The above algorithms nominally require Neutrino to multiply and divide everything by zero. However RFF() of a zero-budget partition is defined to be a constant that is smaller than any nonzero partition's possible value of RFF(). Neutrino defines RFU(p) for a zero budget partition as a constant that is greater than RFU(). The largest value of f() is window size * c(pm) where c(pm) > c(p') for all p' not equal to pm.

Therefore, Neutrino can set f() for a zero-budget partition as:

f_zero = 1 + window size*c(pm)

and then scale it as described for running out of bits.

The window size is expressed in cycles.

How does the scheduler algorithm determine if a thread that's allowed to run as critical, should actually run as critical?

See the next answer.

How does the scheduler algorithm decide when to bill critical time?

When the scheduler algorithm picks a thread that is allowed to run as critical to run, it doesn't always charge its CPU time to its partition's critical budget. A thread t charges its CPU time to the critical budget of its partition p only when the following are true when the scheduler algorithm is invoked:

thread t has the highest priority in the system
thread t is allowed to run as critical now
partition p has a critical budget configured to be greater than zero
CPU cycles used by all threads in partition p during the last averaging window are less than the critical budget of partition p
partition p has exhausted its normal CPU budget

at least one partition, p' not equal to p, has

COMPETING(p') &&(HAS_BUDGET(p')||MAY_RUN_CRITICAL(p')) == True

For definitions of COMPETING(), HAS_BUDGET(), and MAY_RUN_CRITICAL(), see the topic How does the scheduling algorithm work?.

What are the algorithm's size limitations?

The mathematics of the algorithm are extendable to any number of partitions. However, these are the limitations of the current implementation:

It has <= 32 partitions, because of use of bit sets and 32-bit integers.
It has <= 16 partitions, because of an internal step of RFF calculation limited to 16×16 bit multiplication.
It has <= 8 partitions, a practical limit to prevent the scheduler from consuming too much memory or CPU time.
You must specify budgets, in integer percentages, e.g. 30% or 31%, but not 30.5%.
There's no limit on the number of threads per partition.

What are the algorithm's accuracy limitations?

Accuracy refers to the closeness of the scheduler's guarantee or limit that a partition can consume only its budget on a loaded system. For Neutrino, the accuracy is measured based on whichever is greater:

0.5%
Or
Tick size (in ms) or window size (in ms). For a 100 milliseconds window with a default tick, this is 1%.

When you changes the averaging window size to x ms, the accuracy is undefined for the next x ms.

The first limitation comes from the accuracy in which the RFF() calculation is carried out. The accuracy of RFF() is calculated to a limited number of bits, specifically to speed up the scheduling algorithm.

The second limitation comes from the uncertainty in predicting how long a thread runs before it voluntarily blocks, is preempted by a higher-priority thread, or when the next tick interrupt occurs. This limitation comes from the fact that the thread scheduler has guaranteed control of the system only every tick (but may run more often).

In practice, the last limitation implies that when a window size is changed, the scheduler clears its history of used CPU time. So the partition (p) with the highest priority thread runs for budget(p)*window size (ms) before another partition runs. After the window size (in milliseconds) has elapsed, all budgets are again guaranteed. So a partition, configured for a budget of 40%, with a 100 milliseconds averaging window, is considered to be scheduled accurately when its usage over the last 100 ms was 39 to 41 ms. This happens when the window size hasn't changed in the last 100 milliseconds. In practice, the scheduling accuracy is usually much better.

When is the scheduling algorithm approximated?

In order to save overhead, a very short version of the scheduling algorithm is used on some paths involved in message passing. This short version is implemented with the internal scheduler functions, such as ready_ppg(), block_and_ready_ppg() and adjust_priority_ppg().

Overhead

Which partition is the overhead associated with scheduling charged to?

Let's consider all kernel calls, such as messaging and mutexting, that switch threads to be overhead. Call the initial running thread t1, and the next thread t2. Let's consider the kernel calls that are initiated by t1 and cause t1 to stop running and t2 to start running.

The overhead is split between t1 and t2, but mostly to t1 with the following details:

Time to:	Is charged to the partition of:
Enter the kernel	`t1`
Run the scheduling algorithm	`t1`
Do a context switch	`t2`
Exit the kernel	`t2`

Which partition is the overhead for processing interrupts charged to?

There are two parts of interrupt servicing: the interrupt handler and the interrupt thread.

If you service interrupts with an interrupt thread, most of the time spent servicing the interrupt is the thread's time, and only a small part of the time is spent in the interrupt handler. The interrupt handler determines the thread to which the interrupt event should be delivered.

If you service interrupts with an interrupt handler, all of the time spent servicing the interrupt is in the handler.

The time spent in an interrupt thread is charged against the partition of that thread.

The time spent in an interrupt handler is charged against the partition that's running at that time.

Since the interrupts occur in random, time spent in interrupt handler is spread evenly over all running partitions.

What is the CPU overhead with the thread scheduler?

Our results indicate that heavy compiling benchmark that involve a lot of filesystem-related messaging are about 1% slower on x86 platforms when using the thread scheduler.

What is the memory overhead with the thread scheduler?

Data: A few kilobytes of fixed overhead along with 2 KB per partition.
Code: About 18 KB.

Both of these are in the kernel space.

What factors increase the overhead for the thread scheduler?

In approximate order of importance, the cost of the thread scheduler increases with:

the number of scheduling operations, such as sending messages, events and signals sent, mutex operations, and sleeps
the platform — in particular, ARM is noticeably slower because of the I/O needed to implement ClockCycles()
the frequency of clock ticks
the number of partitions
the use of runmasks

In all the above cases, the increase is approximately linear.

The following factors don't affect the cost of scheduling at all:

the number of threads
the length of the averaging window (except for a very small effect when changing the window size)
the choice of percentage budgets
the choice of thread priorities
the choice of FIFO, round-robin, or sporadic thread policies

Critical threads and bankruptcy

How does the scheduler mark a thread as critical?

See the next answer.

How does the thread scheduler know that a thread is critical?

Neutrino maintains a data block, the thread_entry, representing the state of each thread. It contains three state bits for controlling the critical threads that indicate whether or not the thread is:

always allowed to run as critical
allowed to run as critical until it blocks
currently running as critical (and is consuming its partition's critical budget).

These state bits are turned on as follows:

Always allowed	When the user calls SchedCtl() with the SCHED_APS_MARK_CRITICAL command on that thread.
Until blocked	When the thread receives an event from an interrupt handler, a message from another thread marked either “always allowed to run critical”, or “allow critical until it blocks” an event, on which the user has previously called the macro, SIGEV_MAKE_CRITICAL()
Currently running as critical	When the scheduler algorithm decides that thread would not have been eligible to run if it hadn't been allowed to run as critical.

Do critical threads expose security?

No.

You can set your own thread to be critical, or receive a critically tagged event or message. This way, the thread gets the property of the “allowed to run critical” flag. You must configure the partition with a nonzero critical budget to:

affect the critical budget, and
cause its partition to run when it's out of budget (thereby taking time from some other partition).

Setting a nonzero critical budget on a partition is controlled. For the recommended scheduler partition security settings, only root, running in the parent partition of a target partition, can set a nonzero critical budget.

When does the scheduler check for bankruptcy?

To save time, the thread scheduler only polls partitions for bankruptcy only on each clock tick (rather than every scheduling operation). So typically, bankruptcy is detected one millisecond (or clock period) after a partition's critical budget has been exhausted.

How does the scheduler detect bankruptcy?

Neutrino compares the total critical time (over the last averaging window) to the partition's configured maximum critical time budget. Each partition maintains a separate rotating window for tracking critical time usage. The history window for this critical time identifies, for each millisecond of the last 100 milliseconds, which part of the total CPU time was considered to be critical time.

Inheritance

What is partition inheritance?

Partition inheritance occurs when the scheduler bills the CPU time of a thread not to its own partition, but to the partition of a different thread. This feature makes the thread scheduler adaptive.

When does partition inheritance occur?

Partition inheritance occurs under two situations:

when one thread is working on behalf of another
When a client thread sends a message to a server thread, that server thread is considered to be working on the client thread's behalf. In this case, Neutrino charges the execution time of the receiving thread, from the time it receives the message and up to the time it waits for the next message, to the partition of the sending thread.
This means that resource managers, such as filesystems, automatically bill their time to their appropriate clients. This implies that partitions containing only resource managers don't need to be reengineered every time a new client is added to the system.
when not inheriting would cause excessive delays (in a special case of mutex inheritance)

How does mutex partition and inheritance work?

When threads line up for access to a mutex, Neutrino doesn't consider the thread holding the mutex to be waiting on behalf of the threads waiting for the mutex. So, there is no inheritance of partitions.

However, there is a special scenario when the thread holding the mutex is in a partition that ran out of available budget. In this scenario, the thread can't run and release the mutex. All the threads waiting for that mutex are stalled until enough window rotations have occurred for mutex-holding partitions to regain some available budget. This is particularly nasty if the user has configured that partition to have a zero budget.

So, when a thread t1 holds a mutex in a partition that has exhausted its budget, and another thread t2 attempts to seize the mutex, Neutrino puts thread t2 to sleep until thread t1 releases the mutex (which is classic mutex handling), and then changes the partition of t1 to be that of t2 until it releases the mutex, provided the budget of partition t2 is nonzero. This prevents extended delays, should the current mutex holder run out of budget.

How fast is partition inheritance?

Very fast.

The data block that Neutrino keeps for each thread, the thread_entry, has a pointer to its containing partition. So inheritance is simply a matter of swapping the pointer. Often, Neutrino doesn't even need to update the microbilling because the same partition is executing before and after the inheritance.

Why is partition inheritance for message passing secure?

Sending a message to a process effectively gives the sender's partition budget to the receiver thread (temporarily). However, to receive threads in that manner, the receiver process must have been started under the root user.

Budgets

Can I change the budgets dynamically?

You can change a partition's budget any time.

How does a budget change affect scheduling?

See the next answer.

How quickly does a budget change take effect?

The operation is quick and doesn't reset the scheduler or cause any change to the partition's history of CPU usage that is stored in the averaging window.

However, if you change the budget of a partition from 90% to 10%, the partition could suddenly become over budget. In this situation, the partition may not run again until enough window rotations have occurred to lower the partition's used cycles below its budget.

When does a change in budgets take effect?

A change in budget takes effect at the next tick interrupt or next scheduling operation i.e. typically, in less than one millisecond.

What is a partition with zero budget?

Threads in a partition with a defined budget of zero runs if all nonzero partitions are sleeping. These threads also run if they inherit the partition of thread that sends a message. Zero-budget partitions are most useful for resource managers with no internal daemon threads. They're also useful for turning off unused partitions.

How does the scheduler guarantee that the sum of all partitions' budgets is 100%?

At startup, Neutrino creates the first partition (the System partition) with a budget of 100%. Thereafter, when a thread running in a partition creates a new partition, the current partition is considered as the parent and the new partition is the child. The budget of the child is always taken from the budget of the parent, and may never reduce the parent's budget below zero. So creating partitions produces a hierarchy of partitions that subdivide the System's original budget of 100%.

How does the scheduler prevent an untrusted thread from increasing its partition's budget?

For any change to occur, the scheduler partition security would have to be:

unlocked to permit budget changes
set to permit non-root users to modify budgets
set to permit a partition to modify its own budget (usually only the parent can modify a partition's budget)

Note that a thread in a partition can't increase its budget more than the budget of its parent partition.

How can I cheat to exceed my partition's budget?

You can:

change the window size often
provide your partition an infinite critical budget and set yourself to run as critical
As the root user, unlock the scheduler partition configuration and turn off the scheduler partition security.

In order to do either of these, you must be the root user, unlock the scheduler partition configuration and turn off the scheduler partition security.

The following ideas look promising, but:

Giving your own partition more budget (it can't exceed its parent's, even if security is off).
Setting your thread priority to 255 (you can starve everything else in your partition, but not another partition).
Setting your thread policy to FIFO and loop (you can starve everything else in your partition, but not another partition.)
Creating your own partition (the child partition's budget can't be greater than your own).

Joining a partition

How does joining a thread to a partition work?

See the next answer.

How fast is joining a thread to a partition?

Each thread_entry (the control block that Neutrino maintains for each thread) has a pointer to its containing partition. Joining a thread means only changing this pointer. The act of joining is very fast. Most of the time is spent in entering the kernel in order to swap the pointer.

QNX system considerations

Why doesn't Neutrino allow a partition to be deleted?

It's safer and much more efficient not to delete a partition. A suggested alternative is to set the partition's budget to zero.

To delete a partition, Neutrino would have to locate all threads (or assert that there are none) in a partition and move them to some other partition.

Threads are mapped to their partitions with a single pointer. There is no back pointer, as it would require a linked list to implement a many-to-one mapping to chain together all threads.

In addition, Neutrino would require additional kernel memory for a two-way queue through all thread_entry structures. In addition, Neutrino also have to do two-way queue extractions every time it (Neutrino) inherited partitions (e.g. message sending) while evading the simultaneous destruction of other threads.

How does the thread scheduler plug into `procnto`?

See the next answer.

Is the classic scheduler still present when the thread scheduler is active?

Adaptive partitioning scheduler is part of the kernel.

It is shipped as a library module (libmod) that is built into the image along with procnto. The procnto also contains the code for the classic Neutrino scheduler when the thread scheduler module is not present. However, when the thread scheduler module is present, procnto initializes the thread scheduler instead of the classic scheduler. The thread scheduler then directs a set of function pointers, one for each primitive scheduling operation (such as ready(), block(), etc.), to its own function constants. Subsequently, it creates the system partition, which it returns to procnto.

Does the thread scheduler inhibit I/O interrupts?

Yes. The thread scheduler calls InterruptDisable() for slightly longer than the time required to call ClockCycles() each time it must microbill. That includes not inhibiting interrupts to get mutual exclusion between the clock interrupt handler, scheduling algorithm, getting partition statistics, or changing budgets.

Is there a performance limitation on how often I can call `SchedCtl(SCHED_APS_PARTITION_STATS,...)` to get statistics?

Other than the cost of the SchedCtl() kernel call, the answer is no.

Getting statistics doesn't inhibit interrupts, or delay window rotations or the scheduling algorithm (on other SMP processors.) Consistent retrieval of statistics is accomplished by detecting collisions and having the API withdraw and retry. Note that the call to SchedCtl( SCHED_APS_PARTITION_STATS..) fails with EINTR only in the unlikely case of three consecutive collisions. In general, this can occur only if the user has set the clock period to such a short value that it's likely unsafe for the rest of the system.