Considerations for the Thread Scheduler

This chapter includes:

Determining the number of scheduler partitions and their contents
Choosing the percentage of CPU for each partition
Choosing the window size
Practical limits
Uncontrolled interactions between scheduler partitions

You typically use the thread scheduler to:

engineer a system to work in a predictable or defined manner when it's fully loaded
prevent unimportant or untrusted applications from monopolizing the system

In either case, you need to configure the parameters for the thread scheduler with the entire system in mind. The basic decisions are:

How many scheduler partitions should you create, and what software should go into each?
What guaranteed CPU percentage should each scheduler partition receive?
What should be the critical budget, if any, of each scheduler partition?
What size, in milliseconds, should the time-averaging window be?

Determining the number of scheduler partitions and their contents

It seems reasonable to put functionally-related software into the same scheduler partition, and frequently that's the right choice. But thread scheduling is a structured way of deciding when not to run software. So the actual method is to separate the software into different scheduler partitions if it should be starved of CPU time under different circumstances.

The maximum number of partitions you can create is eight.

For example, if the system is a packet router that:

routes packets
collects and logs statistics for packet routing
handles route-topology protocols with peer routers
collects and logs route-topology metrics

it may seem reasonable to have two scheduler partitions: one for routing, and one for topology. Certainly logging routing metrics is functionally related to packet routing.

When the system is overloaded, meaning there's more outstanding work than the machine can possibly accomplish, you need to decide what work to do slowly. In this example, when the router is overloaded with incoming packets, it's still important to route them. But you may decide that if you can't do everything, you'd rather route packets than collect the routing metrics. By the same analysis, you might conclude that route-topology protocols should continue to run, using much less of the machine than routing itself, but run quickly when they need to.

Such an analysis leads to three partitions:

a partition for routing packets, with a large share, say 80%
a partition for topology protocols, say 15%, but with maximum thread priorities that are higher than those for packet routing
a partition for logging both the routing metrics and topology-protocol metrics

In this case, we chose to separate the functionally-related components of routing and logging the routing metrics because we prefer to starve just one if we're forced to starve something. Similarly, we chose to group two functionally-unrelated components, the logging of routing metrics and the logging of topology metrics, because we want to starve them under the same circumstances.

Choosing the percentage of CPU for each partition

The amount of CPU time that each scheduler partition tends to use under unloaded conditions is a good indication of the budget you should assign to it. If your application is a transaction processor, it may be useful to measure CPU consumption under a few different loads and construct a graph of offered load versus the CPU consumed.

In general, the key to obtaining the right combination of partition budgets is to try them:

Leave security turned off.
Load a test machine with realistic loads.
Examine the latencies of your time-sensitive threads with the QNX IDE System Profiler tool.
Try different patterns of budgets, which you can easily change at run time with the aps command.

You can't delete partitions, but you can remove all of its corresponding processes, and then change that specific partition's budget to 0%.

Setting budgets to zero

It's possible to set the budget of a partition to zero as long as the SCHED_APS_SEC_NONZERO_BUDGETS security flag isn't set—see the SCHED_APS_ADD_SECURITY command for SchedCtl().

Threads in a zero-budget partition run only in these cases:

You're using the default scheduling policy (SCHED_APS_SCHEDPOL_DEFAULT), and the highest-priority thread in the system belongs to the zero-budget partition.
You're using SCHED_APS_SCHEDPOL_FREETIME_BY_RATIO, and all other nonzero-budget partitions are idle.
The zero-budget partition has a nonzero critical budget, in which case its critical threads run.
A thread receives a message from a partition with a nonzero budget, in which case the receiving thread runs temporarily in the sender's partition.

When is it useful to set the budget of a partition to zero?

It useful to set the budget of a partition to zero when:

A partition is permanently empty of running threads; you can set its budget to zero to effectively turn it off. When a zero-budget partition is idle, it isn't considered to produce free time (see “Summary of scheduling behavior” in the “Using the Thread Scheduler” chapter of this guide). A partition with a nonzero budget that never runs threads puts the thread scheduler permanently in free-time mode, which may not be the desired behavior.
You want noncritical code to run only when some other partition is idle.
The partition is populated by resource managers, or other software, that runs only in response to receiving messages. Because putting them in a zero-budget partition means you don't have to separately engineer a partition budget for them. (Those resource managers automatically bill their time to the partitions of their clients.)

Typically, setting a partition's budget to zero isn't recommended. (This is why the SCHED_APS_SEC_RECOMMENDED security setting doesn't permit partition budgets to be zero.) The main risk in placing code into a zero-budget partition is that it may run in response to a pulse or event (i.e., not a message), and therefore, not run in the sender's partition. As a result, when the system is loaded (i.e., there's no free time), those threads may simply not run; they might hang, or things might occur in the wrong order.

For example, it's hazardous to set the System partition's budget to zero. On a loaded machine with a System partition of zero, requests to procnto to create processes and threads may hang, for example, when MAP_LAZY is used. In addition, if your system uses zero-budget partitions, you should carefully test it with all other partitions fully loaded with while(1) loops.

Setting budgets for resource managers

Ideally we'd like resource managers, such as filesystems, to run with a budget of zero. That way they'd always be billing time to their clients. But some device drivers realize too late which client a particular thread was doing work for. Consequently, some device drivers may have background threads for audits or maintenance that require CPU time that can't be attributed to a particular client. In those cases, you should measure the resource manager's background and unattributable loads, and then add that amount to its partition's budget.

If your server has maintenance threads that never serve clients, then it should be in a partition with a nonzero budget.
If your server communicates with its clients by sending messages, or by using mutexes or shared memory (i.e., anything other than receiving messages), then your server should be in a partition with a nonzero budget.

Choosing the window size

You can set the size of the time-averaging window to be from 8 to 400 milliseconds. This is the time over which the thread scheduler attempts to balance scheduler partitions to their guaranteed CPU limits. Different choices of window sizes affect both the accuracy of load balancing and, in extreme cases, the maximum delays seen by ready-to-run threads.

Accuracy

Some things to consider:

A small window size reduces the accuracy of CPU time balancing. The error is +/-( tick_size / window_size). For example, if the window size is 10 milliseconds, the accuracy is about 10 percentage points.
If a partition opportunistically goes over budget (because other partitions are using less than their guaranteed budget), it must pay back the borrowed time, but only as much as the thread scheduler “remembers” (i.e., only the borrowing that occurred in the last window).
A small window size means that a scheduler partition that opportunistically goes over budget might not have to pay the time back. If a partition sleeps for longer than the window size, it won't get the time back later. So load balancing won't be accurate over the long term if both the system is loaded, and some partitions sleep for longer than the window size.
If the window size is small enough that some partition's percentage budget becomes less than a tick, the partition will get to run for at least 1 tick during each window, giving it 1 tick / window_size_in_ticks per cent of the CPU time, which may be considerably larger than the partition's actual budget. As a result, other partitions may not get their CPU budgets.

Delays compared to priority scheduling

In an underload situation, the thread scheduler doesn't delay ready-to-run threads, but the highest-priority thread might not run if the thread scheduler is balancing budgets.

In very unlikely cases, a large window size can cause some scheduler partitions to experience runtime delays, but these delays are always less than what would occur without adaptive partitioning thread scheduling. There are two cases where this can occur.

Case 1

If a scheduler partition's budget is budget milliseconds, then the delay is never longer than:

window_size − smallest_budget + largest_budget

This upper bound is only ever reached when low-budget and low-priority scheduler partitions interact with two other scheduler partitions in a specific way, and then only when all threads in the system are ready to run for very long intervals. This maximum possible delay has an extremely low chance of occurring.

For example, given these scheduler partitions:

Partition A: 10% share; always ready to run at priority 10
Partition B: 10% share; when it runs, it runs at priority 20
Partition C: 80% share; when it runs, it runs at priority 30

This delay happens when the following occurs:

Let B and C sleep for a long time. A will run opportunistically and eventually run for 100 milliseconds (the size of the averaging window).
Then B wakes up. It has both available budget and a higher priority, so it runs. Let's call this time Ta, since it's the last time partition A ran. Since C continues to sleep, B runs opportunistically.
At Ta + 90 milliseconds, partition A has just paid back all the time it opportunistically used (the window size minus partition A's budget of 10%). Normally, it would run on the very next tick because that's when it would next have a budget of 1 millisecond, and B is over budget.
But let's say that, by coincidence, C chooses to wake at that exact time. Because it has budget and a higher priority than A, it runs. It proceeds to run for another 80 milliseconds, which is when it runs out of budget.
Only now, at Ta + 90 ms + 80 ms, or 170 milliseconds later, does A get to run again.

This scenario can't occur unless a high-priority partition wakes up exactly when a lower-priority partition just finishes paying back its opportunistic run time.

Case 2

Still rare, but more common, is a delay of:

window_size − budget

milliseconds, which may occur to low-budget scheduler partitions with, on average, priorities equal to other partitions.

With a typical mix of thread priorities, when ready to run each scheduler partition typically experiences a maximum delay of much less than the window_size milliseconds.

For example, let's suppose we have these scheduler partitions:

partition A: 10% share, always ready to run at priority 10
partition B: 90% share, always ready to run at priority 20, except that every 150 milliseconds, it sleeps for 50 milliseconds.

This delay occurs when the following happens:

When partition B sleeps, partition A is already at its budget limit of 10 milliseconds (10% of the window size).
But then A runs opportunistically for 50 milliseconds, which is when B wakes up. Let's call that time Ta, the last time partition A ran.
B runs continuously for 90 milliseconds, which is when it exhausts its budget. Only then does A run again; this is 90 milliseconds after Ta.

This pattern occurs only if the 10% application never suspends (which is exceedingly unlikely), and if there are no threads of other priorities (also exceedingly unlikely).

Approximating the delays

Because these scenarios are complicated, and the maximum delay time is a function of the partition shares, we approximate this rule by saying that the maximum ready-queue delay time is twice the window size.

If you change the tick size of the system at runtime, do so before defining the windows size of the partition thread scheduler, because Neutrino converts the window size from milliseconds to clock ticks for internal use.

The practical way to verify that your scheduling delays are correct is to load your system with stress loads, and use the System Profiler tool from the QNX IDE to monitor the delays. The aps command lets you change budgets dynamically, so you can quickly confirm that you have the correct configuration of budgets.

Practical limits

If you use adaptive partitions, you need to be aware of the following limitations:

The API allows a window size as short as 8 milliseconds, but practical window sizes may need to be larger. For example, in an eight-partition system, with all partitions busy, to reasonably expect all eight to run during every window, the window size needs to be at least 8 timeslices long, which for most systems is 32 milliseconds.
Overloads aren't reported to users. The Adaptive Partition scheduler does detect overload and acts to limit some partitions to guarantee the percentage shares of others, but it doesn't inform anything outside of the kernel that an overload was detected. The problem is that an overload might occur (or might not occur) on every scheduling operation, which can occur at the rate of 50000 per second on a 200mhz machine (an older, slower machine).
SCHED_RR threads might not round robin in partitions whose portion of the averaging window is smaller than one timeslice. For example, when the timeslice is 4 ms (the default) and the adaptive partitioning scheduler's window size is 100 ms (the default), then SCHED_RR threads in a 4% partition may not round-robin correctly.
If you use adaptive partitioning and bound multiprocessing (BMP), some combinations of budgets might not be met. Threads in a zero-budget partition should run only when all other nonzero-budget partitions are idle. On SMP machines, zero-budget partitions may incorrectly run when some other partitions are demanding time. At all times, all partitions' minimum budgets are still guaranteed, and zero-budget partitions won't run if all nonzero-budget partitions are ready to run. For detailed information, see “Scheduler partitions and BMP” in the Using the Thread Scheduler chapter.
To calculate the total microcycle used in a window size, the product of P × W × N must be less than 2,147,483,648, where:
- P is the processor clock rate (in Hz)
- W is the APS window size (in seconds)
- N is the number of processors on the SMP device
The default value of W is 0.1 (100 milliseconds) and, given this value, the following constraints apply:
- 1 processor: maximum clock rate 21.5 GHz
- 2 processors: maximum clock rate 10.7 GHz
- 4 processors: maximum clock rate 5.4 GHz
- 8 processors: maximum clock rate 2.7 GHz
As reported by the aps show -v command on ARM targets, the 10 and 100 window averages occasionally give incorrect information, but this has no effect on scheduling.

Uncontrolled interactions between scheduler partitions

There are cases where scheduler partition can prevent other applications from being given their guaranteed percentage CPU:

Interrupt handlers: The time used in interrupt handlers is never throttled. That is, we always choose to execute the globally highest-priority interrupt handler, independent of its scheduler partition. This means that faulty hardware or software that causes too many interrupts can effectively limit the time available to other applications.
Time spent in interrupt threads (e.g., those that use InterruptAttachEvent()) is correctly charged to those threads' partitions.