Considerations for the Thread Scheduler

This chapter includes:

You typically use the thread scheduler to:

In either case, you need to configure the parameters for the thread scheduler with the entire system in mind. The basic decisions are:

Determining the number of scheduler partitions and their contents

It seems reasonable to put functionally-related software into the same scheduler partition, and frequently that's the right choice. However, thread scheduling is a structured way of deciding when not to run software. So the actual method is to separate the software into different scheduler partitions if it should be starved of CPU time under different circumstances.

Note: The maximum number of partitions you can create is eight.

For example, if the system is a packet router that:

it may seem reasonable to have two scheduler partitions: one for routing, and one for topology. Certainly logging routing metrics is functionally related to packet routing.

However, when the system is overloaded, meaning there's more outstanding work than the machine can possibly accomplish, you need to decide what work to do slowly. In this example, when the router is overloaded with incoming packets, it's still important to route them. But you may decide that if you can't do everything, you'd rather route packets than collect the routing metrics. By the same analysis, you might conclude that route-topology protocols should continue to run, using much less of the machine than routing itself, but run quickly when they need to.

Such an analysis leads to three partitions:

In this case, we chose to separate the functionally-related components of routing and logging the routing metrics because we prefer to starve just one if we're forced to starve something. Similarly, we chose to group two functionally-unrelated components, the logging of routing metrics and the logging of topology metrics, because we want to starve them under the same circumstances.

Choosing the percentage of CPU for each partition

The amount of CPU time that each scheduler partition tends to use under unloaded conditions is a good indication of the budget you should assign to it. If your application is a transaction processor, it may be useful to measure CPU consumption under a few different loads and construct a graph of offered load versus the CPU consumed.

In general, the key to obtaining the right combination of partition budgets is to try them:

  1. Leave security turned off.
  2. Load a test machine with realistic loads.
  3. Examine the latencies of your time-sensitive threads with the QNX IDE System Profiler tool.
  4. Try different patterns of budgets, which you can easily change at run time with the aps command.

Note: You cannot delete partitions; however, you can remove all of its corresponding processes, and then change that specific partition's budget to 0%.

Setting budgets to zero

It's possible to set the budget of a partition to zero as long as the SCHED_APS_SEC_NONZERO_BUDGETS security flag isn't set—see the SCHED_APS_ADD_SECURITY command for SchedCtl().

Threads in a zero-budget partition run only in these cases:

When is it useful to set the budget of a partition to zero?

It useful to set the budget of a partition to zero when:

Note: Typically, setting a partition's budget to zero is not recommended. (This is why the SCHED_APS_SEC_RECOMMENDED security setting doesn't permit partition budgets to be zero.) The main risk in placing code into a zero-budget partition is that it may run in response to a pulse or event (i.e. not a message), and therefore, not run in the sender's partition. As a result, when the system is loaded (i.e. there's no free time), those threads may simply not run; they might hang, or things might occur in the wrong order.

For example, it's hazardous to set the System partition's budget to zero. On a loaded machine with a System partition of zero, requests to procnto to create processes and threads may hang, for example, when MAP_LAZY is used. In addition, if your system uses zero-budget partitions, you should carefully test it with all other partitions fully loaded with while(1) loops.

Setting budgets for resource managers

Ideally we'd like resource managers, such as filesystems, to run with a budget of zero. That way they'd always be billing time to their clients. However, some device drivers realize too late which client a particular thread was doing work for. Consequently, some device drivers may have background threads for audits or maintenance that require CPU time that can't be attributed to a particular client. In those cases, you should measure the resource manager's background and unattributable loads, and then add that amount to its partition's budget.

  • If your server has maintenance threads that never serve clients, then it should be in a partition with a nonzero budget.
  • If your server communicates with its clients by sending messages, or by using mutexes or shared memory (i.e. anything other than receiving messages), then your server should be in a partition with a nonzero budget.

Choosing the window size

You can set the size of the time-averaging window to be from 8 to 400 milliseconds. This is the time over which the thread scheduler attempts to balance scheduler partitions to their guaranteed CPU limits. Different choices of window sizes affect both the accuracy of load balancing and, in extreme cases, the maximum delays seen by ready-to-run threads.


Some things to consider:

Delays compared to priority scheduling

In an underload situation, the thread scheduler doesn't delay ready-to-run threads, but the highest-priority thread might not run if the thread scheduler is balancing budgets.

In very unlikely cases, a large window size can cause some scheduler partitions to experience runtime delays, but these delays are always less than what would occur without adaptive partitioning thread scheduling. There are two cases where this can occur.

Case 1

If a scheduler partition's budget is budget milliseconds, then the delay is never longer than:

window_sizesmallest_budget + largest_budget

This upper bound is only ever reached when low-budget and low-priority scheduler partitions interact with two other scheduler partitions in a specific way, and then only when all threads in the system are ready to run for very long intervals. This maximum possible delay has an extremely low chance of occurring.

For example, given these scheduler partitions:

This delay happens when the following occurs:

Note: This scenario can't occur unless a high-priority partition wakes up exactly when a lower-priority partition just finishes paying back its opportunistic run time.

Case 2

Still rare, but more common, is a delay of:


milliseconds, which may occur to low-budget scheduler partitions with, on average, priorities equal to other partitions.

However, with a typical mix of thread priorities, when ready to run each scheduler partition typically experiences a maximum delay of much less than the window_size milliseconds.

For example, let's suppose we have these scheduler partitions:

This delay occurs when the following happens:

However, this pattern occurs only if the 10% application never suspends (which is exceedingly unlikely), and if there are no threads of other priorities (also exceedingly unlikely).

Approximating the delays

Because these scenarios are complicated, and the maximum delay time is a function of the partition shares, we approximate this rule by saying that the maximum ready-queue delay time is twice the window size.

Note: If you change the tick size of the system at runtime, do so before defining the windows size of the partition thread scheduler, because Neutrino converts the window size from milliseconds to clock ticks for internal use.

The practical way to verify that your scheduling delays are correct is to load your system with stress loads, and use the System Profiler tool from the QNX IDE to monitor the delays. The aps command lets you change budgets dynamically, so you can quickly confirm that you have the correct configuration of budgets.

Practical limits

The API allows a window size as short as 8 milliseconds. However, practical window sizes may need to be larger. For example, in an eight-partition system, with all partitions busy, to reasonably expect all eight to run during every window, the window size needs to be at least 8 timeslices long, which for most systems is 32 milliseconds.

Uncontrolled interactions between scheduler partitions

There are cases where scheduler partition can prevent other applications from being given their guaranteed percentage CPU: