This version of this document is no longer maintained. For the latest documentation, see http://www.qnx.com/developers/docs.

Adaptive Partitioning Scheduling Details

This chapter includes:

Introduction
Keeping track of CPU time
How CPU time is divided between partitions
Partition inheritance
Critical threads
Adaptive partitioning and other schedulers
Using adaptive partitioning and multicore together

Introduction

The adaptive partitioning scheduler is an optional thread scheduler that lets you guarantee minimum percentages of the CPU's throughput to groups of threads, processes, or applications. The percentage of the CPU alloted to a partition is called a budget.

The adaptive partitioning scheduler has been designed on top of the core QNX Neutrino architecture to solve primarily two problems in embedded systems design:

proper working in fully loaded conditions
preventing unimportant or untrusted applications from monopolizing the system

We call our partitions "adaptive" because their contents are dynamic:

You can dynamically launch an application into a partition.
Child threads and child processes automatically run in the same partition as their parent.
By default, when you use the standard Neutrino send-receive-reply messaging, message receivers automatically run in the partition of the message sender while they're processing that message. That means that all resource managers, such as drivers and filesystems, automatically bill CPU time (except overhead) to the budget of their clients.

Keeping track of CPU time

The adaptive partitioning scheduler throttles CPU usage by measuring the average CPU usage of each partition. The average is computed over an averaging window, normally 100 ms, a value that is configurable.

The adaptive partitioning scheduler, however, doesn't wait 100 ms to compute the average. In fact it calculates it very often. As soon as 1 ms passes, the usage for this 1 ms is added to the usage for previous 99 ms to compute the total CPU usage over the averaging window (i.e. 100 ms).

Sliding averaging window

The averaging window moves forward as time advances.

The window size defines the averaging time over which the adaptive partitioning scheduler tries to balance the partitions to their guaranteed CPU limits. You can set the averaging window size to any value from 8 ms to 400 ms.

Different choices of the window size affect both the accuracy of load balancing and, in extreme cases, the maximum delays seen by ready-to-run threads. For more information, see the System Considerations chapter.

Because the averaging window slides, it can be difficult for you to keep statistics over a longer period, so the scheduler keeps two other windows:

Window 2: Typically 10 times the window size.
Window 3: Typically 100 times the window size.

To view the statistics for these windows, use the show -v or show -vv option to the aps command.

The scheduler accounts for time spent to the actual fraction of clock ticks used and accounts for the time spent in interrupt threads and in the kernel on behalf of user threads. We call this microbilling.

Microbilling may be approximated on SH and ARM targets if the board can't provide a micro clock.

How CPU time is divided between partitions

The adaptive partitioning scheduler is a fair-share scheduler. That means it guarantees that partitions receive a defined minimum amount of CPU time, their budget, whenever they demand it. The adaptive partitioning scheduler is also a realtime scheduler. That means it's a preemptive, priority-based scheduler. These two requirements appear to conflict.

The scheduler satisfies both requirements by scheduling by priority at all times that it doesn't need to limit a partition in order to guarantee some other partition its budget. Here are the details.

Underload

Underload is when partitions are demanding less CPU time than their defined budgets, over the averaging window. For example, if we have three partitions:

System partition, with a 70% budget
partition Pa, with a 20% budget
partition Pb, with 10% budget

with light demand in all three partitions, the output of the aps show command may be something like this:

                    +---- CPU Time ----+-- Critical Time --
Partition name   id | Budget |    Used | Budget |      Used
--------------------+------------------+-------------------
System            0 |    70% |   6.23% |  200ms |   0.000ms
Pa                1 |    20% |  15.56% |    0ms |   0.000ms
Pb                2 |    10% |   5.23% |    0ms |   0.000ms
--------------------+------------------+-------------------
Total               |   100% |  27.02% |

In this case, all three partitions are demanding less than their budgets.

Whenever there are partitions demanding less than their budgets, the scheduler chooses between them by picking the highest-priority running thread. In other words, when underloaded, the adaptive partitioning scheduler is a strict realtime scheduler. This is simply normal Neutrino scheduling.

Normal load

The adaptive partitioning scheduler's behavior when underloaded.

Free time

Free time occurs when one partition isn't running. The scheduler then gives that partition's time to other running partitions. If the other running partitions demand enough time, they're allowed to run over budget.

If a partition opportunistically goes over budget, it must pay back the borrowed time, but only as much as the scheduler "remembers" (i.e. only the borrowing that occurred in the last window).

For example, suppose we have these partitions:

System partition, with a 70% budget, but running no threads
partition Pa, with a 20% budget, running an infinite loop at priority 9
partition Pb, with a 10% budget, running an infinite loop at priority 10

Because the System partition is demanding no time, the scheduler hands out the remaining time to the the highest-priority thread in the system. In this case, that's the infinite-loop thread in partition Pb. So the output of the aps show command may be:

                    +---- CPU Time ----+-- Critical Time --
Partition name   id | Budget |    Used | Budget |      Used
--------------------+------------------+-------------------
System            0 |    70% |   0.11% |  200ms |   0.000ms
Pa                1 |    20% |  20.02% |    0ms |   0.000ms
Pb                2 |    10% |  79.83% |    0ms |   0.000ms
--------------------+------------------+-------------------
Total               |   100% |  99.95% |

In this example, partition Pa is getting its guaranteed minimum of 20%, but all the free time is given to partition Pb.

This is a consequence of the principle that the scheduler chooses between partitions strictly by priority, as long as no partition is being limited to its budget. This strategy ensures the most realtime behavior.

However, there may be circumstances when you don't want partition Pb to get all the free time just because it has the highest-priority thread. That may occur when, say, you choose to use Pb to encapsulate an untrusted or third-party application, especially if you can't control its code.

In that case, you may wish to configure the scheduler to run a more restrictive algorithm that divides free time by the budgets of the busy partitions (rather than giving it all to the highest-priority thread). To do so, set the SCHED_APS_FREETIME_BY_RATIO flag with SchedCtl() (see the Neutrino Library Reference) or use the aps modify -S freetime_by_ratio command (see the Utilities Reference).

In our example, freetime-by-ratio mode might cause the aps show command to display:

                    +---- CPU Time ----+-- Critical Time --
Partition name   id | Budget |    Used | Budget |      Used
--------------------+------------------+-------------------
System            0 |    70% |   0.04% |  200ms |   0.000ms
Pa                1 |    20% |  65.96% |    0ms |   0.000ms
Pb                2 |    10% |  33.96% |    0ms |   0.000ms
--------------------+------------------+-------------------
Total               |   100% |  99.96% |

which indicates that in freetime-by-ratio mode, the scheduler divides free time between partitions Pa and Pb in roughly a 2:1 ratio, which is the ratio of their budgets.

Full Load

Full load occurs when all partitions are demanding their full budget. A simple way to demonstrate this is to run while(1) loops in all of the sample partitions. In this case, the aps show command might display:

                    +---- CPU Time ----+-- Critical Time --
Partition name   id | Budget |    Used | Budget |      Used
--------------------+------------------+-------------------
System            0 |    70% |  69.80% |  200ms |   0.000ms
Pa                1 |    20% |  19.99% |    0ms |   0.000ms
Pb                2 |    10% |   9.81% |    0ms |   0.000ms
--------------------+------------------+-------------------
Total               |   100% |  99.61% |

In this case, the requirement to meet the partitions' guaranteed budgets takes precedence over priority.

In general, when partitions are at or over their budget, the scheduler divides time between them by the ratios of their budgets and balances usage to a few percentage points of partitions' budgets. (For more information on budget accuracy, see Choosing the window size in the System Considerations chapter of this guide.)

Even at full load, the scheduler can provide realtime latencies to an engineerable set of critical threads (see "Critical threads" later in this chapter). However, in that case, the scheduling of critical threads takes precedence over meeting budgets.

Full load

The adaptive partitioning scheduler's behavior under a full load.

Summary of scheduling behavior

The following table summarizes how the scheduler divides time in normal and freetime-by-ratio mode:

Partition state	Normal	Freetime-by-ratio
Usage < budget	By priority	By priority
Usage > budget and there's free time	By priority	By ratio of budgets
Full load	By ratio of budgets	By ratio of budgets
Partitions running a critical thread at any load	By priority	By priority

The adaptive partitioning scheduler's overhead doesn't increase with the number of threads. But it may increase with the number of partitions, so you should use as few partitions as possible.

Partition inheritance

Whenever a server thread in the standard Neutrino send-receive-reply messaging scheme receives a message from a client, Neutrino considers the server thread to be working on behalf of the the client. So Neutrino runs the server thread at the priority of the client. In other words, threads receiving messages inherit the priority of their sender.

With the adaptive partitioning scheduler, this concept is extended: we run server threads in the partition of their client thread while the server is working on behalf of that client. So the receiver's time is billed to the sender's adaptive partition.

What about any threads or processes that the server creates? Which partition do they run in?

New threads: If you receive a message from another partition, and you create a new thread in response, the child thread runs in the sender's partition until the child thread becomes receive-blocked. At that point, the child thread's partition is reset to be its creator's partition.
New processes: If you receive a message from another partition, and create a process in response, the process is created in the sender's partition. Any threads that the child process creates also run in the sender's partition.

If you don't want the server or any threads or processes it creates to run in the client's partition, set the _NTO_CHF_FIXED_PRIORITY flag when the server creates its channel. For more information, see ChannelCreate() in the Neutrino Library Reference.

Send-receive-reply message-passing is the only form of thread communication that automatically makes the server inherit the client's partition.

Inheriting the partition

Server threads temporarily join the partition of the threads they work for.

Pulses don't inherit the sender's partition. Instead, their handlers run in the process's pulse-processing partition, which by default is the partition that the process was initially created in. You can change the pulse-processing partition with the SCHED_APS_JOIN_PARTITION command to SchedCtl(), specifying the process ID, along with a thread ID of -1.

Critical threads

A critical thread is one that's allowed to run even if its partition is over budget (provided that partition has a critical time budget). By default, Neutrino automatically identifies all threads that are initiated by an I/O interrupt as critical. However, you can use SchedCtl() to mark selected threads as critical.

The ability to mark any thread as critical may require security control to prevent its abuse as a DOS attack. See "Security" in the System Considerations chapter of this guide.

Critical threads always see realtime latencies, even when the system is fully loaded or any time other threads in the same partition are being limited to meet budgets. The basic idea behind this is that a critical thread is allowed to violate the budget rules of its partition and run immediately, thereby getting the realtime response it needs. For this to work properly, there must not be many critical threads in the system.

You can use a sigevent to make a thread run as critical:

Define and initialize the sigevent as normal. For example:

struct sigevent my_event;
SIGEV_PULSE_INIT (&my_event, coid, 10, 
                  MY_SIGNAL_CODE, 6969);

Set the flag that will mark whatever thread receives your event as being critical:
```
SIGEV_MAKE_CRITICAL(&my_event);
```
before you send the event. This has an effect only if the thread receiving your event runs in a partition with a critical-time budget.
Process the sigevent as normal in the thread that receives it. This thread doesn't have to do anything to make itself a critical thread; the kernel does that automatically.

To make a thread noncritical, you can use the SIGEV_CLEAR_CRITICAL() macro when you set up a sigevent.

The SIGEV_CLEAR_CRITICAL() and SIGEV_MAKE_CRITICAL() macros set a hidden bit in the sigev_notify field. If you ever test the value of the sigev_notify field of your sigevent after creating it, and if you've ever used the SIGEV_MAKE_CRITICAL() macro, then use code like this:

switch (SIGEV_GET_TYPE(&my_event) ) {

instead of this:

switch (my_event.sigev_notify) {

A thread that receives a message from a critical thread automatically becomes critical as well.

You may mark selected adaptive partitions as critical and give each partition a critical time budget. Critical time is specifically intended to allow critical interrupt threads to run over budget.

The critical time budget is specified in milliseconds. It's the amount of time all critical threads may use during an averaging window. A critical thread will run even if its adaptive partition is out of budget, as long as its partition still has critical budget.

Critical time is billed against a partition while all these conditions are met:

The running partition has a critical budget greater than zero.
The top thread in the partition is marked as running critical, or has received the critical state from receiving a SIG_INTR, a sigevent marked as critical, or has just received a message from a critical thread.
The running partition must be out of percentage-CPU budget.
There be at least one other partition that is competing for CPU time.

Otherwise, the critical time isn't billed. The critical threads run whether or not the time is billed as critical. The only time critical threads won't run is when their partition has exhausted its critical budget (see "Bankruptcy," below.

In order to be useful, the number of critical threads in the system must be few. If they are the majority, the adaptive partitioning scheduler will rarely be able to guarantee all partitions' their minimum CPU budgets. In other words, the system degrades to a priority-based scheduler when there are too many critical threads.

A critical thread remains critical until it becomes receive-blocked. A critical thread that's being billed for critical time will not be round-robin timesliced (even if its scheduling policy is round robin).

Neutrino marks all sigevent structures that are returned from a user's interrupt-event handler functions as critical. This makes all I/O handling threads automatically critical. This is done to minimize code changes that you would need to do for basic use of adaptive partitioning scheduling. If you don't want this behavior, specify the -c option to procnto in your buildfile.

To make a partition's critical budget infinite, set it to the number of processors times the size of the averaging window. Do this with caution, as it can cause security problems; see "Security" in the System Considerations chapter of this guide.

Bankruptcy

Bankruptcy occurs when the critical CPU time billed to a partition exceeds its critical budget.

The System partition's critical budget is infinite; this partition can never become bankrupt.

It's very important that you test your system under full load to make sure that everything works correctly, especially to ensure that you've chosen the correct critical budgets. An easy way to do this is to start a while(1) thread in each partition to use up all available time.

Bankruptcy is always considered to be a design error on the part of the application, but the system's response is configurable. Neutrino lets you set a recovery policy. The options are:

Default: Do the minimum. When a partition runs out of critical budget, it's not allowed to run again until it gets more budget, i.e. the sliding-averaging window recalculates that partition's average CPU consumption to be a bit less than its configured CPU budget. After bankruptcy, enough time must pass for the calculated average CPU time of the partition to fall to its configured budget. That means that, at the very least, a number of milliseconds equal to the critical budget must pass before that partition is scheduled again.
Force a reboot: This is intended for your regression testing. It's a good way of making sure that code causing an unintended bankruptcy is never accidently shipped to your customers. We recommend that you turn off this option before you ship.
Notify: The SchedCtl() function lets you attach a sigevent to each partition. The adaptive partitioning scheduler will deliver that sigevent when the corresponding partition becomes bankrupt. To prevent a possible flood of sigevents, the adaptive partitioning scheduler will deliver at most one sigevent per registration. If you want another notification, use the SchedCtl() again and reattach the event to get the next notification. This way Neutrino arranges the rate of delivery of bankruptcy notification to never exceed the application's ability to receive them.
Cancel: The cancel option causes the bankrupt partition's critical-time budget to be set to zero. That prevents it from running critical until you restore its critical-time budget either through the SCHED_APS_MODIFY_PARTITION command to the SchedCtl() function, or through the -B option to the aps modify command.

You can set the bankruptcy policy with the aps utility (see the Utilities Reference) or the SCHED_APS_SET_PARMS command to SchedCtl() (see the Neutrino Library Reference).

Whenever a critical or normal budget is changed for any reason, there is an effect on bankruptcy notification: bankruptcy handing is delayed by two windows. This is to prevent a false bankruptcy detection if someone changes a partition's budget from 90% to 1% suddenly.

Canceling the budget on bankruptcy changes the partition's critical budget, causing further bankruptcy detections to be suppressed for two window sizes.

In order to cause the whole system to stabilize after such an event, the scheduler gives all partitions a two-window grace period against declaring bankruptcy when one partition gets its budget canceled.

Adaptive partitioning and other schedulers

Priorities and scheduler policies are relative to one adaptive partition only; priority order is respected within a partition, but not between partitions if the adaptive partitioning scheduler needs to balance budgets.

You can use the adaptive partitioning scheduler with the existing FIFO, round-robin and sporadic scheduling policies. The scheduler, however, may:

stop a thread running before the end of its timeslice (round-robin case)
Or:
before the thread has run to completion (FIFO case)

This occurs when the thread's partition runs out of budget and some other partition has budget, i.e. the scheduler doesn't wait for the end of a thread's timeslice to decide to run a thread from a different partition. The scheduler takes that decision every tick (where a tick is 1 ms on most machines). There are 4 ticks per timeslice.

Round-robin threads that are being billed for critical time aren't timesliced with threads of equal priority.

A caveat about FIFO scheduling

Be careful not to misuse the FIFO scheduling policy. There's a trick for getting mutual exclusion between a set of threads that are reading and writing shared data, without using a mutex: you can make all threads vying for the same shared data run at the same priority.

Since only one thread can run at a time (at least, on a uniprocessor system), and with FIFO scheduling, one thread never interrupts another, each has a monopoly on the shared data while it runs.

This is bad because any accidental change to the scheduler policy or priority will likely cause one thread to interrupt the other in the middle of its critical section. So it may lead to a code breakdown. If you accidently put the threads using this trick into different partitions (or let them get messages from different partitions) their critical sections will be broken.

If your application's threads use their priorities to control the order in which they run, you should always place the threads in the same partition, and you shouldn't send messages to them from other partitions. Pairs of threads written to depend on executing in a particular order based on their priorities should always be placed in the same partition, and you should not send them messages from other partitions.

In order to solve this problem, you must use mutexes, barriers, or pulses to control thread order. This will also prevent problems if you run your application on a multicore system. As a workaround, you can specify the _NTO_CHF_FIXED_PRIORITY flag to ChannelCreate(), which prevents the receiving thread from running in the sending thread's partition.

In general, you should make sure your applications don't depend on FIFO scheduling or the length of the timeslice for mutual exclusion.

Using adaptive partitioning and multicore together

On a multicore system, you can use adaptive partitioning and symmetric multiprocessing (SMP) to reap the rewards of both. For more information, see the Multicore Processing User's Guide.

Note the following points:

On an SMP machine, the adaptive partitioning scheduler considers the time to be 100%, not (say) 400% for a four-processor machine
The adaptive partitioning scheduler first tries to keep every processor busy; only then does it apply budgets. For example, if you have a four-processor machine, and partitions divided 70%, 10%, 10%, and 10%, but there is only one thread running in each partition, the adaptive partitioning scheduler runs all four threads all the time. The scheduler and the aps command report the partition's consumed time as 25%, 25%, 25%, and 25%.

It may seem unlikely to have only one thread per partition, since most systems have many threads. However, there is a way this situation will occur on a many-threaded system.

The runmask controls which CPUs a thread is allowed to run on. With careful (or foolish) use of the runmask, it's possible to arrange things so that there are not enough threads that are permitted to run on a particular processor for the scheduler to meet its budgets.

If there are several threads that are ready to run, and they're permitted to run on each CPU, then the scheduler correctly guarantees each partition's minimum budget.

On a hyperthreaded machine, actual throughput of partitions may not match the percentage CPU time usage reported by the adaptive partitioning scheduler. This is because on an hyperthreaded machine, throughput isn't always proportional to time, regardless of what kind of scheduler is being used. This is most likely to occur when a partition doesn't contain enough ready threads to occupy all of the pseudo-processors on a hyperthreaded machine.

Adaptive partitioning and BMP

Certain combinations of runmasks and partition budgets can have surprising results.

For example, suppose we have a two-CPU SMP machine, with these partitions:

Pa, with a budget of 50%
System, with a budget of 50%

Suppose the system is idle.

If you run a priority-10 thread that's locked to CPU 1 and is in an infinite loop in partition Pa, the scheduler interprets this to mean that you intend Pa to monopolize CPU 1. That's because CPU 1 can provide only 50% of the whole machine's processing time.

If you run another thread at priority 9, also locked to CPU 1, but in the System partition, the adaptive partitioning scheduler interprets that to mean you also want the System partition to monopolize CPU 1.

The scheduler has a dilemma: it can't satisfy the requirements of both partitions. What it actually does is allow partition Pa to monopolize CPU 1.

This is why: from an idle start, the scheduler observes that both partitions have available budget. When partitions have available budget, the scheduler schedules in realtime mode, which is strict priority scheduling. So partition Pa runs. But because CPU 1 can never satisfy the budget of partition Pa, Pa never runs out of budget. Therefore the scheduler remains in realtime mode and the lower-priority System partition never runs.

For this example, the aps show command might display:

                    +---- CPU Time ----+-- Critical Time --
Partition name   id | Budget |    Used | Budget |      Used
--------------------+------------------+-------------------
System            0 |    50% |   0.09% |  200ms |   0.000ms
Pa                1 |    50% |  49.93% |    0ms |   0.000ms
--------------------+------------------+-------------------
Total               |   100% |  50.02% |

The System partition is getting no CPU time even though it contains a thread that is ready to run.

Similar situations can occur when there are several partitions, each having a budget less than 50%, but whose budgets sum to 50% or more.

Avoiding infinite loops is a good way to avoid these situations. However, if you're running third-party software, you may not have control over the code.

To simplify the usage of runmasks with the adaptive partitioning scheduler, you may configure the scheduler to follow a more restrictive algorithm that prefers to meet budgets in some circumstances rather than schedule by priority.

To do so, set the flag SCHED_APS_SCHEDPOL_BMP_SAFETY with the SchedCtl() function that's declared in <sys/aps_sched.h> (see the Neutrino Library Reference), or use the aps modify -S bmp_safety command (see the entry for aps in the Utilities Reference).

The following table shows how time is divided in normal mode (with its risk of monopolization) and BMP-safety mode on a 2-CPU machine:

Partition state	Normal	BMP-safety
Usage < budget / 2	By priority	By priority
Usage < budget	By priority	By ratio of budgets
Usage > budget, but there's free time	By priority^a	By ratio of budgets
Full load	By ratio of budgets	By ratio of budgets

^a When SCHED_APS_SCHEDPOL_FREETIME_BY_PRIORITY isn't set. For more information, see the SCHED_APS_SET_PARMS command in the entry for SchedCtl() in the Neutrino Library Reference.

In the example above, but with BMP-safety turned on, not only does the scheduler run both the System partition and partition Pa, it fairly divides time on CPU 1 by the ratio of the partitions' budgets. The aps show command displays usage something like this:

                    +---- CPU Time ----+-- Critical Time --
Partition name   id | Budget |    Used | Budget |      Used
--------------------+------------------+-------------------
System            0 |    50% |  25.03% |  200ms |   0.000ms
Pa                1 |    50% |  24.99% |    0ms |   0.000ms
--------------------+------------------+-------------------
Total               |   100% |  50.02% |

The BMP-safety mode provides an easier-to-analyze scheduling mode at the cost of reducing the circumstances when the scheduler will schedule strictly by priority.