Caution: This version of this document is no longer maintained. For the latest documentation, see http://www.qnx.com/developers/docs.

System Considerations

This chapter includes:

You typically use the adaptive partitioning scheduler to:

In either case, you need to configure the parameters for the adaptive partitioning scheduler with the whole system in mind. The basic decisions are:

Determining the number of adaptive partitions and their contents

It seems reasonable to put functionally-related software into the same adaptive partition, and frequently that's the right choice. However, adaptive partitioning scheduling is a structured way of deciding when not to run software. So the actual method is to separate the software into different adaptive partitions if it should be starved of CPU time under different circumstances.

For example, if the system is a packet router that:

it may seen reasonable to have two adaptive partitions: one for routing, and one for topology. Certainly logging routing metrics is functionally related to packet routing.

However, when the system is overloaded, meaning there's more outstanding work than the machine can possibly accomplish, you need to decide what work to do slowly. In this example, when the router is overloaded with incoming packets, it's still important to route them. But you may decide that if you can't do everything, you'd rather route packets than collect the routing metrics. By the same analysis, you might conclude that route-topology protocols should still run, using much less of the machine than routing itself, but run quickly when they need to.

Such an analysis leads to three partitions:

In this case, we chose to separate the functionally-related components of routing and logging the routing metrics because we prefer to starve just one if we're forced to starve something. Similarly, we chose to group two functionally-unrelated components, the logging of routing metrics and the logging of topology metrics, because we want to starve them under the same circumstances.

Choosing the percentage CPU for each partition

The amount of CPU time that each adaptive partition tends to use under unloaded conditions is a good indication of the budget you should assign to it. If your application is a transaction processor, it may be useful to measure CPU consumption under a few different loads and construct a graph of offered load versus CPU consumed.

In general, the key to getting the right combination of partition budgets is to try them:

  1. Leave security turned off.
  2. Load up a test machine with realistic loads.
  3. Examine the latencies of your time-sensitive threads with the IDE's System Profiler.
  4. Try different patterns of budgets, which you can easily change at run time with the aps command.

Setting budgets to zero

It's possible to set the budget of a partition to zero as long as the SCHED_APS_SEC_NONZERO_BUDGETS security flag isn't set--see the SCHED_APS_ADD_SECURITY command for SchedCtl().

Threads in a zero-budget partition run only in these cases:

When is it useful to set the budget of a partition to zero?

But in general, setting a partition's budget to zero is risky. (This is why the SCHED_APS_SEC_RECOMMENDED security setting doesn't permit partition budgets to be zero.) The main risk in placing code into a zero-budget partition is that it may run in response to a pulse or event (i.e. not a message), and hence not run in the sender's partition. So, when the system is loaded (i.e. there's no free time), those threads may simply not run; they might hang, or things might happen in the wrong order.

For example, it's hazardous to set the System partition's budget to zero. On a loaded machine with a System partition of zero, requests to procnto to create processes and threads may hang, for example, when MAP_LAZY is used.


Note: If your system uses zero-budget partitions, you should carefully test it with all other partitions fully loaded with while(1) loops.

Setting budgets for resource managers

Ideally we'd like resource managers, such as filesystems, to run with a budget of zero. That way they'd always be billing time to their clients. However, sometimes device drivers find out too late which client a particular thread has been doing work for. Some device drivers may have background threads for audits or maintenance that require CPU time that can't be attributed to a particular client.

In those cases, you should measure the resource manager's background and unattributable loads and add that amount to its partition's budget.


Note:
  • If your server has maintenance threads that never serve clients, then it should be in a partition with a nonzero budget.
  • If your server communicates with its clients by sending messages, or by using mutexes or shared memory (i.e. anything other than receiving messages), then your server should be in a partition with a nonzero budget.

Choosing the window size

You can set the size of the time-averaging window to be from 8 to 400 ms. This is the time over which the scheduler tries to balance adaptive partitions to their guaranteed CPU limits. Different choices of window sizes affect both the accuracy of load balancing and, in extreme cases, the maximum delays seen by ready-to-run threads.

Accuracy

Some things to consider:

Delays compared to priority scheduling

In an underload situation, the scheduler doesn't delay ready-to-run threads, but the highest-priority thread might not run if the adaptive partitioning scheduler is balancing budgets.

In very unlikely cases, a large window size can cause some adaptive partitions to experience runtime delays, but these delays are always less than what would occur without adaptive partitioning scheduling. There are two cases where this can occur.

Case 1

If an adaptive partition's budget is budget milliseconds, then the delay is never longer than:

window_size - smallest_budget + largest_budget

This upper bound is only ever reached when low-budget and low-priority adaptive partitions interact with two other adaptive partitions in a specific way, and then only when all threads in the system are ready to run for very long intervals. This maximum possible delay has an extremely low chance of occurring.

For example, let's suppose we have these adaptive partitions:

This delay happens if the following happens:

Note this scenario can't happen unless a high-priority partition wakes up exactly when a lower-priority partition just finishes paying back its opportunistic run time.

Case 2

Still rare, but more common, is a delay of:

window_size - budget

milliseconds, which may occur to low-budget adaptive partitions with, on average, priorities equal to other partitions.

However, with a typical mix of thread priorities, each adaptive partition typically experiences a maximum delay, when ready to run, of much less than window_size milliseconds.

For example, let's suppose we have these adaptive partitions:

This delay happens if the following happens:

However, this pattern occurs only if the 10% application never suspends (which is exceedingly unlikely) and if there are no threads of other priorities (also exceedingly unlikely).

Approximating the delays

Because these scenarios are complicated, and the maximum delay time is a function of the partition shares, we approximate this rule by saying that the maximum ready-queue delay time is twice the window size.


Note: If you change the tick size of the system at runtime, do so before defining the adaptive partitioning scheduler's window size. That's because Neutrino converts the window size from milliseconds to clock ticks for internal use.

The practical way to check that your scheduling delays are correct is to load your system with stress loads and use the IDE's System Profiler to study the delays. The aps command lets you change budgets dynamically, so you can quickly confirm that you have the right configuration of budgets.

Practical limits

The API allows a window size as short as 8 ms. However practical window sizes may need to be larger. For example, in an eight-partition system, with all partitions busy, to reasonably expect all eight to run during every window, the window size needs to be at least 8 timeslices long, which for most systems is 32 ms.

Uncontrolled interactions between adaptive partitions

There are cases where an adaptive partition can prevent other applications from being given their guaranteed percentage CPU:

Security

By default, anyone on the system can add partitions and modify their attributes. We recommend that you use the SCHED_APS_ADD_SECURITY command to SchedCtl(), or the aps modify command to specify the level of security that suits your system.

Here are the main security options, in increasing order of security. This list shows the aps command and the corresponding SchedCtl() flag:

none or the APS_SCHED_SEC_OFF flag
Anyone on the system can add partitions and modify their attributes.
basic or the SCHED_APS_SEC_BASIC flag
Only root in the System partition may change overall scheduling parameters and set critical budgets.
flexible or the SCHED_APS_SEC_FLEXIBLE flag
Only root in the System partition can change scheduling parameters or change critical budgets. But root running in any partition can create subpartitions, join threads into its own subpartitions and modify subpartitions. This lets applications create their own local subpartitions out of their own budgets. The percentage budgets must not be zero.
recommended or the SCHED_APS_SEC_RECOMMENDED flag
Only root from the System partition may create partitions or change parameters. This arranges a 2-level hierarchy of partitions: the System partition and its children. Only root, running in the System partition, may join its own thread to partitions. The percentage budgets must not be zero.

Unless you're testing the partitioning and want to change all parameters without needing to restart, you should set at least basic security.

After setting up the partitions, you can use SCHED_APS_SEC_LOCK_PARTITIONS to prevent further unauthorized changes. For example:

sched_aps_security_parms p; 

APS_INIT_DATA( &p );
p.sec_flags = SCHED_APS_SEC_LOCK_PARTITIONS;
SchedCtl( SCHED_APS_ADD_SECURITY, &p, sizeof(p));

Note: Before you call SchedCtl(), make sure you initialize all the members of the data structure associated with the command. You can use the APS_INIT_DATA() macro to do this.

The security options listed above are composed of the following options (but it's more convenient to use the compound options):

root0_overall or the SCHED_APS_SEC_ROOT0_OVERALL flag
You must be root running in the System partition in order to change the overall scheduling parameters, such as the averaging window size.
root_makes_partitions or the SCHED_APS_SEC_ROOT_MAKES_PARTITIONS flag
You must be root in order to create or modify partitions.
sys_makes_partitions or the SCHED_APS_SEC_SYS_MAKES_PARTITIONS flag
You must be running in the System partition in order to create or modify partitions.
parent_modifies or the SCHED_APS_SEC_PARENT_MODIFIES flag
Allows partitions to be modified (SCHED_APS_MODIFY_PARTITION), but you must be running in the parent partition of the partition being modified. "Modify" means to change a partition's percentage or critical budget or attach events with the SCHED_APS_ATTACH_EVENTS command.
nonzero_budgets or the SCHED_APS_SEC_NONZERO_BUDGETS flag
A partition may not be created with, or modified to have, a zero budget. Unless you know your partition needs to run only in response to client requests, i.e. receipt of messages, you should set this option.
root_makes_critical or the SCHED_APS_SEC_ROOT_MAKES_CRITICAL flag
You have to be root in order to create a nonzero critical budget or change an existing critical budget.
sys_makes_critical or the SCHED_APS_SEC_SYS_MAKES_CRITICAL flag
You must be running in the System partition to create a nonzero critical budget or change an existing critical budget.
root_joins or the SCHED_APS_SEC_ROOT_JOINS flag
You must be root in order to join a thread to a partition.
sys_joins or the SCHED_APS_SEC_SYS_JOINS flag
You must be running in the System partition in order to join a thread.
parent_joins or the SCHED_APS_SEC_PARENT_JOINS flag
You must be running in the parent partition of the partition you wish to join to.
join_self_only or the SCHED_APS_SEC_JOIN_SELF_ONLY flag
A process may join only itself to a partition.
partitions_locked or the SCHED_APS_SEC_PARTITIONS_LOCKED flag
Prevent further changes to any partition's budget, or overall scheduling parameters, such as the window size. Set this after you've set up your partitions.

Security and critical threads

Any thread can make itself critical, and any designer can make any sigevent critical (meaning that it will cause the eventual receiver to run as critical), but this isn't a security hole. That's because a thread marked as critical has no effect on the scheduler unless the thread is in a partition that has a critical budget. The adaptive partitioning scheduler has security options that control who may set or change a partition's critical budget.

For the system to be secure against possible critical thread abuse, it's important to: