This chapter includes:
- Determining the number of adaptive partitions and their contents
- Choosing the percentage CPU for each partition
- Choosing the window size
- Practical limits
- Uncontrolled interactions between adaptive partitions
You typically use the adaptive partitioning scheduler to:
- engineer a system to work in a predictable or defined manner when it's fully loaded
- prevent unimportant or untrusted applications from monopolizing the system
In either case, you need to configure the parameters for the adaptive partitioning scheduler with the whole system in mind. The basic decisions are:
- How many adaptive partitions should you create and what software should go into each?
- What guaranteed CPU percentage should each adaptive partition get?
- What should be the critical budget, if any, of each adaptive partition?
- What size, in milliseconds, should the time-averaging window be?
It seems reasonable to put functionally-related software into the same adaptive partition, and frequently that's the right choice. However, adaptive partitioning scheduling is a structured way of deciding when not to run software. So the actual method is to separate the software into different adaptive partitions if it should be starved of CPU time under different circumstances.
For example, if the system is a packet router that:
- routes packets
- collects and logs statistics for packet routing
- handles route-topology protocols with peer routers
- collects and logs route-topology metrics
it may seen reasonable to have two adaptive partitions: one for routing, and one for topology. Certainly logging routing metrics is functionally related to packet routing.
However, when the system is overloaded, meaning there's more outstanding work than the machine can possibly accomplish, you need to decide what work to do slowly. In this example, when the router is overloaded with incoming packets, it's still important to route them. But you may decide that if you can't do everything, you'd rather route packets than collect the routing metrics. By the same analysis, you might conclude that route-topology protocols should still run, using much less of the machine than routing itself, but run quickly when they need to.
Such an analysis leads to three partitions:
- a partition for routing packets, with a large share, say 80%
- a partition for topology protocols, say 15%, but with maximum thread priorities that are higher than those for packet routing
- a partition for logging both the routing metrics and topology-protocol metrics
In this case, we chose to separate the functionally-related components of routing and logging the routing metrics because we prefer to starve just one if we're forced to starve something. Similarly, we chose to group two functionally-unrelated components, the logging of routing metrics and the logging of topology metrics, because we want to starve them under the same circumstances.
The amount of CPU time that each adaptive partition tends to use under unloaded conditions is a good indication of the budget you should assign to it. If your application is a transaction processor, it may be useful to measure CPU consumption under a few different loads and construct a graph of offered load versus CPU consumed.
In general, the key to getting the right combination of partition budgets is to try them:
- Leave security turned off.
- Load up a test machine with realistic loads.
- Examine the latencies of your time-sensitive threads with the IDE's System Profiler.
- Try different patterns of budgets, which you can easily change at run time with the aps command.
It's possible to set the budget of a partition to zero as long as the SCHED_APS_SEC_NONZERO_BUDGETS security flag isn't set--see the SCHED_APS_ADD_SECURITY command for SchedCtl().
Threads in a zero-budget partition run only in these cases:
- All other zero-budget partitions are idle.
- The zero-budget partition has a nonzero critical budget, in which case its critical threads run.
- A thread receives a message from a partition with a nonzero budget, in which case the receiving thread runs temporarily in the sender's partition.
When is it useful to set the budget of a partition to zero?
- When a partition is permanently empty of running threads, you can set its budget to zero to effectively turn it off. When a zero-budget partition is idle, it isn't considered to cause free time (see "Summary of scheduling behavior" in the Adaptive Partitioning Scheduling Details chapter of this guide). A partition with a nonzero budget, that never runs threads, will put the scheduler permanently in free-time mode, which may not be the desired behavior.
- When you want noncritical code to run only when some other partition is idle.
- When the partition is populated by resource managers, or other software, that runs only in response to receiving messages. Because putting them in a zero-budget partition means you don't have to separately engineer a partition budget for them. Those resource managers automatically bill their time to the partitions of their clients.
But in general, setting a partition's budget to zero is risky. (This is why the SCHED_APS_SEC_RECOMMENDED security setting doesn't permit partition budgets to be zero.) The main risk in placing code into a zero-budget partition is that it may run in response to a pulse or event (i.e. not a message), and hence not run in the sender's partition. So, when the system is loaded (i.e. there's no free time), those threads may simply not run; they might hang, or things might happen in the wrong order.
For example, it's hazardous to set the System partition's budget to zero. On a loaded machine with a System partition of zero, requests to procnto to create processes and threads may hang, for example, when MAP_LAZY is used.
|If your system uses zero-budget partitions, you should carefully test it with all other partitions fully loaded with while(1) loops.|
Ideally we'd like resource managers, such as filesystems, to run with a budget of zero. That way they'd always be billing time to their clients. However, sometimes device drivers find out too late which client a particular thread has been doing work for. Some device drivers may have background threads for audits or maintenance that require CPU time that can't be attributed to a particular client.
In those cases, you should measure the resource manager's background and unattributable loads and add that amount to its partition's budget.
You can set the size of the time-averaging window to be from 8 to 400 ms. This is the time over which the scheduler tries to balance adaptive partitions to their guaranteed CPU limits. Different choices of window sizes affect both the accuracy of load balancing and, in extreme cases, the maximum delays seen by ready-to-run threads.
Some things to consider:
- A small window size reduces the accuracy of CPU time balancing. The error is +/-( tick_size / window_size). For example, if the window size is 10 ms, the accuracy is about 10 percentage points.
- If a partition opportunistically goes over budget (because other
partitions are using less than their guaranteed budget), it must pay
back the borrowed time, but only as much as the scheduler
"remembers" (i.e. only the borrowing that occurred in
the last window).
A small window size means that an adaptive partition that opportunistically goes over budget might not have to pay the time back. If a partition sleeps for longer than the window size, it won't get the time back later. So load balancing won't be accurate over the long term if both the system is loaded and some partitions sleep for longer than the window size.
- If the window size is small enough that some partition's percentage budget becomes less than a tick, the partition will get to run for at least 1 tick during each window, giving it 1 tick / window_size_in_ticks per cent of the CPU time, which may be considerably larger than the partition's actual budget. As a result, other partitions may not get their CPU budgets.
In an underload situation, the scheduler doesn't delay ready-to-run threads, but the highest-priority thread might not run if the adaptive partitioning scheduler is balancing budgets.
In very unlikely cases, a large window size can cause some adaptive partitions to experience runtime delays, but these delays are always less than what would occur without adaptive partitioning scheduling. There are two cases where this can occur.
If an adaptive partition's budget is budget milliseconds, then the delay is never longer than:
window_size - smallest_budget + largest_budget
This upper bound is only ever reached when low-budget and low-priority adaptive partitions interact with two other adaptive partitions in a specific way, and then only when all threads in the system are ready to run for very long intervals. This maximum possible delay has an extremely low chance of occurring.
For example, let's suppose we have these adaptive partitions:
- Partition A: 10% share; always ready to run at priority 10
- Partition B: 10% share; when it runs, it runs at priority 20
- Partition C: 80% share; when it runs, it runs at priority 30
This delay happens if the following happens:
- Let B and C sleep for a long time. A will run opportunistically and eventually run for 100 ms (the size of the averaging window).
- Then B wakes up. It has both available budget and a higher priority, so it runs. Let's call this time Ta, since it's the last time partition A ran. Since C is still sleeping, B runs opportunistically.
- At Ta + 90 ms, partition A has just paid back all the time it opportunistically used (the window size minus partition A's budget of 10%). Normally, it would run on the very next tick because that's when it would next have a budget of 1 ms, and B is over budget.
- But let's say that, by coincidence, C chooses to wake at that exact time. Because it has budget and a higher priority than A, it runs. It proceeds to run for another 80 ms, which is when it runs out of budget.
- Only now, at Ta + 90 ms + 80 ms, or 170 ms later, does A get to run again.
Note this scenario can't happen unless a high-priority partition wakes up exactly when a lower-priority partition just finishes paying back its opportunistic run time.
Still rare, but more common, is a delay of:
window_size - budget
milliseconds, which may occur to low-budget adaptive partitions with, on average, priorities equal to other partitions.
However, with a typical mix of thread priorities, each adaptive partition typically experiences a maximum delay, when ready to run, of much less than window_size milliseconds.
For example, let's suppose we have these adaptive partitions:
- partition A: 10% share, always ready to run at priority 10
- partition B: 90% share, always ready to run at priority 20, except that every 150 ms, it sleeps for 50 ms.
This delay happens if the following happens:
- When partition B sleeps, partition A is already at its budget limit of 10 ms (10% of the window size).
- But then A runs opportunistically for 50 ms, which is when B wakes up. Let's call that time Ta, the last time partition A ran.
- B runs continuously for 90 ms, which is when it exhausts its budget. Only then does A run again. This is 90 ms after Ta.
However, this pattern occurs only if the 10% application never suspends (which is exceedingly unlikely) and if there are no threads of other priorities (also exceedingly unlikely).
Because these scenarios are complicated, and the maximum delay time is a function of the partition shares, we approximate this rule by saying that the maximum ready-queue delay time is twice the window size.
|If you change the tick size of the system at runtime, do so before defining the adaptive partitioning scheduler's window size. That's because Neutrino converts the window size from milliseconds to clock ticks for internal use.|
The practical way to check that your scheduling delays are correct is to load your system with stress loads and use the IDE's System Profiler to study the delays. The aps command lets you change budgets dynamically, so you can quickly confirm that you have the right configuration of budgets.
The API allows a window size as short as 8 ms. However practical window sizes may need to be larger. For example, in an eight-partition system, with all partitions busy, to reasonably expect all eight to run during every window, the window size needs to be at least 8 timeslices long, which for most systems is 32 ms.
There are cases where an adaptive partition can prevent other applications from being given their guaranteed percentage CPU:
- Interrupt handlers: The time used in interrupt handlers is never
That is, we always choose to execute the globally highest-priority
interrupt handler, independent of its adaptive partition.
This means that faulty hardware or software that causes too many
interrupts can effectively limit the time available to other applications.
However, time spent in interrupt threads (e.g. those that use InterruptAttachEvent()) is correctly charged to those threads' adaptive partitions.
By default, anyone on the system can add partitions and modify their attributes. We recommend that you use the SCHED_APS_ADD_SECURITY command to SchedCtl(), or the aps modify command to specify the level of security that suits your system.
Here are the main security options, in increasing order of security. This list shows the aps command and the corresponding SchedCtl() flag:
- none or the APS_SCHED_SEC_OFF flag
- Anyone on the system can add partitions and modify their attributes.
- basic or the SCHED_APS_SEC_BASIC flag
- Only root in the System partition may change overall scheduling parameters and set critical budgets.
- flexible or the SCHED_APS_SEC_FLEXIBLE flag
- Only root in the System partition can change scheduling parameters or change critical budgets. But root running in any partition can create subpartitions, join threads into its own subpartitions and modify subpartitions. This lets applications create their own local subpartitions out of their own budgets. The percentage budgets must not be zero.
- recommended or the SCHED_APS_SEC_RECOMMENDED flag
- Only root from the System partition may create partitions or change parameters. This arranges a 2-level hierarchy of partitions: the System partition and its children. Only root, running in the System partition, may join its own thread to partitions. The percentage budgets must not be zero.
Unless you're testing the partitioning and want to change all parameters without needing to restart, you should set at least basic security.
After setting up the partitions, you can use SCHED_APS_SEC_LOCK_PARTITIONS to prevent further unauthorized changes. For example:
sched_aps_security_parms p; APS_INIT_DATA( &p ); p.sec_flags = SCHED_APS_SEC_LOCK_PARTITIONS; SchedCtl( SCHED_APS_ADD_SECURITY, &p, sizeof(p));
|Before you call SchedCtl(), make sure you initialize all the members of the data structure associated with the command. You can use the APS_INIT_DATA() macro to do this.|
The security options listed above are composed of the following options (but it's more convenient to use the compound options):
- root0_overall or the SCHED_APS_SEC_ROOT0_OVERALL flag
- You must be root running in the System partition in order to change the overall scheduling parameters, such as the averaging window size.
- root_makes_partitions or the SCHED_APS_SEC_ROOT_MAKES_PARTITIONS flag
- You must be root in order to create or modify partitions.
- sys_makes_partitions or the SCHED_APS_SEC_SYS_MAKES_PARTITIONS flag
- You must be running in the System partition in order to create or modify partitions.
- parent_modifies or the SCHED_APS_SEC_PARENT_MODIFIES flag
- Allows partitions to be modified (SCHED_APS_MODIFY_PARTITION), but you must be running in the parent partition of the partition being modified. "Modify" means to change a partition's percentage or critical budget or attach events with the SCHED_APS_ATTACH_EVENTS command.
- nonzero_budgets or the SCHED_APS_SEC_NONZERO_BUDGETS flag
- A partition may not be created with, or modified to have, a zero budget. Unless you know your partition needs to run only in response to client requests, i.e. receipt of messages, you should set this option.
- root_makes_critical or the SCHED_APS_SEC_ROOT_MAKES_CRITICAL flag
- You have to be root in order to create a nonzero critical budget or change an existing critical budget.
- sys_makes_critical or the SCHED_APS_SEC_SYS_MAKES_CRITICAL flag
- You must be running in the System partition to create a nonzero critical budget or change an existing critical budget.
- root_joins or the SCHED_APS_SEC_ROOT_JOINS flag
- You must be root in order to join a thread to a partition.
- sys_joins or the SCHED_APS_SEC_SYS_JOINS flag
- You must be running in the System partition in order to join a thread.
- parent_joins or the SCHED_APS_SEC_PARENT_JOINS flag
- You must be running in the parent partition of the partition you wish to join to.
- join_self_only or the SCHED_APS_SEC_JOIN_SELF_ONLY flag
- A process may join only itself to a partition.
- partitions_locked or the SCHED_APS_SEC_PARTITIONS_LOCKED flag
- Prevent further changes to any partition's budget, or overall scheduling parameters, such as the window size. Set this after you've set up your partitions.
Any thread can make itself critical, and any designer can make any sigevent critical (meaning that it will cause the eventual receiver to run as critical), but this isn't a security hole. That's because a thread marked as critical has no effect on the scheduler unless the thread is in a partition that has a critical budget. The adaptive partitioning scheduler has security options that control who may set or change a partition's critical budget.
For the system to be secure against possible critical thread abuse, it's important to:
- give a critical budget only to the partitions that need one
- move as much application software as possible out of the System partition (which has an infinite critical budget)