Using the thread scheduler and multicore together

On a multicore system, you can use scheduler partitions and symmetric multiprocessing (SMP) to reap the rewards of both. For more information, see the Multicore Processing User's Guide.

Note the following facts:

It may seem unlikely to have only one thread per partition, since most systems have many threads. However, there is a way this situation will occur on a multithreaded system.

The runmask controls which CPUs a thread is allowed to run on. With careful (or foolish) use of the runmask, it's possible to arrange things so that there aren't enough threads that are permitted to run on a particular processor for the scheduler to meet its budgets.

If there are several threads that are ready to run, and they're permitted to run on each CPU, then the thread scheduler correctly guarantees each partition's minimum budget.

Note: On a hyperthreaded machine, actual throughput of partitions may not match the percentage of CPU time usage reported by the thread scheduler. This discrepancy occurs because on a hyperthreaded machine, throughput isn't always proportional to time, regardless of what kind of scheduler is being used. This scenario is most likely to occur when a partition doesn't contain enough ready threads to occupy all of the pseudo-processors on a hyperthreaded machine.

Scheduler partitions and BMP

Certain combinations of runmasks and partition budgets can have surprising results.

For example, suppose we have a two-CPU SMP machine, with these partitions:

Now, suppose the system is idle. If you run a priority-10 thread that's locked to CPU 1 and is in an infinite loop in partition Pa, the thread scheduler interprets this to mean that you intend Pa to monopolize CPU 1. That's because CPU 1 can provide only 50% of the entire machine's processing time.

If you run another thread at priority 9, also locked to CPU 1, but in the System partition, the thread scheduler interprets that to mean you also want the System partition to monopolize CPU 1.

The thread scheduler has a dilemma: it can't satisfy the requirements of both partitions. What it actually does is allow partition Pa to monopolize CPU 1.

This is why: from an idle start, the thread scheduler observes that both partitions have available budget. When partitions have available budget, the thread scheduler schedules in realtime mode, which is strict priority scheduling. So partition Pa runs. However, because CPU 1 can never satisfy the budget of partition Pa; Pa never runs out of budget. Therefore, the thread scheduler remains in realtime mode and the lower-priority System partition never runs.

For this example, the aps show command might display:

                    +---- CPU Time ----+-- Critical Time --
Partition name   id | Budget |    Used | Budget |      Used
--------------------+------------------+-------------------
System            0 |    50% |   0.09% |  200ms |   0.000ms
Pa                1 |    50% |  49.93% |    0ms |   0.000ms
--------------------+------------------+-------------------
Total               |   100% |  50.02% |

The System partition receives no CPU time even though it contains a thread that is ready to run.

Similar situations can occur when there are several partitions, each having a budget less than 50%, but whose budgets sum to 50% or more.

Avoiding infinite loops is a good way to avoid these situations. However, if you're running third-party software, you may not have control over the code.

To simplify the usage of runmasks with the thread scheduler, you may configure the scheduler to follow a more restrictive algorithm that prefers to meet budgets in some circumstances rather than schedule by priority.

To do so, set the SCHED_APS_SCHEDPOL_BMP_SAFETY flag (see "Scheduling policies" in the entry for SchedCtl() in the QNX Neutrino C Library Reference), or use the aps modify -S bmp_safety command (see the entry for aps in the Utilities Reference).

The following table shows how time is divided in normal mode (with its risk of monopolization), and BMP-safety mode on a 2-CPU machine:

Partition state Normal BMP-safety
Usage < budget / 2 By priority By priority
Usage < budget By priority By ratio of budgets
Usage > budget, but there's free time By priority* By ratio of budgets
Full load By ratio of budgets By ratio of budgets

* When SCHED_APS_SCHEDPOL_FREETIME_BY_PRIORITY isn't set. For more information, see the SCHED_APS_SET_PARMS command in the entry for SchedCtl() in the Neutrino Library Reference.

In the example above, but with BMP-safety turned on, not only does the thread scheduler run both the System partition and partition Pa, it reasonably divides time on CPU 1 by the ratio of the partitions' budgets. The aps show command displays usage something like this:

                    +---- CPU Time ----+-- Critical Time --
Partition name   id | Budget |    Used | Budget |      Used
--------------------+------------------+-------------------
System            0 |    50% |  25.03% |  200ms |   0.000ms
Pa                1 |    50% |  24.99% |    0ms |   0.000ms
--------------------+------------------+-------------------
Total               |   100% |  50.02% |

The BMP-safety mode provides an easier-to-analyze scheduling mode at the cost of reducing the circumstances when the thread scheduler will schedule strictly by priority.