Multicore Processing

Introduction
Asymmetric multiprocessing (AMP)
Symmetric multiprocessing (SMP)
Bound multiprocessing (BMP)
Choosing between AMP, SMP, and BMP

Introduction

“Two heads are better than one” goes the old saying, and the same is true for computer systems, where two—or more—processors can greatly improve performance. Multiprocessing systems can be in these forms:

Discrete or traditional: A system that has separate physical processors hooked up in multiprocessing mode over a board-level bus.
Multicore: A chip that has one physical processor with multiple CPUs interconnected over a chip-level bus.
Multicore processors deliver greater computing power through concurrency, offer greater system density, and run at lower clock speeds than uniprocessor chips. Multicore processors also reduce thermal dissipation, power consumption, and board area (and hence the cost of the system).

Multiprocessing includes several operating modes:

Asymmetric multiprocessing (AMP): A separate OS, or a separate instantiation of the same OS, runs on each CPU.
Symmetric multiprocessing (SMP): A single instantiation of an OS manages all CPUs simultaneously, and applications can float to any of them.
Bound multiprocessing (BMP): A single instantiation of an OS manages all CPUs simultaneously, but each application is locked to a specific CPU.

To determine how many processors there are on your system, look at the num_cpu entry of the system page. For more information, see “Structure of the system page” in the Customizing Image Startup Programs chapter of Building Embedded Systems.

Asymmetric multiprocessing (AMP)

Asymmetric multiprocessing provides an execution environment that's similar to conventional uniprocessor systems. It offers a relatively straightforward path for porting legacy code and provides a direct mechanism for controlling how the CPUs are used. In most cases, it lets you work with standard debugging tools and techniques.

AMP can be:

homogeneous — each CPU runs the same type and version of the OS
heterogeneous — each CPU runs either a different OS or a different version of the same OS

Neutrino's distributed programming model lets you make the best use of the multiple CPUs in a homogeneous environment. Applications running on one CPU can communicate transparently with applications and system services (e.g. device drivers, protocol stacks) on other CPUs, without the high CPU utilization imposed by traditional forms of interprocessor communication.

In heterogeneous systems, you must either implement a proprietary communications scheme or choose two OSs that share a common infrastructure (likely IP based) for interprocessor communications. To help avoid resource conflicts, the OSs should also provide standardized mechanisms for accessing shared hardware components.

With AMP, you decide how the shared hardware resources used by applications are divided up between the CPUs. Normally, this resource allocation occurs statically during boot time and includes physical memory allocation, peripheral usage, and interrupt handling. While the system could allocate the resources dynamically, doing so would entail complex coordination between the CPUs.

In an AMP system, a process always runs on the same CPU, even when other CPUs run idle. As a result, one CPU can end up being under- or overutilized. To address the problem, the system could allow applications to migrate dynamically from CPU to another. Doing so, however, can involve complex checkpointing of state information or a possible service interruption as the application is stopped on one CPU and restarted on another. Also, such migration is difficult, if not impossible, if the CPUs run different OSs.

Symmetric multiprocessing (SMP)

Allocating resources in a multicore design can be difficult, especially when multiple software components are unaware of how other components are employing those resources.

Symmetric multiprocessing addresses the issue by running only one copy of Neutrino on all of the system's CPUs. Because the OS has insight into all system elements at all times, it can allocate resources on the multiple CPUs with little or no input from the application designer. Moreover, Neutrino provides built-in standardized primitives, such as pthread_mutex_lock(), pthread_mutex_unlock(), pthread_spin_lock(), and pthread_spin_unlock(), that let multiple applications share these resources safely and easily.

By running only one copy of Neutrino, SMP can dynamically allocate resources to specific applications rather than to CPUs, thereby enabling greater utilization of available processing power. It also lets system tracing tools gather operating statistics and application interactions for the multiprocessing system as a whole, giving you valuable insight into how to optimize and debug applications.

For instance, the System Profiler in the IDE can track thread migration from one CPU to another, as well as OS primitive usage, scheduling events, application-to-application messaging, and other events, all with high-resolution timestamping. Application synchronization also becomes much easier since you use standard OS primitives rather than complex IPC mechanisms.

Neutrino lets the threads of execution within an application run concurrently on any CPU, making the entire computing power of the chip available to applications at all times. Neutrino's preemption and thread prioritization capabilities help you ensure that CPU cycles go to the application that needs them the most.

Neutrino's microkernel approach

SMP is typically associated with high-end operating systems such as Unix and Windows NT running on high-end servers. These large monolithic systems tend to be quite complex, the result of many person-years of development. Since these large kernels contain the bulk of all OS services, the changes to support SMP are extensive, usually requiring large numbers of modifications and the use of specialized spinlocks throughout the code.

QNX Neutrino, on the other hand, contains a very small microkernel surrounded by processes that act as resource managers, providing services such as filesystems, character I/O, and networking. By modifying the microkernel alone, all other OS services will gain full advantage of SMP without the need for coding changes. If these service-providing processes are multithreaded, their many threads will be scheduled among the available processors. Even a single-threaded server would also benefit from an SMP system, because its thread would be scheduled on the available processors beside other servers and client processes.

As a testament to this microkernel approach, the SMP-enabled QNX Neutrino kernel/process manager adds only a few kilobytes of additional code. The SMP versions are designed for these main processor families:

PowerPC (e.g. procnto-600-smp)
MIPS (procnto-smp)
x86 (procnto-smp)

The x86 version can boot on any system that conforms to the Intel MultiProcessor Specification (MP Spec) with up to 32 Pentium (or better) processors. QNX Neutrino also supports Intel's Hyper-Threading Technology found in P4 and Xeon processors.

The procnto-smp manager will also function on a single non-SMP system. With the cost of building a dual-processor Pentium motherboard very nearly the same as that for a single-processor motherboard, it's possible to deliver cost-effective solutions that can be scaled in the field by the simple addition of a second CPU. The fact that the OS itself is only a few kilobytes larger also allows SMP to be seriously considered for small CPU-intensive embedded systems, not just high-end servers.

The PowerPC and MIPS versions of the SMP-enabled kernel deliver full SMP support (e.g. cache-coherency, interprocessor interrupts, etc.) on appropriate PPC and MIPS hardware. The PPC version supports any SMP system with 7xx or 74xx series processors, as in such reference design platforms as the Motorola MVP or the Marvell EV-64260-2XMPC7450 SMP Development System. The MIPS version supports such systems as the dual-core Broadcom BCM1250 processor.

Booting an x86 SMP system

The microkernel itself contains very little hardware- or system-specific code. The code that determines the capabilities of the system is isolated in a startup program, which is responsible for initializing the system, determining available memory, etc. Information gathered is placed into a memory table available to the microkernel and to all processes (on a read-only basis).

The startup-bios program is designed to work on systems compatible with the Intel MP Spec (version 1.4 or later). This startup program is responsible for:

determining the number of processors
determining the address of the local and I/O APIC
initializing each additional processor

After a reset, only one processor will be executing the reset code. This processor is called the boot processor (BP). For each additional processor found, the BP running the startup-bios code will:

initialize the processor
switch it to 32-bit protected mode
allocate the processor its own page directory
set the processor spinning with interrupts disabled, waiting to be released by the kernel

Booting a PowerPC or MIPS SMP system

On a PPC or MIPS SMP system, the boot sequence is similar to that of an x86, but a specific startup program (e.g. startup-mvp, startup-bcm1250) will be used instead. Specifically, the PPC-specific startup is responsible for:

determining the number of processors
initializing each additional processor
initializing the IRQ, IPI, system controller, etc.

For each additional processor found, the startup code will:

initialize the processor
initialize the MMU
initialize the caches
set the processor spinning with interrupts disabled, waiting to be released by the kernel

How the SMP microkernel works

Once the additional processors have been released and are running, all processors are considered peers for the scheduling of threads.

Scheduling

The scheduling policy follows the same rules as on a uniprocessor system. That is, the highest-priority thread will be running on an available processor. If a new thread becomes ready to run as the highest-priority thread in the system, it will be dispatched to the appropriate processor. If more than one processor is selected as a potential target, then the microkernel will try to dispatch the thread to the processor where it last ran. This affinity is used as an attempt to reduce thread migration from one processor to another, which can affect cache performance.

In an SMP system, the scheduler has some flexibility in deciding exactly how to schedule the other threads, with an eye towards optimizing cache usage and minimizing thread migration. This could mean that some processors will be running lower-priority threads while a higher-priority thread is waiting to run on the processor it last ran on. The next time a processor that's running a lower-priority thread makes a scheduling decision, it will choose the higher-priority one.

In any case, the realtime scheduling rules that were in place on a uniprocessor system are guaranteed to be upheld on an SMP system.

Kernel locking

In a uniprocessor system, only one thread is allowed to execute within the microkernel at a time. Most kernel operations are short in duration (typically a few microseconds on a Pentium-class processor). The microkernel is also designed to be completely preemptible and restartable for those operations that take more time. This design keeps the microkernel lean and fast without the need for large numbers of fine-grained locks. It is interesting to note that placing many locks in the main code path through a kernel will noticeably slow the kernel down. Each lock typically involves processor bus transactions, which can cause processor stalls.

In an SMP system, QNX Neutrino maintains this philosophy of only one thread in a preemptible and restartable kernel. The microkernel may be entered on any processor, but only one processor will be granted access at a time.

For most systems, the time spent in the microkernel represents only a small fraction of the processor's workload. Therefore, while conflicts will occur, they should be more the exception than the norm. This is especially true for a microkernel where traditional OS services like filesystems are separate processes and not part of the kernel itself.

Interprocessor interrupts (IPIs)

The processors communicate with each other through IPIs (interprocessor interrupts). IPIs can effectively schedule and control threads over multiple processors. For example, an IPI to another processor is often needed when:

a higher-priority thread becomes ready
a thread running on another processor is hit with a signal
a thread running on another processor is canceled
a thread running on another processor is destroyed

Critical sections

To control access to data structures that are shared between them, threads and processes use the standard POSIX primitives of mutexes, condvars, and semaphores. These work without change in an SMP system.

Many realtime systems also need to protect access to shared data structures between an interrupt handler and the thread that owns the handler. The traditional POSIX primitives used between threads aren't available for use by an interrupt handler. There are two solutions here:

One is to remove all work from the interrupt handler and do all the work at thread time instead. Given our fast thread scheduling, this is a very viable solution.
In a uniprocessor system running QNX Neutrino, an interrupt handler may preempt a thread, but a thread will never preempt an interrupt handler. This allows the thread to protect itself from the interrupt handler by disabling and enabling interrupts for very brief periods of time.

The thread on a non-SMP system protects itself with code of the form:

InterruptDisable()
// critical section
InterruptEnable()

Or:

InterruptMask(intr)
// critical section
InterruptUnmask(intr)

Unfortunately, this code will fail on an SMP system since the thread may be running on one processor while the interrupt handler is concurrently running on another processor!

One solution would be to lock the thread to a particular processor (see “Bound Multiprocessing (BMP),” later in this chapter).

A better solution would be to use a new exclusion lock available to both the thread and the interrupt handler. This is provided by the following primitives, which work on both uniprocessor and SMP machines:

InterruptLock(intrspin_t* spinlock): Attempt to acquire a spinlock, a variable shared between the interrupt handler and thread. The code will spin in a tight loop until the lock is acquired. After disabling interrupts, the code will acquire the lock (if it was acquired by a thread). The lock must be released as soon as possible (typically within a few lines of C code without any loops).
InterruptUnlock(intrspin_t* spinlock): Release a lock and reenable interrupts.

On a non-SMP system, there's no need for a spinlock.

For more information, see the Multicore Processing User's Guide.

Bound multiprocessing (BMP)

Bound multiprocessing provides the scheduling control of an asymmetric multiprocessing model, while preserving the hardware abstraction and management of symmetric multiprocessing. BMP is similar to SMP, but you can specify which processors a thread can run on. You can use both SMP and BMP on the same system, allowing some threads to migrate from one processor to another, while other threads are restricted to one or more processors.

As with SMP, a single copy of the OS maintains an overall view of all system resources, allowing them to be dynamically allocated and shared among applications. But, during application initialization, a setting determined by the system designer forces all of an application's threads to execute only on a specified CPU.

Compared to full, floating SMP operation, this approach offers several advantages:

It eliminates the cache thrashing that can reduce performance in an SMP system by allowing applications that share the same data set to run exclusively on the same CPU.
It offers simpler application debugging than SMP since all execution threads within an application run on a single CPU.
It helps legacy applications that use poor techniques for synchronizing shared data to run correctly, again by letting them run on a single CPU.

With BMP, an application locked to one CPU can't use other CPUs, even if they're idle. However, Neutrino lets you dynamically change the designated CPU, without having to checkpoint, and then stop and restart the application.

QNX Neutrino supports the concept of hard processor affinity through a runmask. Each bit that's set in the runmask represents a processor that a thread can run on. By default, a thread's runmask is set to all ones, allowing it to run on any processor. A value of 0x01 would allow a thread to execute only on the first processor.

By default, a process's or thread's children don't inherit the runmask; there's a separate inherit mask.

By careful use of these masks, a systems designer can further optimize the runtime performance of a system (e.g. by relegating nonrealtime processes to a specific processor). In general, however, this shouldn't be necessary, because our realtime scheduler will always preempt a lower-priority thread immediately when a higher-priority thread becomes ready. Processor locking will likely affect only the efficiency of the cache, since threads can be prevented from migrating.

You can specify the runmask for a new thread or process by:

setting the runmask member of the inheritance structure and specifying the SPAWN_EXPLICIT_CPU flag when you call spawn()
Or:
using the -C or -R option to the on utility when you launch a program. This also sets the process's inherit mask to the same value.

You can change the runmask for an existing thread or process by:

using the _NTO_TCTL_RUNMASK or _NTO_TCTL_RUNMASK_GET_AND_SET_INHERIT command to the ThreadCtl() kernel call
Or:
using the -C or -R option to the slay utility. If you also use the -i option, slay sets the inherit mask to the same value.

For more information, see the Multicore Processing User's Guide.

A viable migration strategy

As a midway point between AMP and SMP, BMP offers a viable migration strategy if you wish to move towards full SMP, but you're concerned that your existing code may operate incorrectly in a truly concurrent execution model.

You can port legacy code to a multicore system and initially bind it to a single CPU to ensure correct operation. By judiciously binding applications (and possibly single threads) to specific CPUs, you can isolate potential concurrency issues down to the application and thread level. Resolving these issues will allow the application to run fully concurrently, thereby maximizing the performance gains provided by the multiple processors.

Choosing between AMP, SMP, and BMP

The choice between AMP, SMP, and BMP depends on the problem you're trying to solve:

AMP works well with legacy applications, but has limited scalability beyond two CPUs.
SMP offers transparent resource management, but software that hasn't been properly designed for concurrency might have problems.
BMP offers many of the same benefits as SMP, but guarantees that uniprocessor applications will behave correctly, greatly simplifying the migration of legacy software.

As the following table illustrates, the flexibility to choose from any of these models lets you strike the optimal balance between performance, scalability, and ease of migration.

Feature	SMP	BMP	AMP
Seamless resource sharing	Yes	Yes	—
Scalable beyond dual CPU	Yes	Yes	Limited
Legacy application operation	In most cases	Yes	Yes
Mixed OS environment (e.g. Neutrino and Linux)	—	—	Yes
Dedicated processor by function	—	Yes	Yes
Intercore messaging	Fast (OS primitives)	Fast (OS primitives)	Slower (application)
Thread synchronization between CPUs	Yes	Yes	—
Load balancing	Yes	Yes	—
System-wide debugging and optimization	Yes	Yes	—