This version of this document is no longer maintained. For the latest documentation, see http://www.qnx.com/developers/docs.

Appendix: Developing SMP Systems

Introduction
The impact of SMP
Designing with SMP in mind

Introduction

As described in the System Architecture guide, there's an SMP (Symmetrical MultiProcessor) version of Neutrino that runs on:

Pentium-based multiprocessor systems that conform to the Intel MultiProcessor Specification (MP Spec)
MIPS-based systems
PowerPC-based systems

If you have one of these systems, then you're probably itching to try it out, but are wondering what you have to do to get Neutrino running on it. Well, the answer is not much. The only part of Neutrino that's different for an SMP system is the microkernel -- another example of the advantages of a microkernel architecture!

The SMP versions of procnto are available only in the Symmetric Multiprocessing Technology Development Kit (TDK).

Building an SMP image

Assuming you're already familiar with building a bootable image for a single-processor system (as described in the Making an OS Image chapter in Building Embedded Systems), let's look at what you have to change in the buildfile for an SMP system.

As we mentioned above, basically all you need to use is the SMP kernel (procnto-smp) when building the image.

Here's an example of a buildfile:

#   A simple SMP buildfile

[virtual=x86,bios] .bootstrap = {
    startup-bios
    PATH=/proc/boot procnto-smp
}
[+script] .script = {
    devc-con -e &
    reopen /dev/con1
    [+session] PATH=/proc/boot esh
}

libc.so
[type=link] /usr/lib/ldqnx.so.2=/proc/boot/libc.so

[data=copy]
devc-con
esh
ls

After building the image, you proceed in the same way as you would with a single-processor system.

The impact of SMP

Although the actual changes to the way you set up the processor to run SMP are fairly minor, the fact that you're running on an SMP system can have a major impact on your software!

The main thing to keep in mind is this: in a single processor environment, it may be a nice "design abstraction" to pretend that threads execute in parallel; under an SMP system, they really do execute in parallel!

In this section, we'll examine the impact of SMP on your system design.

To SMP or not to SMP

It's possible to use the non-SMP kernel on an SMP box. In this case, only processor 0 will be used; the other processors won't run your code. This is a waste of additional processors, of course, but it does mean that you can run images from single-processor boxes on an SMP box. (You can also run SMP-ready images on single-processor boxes.)

It's also possible to run the SMP kernel on a uniprocessor system, but it requires a 486 or higher on x86 architectures, and PPCs require an SMP-capable implementation.

Processor affinity

One issue that often arises in an SMP environment can be put like this: "Can I make it so that one processor handles the GUI, another handles the database, and the other two handle the realtime functions?"

The answer is: "Yes, absolutely."

This is done through the magic of processor affinity -- the ability to associate certain programs (or even threads within programs) with a particular processor or processors.

Processor affinity works like this. When a thread starts up, its processor affinity mask is set to allow it to run on all processors. This implies that there's no inheritance of the processor affinity mask, so it's up to the thread to use ThreadCtl() with the _NTO_TCTL_RUNMASK control flag to set the processor affinity mask.

The processor affinity mask is simply a bitmap; each bit position indicates a particular processor. For example, the processor affinity mask 0x05 (binary 00000101) allows the thread to run on processors 0 (the 0x01 bit) and 2 (the 0x04 bit).

SMP and synchronization primitives

Standard synchronization primitives (barriers, mutexes, condvars, semaphores, and all of their derivatives, e.g. sleepon locks) are safe to use on an SMP box. You don't have to do anything special here.

SMP and FIFO scheduling

A common single-processor "trick" for coordinated access to a shared memory region is to use FIFO scheduling between two threads running at the same priority. The idea is that one thread will access the region and then call SchedYield() to give up its use of the processor. Then, the second thread would run and access the region. When it was done, the second thread too would call SchedYield(), and the first thread would run again. Since there's only one processor, both threads would cooperatively share that processor.

This FIFO trick won't work on an SMP system, because both threads may run simultaneously on different processors. You'll have to use the more "proper" thread synchronization primitives (e.g. a mutex).

SMP and interrupts

The following method is closely related to the FIFO scheduling trick. On a single-processor system, a thread and an interrupt service routine were mutually exclusive, due to the fact that the ISR ran at a priority higher than that of any thread. Therefore, the ISR would be able to preempt the thread, but the thread would never be able to preempt the ISR. So the only "protection" required was for the thread to indicate that during a particular section of code (the critical section) interrupts should be disabled.

Obviously, this scheme breaks down in an SMP system, because again the thread and the ISR could be running on different processors.

The solution in this case is to use the InterruptLock() and InterruptUnlock() calls to ensure that the ISR won't preempt the thread at an unexpected point. But what if the thread preempts the ISR? The solution is the same -- use InterruptLock() and InterruptUnlock() in the ISR as well.

We recommend that you always use the InterruptLock() and InterruptUnlock() function calls, both in the thread and in the ISR. The small amount of extra overhead on a single-processor box is negligible.

SMP and atomic operations

Note that if you wish to perform simple atomic operations, such as adding a value to a memory location, it isn't necessary to turn off interrupts to ensure that the operation won't be preempted. Instead, use the functions provided in the C include file <atomic.h>, which allow you to perform the following operations with memory locations in an atomic manner:

Function	Operation
atomic_add()	Add a number.
atomic_add_value()	Add a number and return the original value of *loc.
atomic_clr()	Clear bits.
atomic_clr_value()	Clear bits and return the original value of *loc.
atomic_set()	Set bits.
atomic_set_value()	set bits and return the original value of *loc.
atomic_sub()	Subtract a number.
atomic_sub_value()	Subtract a number and return the original value of *loc.
atomic_toggle()	Toggle (complement) bits
atomic_toggle_value()	Toggle (complement) bits and return the original value of *loc.

The *_value() functions may be slower on some systems (e.g. 386) -- don't use them unless you really want the return value.

Designing with SMP in mind

You may not have an SMP system today, but wouldn't it be great if your software just ran faster on one when you or your customer upgrade the hardware?

While the general topic of how to design programs so that they can scale to N processors is still the topic of research, this section contains some general tips.

Use the SMP primitives

Don't assume that your program will run only on one processor. This means staying away from the FIFO synchronization trick mentioned above. Also, you should use the SMP-aware InterruptLock() and InterruptUnlock() functions.

By doing this, you'll be "SMP-ready" with little negative impact on a single-processor system.

Assume that threads really do run concurrently

As mentioned above, it's not merely a useful "programming abstraction" to pretend that threads run simultaneously; you should design as if they really do. That way, when you move to an SMP system, you won't have any nasty surprises.

Break the problem down

Most problems can be broken down into independent, parallel tasks. Some are easy to break down, some are hard, and some are impossible. Generally, you want to look at the data flow going through a particular problem. If the data flows are independent (i.e. one flow doesn't rely on the results of another), this can be a good candidate for parallelization within the process by starting multiple threads. Consider the following graphics program snippet:

do_graphics ()
{
    int     x;

    for (x = 0; x < XRESOLUTION; x++) {
        do_one_line (x);
    }
}

In the above example, we're doing ray-tracing. We've looked at the problem and decided that the function do_one_line() only generates output to the screen -- it doesn't rely on the results from any other invocation of do_one_line().

To make optimal use of an SMP system, you would start multiple threads, each running on one processor.

The question then becomes how many threads to start. Obviously, starting XRESOLUTION threads (where XRESOLUTION is far greater than the number of processors, perhaps 1024 to 4) is not a particularly good idea -- you're creating a lot of threads, all of which will consume stack resources and kernel resources as they compete for the limited pool of CPUs.

A simple solution would be to find out the number of CPUs that you have available to you (via the system page pointer) and divide the work up that way:

#include <sys/syspage.h>

int     num_x_per_cpu;

do_graphics ()
{
    int     num_cpus;
    int     i;
    pthread_t *tids;

    // figure out how many CPUs there are...
    num_cpus = _syspage_ptr -> num_cpu;

    // allocate storage for the thread IDs
    tids = malloc (num_cpus * sizeof (pthread_t));

    // figure out how many X lines each CPU can do
    num_x_per_cpu = XRESOLUTION / num_cpus;

    // start up one thread per CPU, passing it the ID
    for (i = 0; i < num_cpus; i++) {
        pthread_create (&tids[i], NULL, do_lines, (void *) i);
    }

    // now all the "do_lines" are off running on the processors

    // we need to wait for their termination
    for (i = 0; i < num_cpus; i++) {
        pthread_join (tids[i], NULL);
    }

    // now they are all done
}

void *
do_lines (void *arg)
{
    int    cpunum = (int) arg;  // convert void * to an integer
    int    x;

    for (x = cpunum * num_x_per_cpu; x < (cpunum + 1) * 
          num_x_per_cpu; x++) { do_line (x);
    }
}

The above approach will allow the maximum number of threads to run simultaneously on the SMP system. There's no point creating more threads than there are CPUs, because they'll simply compete with each other for CPU time.

An alternative approach is to use a semaphore. You could preload the semaphore with the count of available CPUs. Then, you create threads whenever the semaphore indicates that a CPU is available. This is conceptually simpler, but involves thread creation/destruction overhead for each iteration.