Designing with multiprocessing in mind

QNX SDP8.0Programmer's GuideDeveloper

This section contains some general tips on how to design programs so that they can scale to N processors.

Use the multicore primitives

Don't assume that your program will run only on one processor. This means staying away from the FIFO synchronization trick mentioned above. Also, you should use the multicore-aware InterruptLock() and InterruptUnlock() functions.

Assume that threads really do run concurrently

As mentioned above, it isn't merely a useful programming abstraction to pretend that threads run simultaneously; you should design as if they really do. That way, when you move to a multicore system, you won't have any nasty surprises (but you can use BMP if you have problems and don't want to modify the code).

Break the problem down

Most problems can be broken down into independent, parallel tasks. Some are easy to break down, some are hard, and some are impossible. Generally, you want to look at the data flow going through a particular problem. If the data flows are independent (i.e., one flow doesn't rely on the results of another), this can be a good candidate for parallelization within the process by starting multiple threads. Consider the following graphics program snippet:
do_graphics ()
{
    int     x;

    for (x = 0; x < XRESOLUTION; x++) {
        do_one_line (x);
    }
}

In the above example, we're doing ray-tracing. We've looked at the problem and decided that the function do_one_line() only generates output to the screen—it doesn't rely on the results from any other invocation of do_one_line().

To make optimal use of a multicore system, you would start multiple threads, each running on one processor.

The question then becomes how many threads to start. Obviously, starting XRESOLUTION threads (where XRESOLUTION is far greater than the number of processors, perhaps 1024 to 4) isn't a particularly good idea—you're creating a lot of threads, all of which will consume stack resources and kernel resources as they compete for the limited pool of CPUs.

A simple solution would be to find out the number of CPUs that you have available to you (via the system page pointer) and divide the work up that way:
#include <sys/syspage.h>

int     num_x_per_cpu;

do_graphics ()
{
    int     num_cpus;
    int     i;
    pthread_t *tids;

    // figure out how many CPUs there are...
    num_cpus = _syspage_ptr -> num_cpu;

    // allocate storage for the thread IDs
    tids = malloc (num_cpus * sizeof (pthread_t));

    // figure out how many X lines each CPU can do
    num_x_per_cpu = XRESOLUTION / num_cpus;

    // start up one thread per CPU, passing it the ID
    for (i = 0; i < num_cpus; i++) {
        pthread_create (&tids[i], NULL, do_lines, (void *) i);
    }

    // now all the "do_lines" are off running on the processors

    // we need to wait for their termination
    for (i = 0; i < num_cpus; i++) {
        pthread_join (tids[i], NULL);
    }

    // now they are all done
}

void *
do_lines (void *arg)
{
    int    cpunum = (int) arg;  // convert void * to an integer
    int    x;

    for (x = cpunum * num_x_per_cpu; x < (cpunum + 1) * 
          num_x_per_cpu; x++) { do_line (x);
    }
}

The above approach lets the maximum number of threads run simultaneously on the multicore system. There's no point creating more threads than there are CPUs, because they'll simply compete with each other for CPU time.

Note that in this example, we didn't specify which processor to run each thread on. We don't need to in this case, because the READY thread with the highest priority always runs on the next available processor. The threads will tend to run on different processors (depending on what else is running in the system). You typically use the same priority for all the worker threads if they're doing similar work.

An alternative approach is to use a semaphore. You could preload the semaphore with the count of available CPUs. Then, you create threads whenever the semaphore indicates that a CPU is available. This is conceptually simpler, but involves the overhead of creating and destroying threads for each iteration.

Page updated: