Threads in mathematical operations

Suppose that we have a graphics program that performs ray tracing. Each raster line on the screen is dependent on the main database (which describes the actual picture being generated). The key here is this: each raster line is independent of the others. This immediately causes the problem to stand out as a threadable program.

Here's the single-threaded version:

int
main (int argc, char **argv)
{
    int x1;

    ...    // perform initializations

    for (x1 = 0; x1 < num_x_lines; x1++) {
        do_one_line (x1);
    }

    ...    // display results
}

Here we see that the program will iterate x1 over all the raster lines that are to be calculated.

On an SMP system, this program will use only one of the CPUs. Why? Because we haven't told the operating system to do anything in parallel. The operating system isn't smart enough to look at the program and say, "Hey, hold on a second! We have 4 CPUs, and it looks like there are independent execution flows here. I'll run it on all 4 CPUs!"

So, it's up to the system designer (you) to tell QNX Neutrino which parts can be run in parallel. The easiest way to do that would be:

int
main (int argc, char **argv)
{
    int x1;

    ...    // perform initializations

    for (x1 = 0; x1 < num_x_lines; x1++) {
        pthread_create (NULL, NULL, do_one_line, (void *) x1);
    }

    ...    // display results
}

There are a number of problems with this simplistic approach. First of all (and this is most minor), the do_one_line() function would have to be modified to take a void * instead of an int as its argument. This is easily remedied with a typecast.

The second problem is a little bit trickier. Let's say that the screen resolution that you were computing the picture for was 1280 by 1024. We'd be creating 1280 threads! This is not a problem for QNX Neutrino—the OS "limits" you to 32767 threads per process! However, each thread must have a unique stack. If your stack is a reasonable size (say 8 KB), you'll have used 1280 × 8 KB (10 megabytes!) of stack. And for what? There are only 4 processors in your SMP system. This means that only 4 of the 1280 threads will run at a time—the other 1276 threads are waiting for a CPU. (In reality, the stack will "fault in," meaning that the space for it will be allocated only as required. Nonetheless, it's a waste—there are still other overheads.)

A much better solution to this would be to break the problem up into 4 pieces (one for each CPU), and start a thread for each piece:

int num_lines_per_cpu;
int num_cpus;

int
main (int argc, char **argv)
{
    int cpu;

    ...    // perform initializations

    // get the number of CPUs
    num_cpus = _syspage_ptr -> num_cpu;
    num_lines_per_cpu = num_x_lines / num_cpus;
    for (cpu = 0; cpu < num_cpus; cpu++) {
        pthread_create (NULL, NULL, 
                        do_one_batch, (void *) cpu);
    }

    ...    // display results
}

void *
do_one_batch (void *c)
{
    int cpu = (int) c;
    int x1;

    for (x1 = 0; x1 < num_lines_per_cpu; x1++) {
        do_line_line (x1 + cpu * num_lines_per_cpu);
    }
}

Here we're starting only num_cpus threads. Each thread will run on one CPU. And since we have only a small number of threads, we're not wasting memory with unnecessary stacks. Notice how we got the number of CPUs by dereferencing the "System Page" global variable _syspage_ptr. (For more information about what's in the system page, see the Customizing Image Startup Programs chapter of Building Embedded Systems, or the <sys/syspage.h> include file).