Threads in mathematical operations
Suppose that we have a graphics program that performs ray tracing. Each raster line on the screen is dependent on the main database (which describes the actual picture being generated). The key here is this: each raster line is independent of the others. This immediately causes the problem to stand out as a threadable program.
int main (int argc, char **argv)
{
int x1;
... // perform initializations
for (x1 = 0; x1 < num_x_lines; x1++) {
do_one_line (x1);
}
... // display results
}
Here we see that the program will iterate x1 over all the raster lines that are to be calculated.
On an SMP system, this program will use only one CPU.
Why?
Because we haven't told the operating system to do anything in parallel.
The operating system isn't smart enough to look at the program and say,
Hey, hold on a second! We have 4 CPUs, and it looks like there
are independent execution flows here. I'll run it on all 4 CPUs!
int main (int argc, char **argv)
{
int x1;
... // perform initializations
for (x1 = 0; x1 < num_x_lines; x1++) {
pthread_create (NULL, NULL, do_one_line, (void *) x1);
}
... // display results
}
There are a number of problems with this simplistic approach. First of all (and this is most minor), the do_one_line() function would have to be modified to take a void * instead of an int as its argument. This is easily remedied with a typecast.
The second problem is a little bit trickier.
Let's say that the screen resolution that you were computing the picture
for was 1280 by 1024.
We'd be creating 1280 threads!
This is not a problem for QNX OS—the OS limits
you to 32767 threads per process!
However, each thread must have a unique stack.
If your stack is a reasonable size (say 8 KB), you'll have used
1280 × 8 KB (10 megabytes!) of stack.
And for what?
There are only 4 processors in your SMP system.
This means that only 4 of the 1280 threads will run at a time—the other
1276 threads are waiting for a CPU.
(In reality, the stack will fault in,
meaning that the space for
it will be allocated only as required. Nonetheless, it's a waste—there are still other overheads.)
int num_lines_per_cpu;
int num_cpus;
int main (int argc, char **argv)
{
int cpu;
... // perform initializations
// get the number of CPUs
num_cpus = _syspage_ptr -> num_cpu;
num_lines_per_cpu = num_x_lines / num_cpus;
for (cpu = 0; cpu < num_cpus; cpu++) {
pthread_create (NULL, NULL, do_one_batch, (void *) cpu);
}
... // display results
}
void * do_one_batch (void *c)
{
int cpu = (int) c;
int x1;
for (x1 = 0; x1 < num_lines_per_cpu; x1++) {
do_line_line (x1 + cpu * num_lines_per_cpu);
}
}
Here we're starting only num_cpus threads.
Each thread will run on one CPU.
And since we have only a small number
of threads, we're not wasting memory with unnecessary stacks.
Notice how we got the number of CPUs by dereferencing the System Page
global variable _syspage_ptr.
(For more information about what's in the system page, see the
System Page chapter of Building Embedded Systems, or the
<sys/syspage.h> include file).