The simplest method of synchronization is to join the threads as they terminate. Joining really means waiting for termination.

Joining is accomplished by one thread waiting for the termination of another thread. The waiting thread calls pthread_join():

#include <pthread.h>

pthread_join (pthread_t thread, void **value_ptr);

To use pthread_join(), you pass it the thread ID of the thread that you wish to join, and an optional value_ptr, which can be used to store the termination return value from the joined thread. (You can pass in a NULL if you aren't interested in this value—we're not, in this case.)

Where did the thread ID came from? We ignored it in the pthread_create()—we passed in a NULL for the first parameter. Let's now correct our code:

int num_lines_per_cpu, num_cpus;

int main (int argc, char **argv)
    int cpu;
    pthread_t *thread_ids;

    ...    // perform initializations
    thread_ids = malloc (sizeof (pthread_t) * num_cpus);

    num_lines_per_cpu = num_x_lines / num_cpus;
    for (cpu = 0; cpu < num_cpus; cpu++) {
        pthread_create (&thread_ids [cpu], NULL,
                        do_one_batch, (void *) cpu);

    // synchronize to termination of all threads
    for (cpu = 0; cpu < num_cpus; cpu++) {
        pthread_join (thread_ids [cpu], NULL);

    ...    // display results

You'll notice that this time we passed the first argument to pthread_create() as a pointer to a pthread_t. This is where the thread ID of the newly created thread gets stored. After the first for loop finishes, we have num_cpus threads running, plus the thread that's running main(). We're not too concerned about the main() thread consuming all our CPU; it's going to spend its time waiting.

The waiting is accomplished by doing a pthread_join() to each of our threads in turn. First, we wait for thread_ids [0] to finish. When it completes, the pthread_join() will unblock. The next iteration of the for loop will cause us to wait for thread_ids [1] to finish, and so on, for all num_cpus threads.

A common question that arises at this point is, “What if the threads finish in the reverse order?” In other words, what if there are 4 CPUs, and, for whatever reason, the thread running on the last CPU (CPU 3) finishes first, and then the thread running on CPU 2 finishes next, and so on? Well, the beauty of this scheme is that nothing bad happens.

The first thing that's going to happen is that the pthread_join() will block on thread_ids [0]. Meanwhile, thread_ids [3] finishes. This has absolutely no impact on the main() thread, which is still waiting for the first thread to finish. Then thread_ids [2] finishes. Still no impact. And so on, until finally thread_ids [0] finishes, at which point, the pthread_join() unblocks, and we immediately proceed to the next iteration of the for loop. The second iteration of the for loop executes a pthread_join() on thread_ids [1], which will not block—it returns immediately. Why? Because the thread identified by thread_ids [1] is already finished. Therefore, our for loop will “whip” through the other threads, and then exit. At that point, we know that we've synched up with all the computational threads, so we can now display the results.