The process is made to heartbeat

Updated: April 19, 2023

Now consider the case where the client can be made to heartbeat so that a HAM will automatically detect when it's unresponsive and will terminate it.

Thread 1                             Thread 2

...                                  ...
while true                           while true
do                                   do
  obtain lock a                        obtain lock b
    (compute section1)                   (compute section1)
    obtain lock b                        obtain lock a
      send heartbeat                       send heartbeat
      (compute section2)                   (compute section2)
    release lock b                       release lock a
  release lock a                       release lock b
done                                 done
...                                  ...

Here the process is expected to send heartbeats to a HAM. By placing the heartbeat call within the inside loop, the deadlock condition is trapped. The HAM notices that the heartbeats have stopped and can then perform recovery.

Let's look at what happens now:

  1. Starting two-threaded process.

    The threads will execute as described earlier, but will eventually deadlock. We'll wait for a reasonable amount of time (a few seconds) until they do end in deadlock. The threads write a simple execution log in /dev/shmem/mutex-deadlock-heartbeat.log. The HAM detects that the threads have stopped heartbeating and terminates the process, after saving its state for postmortem analysis.

  2. Waiting for them to deadlock.

    Here's the current state of the threads in process 462866 and the state of mutex-deadlock when it missed heartbeats:

         pid tid name               prio STATE       Blocked
      462866   1 oot/mutex-deadlock  10r MUTEX       462866-03 #-2147
      462866   2 oot/mutex-deadlock  63r RECEIVE     1
      462866   3 oot/mutex-deadlock  10r MUTEX       462866-01 #-2147
    
    
        Entity state from HAM
    
    Path            : mutex-deadlock
    Entity Pid      : 462866
    Num conditions  : 1
    Condition type  : ATTACHEDSELF
    Stats:
    HeartBeat Period: 1000000000
    HB Low Mark     : 5
    HB High Mark    : 5
    Last Heartbeat  : 2001/09/03 14:40:41:406575120
    HeartBeat State : MISSEDHIGH
    Created         : 2001/09/03 14:40:40:391615720
    Num Restarts    : 0
    

    And here's the tail from the threads' log file:

    Thread 2: Obtained lock b
    Thread 2: Waiting for lock a
    Thread 2: Obtained lock a
    Thread 2: Performing computation
    Thread 2: Unlocking lock a
    Thread 2: Unlocking lock b
    Thread 2: Obtained lock b
    Thread 2: Waiting for lock a
    Thread 1: Obtained lock a
    Thread 1: Waiting for lock b
    
  3. Extracting core current process information:
    /tmp/mutex-deadlock.core:
     processor=ARM num_cpus=2
      cpu 1 cpu=602370 name=604e speed=299
       flags=0xc0000001 FPU MMU EAR
      cpu 2 cpu=602370 name=604e speed=299
       flags=0xc0000001 FPU MMU EAR
     cyc/sec=16666666 tod_adj=999522656000000000 nsec=5390696363520 inc=999960
     boot=999522656 epoch=1970 intr=-2147483648
     rate=600000024 scale=-16 load=16666
       MACHINE="mtx604-smp" HOSTNAME="localhost"
     hwflags=0x000004  
     pretend_cpu=0 init_msr=36866 
     pid=462866 parent=434193 child=0 pgrp=462866 sid=1
     flags=0x000300 umask=0 base_addr=0x48040000 init_stack=0x4803f9f0
     ruid=0 euid=0 suid=0  rgid=0 egid=0 sgid=0
     ign=0000000006801000 queue=ff00000000000000 pending=0000000000000000
     fds=5 threads=3 timers=1 chans=4
     thread 1 REQUESTED
      ip=0xfe32f838 sp=0x4803f8f0 stkbase=0x47fbf000 stksize=528384
      state=MUTEX flags=0 last_cpu=2 timeout=00000000
      pri=10 realpri=10 policy=RR
     thread 2
      ip=0xfe32f1a8 sp=0x47fbef50 stkbase=0x47f9e000 stksize=135168
      state=RECEIVE flags=4000000 last_cpu=2 timeout=00000000
      pri=63 realpri=63 policy=RR
      blocked_chid=1
     thread 3
      ip=0xfe32f838 sp=0x47f9df80 stkbase=0x47f7d000 stksize=135168
      state=MUTEX flags=4020000 last_cpu=1 timeout=00000000
      pri=10 realpri=10 policy=RR