Mars Pathfinder Case Study
More than a decade ago, the Mars Pathfinder landed to a media fanfare and began to transmit data back to Earth. However, a few days later, the flow of information and images was interrupted by a series of total systems resets. How this problem was diagnosed and resolved still makes for a fascinating tale for software engineers.
The Pathfinder’s applications were scheduled by a real time embedded operating system providing pre-emptive priority scheduling of threads, tasks were executed as threads with priorities determined by their relative urgency. The meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus synchronized with mutual exclusion locks (mutexes). Other higher priority threads took precedence when necessary, including a very high priority bus management task, which also accessed the bus with mutexes.
Unfortunately in this case, a long-running communications task, having higher priority than the meteorological task, but lower than the bus management task, prevented it from running. Soon, a watchdog timer noticed that the bus management task had not been executed for some time, concluded that something had gone wrong, and ordered a total system reset. Eventually, engineers diagnosed and fixed the problem, eventually spotting a priority inversion.
A priority inversion occurs when a high priority task is indirectly pre-emptied by a medium priority task “inverting” the relative priorities of the two tasks. This is a clear violation of the priority model which says high priority tasks can only be prevented from running by higher priority tasks and briefly by low priority tasks which will quickly complete their use of a resource shared by the high and low priority tasks.
Problems with Locking
In computer science, a lock or mutex (from mutual exclusion) is a synchronization mechanism for enforcing limits on access to a resource in an environment where there are many threads of execution. A lock is designed to enforce a mutual exclusion concurrency control policy.
The following problems exist when using locking in multi-threaded applications to protect shared resources. The characteristic of each of these multi-threaded lock issue is described in the following paragraphs.
- Priority Inversion.
A deadlock is a situation in which two computer programs sharing the same resource are effectively preventing each other from accessing the resource, resulting in both programs ceasing to function. The earliest computer operating systems ran only one program at a time. Priority inheritance is a possible solution to deadlocks.
A livelock is similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another — none progressing. Livelock is a risk with some algorithms that detect and recover from deadlock. If more than one process takes action, the deadlock detection algorithm can be repeatedly triggered. This can be avoided by ensuring that only one process (chosen randomly or by priority) takes action.
In computer science, priority inversion is a problematic scenario in scheduling in which a high priority task is indirectly preempted by a lower priority task effectively “inverting” the relative priorities of the two tasks. Priority inversion was the error condition for the Mars Pathfinder mission described above. Priority inheritance mutexes are used to reduce the effect of priority inversions.
In computer science, a lock convoy is a performance problem that can occur when using locks for concurrency control in a multi-threaded application. A lock convoy occurs when multiple threads of equal priority contend repeatedly for the same lock. Unlike deadlock and live lock situations, the threads in a lock convoy do progress; however, each time a thread attempts to acquire the lock and fails, it relinquishes the remainder of its scheduling quantum and forces a context switch. The overhead of repeated context switches and under utilization of scheduling quanta degrade overall performance.
We pride ourselves in developing quality software solutions that operate reliably 24×7 on various operating system and hardware platforms. Nearly all of the applications we develop are multi-threaded and vary in complexity from moderate to complex. Most all of our custom solutions decompose a problem space into a multi-threaded application running under either an embedded RTOS distribution, embedded Linux, or a desktop operating system, with tasks of different relative priorities.
To avoid the issues illustrated by the Mars Pathfinder problem, we utilize lock-free programming and concurrency collections where ever possible to avoid common errors associated with resource locking in multi-threaded programming. By avoiding these multi-threaded issues, we can develop a reliable software product free of any unexpected and perhaps difficult to reproduce errant behaviors.
General Approach to Lock-Free Algorithms
Designing generalized lock-free algorithms is hard. Instead, lock free designs implement lock-free data structures. This can consist of lock-free buffers, lists, stacks, queues, maps, dequeues, etc. These data structures are often implemented in terms of simpler primitives such as “Multi-word Compare and Set’ (MCAS, CAS2, CASN).
We have written both a “lock-free” and “wait-free” collection to implement a simple queue data structure. We also leverage open source lock-free implementations for higher level data collections such as lists, stacks, and maps.
Lock Free Programming Techniques
Atomic operations are ones which manipulate memory in a way that appears indivisible: No thread can observe the operation half-complete. On modern processors, lots of operations are already atomic. For example, aligned reads and writes of simple data types (such as volatile or atomic booleans) are usually atomic.
The key in lock-free programming is to use hardware-intrinsic atomic operations. The locks themselves must use those atomic operations. The difference between locked and lock-free programming is that a lock-free program can never be stalled entirely by any single thread. By contrast, if in a locking program one thread acquires a lock and then gets suspended indefinitely, the entire program is blocked and cannot make progress. By contrast, a lock-free program can make progress even if individual threads are suspended indefinitely.
The new C and C++ standards (C11 and C++11) have threads, and shared between threads are atomic data types and operations. An atomic operation gives guarantees for operations that run into race conditions between threads. Once a thread returns from such an operation, it can be sure that the operation is effected in its entirety.
Typical processor support for such atomic operations exists on modern processors for compare and swap (CAS) or atomic increments. Additionally to being atomic, data type can have the “lock-free” property. This should perhaps have been coined “stateless”, since this property implies that an operation on such a type will never leave the object in an intermediate state, even when it is interrupted by an interrupt handler or a read of another thread falls in the middle of an update. Several atomic types may (or may not) be lock-free, there are macros to test for that property. There is always one type that is guaranteed to be lock-free.
Solutions where lock-free is not possible
Hard real time applications should be designed such that priority inversion does not happen in the first place. For platforms where lock free implementations are not possible due to the platform CPU, we use priority inheritance mutexes to minimize the effects of priority inversions. This mechanism is designed to ensure the higher priority task is kept in the blocked state for the shortest time possible, and in so doing minimize the ‘priority inversion’ that has already occurred.