Mars Pathfinder Case Study
More than a decade ago, the Mars Pathfinder landed to a media fanfare and began to transmit data back to Earth. However, a few days later, the flow of information and images was interrupted by a series of total systems resets. How this problem was diagnosed and resolved still makes for a fascinating tale for software engineers.
The Pathfinder’s applications were scheduled by a real time embedded operating system providing pre-emptive priority scheduling of threads, tasks were executed as threads with priorities determined by their relative urgency. The meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus synchronized with mutual exclusion locks (mutexes). Other higher priority threads took precedence when necessary, including a very high priority bus management task, which also accessed the bus with mutexes.
Unfortunately in this case, a long-running communications task, having higher priority than the meteorological task, but lower than the bus management task, prevented it from running. Soon, a watchdog timer noticed that the bus management task had not been executed for some time, concluded that something had gone wrong, and ordered a total system reset. Eventually, engineers diagnosed and fixed the problem, eventually spotting a priority inversion.
A priority inversion occurs when a high priority task is indirectly pre-emptied by a medium priority task “inverting” the relative priorities of the two tasks. This is a clear violation of the priority model which says high priority tasks can only be prevented from running by higher priority tasks and briefly by low priority tasks which will quickly complete their use of a resource shared by the high and low priority tasks.
Problems with Locking
In computer science, a lock or mutex (from mutual exclusion) is a synchronization mechanism for enforcing limits on access to a resource in an environment where there are many threads of execution. A lock is designed to enforce a mutual exclusion concurrency control policy.
What’s wrong with mutexes? Mutexes are perfectly fine, but you have a problem if there is lock contention. If you want your algorithm to be fast, you want to use the available cores as much as possible instead of letting them sleep. A thread can hold a mutex and be de-scheduled by the CPU (because of a cache miss or its time slice is over). Then all the threads that want to acquire this mutex will be blocked. And if you have a lot of blocking, the OS also needs to do more context switches which are expensive because they clear the caches.
Other problems may arise if you do real time programming (priority inversion, convoying, etc.). Mutexes also cannot be used in signal handlers.
The following problems exist when using locking in multi-threaded applications to protect shared resources. The characteristic of each of these multi-threaded lock issue is described in the following paragraphs.
- Priority Inversion.
A deadlock is a situation in which two computer programs sharing the same resource are effectively preventing each other from accessing the resource, resulting in both programs ceasing to function. The earliest computer operating systems ran only one program at a time. Priority inheritance is a possible solution to deadlocks.
As an example, let us suppose you want to split your program into different processes or threads so the whole application does not crash if one process or thread crashes. But if the process or thread crashes while holding a shared lock, a dead lock will likely occur in the main application.
A livelock is similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another — none progressing. Livelock is a risk with some algorithms that detect and recover from deadlock. If more than one process takes action, the deadlock detection algorithm can be repeatedly triggered. This can be avoided by ensuring that only one process (chosen randomly or by priority) takes action.
In computer science, priority inversion is a problematic scenario in scheduling in which a high priority task is indirectly preempted by a lower priority task effectively “inverting” the relative priorities of the two tasks. Priority inversion was the error condition for the Mars Pathfinder mission described above. Priority inheritance mutexes are used to reduce the effect of priority inversions.
In computer science, a lock convoy is a performance problem that can occur when using locks for concurrency control in a multi-threaded application. A lock convoy occurs when multiple threads of equal priority contend repeatedly for the same lock. Unlike deadlock and live lock situations, the threads in a lock convoy do progress; however, each time a thread attempts to acquire the lock and fails, it relinquishes the remainder of its scheduling quantum and forces a context switch. The overhead of repeated context switches and under utilization of scheduling quanta degrade overall performance.
We pride ourselves in developing quality software solutions that operate reliably 24×7 on various operating system and hardware platforms. Nearly all of the applications we develop are multi-threaded and vary in complexity from moderate to complex. Most all of our custom solutions decompose a problem space into a multi-threaded application running under either an embedded RTOS distribution, embedded Linux, or a desktop operating system, with tasks of different relative priorities.
To avoid the issues illustrated by the Mars Pathfinder problem, we utilize lock-free programming and concurrency collections where ever possible to avoid common errors associated with resource locking in multi-threaded programming. By avoiding these multi-threaded issues, we can develop a reliable software product free of any unexpected and perhaps difficult to reproduce errant behaviors.
General Approach to Lock-Free Algorithms
Designing generalized lock-free algorithms is hard. Instead, lock free designs implement lock-free data structures. This can consist of lock-free buffers, lists, stacks, queues, maps, dequeues, etc. These data structures are often implemented in terms of simpler primitives such as “Multi-word Compare and Set’ (MCAS, CAS2, CASN).
We have written both a “lock-free” and “wait-free” collection to implement a simple queue data structure. We also leverage open source lock-free implementations for higher level data collections such as lists, stacks, and maps.
The Lock-Free property guarantees that at least some thread is doing progress on its work. The Wait-Free property guarantees that any given thread provided with a time-slice will be able to make some progress and eventually complete. Wait-free extends Lock-Free such that Wait-Free is better than Lock-Free which is better than Blocking.
So how can we do it without locking?
Modern CPUs have something called atomic operations. There are libraries that have APIs that let you use those atomic operations. Qt has two classes: QAtomicInt and QAtomicPointer. Other libraries or languages might have different primitives, but the principles are the same. For values that are atomic, if two threads operate on a value on the same time, it stays consistent. Atomic implementations typically use assembly language instructions. Qt’s atomic classes are one of the very few places inside Qt implemented with assembly on each platform.
Lock Free Programming Techniques
Atomic operations are ones which manipulate memory in a way that appears indivisible: No thread can observe the operation half-complete. On modern processors, lots of operations are already atomic. For example, aligned reads and writes of simple data types (such as volatile or atomic booleans) are usually atomic.
The key in lock-free programming is to use hardware-intrinsic atomic operations. The locks themselves must use those atomic operations. The difference between locked and lock-free programming is that a lock-free program can never be stalled entirely by any single thread. By contrast, if in a locking program one thread acquires a lock and then gets suspended indefinitely, the entire program is blocked and cannot make progress. By contrast, a lock-free program can make progress even if individual threads are suspended indefinitely.
The new C and C++ standards (C11 and C++11) have threads, and shared between threads are atomic data types and operations. An atomic operation gives guarantees for operations that run into race conditions between threads. Once a thread returns from such an operation, it can be sure that the operation is effected in its entirety.
Typical processor support for such atomic operations exists on modern processors for compare and swap (CAS) or atomic increments. Additionally to being atomic, data type can have the “lock-free” property. This should perhaps have been coined “stateless”, since this property implies that an operation on such a type will never leave the object in an intermediate state, even when it is interrupted by an interrupt handler or a read of another thread falls in the middle of an update. Several atomic types may (or may not) be lock-free, there are macros to test for that property. There is always one type that is guaranteed to be lock-free.
Solutions where lock-free is not possible
Hard real time applications should be designed such that priority inversion does not happen in the first place. For platforms where lock free implementations are not possible due to the platform CPU, we use priority inheritance mutexes to minimize the effects of priority inversions. This mechanism is designed to ensure the higher priority task is kept in the blocked state for the shortest time possible, and in so doing minimize the ‘priority inversion’ that has already occurred.
Developing lock-free algorithms require much more thinking than writing blocking algorithms. So traditional mutexes (and semaphores) will always have a place in multi-threaded applications, especially where lock free implementations are not readily available or your application does not have a lot of lock contention. However, lock-free programming has the potential upsides described if you are looking for a challenge and a way of making your multi-threaded application more robust, predictable, and error free.