Lecture 11 Deadlock - Problems and Solutions Notes

# Lecture 11 Deadlock - Problems and Solutions Notes # Slide Notes ## Deadlock it is a situation where two entities have locked some resource and each needs the other;s locked resource to continue. neither will unlock without the other resource therefore blocking each other. This concept can be explained with the **dining philosopher problem.** ![Screenshot 2024-05-09 at 2.03.34 PM](https://hackmd.io/_uploads/HyM49n5MC.png) it highlights on how changing the rules in a way that eliminates deadlocks Deadlocks are important to avoid as it is a major peril in cooperating parrallel interpreters. thye result in system failures. it is also very hard to find them through debugging and are much easier to prevent at design time. once you understand them you can avoid them - most deadlocks result from carless/ignorant design deadlocks may not be obvious as process resources needs are ever-changing. deadlocks depend on what data, where in conputation, and what errors have happened. modern software depednds on many servies each needing number resources and interacting meaning its very hard to pin down the failure. modularity and encapsulation is a hinderence here. Services encapsulate much complexity as we do not know what resources they require or when/how they are serialized note deadlokcs are not the only synchronization problems ## Deadlocks and Different Resource Types **Commodity Resources:** clients need an amount of it (e.g. memory), deadlocks result from over-commitment - **avoidance** can be done in **resource manager** **General Resources:** Clients need a specific instance of something (particulare file, message, request, or semaphore), deadlocks result from specific dependency relationships - **prevention** is usually done at **design time** ## Four Basic Conditions for Deadlocks 1. Mutual exclusion - the resource in question can each only be used by one entity at a time 2. Incremental allocation - processes/threads are allowed to ask for resources whenever they want - if they must pre-allocate all resources, either: - they get all they need and run to completion - they do not get all they need and abort 3. no preemption - when an entity has reserved a resource, you cannot take it away from him - if you can, deadlocks are simply resolved by taking someone's resource away 4. Circular wait - A waits on B which waits on A - could involve a lot more than 2 entities - but if there is no such cycle, someone can complete without anyone releasing a resource - maybe not very fast, though ... all must be true for a deadlock A wait-for graph is useful to see if all these conditions are true. Note this only **indicates** a deadlock could occur. it is not guarenteed. ## Deadlock Avoidance Use methods that guarantee that no deadlock can occur, by their nature. Use can also used advance reservation techniques ### Avoiding Deadlock Using Reservations we can advance reservations for a commodity resources. The resource mamnager tracks outstanding reservations and only grants reservations if resources are available using this method you can detect over subscriptions before the process ever gets the resource and Clients can/must be prepared to deal with these failers and not result in deadlocks #### Overbooking Vs. Under Utilization processes generally cannot perfectly predict their resource needs. To ensure they have enough, they tend to ask for more than they will ever need. The OS will either grent request until everything's reserved or grants request beyond the available amount where someone won't get a resource they reserved. when handling reservation problems we know clients seldom need all resources all the time and all clients won't need max allocation at the same time. genreally it is safe to over-book resources just like when a airport overbooks a flight. - "safe" means everyone will be able to complete, some people may have to wait for other to complete, and we need to ensure there are no deadlocks by resource. ### Commodity Resources Management in Real Systems Advanced resourvations mechanism are common is memory reservations, disk quotas, and quality of service contracts. Once granted, systems must guarantee reservations. this means alocation failures only happen at reservation time, before the new computation has begun. failures will not happen at request time and systems behavior then becomes easier to handle and be more predictable. remember resource reservation is eliminating the deadlock. With this in mind apps will be designed to handle failures gracefully (refuse to perform new request, but continue running) and must also have a way of reporting the failure to requester. The app must be able to continue running, therefoer all critical resouces should try to be reserved at start-up time. reservation denial is not great but it is better than failing later. with advance notice, apps may be able to adjust service to not need the unavailable resource. - if apps are in the middle of servicing a request everything will unwind, fail, and ce complex or even impossible to complete ## Deadlock Prevention The main method of prevention is attacking 1 of the 4 basic conditions to ensure deadlock does not occur ### 1. Mutual Exclusion you cannot deadlock over a shareable resource - perhpas maintained with atomic instructions or even read/writer locks can help you also cannot deadlock on your private resources - can we give each rpocess its own private resouce? ### 2. Incremental Allocation 1. Allocate all of your resource in a single operation - if you cannot then system returns failure (all or nothing) 2. Non-blocking request - A request that cannot be satified immediately will fail 3. Disallow blocking while holding resources - you must release all held locks prior to blocking and reacquire them after you return There is a lot to consider before realseing locks before blocking. it could be that you are blocking for a reason not related to reousrce locking. we also have to consider that when you reaquire them, you will be required to do so in a single all or none transaction. such a transaction does not involve hold-and-block, as so cannot result in a deadlock - note: deadlocks solutions solve deadlock - they do not necessarily solve all your other problemsQ they may even crease new errors. ### 3. No Pre-emption Deadlocks can be broken by resource confiscation where resources can be seized & reallocated to new client. Revocation must be enforced. It will invalidate previous owner's resource handle. if it is not possible then you kill the previous owner. We have to be careful as some resources may be damaged by lock breaking where the previous owner was in the middle of critical section and may need to do repairs resources must be designed with revocation in mind. The OS can only "Seize" a resource when it can revoke access by invalidating a process' resource handle - if process has to use a system service to access the resource that service, that service can stop honoring requests it cannot revoke a process' access when the process has direct access to the object - part of process address space which would require killing the process to kill ### 4. Circular Dependencies To prevent this we use **total resource ordering** in which all requester allocate resources in the same order This assumes we know how to order the resources. We can order by resource type (groups before members) or order by relationship (parents before children) this may requre a **lock dance** where we release lock 2, allocate lock 1, and reacquire lock 2 ![Screenshot 2024-05-09 at 3.06.19 PM](https://hackmd.io/_uploads/B1vJtaqMA.png) here to delete a buffer we have to unlock buffer (lock 2), allocate list head lock (lock 1), and reacqire lock buffer later (lock 2) ### Which Approach to use there is no one universal soltion to all deadlocks. This is fine as we only need a solution for each resource. We can solve each individual problem any way we want. you can do a mixture of solutions. OS must prevent deadlocks in all system services and applications are responsible for their own behavior. the last solution is to ignore the problem as deadlock are very improbable. This may be a good option as doing anything can be very expensive. ## Deadlock Detection and Recovery if deadlocks do happen we want to detect them once they have occured and be able to recover the system (break or repair) ### implementing Deadlock Detection To detect all deadlocks, need to identify all resources that can be locked. This is not always clear in an OS. This must maintain wait-for graph or equivalent strucutre to figure it out. One method of detection is when lock requested, strucutre is updated and check for deadlocks - may be better to just reject the lock request and not let the requester block? ### Deadlock Outside the OS note the issue grows bigger with detection as it is not only found in one location. Some applications use locking internally which is build into their own code. Databases systems are a key example of this where they use locking of records. The OS knows nothing of those locks and thus cannot offer any help we need a system that can detect OS and non OS deadlocks ### Not All Synchronization Bugs are deadlocks there are many reasons a system hangs and makes no progress. one of the reasons is deadlocking but other times it is just lovelock, flaws in implementation, simple bugs. Generally an approach that handles the whole range of synchronization problems would be prefered. ## Dealing with General Synchronization Bugs deadlock detection seldom makes sense as it is complex to implement and it is not always clear cut what you should do if one is detected to solve this we can have services for health monitoring to detect when a response takes too long and is decleared "hung" - this is easier to implement and can detect a wider range of problems (deadlocks, live-locks, infinite loops & waits, crashes) ### Related Problems That Health Monitoring Can Handle **Live-lock** where a process is running but will not free lock 1 until it gets its message. The process that will send the message is blocked waiting for lock 1 **Sleeping Beauty, waiting for "Prince Charming"** where a process is blocked, awaiting some completion that will never happen. (e.g. sleep/wakeup race) **Priority inversion hangs** where a lower priority process is running becuase of the lock (e.g. Mars Pathfinder case) none of these are a true deadlock but do leave the system hung up just like a deadlock would. ### How to Monitor Process Health - look for obvious failures - process exists or core dumps - passive observations to detect hangs - is the process consuming CPU time, blocked, or doing I/O or network - external health monitoring - "pings" / standared test request to detect errors in output - internal instrumentation - white box audits, exercisers, and monitoring - e.g. heart beat to show it is awake ### What to do with "Unhealthy" Processes simple answer is to kill and restart "all of the affected software". we canot to kill as few as possible but enough that we get back to a working state. - note the hung processes may not be the ones that are broken. Apps must be designed for cold/warm/partial restarts which depends on the service APIs and/or protocols Highly available systems will define restart groups (groups of processes to be started/killed as a group) this makes it easier to kill unhealth processes correctly. - also needs to define inter-group dependencies (restart B after A) ### Failure Recovery Methodology you should Retry if possible but not forever as clients cannot wait forever and you are holding up resources are being held while you wait. you can roll-back failed operations are return an error. Or you can continue with reduced capacity or functionality (accept request you can handle, reject those you cannot) we can also do automiatic restart (cold, warm, and partial). when we go for resets we have to consider escalation to account for when the "hung" process is not the one that needs correcting (go from 1 reset to group reset ... -> whole system in worst case) ## Making Synchronization Easier Locks, semaphores, mutexes are hard to use correctly and we want to make synchronization easier for programmers One Approach is to identify a shared resource (objects whose methods may requre serialization). We write code to operate on those objects and assume all critical sections will be serialized the complier will generate and handle the serialization by automatically generateing locks and releases when approriate mechanisms are called. ### Monitors - Protected Classes Each monitor object has a mutex which will automatically acquire on any method invocation and release on method return. they will also have good encapsulation where developers do not need to identify critical sections as protection is completely automatic. The programmer will have high confidence of adequate protection. ![Screenshot 2024-05-09 at 3.38.12 PM](https://hackmd.io/_uploads/Hkgve0cM0.png) Monitor locking is very conservative meaning it will lock the entire object on any method. This can create performance problems as they eliminate conflicts by eliminating parallelism. if a thread block in a monitor a convoy can form. TANSTAAFL - fine-grained locking is difficult and error prone - Coarse-grained locking creates bottle-necks ### Java Synchronized Methods in this each object has an associated mutex. however the difference is it is only acquired for specified methods. this solves to issue of not all object methods need to be synchronized. These can have nested calls (recursion by same thread) that do not reacquire. This will automatically release upon final return. The other option is a static synchronized methods lock class mutex. all other threads of this class must wait Advantages are that it has finer lock granularity and reduced deadlock risk. The cost is the developer must identify serialized methods ![Screenshot 2024-05-09 at 3.46.00 PM](https://hackmd.io/_uploads/ryNVM05GA.png) # Ch 32 Common Concurrency Problems Crux: How to handle common concurrency bugs ## Non-Deadlock Bugs **Atomicity-Violation Bugs**: The desited serialzability amoung multiple memory accesses is violated - i.e, a code region is intended to be atomic, but the atomicity is not enfored during execution the bug occurs when an atomicity assumption is incorrect Ex: thread 1 does a check, pauses, thread 2 sets a value to null, thread 1 returns and prints the null value it was supposed to prevent with the check. Solution: simply added lock wrapping **Order-Violation Bugs**: The desired order between two (groups of) memory accesses is flipped - i.e. A should always execute before B but the order is not enforced during execution. Solution: Condition Variables or semaphores **Conclusion:** these bugs should be accounted for in automated code-checkign tools ## Deadlock Bugs **Deadlock** occurs when thread 1 is holding lock 1 and waiting for thread 2 to release lock 2. However thread 2 is waiting for lock 1 to do the same thing. ![Screenshot 2024-05-09 at 10.16.14 AM](https://hackmd.io/_uploads/S1uJHK5fA.png) Deadlocking may occur becuase circular dependencies easily occur naturally in large code bases. It also may occur due to the nature of encapsulation. its becuase modularity does not mesh will with locking. ### Conditions for Deadlock - **Mutual exclusion:** threads claim exclusive control of resources they require - **Hold-and-wait:** threads hold resources allocated to them while waiting for additional resources - **No preemption:** Resources cannot be forcibly removed from threads that are holding them. - **Circular wait:** There exists a circular chain of threads such that each thread holds one or more resources that are being requested by the next thread in the chain ### Prevention the most practical prevention technique is to write your locking code such that you never induce a circular wait **Total ordering:** strict ordering of lock acquisition that ensures that no cyclical wait arises this may be hard to achieve in a system of more than 2 locks **Partial ordering:** possible ordering that are okay to occur. Another method is a **Prevention lock** which ensure that once that lock is aquired it ensures no untimely thead switch can occur. - the limitation of this is it reauires us to know exactly which locks must be held and to acquire them ahead of time. it also will cecrease concurrency another method is the trylock() which grabs the lock if available and returns successfully else return error and retry to lock aquiring later. ```c= top: pthread_mutex_lock(L1); if (pthread_mutex_trylock(L2) != 0) { pthread_mutex_unlock(L1); goto top; } ``` New problem arises called **livelock** where 2 threads both repeatedly attempt this and repeatly fail to acuqire both locks. - easily solve with a small random delay encapsulation is still an issue as to lock could be buried in a routine that would make the `goto top` a little difficult to work as intended this method is not the OS preempting but the code preempting ownership gracefully the final prevention is to avoid the need for mutual exclusion at all you code would have to be redesigned to be **lock-free** and **wait-free** - e.g. create a function that repeatedly tries to update the value to the new amount using compare and swap! ![Screenshot 2024-05-09 at 10.44.33 AM](https://hackmd.io/_uploads/B1nYjF5MA.png) #### Deadlock Avoidance via Scheduling **avoidance** is somethings preferable which requires some global knowledge of which locks various threads might grab during their execution. you want to schedule in such a way that there will be no contention. ![Screenshot 2024-05-09 at 10.47.33 AM](https://hackmd.io/_uploads/BJeB2Kcf0.png) ![Screenshot 2024-05-09 at 10.47.56 AM](https://hackmd.io/_uploads/SJv8nF5zC.png) famous approach is the Dijkstra's Banker's Algorithm. this is not widely used as it is only useful in a limited enviroment and can limit concurrency #### detect and recover final final solution is the detect when a deadlock has occured and take some action to undo the deadlock e.g. reboot # Resource Deadlock Avoidance if we ensure conditions for deadlocking never occur then it is impossble. However there are common situations in which: - mutual exclusion is fundamental - hold and block are inevitable - preemption is unacceptable - the resource dependency netwroks are imponderable we have gone over solutions for these but there are some cases where new issues arise if: - main memory is exhausted - we need to swap some processes out to secondary storage to free up memory - swapping processes out involves the creation of new I/O requests descriptors, which must be allocated from main memory these problems result in the exhaustion of a critical resource and the system is in a state where: - some preocesses will free up reousrces when it completes or need more resources in order to complete this would result in the system being in a resource depleted state. Systems are commonly designed to have **deadlock avoidance** for this by keeping track of free resources and refuse to grant resource requests ### Reservations as a delined request may be hard to gradefully handle, it is common for a process to reserve their resources before they actually need them. this approach can be taken in... - memory to determine in it would get over-taxed - refuse to create new files when file system space gets low - refuse to create new process if we found ourselves thrashing due to pressure on main memory - we could refuse to create or bind sockets when network traffic saturates our service level agreement this failable request method of reservation gives the program to oppritunity to consider and avoid a resource-exhaustion deadlock ### Over-Booking it is often considered safe to grant somewhat more reservations than we actually have the resources to fulfill. The reward for overbooking is we can get more work done with that resource. the danger is that there might be a deman we cannot gracefully handle - e.g. air lines will overbook a flight and offer cash and free trips to anyone who is willing to take a later flight. - in operating systems, the notion of killing random processes is so abhorrent that most operations systems simply refuse to overbook. - they may even under-book ### Dealing with Rejection some processes have systems that can deal with resource rejection - a simple program might log an error message and exit - a stubborn program might continue retrying the request - a more robust program might return errors for request that cannot be processed, but continue trying to serve new requests - a more civic-minded program might attempt to reduce its resource use - may also lower # of requests in can serve ultimatly in this way the program will try to manage the situation in the most graceful way possible # Health Monitoring and Recovery suppose the system has locked up in which you need to - identify all of the blocked processes - identify the resource on which each process is blocked - identify the owner of each blocking resource - determine whether or not the implied dependency graph contains any loops it may not be possible to identify the owners of all the reousrces or even all the resources. worse the system may not actually be blocked, just waiting on a message even if we are sure its a deadly we may not know the best course of action to get the system working proerly again Formal deadlock detection in real systems is difficult to perform, inadequate to diagnoes most hangs, and does not enable us to fix the problem ### Health Monitoring evaluation of deadlock is based on ... - by having an internal monitoring agent to watch message traffic or transaction log - by asking clients to submit failure reports to a central monitoring service - by having each server send periodic heart-beat messages to a central health monitoring service - by having an external health monitoring service send periodic tests to the service thatis being monitored Weaknesses of this include: - heart beat messages can only tell us that the node and application are still up and running. They cannot tell us if the application is actually serving requests - clients or an external health monitoring service can determine whether or not the monitored application is responding to requests. But this does not mean that some other requests have not been deadlocked or otherwise wedged. - an internal monitoring agent might be able to monitor logs or statisitics to determine that the service is processing requests at a reasonable rate. but if the interal monitoring agent fails, it may not be able to detect and report errors many systems use a combination of these methods. - the first line of defence is an interal monitoring agent that closely watches key applications ot detect failures and hangs. - if the interal monitoring agent is responsible for sending heart-beats to a central monitoring agent, a failure of the internal monitoring agent will be noticed by the central monitoring agent - an external test service that periodically generates test transactions provides an independnent assessment that might include external factors (switches, load balancers, network connectivity) that would not be tested by the internal and central montiring services ### Managed Recovery once a service has been determined to be hung or failed ... - the software should be designed so that any process in the system can be killed and restarted at any time. when it does restart, it should be able to reestablish comms with toher processes and resume work with minimal disruption - the software should be designed to support multiple levels of restart - warm-start, cold-start, reset and reboot - the software might also designed to be for progressively escalating scope of restarts - restart a single process, restart a group, restal all sofware on a single node, restart a group of nodes, restart the entire system you would want to escalate if: - process A failed as a result of an incorrect request received from process B - opeation that cause process A to fail is still in the database - the operation that caused the failure may have been mirrored to other systems ### False Reports lets say the monitoring service has not recieved a heart beat from process A - it might mean that process A's node has failed - it might mean that the process A has failed - it might mean that the process A's system is loaded and the heart-beat message was delayed - it might mean that a network error prevented or delayed the delivery of a heart-beat message - it might mean there is a problem with the central monitoring service This recovery might involved a great deal of network traffic and system activity we do not want to do unless we are certain - the best option would be for a failing system to detect its own problems, inform its partners, and shit-down cleanly - if the failures is detected by a missing heart-beat, it may be wise to wait until multiple heart-beat messages have been missed before declaring the process to have failed - to distinguish a problem with a monitored system from a problem in the monitoring infrastrucutre, we might want to wait for multiple other processes/node to notice and report the problem the trade off is: - if we do not take the time to confirm suspected failures, we may suffer unnecessary service disruptions - if we mis-diagnose the cause of the problem, it may make the problem worse - if we wait too long before initiating fail-vers, we a prolonging the service outage therefore we need tuned **mark-out thresholds** and complex decision algorithms to filter and reconcile potentially conflicting reports and effectively deal with it ### Other Managed Restarts **Non-disruptive rolling upgrades**: when a system is capable of operating without some of its nodes. we take nodes down, upgrade each to a new software release, and then reintegrate them this comes with 2 risks: - the new software must be upwards compatiable with the old software, so that new nodes can interoperate with old ones - if the rolling upgrade does not seem to be working, there needs to be an automatic fall back option to return to the revious woking release. **prophylactic reboots**: restart the system at a regualr interval (e.g. few hours or days) solution to when software systems become slower and more error prone the longer they run If a system can continue operating in the face of node failures, it should be fairly simple to shut-down and restart nodes one at a time, on a regular schedule. # Synchronized Methods Synchronized methods enable a simple strategy for preventing thread interference and memory consistency errors: if an object is visible to more than one thread, all reads or writes to that object's variables are done through synchronized methods. (An important exception: final fields, which cannot be modified after the object is constructed, can be safely read through non-synchronized methods, once the object is constructed) This strategy is effective, but can present problems with liveness - it is not possible for 2 invocations of synchronized mothods on the same object to interleave. all other threads suspend execution (block) until the first thread is done iwth the object - automatically establishes a happens-before relationship with any subsequent invocation of a synchronized method for the same object. This guarantees that changes to the state of the object are visible to all threads To make a method synchronized, simply add the synchronized keyword to its declaration: ```c= public class SynchronizedCounter { private int c = 0; public synchronized void increment() { c++; } public synchronized void decrement() { c--; } public synchronized int value() { return c; } } ``` # Intrinsic Locks and Synchronization Intrinsic locks play a role in both aspects of synchronization: enforcing exclusive access to an object's state and establishing happens-before relationships that are essential to visibility. ### Locks in Synchronized Methods When a thread invokes a synchronized method, it automatically acquires the intrinsic lock for that method's object and releases it when the method returns. The lock release occurs even if the return was caused by an uncaught exception. ### Synchronized Statements Another way to create synchronized code is with synchronized statements. Unlike synchronized methods, synchronized statements must specify the object that provides the intrinsic lock: ```c= public void addName(String name) { synchronized(this) { lastName = name; nameCount++; } nameList.add(name); } ``` Synchronized statements are also useful for improving concurrency with fine-grained synchronization. Instead of using synchronized methods or otherwise using the lock associated with this, we create two objects solely to provide locks. ### Reentrant Synchronization Recall that a thread cannot acquire a lock owned by another thread. But a thread can acquire a lock that it already owns. Allowing a thread to acquire the same lock more than once enables reentrant synchronization. Without reentrant synchronization, synchronized code would have to take many additional precautions to avoid having a thread cause itself to block.