12: Concurrency S25

Concurrency ------ :::success What is concurrency? ::: <details> <summary>Think, then click! </summary> Systems with multiple program executions happening "at the same time". </details> :::success What challenges are there with using traditional testing to find bugs in systems with concurrency? ::: <details> <summary>Think, then click! </summary> - The system is non-deterministic, so the same test could sometimes pass and sometimes fail. - Some bugs rely on subtle timing that may be hard to capture in a unit test. </details> # Example: Workday registration Consider a server that implements Workday course list and registration functionality. The server needs to be able to communicate with multiple clients simultaneously (e.g., two students accessing Workday's student view via their laptops at the same time). :::success What does the implementation on the server need to do to service multiple clients at the same time? ::: <details> <summary>Think, then click! </summary> Use concurrency: specifically, either: 1. Processes. Heavier weight, typically distinct memory spaces, managed by the OS. 2. Threads. Multiple threads can run in the same process, threads share a single memory space within a process, typically managed by the language. Threads would be the more common case for this scenario. For simplicity, we will consider the case of there being just two student clients, each with one thread assigned to handle that client. </details> :::success When is it fine for the two students to do operations at the same time, without any special changes to the server code? ::: <details> <summary>Think, then click! </summary> If both students are just observing (reading but not writing/modifying) data, then it is fine for both threads to access global data at the same time. Problems happen when one or more students try to modify global data at the same time! </details> Now consider two people trying to register for the same class in Workday, say, `class_id = 240`. Imagine the server had the following code, run by each independent thread: ```rust if (class_is_not_full(class_id)) { allow_student_to_enroll(class_id, student_id); } else { present_error_to_student(class_id); } ``` :::success Group discussion: What problem can happen when two threads execute this code around the same time? Write out the code that executes for each student, line by line. Assume one student has `student_id = A` and the other has `student_id = B`. These would be stack variables that are distinct per thread. Assume there are shared (global or heap) variables `current_enrollment` and `cap` defined per class. ::: <details> <summary>Think, then click! </summary> Assume `current_enrollement` is 17 and cap is `18`. Both threads could get to the line `class_is_not_full` and get the answer `true`. Both threads then proceed to call `allow_student_to_enroll`, and we end up with 19 students in a course capped at 18! Student question: would the `current_enrollment` then be 18, or 19? Answer: it depends on how the code is specifically implemented! Could be either, depending on how `allow_student_to_enroll` is defined. </details> We need to add in some sort of synchronization or enforcement of mutual exclusion for these lines of code: ```rust if (class_is_not_full(class_id)) { allow_student_to_enroll(class_id, student_id); } else { ... ``` Our goal should be that at any moment in time, **at most one** thread is executing these lines of code. By convention, we call these areas where we desire no overlapping threads to be the [**"critical section"**](https://en.wikipedia.org/wiki/Critical_section). # Strategies for protecting critical sections The idea of protecting critical sections is something you have likely seen before, maybe with the name "locking". The rough idea for our example would be: ```rust get_lock(class_id).lock(); // loops unless thread with lock if (class_is_not_full(class_id)) { allow_student_to_enroll(class_id, student_id); success = true } get_lock(class_id).unlock(); if !success { ... ``` In many languages, locking is implemented with a "mutex", which stands for "mutual exclusion". In the rest of this lecture, we will consider algorithms for actually implementing mutual exclusion. We will assume: 1. Thread can access shared global information, and that thread actions can happen simultaneously and with any ordering between different threads' actions. 2. An abstract notion of the critical section, where each thread can be either in the critical section (CS), not in the critical section, or trying to enter the CS but waiting. # First attempt at a mutex algorithm Here is a first, initial attempt at a mutual exclusion algorithm. #### Alg 1 ```rust= if the other thread is in the CS: while (other_in_CS()) {} // Loop until CS is clear try again else: enter CS do work in CS leave CS ``` - On line 1, there is a transition to attempt to enter the critical section - Between lines 2 and 4, there is a transition to enter the critical section after waiting. - Between lines 6 and 7, there is a transition to leave the critical section. - Otherwise, there is a "do nothing" transition. Now, let's model Alg 1 in Alloy, which we'll flesh out more on Thursday: ```java // Model for Alg 1 abstract sig State {} one sig Uninterested, Waiting, InCriticalSection extends State {} // Uninterested: the student is just viewing data, not trying to enter the critical section // InCriticalSection: the thread is actually making some critical changes for the student // Waiting: another thread was already in the critical section when this one wanted to enter abstract sig Step {} one sig Attempt, EnterAfterW, Leave, Nothing extends Step {} abstract sig Thread { var state: State, var whichStep: Step, // step taken: per thread => multiple threads can take steps at once } one sig ThreadA, ThreadB extends Thread {} pred init { // Start all threads are uninterested all t: Thread | t.state = Uninterested } fact validTraces { init // outside of any temporal operator: applies to the first step (0) // Fill in validTraces always // Difference: *all* threads simultaneously get to choose what to do (all t: Thread | attempt[t] or enterCSAfterWaiting[t] or leave[t] or doesNothing[t] ) } run { some t: Thread | eventually leave[t] } ----- Transition predicates, for concurrency: choosing to make per thread // Different modeling style: we aren't globally talking about "state" and "state' " // Splitting the obligation to talk about the variable sig state/state' over each predicate // This works because validTraces says *all* threads take a transition, not some thread pred attempt[t: Thread] { -- Preconditions for this to apply t.state = Uninterested -- Actions that it does on the system // No threads at all currently in the critical section no (state.InCriticalSection) => // TODO // Fill this in t.state' = InCriticalSection some (state.InCriticalSection) => // TODO t.state' = Waiting t.whichStep = Attempt } run attempt for 5 pred enterCSAfterWaiting[t: Thread] { -- Preconditions for this to apply t.state = Waiting no state.InCriticalSection // No other threads in CS -- Actions that it does on the system t.state' = InCriticalSection t.whichStep = EnterAfterW } pred leave[t: Thread] { -- Precondition t.state = InCriticalSection -- Action t.state' = Uninterested t.whichStep = Leave } // Different: instead of global "doNothing", specific "doesNothing (given thread)" pred doesNothing[t: Thread] { t.state' = t.state t.whichStep = Nothing } ``` When we run `attempt`, we see that we already have a bad trace! Both threads are initially uninterested, then both take the `attempt` transition from state 0 to state 1, putting both threads into the critical section at the same time. :::warning Whenever we see a fishy instance like this, we need to decide whether it is caused by one of two distinct things: 1. The algorithm/system we are modeling is fine, but our model is incomplete or under-constrained, and allows bad behavior that is not actually present in the real system. 2. The algorithm/system we are modeling **has a real bug**, which our model has correctly uncovered. Deciding between these two root causes is a skill that develops with practice. To decide, we need to look at the specific instance (and trace) and compare it to the real algorithm/system to see if we think that trace is actually possible (and problematic). Here, nothing in Alg 1 stops both threads from simultaneously entering the critical section! This means we are in a Scenario 2: a real bug that our model correctly spotted! ::: :::danger **Alg 1 is unsound** (soundness here: at most 1 thread in the CS at once)!! Alg 1 essentially copies the problem of mutual exclusion from the specific application code (e.g., `class_is_not_full` timing) to the mutex implementation code, without actually doing anything to force threads to not overlap. ::: # Next attempt: polite threads The core idea behind how to fix Alg 1 comes from how we as humans negotiate access to shared resources, for example, speaking time in a Zoom meeting. In a Zoom meeting with lots of opinions, if everyone just begins speaking whenever they have a thought (and momentarily no one else is talking), we would have a lot of overlapping speech. :::success How do humans solve this problem? ::: <details> <summary>Think, then click! </summary> We raise our hands before we speak! If we raise our hand then check that no one else has their hand raised, we can be confident that its now our turn to speak. </details> The real algorithm that implements one type of mutex is based on this idea: each thread gets an extra bit of state to model a *flag* that it can have raised when it is interested in entering the CS. Consider the updated pseudocode below: #### Alg 2 ```rust= while(true) { [state: uninterested] this.flag = true; [state: waiting] while(other.flag == true); [state: in critical section] // Do work in the critical section this.flag = false; } ``` - Between lines 2 and 4, there is a transition to raise the flag. - Between lines 4 and 6, there is a transition to enter the critical section. - Between lines 6 and 9, there is a transition to leave the critical section. - Otherwise, there is a "do nothing" transition. When a thread becomes interested in entering the critical section, it sets its flag bit to true. The thread waits to enter the CS until the other thread's flag is false. The thread then enters the CS, does work, then once done executing the CS work, the thread lowers its flag. This version works, but can deadlock, which we will demonstrate next class. To start us off, our new model begins like this: ```java // Start of model for Alg 2 abstract sig State {} one sig Uninterested, Waiting, InCriticalSection extends State {} // New set of steps: Raise and Enter abstract sig Step {} one sig Raise, Enter, Leave, Nothing extends Step {} abstract sig Thread { var state: State, var whichStep: Step, // type is boolean => instead have a new external sig that models a subset, "in" } one sig ThreadA, ThreadB extends Thread {} // Instead of a per-thread boolean, we model the set of threads with their // flags raised as a non-exclusive subset. var sig FlagsRaised in Thread {} ----- Transition predicates, for concurrency: choosing to make per thread pred raiseFlag[t: Thread] { -- Preconditions for this to apply t.state = Uninterested -- Actions that it does on the system t.state' = Waiting t in FlagsRaised' t.whichStep = Raise } ``` # Completing our model of Alg 2 Here is the final model we got to in class for Alg 2: ```java // Alg 2 abstract sig State {} one sig Uninterested, Waiting, InCriticalSection extends State {} abstract sig Step {} one sig Raise, Enter, Leave, Nothing extends Step {} abstract sig Thread { var state: State, var whichStep: Step, // flag raised type is boolean // => instead have a new external sig that models a subset, "in", FlagsRaised } one sig ThreadA, ThreadB extends Thread {} var sig FlagsRaised in Thread {} pred init { // Start all threads are uninterested all t: Thread | t.state = Uninterested no FlagsRaised } fact validTraces { init // outside of any temporal operator: applies to the first step (0) // Fill in validTraces always // Difference: *all* threads simultaneously get to choose what to do (all t: Thread | raiseFlag[t] or enter[t] or leave[t] or doesNothing[t] ) } //run { some t: Thread | eventually leave[t] } ----- Transition predicates, for concurrency: choosing to make per thread pred raiseFlag[t: Thread] { -- Preconditions for this to apply t.state = Uninterested -- Actions that it does on the system t.state' = Waiting t in FlagsRaised' t.whichStep = Raise } pred enter[t: Thread] { // Preconditions for this to apply t.state = Waiting no FlagsRaised - t // no t2:Thread | t2 != t and t2 in FlagsRaised // Actions that it does on the system t.state' = InCriticalSection t in FlagsRaised' // t in FlagsRaised' <=> t in FlagsRaised t.whichStep = Enter } // Just "run enter" does not work because it's inconsistent with init // Need an "eventually" to allow this to be the 2nd step run {some t: Thread | eventually leave[t] } for 5 // Different modeling style: we aren't globally talking about "state" and "state' " // Splitting the obligation to talk about the variable sig state/state' over each predicate // This will work because validTraces will say *all* threads take a transition, not some thread pred leave[t: Thread] { // Precondition t.state = InCriticalSection -- Action t.state' = Uninterested t.whichStep = Leave t not in FlagsRaised' } // Different: instead of global "doNothing", specific "doesNothing (given thread)" pred doesNothing[t: Thread] { t.state' = t.state t.whichStep = Nothing t in FlagsRaised' <=> t in FlagsRaised // Don't want: FlagsRaised' = FlagsRaised } // Safety property : no bad things happen: now passes! Fails for Alg 1 assert noOverlappingInCriticalSection { // Critical section should never have more than one thread in it // 4 different but equivalent ways to say the same thing. Any 1 would work // on its own. always { lone InCriticalSection.~state } always { lone state.InCriticalSection } always {not (ThreadA.state = InCriticalSection and ThreadB.state = InCriticalSection)} always {no disj t1, t2 : Thread | t1.state = InCriticalSection and t2.state = InCriticalSection} } check noOverlappingInCriticalSection ``` # Problem: deadlock Alg 2 now satisfies the safety property of never having overlapping threads in the critical section, but we suspect it might have another type of problem: what happens if both processes raise their flags at once? Both processes being unable to make progress would be a state of *deadlock*. That is the violation of a *liveness* property, rather than the violation of a safety property. :::info A **Safety** property: a bad thing never happens. Examples: a garbage collector never allows a use-after-free, a mutual exclusion algorithm never allows overlapping threads in the critical section. A **Liveness** property: a good thing eventually happens. A mutual exclusion algorithm eventually allows some interested thread to enter the critical section, e.g., prevents deadlock. ::: A deadlock state is one where no further useful transitions are possible; it is a violation of a liveness property. How can we ask Alloy to find out if there is a reachable deadlock state? There are two challenges: 1. How do we phrase the constraint, in terms of the transition predicates? 2. Given that we already allow a `doNothing` transition predicate, how to we determine whether a trace ends in a `doNothing` loop to satisfy the lasso property despite other transitions being possible (*not* a deadlock), vs a true deadlock where threads are *stuck* in `doNothing` without any possibility of process being made? We'll first just consider how we can force a given model to always make progress. ## Distinction: ensuring progress vs checking for deadlock If you just want to ensure that your model always can make progress, one strategy is to just add to your `fact`s that progress is eventually made. For example, for a ring election you might write something like this: ```java always { // Until the election is over, every process // will eventually get some update. not someProcessWon => all p: Process | eventually not doNothing[p] } ``` Or this, depending on your model: ```java always { // Until the election is over, every process // will eventually get some update. all p: Process | some p2: Process | processWon[p2] or eventually receiveUpdate[p] } ``` ## Finding deadlocked instances Another more explicit way to find whether a model can deadlock is to draw a distinct of *when* our transition predicates are possible, instead of just when they occur. We already had begun to split up each predicate into *preconditions* and *actions*. We can actually use this to separate out the *preconditions* of each useful, non-do-nothing transition predicate to help us find cases where the model gets stuck in a deadlock. We can then add the following check for deadlock: ```java assert noDeadlock {always (always (some t: Thread | canRaiseFlag[t] or canEnterCriticalSection[t] or canLeaveCriticalSection[t])) } check noDeadlock for 5 ``` The full model of Alg 2 including the check for deadlock: ```java abstract sig Step {} one sig Raise, Enter, Leave, Nothing extends Step {} abstract sig State {} one sig InCS, Waiting, Uninterested extends State {} abstract sig Thread { var state: State, var whichStep: Step, // step taken: per thread // each Alloy state: multiple threads can have // different steps } one sig ThreadA, ThreadB extends Thread {} // modeling flag raising: either boolean, or a subset // more Alloy style: as a subset, but not an exclusive one, // tells us: use "in" var sig FlagRaised in Thread {} pred init { all t: Thread | t.state = Uninterested } fact validTraces { init // apply in initial state always { (all t: Thread | raiseFlag[t] or doesNothing[t] or enterCS[t] or leave[t]) } } pred canRaiseFlag[t : Thread] { -- precondition t.state = Uninterested } ---- Transition predicates, for concurrency ------- pred raiseFlag[t: Thread] { canRaiseFlag[t] -- action t.state' = Waiting t in FlagRaised' -- bookkeeping: step t.whichStep = Raise } // per thread, this thread to do nothing pred doesNothing[t: Thread] { t.state' = t.state t.whichStep = Nothing // Reminder: talk about every var sig // but because concurrent model: just for this t t in FlagRaised <=> t in FlagRaised' } pred canEnterCS[t:Thread] { -- precondition t.state = Waiting // precondition: no other thread has the flag raised // ?? no FlagRaised - t } pred enterCS[t: Thread] { canEnterCS[t] -- actions that it does on the system t.state' = InCS // this thread's flag does not change // t in FlagRaised <=> t in FlagRaised' t in FlagRaised' -- bookkeeping, update which step t.whichStep = Enter } pred canLeave[t: Thread] { -- precondition t.state = InCS } pred leave[t: Thread] { canLeave[t] -- action that it does on the system t.state' = Uninterested t not in FlagRaised' // discussion: // depending on our model, this might be overconstrained // FlagRaised' = FlagRaised - t // Better in concurrent models to just talk about how t // (or whatever actor in your model) changes the variable // sigs, rather than fully specifying the variable sig // Before: was good to use +, -, & (sequential models) // Now (concurrent models): usually better to use "in", // "not in" or use . dot joins to just talk about pieces --bookkeeping t.whichStep = Leave } // Because we want to know if we actually are avoiding this // saftey issue, we'll describe the "bad thing" in an assert, // checking that it never happens // inCS (which is a State) // state (which is a relation in Thread -> State) assert noOverlapInCS { // more intesting model //all t: Thread | eventually { t.state = InCS } lone state.InCS ; // recall: one says exactly one //; lone says 0 or 1 // lone InCS.~state // no disj t1, t2 : Thread | ts.state = InCS and t2.state = InCS // if we did the check like above, it would check the // *initial* state. Remember: without temporal operators, // preds/asserts/facts only apply to the initial state. always { lone state.InCS } } check noOverlapInCS run { all t: Thread | eventually { t.state = InCS } } assert noDeadlock { always { some t: Thread | canRaiseFlag[t] or canEnterCS[t] or canLeave[t] } } check noDeadlock ``` # Fixing deadlock To fix this deadlock, we'd need to model the final bit of state used by a real algorithm for implementing mutual exclusion, the [Peterson Lock](https://en.wikipedia.org/wiki/Peterson%27s_algorithm). The idea is to have one extra bit of global state, called `turn`, such that if both threads have their flags raised at the same time, the one whose turn it is goes first. Here is one possible model for this: ```java abstract sig Step {} one sig Raise, Enter, Leave, Nothing extends Step {} abstract sig State {} one sig InCS, Waiting, Uninterested extends State {} abstract sig Thread { var state: State, var whichStep: Step, // step taken: per thread // each Alloy state: multiple threads can have // different steps } one sig ThreadA, ThreadB extends Thread {} // modeling flag raising: either boolean, or a subset // more Alloy style: as a subset, but not an exclusive one, // tells us: use "in" var sig FlagRaised in Thread {} one var sig Turn in Int {} fact { always { // remember: without the always, would only // apply to the initial state Turn = 0 or Turn = 1 } } pred init { all t: Thread | t.state = Uninterested } fact validTraces { init // apply in initial state always { (all t: Thread | raiseFlag[t] or doesNothing[t] or enterCS[t] or leave[t]) } } pred canRaiseFlag[t : Thread] { -- precondition t.state = Uninterested } ---- Transition predicates, for concurrency ------- pred raiseFlag[t: Thread] { canRaiseFlag[t] -- action t.state' = Waiting t in FlagRaised' -- bookkeeping: step t.whichStep = Raise Turn' = Turn } // per thread, this thread to do nothing pred doesNothing[t: Thread] { t.state' = t.state t.whichStep = Nothing // Reminder: talk about every var sig // but because concurrent model: just for this t t in FlagRaised <=> t in FlagRaised' Turn = Turn' } pred canEnterCS[t:Thread] { -- precondition t.state = Waiting // precondition: no other thread has the flag raised // ?? no FlagRaised - t or // incorporate turns (t = ThreadA and Turn = 1) or (t = ThreadB and Turn = 0) } pred enterCS[t: Thread] { canEnterCS[t] -- actions that it does on the system t.state' = InCS // this thread's flag does not change // t in FlagRaised <=> t in FlagRaised' t in FlagRaised' -- prevent starvation by switching Turn back and forth t = ThreadA => Turn' = 0 t = ThreadB => Turn' = 1 -- bookkeeping, update which step t.whichStep = Enter } pred canLeave[t: Thread] { -- precondition t.state = InCS } pred leave[t: Thread] { canLeave[t] -- action that it does on the system t.state' = Uninterested t not in FlagRaised' Turn' = Turn // discussion: // depending on our model, this might be overconstrained // FlagRaised' = FlagRaised - t // Better in concurrent models to just talk about how t // (or whatever actor in your model) changes the variable // sigs, rather than fully specifying the variable sig // Before: was good to use +, -, & (sequential models) // Now (concurrent models): usually better to use "in", // "not in" or use . dot joins to just talk about pieces --bookkeeping t.whichStep = Leave } // Because we want to know if we actually are avoiding this // saftey issue, we'll describe the "bad thing" in an assert, // checking that it never happens // inCS (which is a State) // state (which is a relation in Thread -> State) assert noOverlapInCS { // more intesting model //all t: Thread | eventually { t.state = InCS } lone state.InCS ; // recall: one says exactly one //; lone says 0 or 1 // lone InCS.~state // no disj t1, t2 : Thread | ts.state = InCS and t2.state = InCS // if we did the check like above, it would check the // *initial* state. Remember: without temporal operators, // preds/asserts/facts only apply to the initial state. always { lone state.InCS } } check noOverlapInCS run { all t: Thread | eventually { t.state = InCS } } assert noDeadlock { always { some t: Thread | canRaiseFlag[t] or canEnterCS[t] or canLeave[t] } } check noDeadlock ``` ## Distinction: ensuring progress vs checking for deadlock If you just want to ensure that your model always can make progress, one strategy is to just add to your `fact`s that progress is eventually made. For example, for a ring election you might write something like this: ```java always { // Until the election is over, every process // will eventually get some update. not someProcessWon => all p: Process | eventually not doNothing[p] } ``` Or this, depending on your model: ```java always { // Until the election is over, every process // will eventually get some update. all p: Process | some p2: Process | processWon[p2] or eventually receiveUpdate[p] } ```