21: Memory safety

:::info Announcements - Help hours 3:45-5:15 in L037. - Quiz 3 Monday - A6 graded, A7/A8 back by Sunday - A9 due following Monday (May 5) ::: :::success Today - [Quiz 3 logistics](https://cs.wellesley.edu/~cs251/s25/notes/quiz3.html) - Memory safety/management - What is memory safety? Why care? - What languages are memory safe? - How do safe languages handle memory? - Automatic memory management - Automatic reference counting - Tracing garbage collection ::: For our last topic of the semester, we'll focus on an important idea across programming languages: memory safety and memory management. # Memory safety ## Press Release: Future Software Should Be Memory Safe In 2024, the Biden White House released the following press release: [Press Release: Future Software Should Be Memory Safe](https://bidenwhitehouse.archives.gov/oncd/briefing-room/2024/02/26/press-release-technical-report/). There has been a massive push in recent years for software systems to use **memory safe programming languages**. As we wrap up this first semester of study of programming languages, I want us to all understand what this means! ## Definition of memory safety :::info **Memory safety** is the property of a program that says that references to data must always be to the correct size and type of data, owned by that particular program. ::: Said another way, a **memory safe** program does not contain memory errors: accessing memory that it should not, or attempting to access memory of the wrong type. One simple memory error would be declaring an `a` array of size 3, then trying to access `a[4]`. While a *safe* language might not rule this out via type checking, it might fail with a specific dynamic error, such as an array out of bounds exception. In an *unsafe* language, like C, accessing `a[4]` might simply return some adjacent value in memory. :::info A **programming language** is memory safe if it rules out memory unsafe programs (either by static analysis with the type system, or dynamic checks). ::: Most language you have used are memory safe: Racket, OCaml, Python, and Java are all memory safe. The two primary languages that are **not** memory safe are C and C++: the list of unsafe languages is much shorter, but these languages are still in widespread use! ## Security implications of manual memory management In computer security, there is a concept called ["CVEs", or Common Vulnerabilities and Exposures](https://www.cve.org/About/Overview). CVEs are essentially a public, tracked list of known security vulnerabilities (bugs that could be exploited). :::success What percentage of CVEs listed by Microsoft do you think are from memory safety issues? ::: <details> <summary>(Think, then click!)</summary> The answer is [around 70%](https://msrc.microsoft.com/blog/2019/07/a-proactive-approach-to-more-secure-code/)! That's a huge number of bugs that could have been avoided by using a memory safe language. </details> # Memory management in general To write almost any computer program that runs over a period of time, programmers need to consider how to management memory use within the program. The phrase "memory management" typically refers to *dynamic* memory, or memory that is requested while the program is running (as opposed to static memory, such as global variables, which are pre-allocated by the compiler or interpreter before the program runs). **Dynamic memory** primarily means memory on the heap. In *any* program, there are 3 steps a programmer needs to do to use dynamic memory: 1. Allocate the memory needed for the specific size of object required. 2. Use that allocated memory for some period (e.g., initializing it, doing further calculations or computation with it). 3. Release, or free, the allocated memory when the task is done and the memory is no longer needed. In *most* programming languages, like Racket, OCaml, Java and Python, steps (1) and (3) are done automatically for the programmer by the language implementation. In low-level languages like C and C++, the programmer must do all 3 steps manually. There are a few different wide buckets that break down how these different languages and settings approach this problem. ## Explicit/manual memory management (e.g., C, C++) When you learn to program in the C programming language (e.g., in CS240), you learn that you need to manually manage dynamic memory use in your program. We will mostly leave this to CS240 to cover, but for our purposes in CS251, just know that steps (1) and (3) involve more explicit programmer control in an unsafe language. ## Implicit/automatic memory management: "garbage collection" (e.g., Java, Python, most high-level languages) In high-level languages like Racket, OCaml, Java, and Python, the **language implementation itself** handles allocating the right size of memory and freeing memory that is no longer in use by the program. ### Intuition for types of automatic memory management The key idea of automatic memory management is to **not** make the programmer track the size of data and when data should be freed. Instead, the language implementation itself handles these details. For example, consider a `Node` example in Python-ish pseudocode: ```python def make_node(): Node n = new Node(); # Don't need to worry about size int result = n.do_something() return result # No need to free the Node memory ``` :::success VOTE: This brings up the core question in this lecture: when is it safe for the programming language to automatically free allocated memory? Note that there is more than one reasonable/correct answer here! 1. When the function in which it was created returns. 2. When the thread in which it was created ends. 3. When the programmer can no longer reference it. 4. When there are no more pointers to it. 5. After the last reference to it in the code. ::: <details> <summary>(Think, then click!)</summary> In class, we saw a range of answers, with most folks choosing 3, 4, or 5. The 2 answers that **do not work** are these two: :::danger (1) When the function in which it was created returns. (2) When the thread in which it was created ends. ::: </details> We can't just free memory when the function returns, because consider this modification to the code: ```python def call_make_node(): l = [] n = new Node() l.append(n) return l # The programmer can still use the Node after call_make_node returns ``` `call_make_node` returns, but other functions can continue to use that memory! Even though we don't *return* `n`, the `Node` data is still potentially being used because it was added to some list data structure created outside of the function. We also can't free data when the thread in which it was created ends for similar reasons: the data can be shared between threads, so the creator thread ending does not mean it is safe to free the data! :::info In the list from before, options (3) and (4) are the potentially-correct answers that correspond to and **tracing garbage collection** (3) and **reference counting** (4). option (5) somewhat corresponds to Rust's borrow checking, but how this is accomplished is more complicated than we have time to go into today. 1. When the function in which it was created returns. 2. When the thread in which it was created ends. 3. **Correct (tracing GC): When the programmer can longer reference it.** 4. **Correct (reference counting):When there are no more pointers to it.** 5. **Right idea for Rust, but needs more detail:** After the last reference to it in the code. ::: There are two historically common flavors of automatic memory management we'll cover in today's lecture: 1. Tracing garbage collection (e.g., Java, C#). 2. Automatic reference counting (e.g., Python (primarily, uses tracing GC as backup), Swift, Objective-C) Some languages, like Rust, allow some forms of garbage collection (e.g., automatic reference counting) while also allowing programmers to have some of the manual components of explicit memory management handled for them via the "borrow checker" (we'll cover this next week if there is time). # Automatic reference counting The overall idea of reference counting is that each dynamic allocation in the heap also tracks a bit of extra metadata: the number of incoming references to that object. The key idea is that when the count gets to 0, the language implementation can free that memory. This proceeds as follows: the language implementation tracks a "root set" of references outside the heap that reference heap data. This includes global variables and any local variables, which are stored in the stack frame of the relevant function call where that local variable is defined. When a new heap allocation is created, the reference count is automatically set to 1 (assuming the data is stored in some variable). When a new reference is created (e.g., by passing the data into a new function call) the count is incremented. When a reference is destroyed (e.g., by a function call returning), the reference count is decremented. In short, automatic reference counting tracks the number of incoming edges in the reference graph, and free an object when its count reaches 0. :::success Can you think of an example of a reference count where this strategy does not free data that could be freed? ::: <details> <summary>(Think, then click!)</summary> In class, I used the following (new!) analogy: imagine you are tubing down a river with friends (as a midwesterner, this is close to my heart). Maybe you have a tube that has your snacks, consider that tube the root set. If you and one friend are holding eachother's tubes, you might still not be reachable from the snack tube! We can have **disconnected cycles** within a memory graph, which cause issues for automatic reference counting. When there is a cycle in the reference graph, but the entire cycle is no longer reachable from the root set! </details> ## The problem of cycles in automatic reference counting Consider how reference counting works on the following program modeling linked-list nodes: ```python def nodes() n1 = new Node() # n1 RC = 1 n2 = new Node() # n2 RC = 1 n1.next = n2 # n2 RC = 2 n2.next = n1 # n1 RC = 2 return # n1, n2 RC = 1, not 0 ``` Note that when the function returns, the local variables `n1` and `n2` are destroyed because the call stack frame is popped. This decrements each RC by 1, to 1. However, without further action, this does not bring the reference count to 0, so this data is never freed! :::danger High-level note: cycles in the reference graph can cause these memory leaks in automatic reference counting! These are sometimes called "retain cycles". Alexa has gotten this as a software engineering interview question. ::: In practice, languages like Python have additional mechanisms to try to find and free these cycles (like intermittently running a tracing GC). :::info Note that while retain cycles can cause worse memory utilization, they do not actually create any **memory safety** problems. In other words, they cause the collector to not collect all possible data, but they don't free data that should not be freed. ::: ### Efficiency of automatic reference counting - On creating an object: $O(1)$ extra space (~a few extra ints) per object. - On making/deleting references: local, $O(1)$ increment/decrement. # Tracing garbage collection Tracing garbage collection uses more complicated graph algorithms to get around this problem with collecting cycles. Instead of each object always tracking the incoming edges, a tracing GC periodically traces everything reachable from the root set using a graph traversal (either BFS or DFS work). This is what you learned in CS111/CS230 as "garbage collection" for Java/Python! The core algorithm is: 1. Starting at the root set, traverse the reference graph to get the reachable set of "live" objects. 2. Free objects that are not reachable ("dead"). ## Mark and sweep (simple, but inefficient) One core algorithm, called "mark and sweep", proceeds as follows: everything in the root set is marked reachable. Then in the "mark" phase, the GC traverses the reference graph, marking a few metadata bits to indicate an object is still "live". In the "sweep" phase, the GC does a linear scan of entire heap, freeing any data that was not marked as live. The downside of all tracing GCs is that they typically require the runtime to "stop the world" in order for garbage collection to periodically take place. ### Mark and sweep details The simplest tracing GC implementation is mark-and-sweep. 1. Mark: keep an extra few bits of metadata on each heap block. Traverse starting at the root set, marking blocks that you visit. 2. Sweep: do a second, linear pass over the entire heap, freeing every block not marked in the previous step. The downside of this simple implementation is that it requires traversing the entire heap, instead of just live objects. The runtime of mark is $O(|L| + |E|)$, where $L$ is the set of live objects and $E$ is the edges (references) between them. The runtime of sweep is worse: $O(|H|)$, where $H$ is the entire heap. </details> ### Semi-space GC: more efficient, used in practice The high level idea of more efficient garbage collectors, semi-space collectors, is this: 1. Find live objects via the same root set traversal idea as before. 2. Copy all live objects, eliminating any space between objects. The simplest semi-space GC has such a strange idea that it at first seems like it would note work: 1. Divide the heap in half. 2. Use only the first half. Then run a moving GC step, copying and condensing live objects into the second half. 3. Consider the entire first half free. To me, this idea is like deciding you don't like one room's decorations in your house, so you buy a new house and move all your furniture there, then burn down your old house! In turns out in practice, because this only uses twice the memory as before and multiplying by 2 is a single constant multiplicative factor, this is often a reasonable tradeoff to make in practice. This is especially true because the runtime for collection is now conceptually $O(|L|)$ where $L$ is the set of live objects, instead of $O(|H|)$ where $H$ is the entire heap. ### Generational collectors The following tidbit is not in scope for the quiz, but is interesting! More complicated semispace collectors take advantage of an adage that "most objects die young". This captures the idea that in programming, many objects are only used briefly, while a small proportion of objects might be used for almost the entire duration of the program. "Generational" GCs take advantage of this by further subdividing the heap into different areas for objects expected to live for different durations, and collecting from them at varying frequency. The "nursery" semi-space is for new objects, which are assumed to die more quickly (dark! :skull:). Java and OCaml both use generational tracing garbage collectors. # Group exercise For these two OCaml programs: a. Draw a graph of what the heap would look like throughout the program. Assume all `Node` data is on the heap. b. Determine which data would be garbage collected, and which data would still be live after `f` returns. ```ocaml (* Program 1 *) type tree = Leaf of int | Node of tree * tree let f x = let t1 = Node (Leaf 1, Leaf 2) in let t2 = Node (Leaf 3, Leaf 4) in let t3 = Node (t1, t2) in let t4 = Node (Leaf 5, Leaf x) in t3 (* Program 2 *) type node = { value: int; mutable next: node option } let f x y = let a = { value: x; next = None } in let b = { value: y; next = Some a } in a.next <- Some b; () ``` **Solution**: Program 1 The tree referenced by `t3` and its subtrees and their nodes/leaves (referenced by `t1`, `t2`) all remain alive, as does `x`. The subtree referenced by `t4`, as well as its `Leaf` for 5 and `Leaf` for `x`, are all garbage and would be collected sometime after `f` returns. **Solution**: Program 2 Even though the records referenced by `a` and `b` point to eachother, a tracing GC would identify that neither is reachable once `f` returns. `x` and `y` would still remain alive, but both records created by `f` (as well as their contained `option` data) are all garbage and would be collected sometime after `f` returns.