Memory management in general

:::info Announcements - Help hours 3:45-5:15 in L037. ::: :::success Today - Discuss **The Urgent Need for Memory Safety**. - Zooming out from Alloy: memory management, so we can model it in Alloy next week - Stack vs. heap - Manual memory management vs. memory safe languages - When is it safe to free/reclaim memory? - Start (more next class): - Automatic reference counting - Tracing garbage collection. ::: In this lecture, we'll take a break from Alloy to cover a topic important in systems programming in general: the idea of memory management. This will prepare us to model memory management in next week's assignment. # Discussion: The Urgent Need for Memory Safety We discussed in groups the press release [The Urgent Need for Memory Safety in Software Products](https://www.cisa.gov/news-events/news/urgent-need-memory-safety-software-products). We went over memory safe vs. unsafe languages, the tradeoffs of each, and each group's mental model for how dynamic memory is handled in each case. # Memory management in general To write almost any computer program that runs over a period of time, programmers need to consider how to management memory use within the program. The phrase "memory management" typically refers to *dynamic* memory, or memory that is requested while the program is running (as opposed to static memory, such as global variables, which are pre-allocated by the compiler or interpreter before the program runs). In *any* program, there are 3 steps a programmer needs to do to use dynamic memory: 1. Allocate the memory needed for the specific size of object required. 2. Use that allocated memory for some period (e.g., initializing it, doing further calculations or computation with it). 3. Release, or free, the allocated memory when the task is done and the memory is no longer needed. In *some* programming languages, like Java and Python, steps (1) and (3) are done automatically for the programmer by the language implementation. In low-level languages like C, the programmer must do all 3 steps manually. There are a few different wide buckets that break down how these different languages and settings approach this problem. ## Explicit/manual memory management (e.g., C, C++) When you first learned to program in C (e.g., in CS240), that was likely your first time needing to manually manage all parts of using dynamic memory in your program via calls to `malloc` and `free`. The high level idea is that a memory allocator library implements `malloc` and `free`, but the programmer themselves is responsible for calling `free` on allocations that are no longer needed. ## Implicit/automatic memory management: "garbage collection" (e.g., Java, Python, most high-level languages) In high-level languages like Java and Python, the language implementation itself handles allocating the right size of memory and freeing memory that is no longer in use by the program. There are two historically common flavors of automatic memory management we'll cover in today's lecture: 1. Tracing garbage collection (e.g., Java, C#). 2. Automatic reference counting (e.g., Python (usually), Objective-C) Some languages, like Rust, allow some forms of garbage collection (e.g., automatic reference counting) while also allowing programmers to have some of the manual components of explicit memory management handled for them via the "borrow checker" (we'll cover this if there is time). --- # Factors for choosing different memory management strategies The main factors/tradeoffs are: 1. Efficiency (speed) of the code. Explicit management (e.g., C) tends to be faster than automatic, automatic reference counting tends to be fasting than tracing GC. 2. Memory utilization. Different strategies have different amounts of overhead in how much "extra" memory they require. 3. Most importantly (I would argue) is how "safe" each strategy is: that is, how easy it is for the programmer to make mistakes that can be costly or violate security properties. # Downsides of explicit/manual memory management (e.g., C) In C, a programmer wanting to create and use a `Node` struct would write something like: ```C void makeNode() { Node *n = (Node *)malloc(sizeof(Node)); // Initialize and use n free(n); // Programmer has to remember to free return; } ``` :::success What can go wrong in the use of manual memory management here? ::: <details> <summary>(Think, then click!)</summary> Many potential answers: 1. Easy to mess up the size (which could lead to buffer overflows in the extreme case!) 1. Could forget to free, an instance of a "memory leak". "Memory leak" terminology is used especially when it would be no longer possible for the programmer to free the memory, because the local variable storing the pointer is now gone (popped off the call stack). 1. Freeing doesn't actually clear the bytes (this is something not really handled by any of these strategies alone, though is exacerbated by other problems like buffer overflows). 1. Pointers are just integers/raw bytes treated as memory addresses: they do not have any special protection within the language. 1. Programmers could accidentally call free multiple times on the same pointer ("double free"). 1. Programmers could use memory after having called free ("use after free)") </details> ## Security implications of manual memory management In computer security, there is a concept called ["CVEs", or Common Vulnerabilities and Exposures](https://www.cve.org/About/Overview). CVEs are essentially a public, tracked list of known security vulnerabilities (bugs that could be exploited). :::success What percentage of CVEs listed by Microsoft do you think are from memory safety issues? ::: <details> <summary>(Think, then click!)</summary> The answer is [around 70%](https://msrc.microsoft.com/blog/2019/07/a-proactive-approach-to-more-secure-code/). </details> The reading for today is a [2023 press release from the US Cybersecurity and Infrastructure Security Agency](https://www.cisa.gov/news-events/news/urgent-need-memory-safety-software-products) describing the urgent risk posed by memory safety issues in modern software. # Details on automatic memory management The key idea of automatic memory management is to **not** make the programmer track the size of data and when data should be freed. Instead, the language implementation itself handles these details. For example, consider our `Node` example in Python-ish pseudocode: ```python def make_node(): Node c = new Node(); # Don't need to worry about size int result = c.do_something() return result # No need to free the Node memory ``` :::success The brings up the core question in this lecture: when is it safe for the programming language to automatically free allocated memory? Note that there is more than one reasonable/correct answer here! 1. When the function in which it was created returns. 2. When the thread in which it was created ends. 3. When the programmer can no longer reference it. 4. When there are no more pointers to it. 5. After the last reference to it in the code. ::: <details> <summary>(Think, then click!)</summary> In class, we had groups vote on the answers, and got a wide range! The 2 answers that **do not work** are these two: :::danger (1) When the function in which it was created returns. 2. When the thread in which it was created ends. ::: </details> We can't just free memory when the function returns, because consider this modification to the code: ```python def call_make_node(): l = [] c2 = make_node() # Use c2 after make_node returns def make_node(): Node c = new Node(); return c ``` `make_node` returns, but other functions can continue to use that memory! :::info Student question: can we just track what memory is returned by the function? ::: Answer: no, not quite, consider the following: ```python def make_node_list(): l = [] make_node(l) # use Node via the list l def make_node(l2): Node c = new Node(); l2.append(c) return ``` Even though we don't *return* `c`, the `Node` data is still potentially being used because it was added to some list data structure created outside of the function. We also can't free data when the thread in which it was created ends for similar reasons: the data can be shared between threads, so the creator thread ending does not mean it is safe to free the data! :::info In the list from before, options (3) and (4) are the potentially-correct answers that correspond to reference counting (4) and tracing GC (3). Option (5) somewhat corresponds to Rust's borrow checking, but how this is accomplished is more complicated than we have time to go into today. 1. When the function in which it was created returns. 2. When the thread in which it was created ends. 3. **Correct (tracing GC): When the programmer can longer reference it.** 4. **Correct reference counting):When there are no more pointers to it.** 5. **Right idea, but needs more detail:** After the last reference to it in the code. ::: # Automatic reference counting The overall idea of reference counting is that each dynamic allocation in the heap also tracks a bit of extra metadata: the number of incoming references to that object. The key idea is that when the count gets to 0, the language implementation can free that memory. This proceeds as follows: the language implementation tracks a "root set" of references outside the heap that reference heap data. This includes global variables and any local variables, which are stored in the stack frame of the relevant function call where that local variable is defined. When a new heap allocation is created, the reference count is automatically set to 1 (assuming the data is stored in some variable). When a new reference is created (e.g., by passing the data into a new function call) the count is incremented. When a reference is destroyed (e.g., by a function call returning), the reference count is decremented. In short, automatic reference counting tracks the number of incoming edges in the reference graph, and free an object when its count reaches 0. :::success Can you think of an example, using the graph properties we're been examining frequently in Alloy, of a reference count where this strategy does not free data that could be freed? ::: <details> <summary>(Think, then click!)</summary> When there is a cycle in the reference graph, but the entire cycle is no longer reachable from the root set! </details> ## The problem of cycles in automatic reference counting Consider how reference counting works on the following program modeling linked-list nodes: ```python def nodes() n1 = new Node() # n1 RC = 1 n2 = new Node() # n2 RC = 1 n1.next = n2 # n2 RC = 2 n2.next = n1 # n1 RC = 2 return # n1, n2 RC = 1, not 0 ``` Note that when the function returns, the local variables `n1` and `n2` are destroyed because the call stack frame is popped. This decrements each RC by 1, to 1. However, without further action, this does not bring the reference count to 0, so this data is never freed! :::danger High-level note: cycles in the reference graph can cause these memory leaks in automatic reference counting! These are sometimes called "retain cycles". Alexa has gotten this as an interview question. ::: In practice, languages like Python have additional mechanisms to try to find and free these cycles, but these mechanisms are heuristic and potentially slow down automatic reference counting. :::info Note that while retain cycles can cause worse memory utilization, they do not actually create any **memory safety** problems. In other words, they cause the collector to not collect all possible data, but they don't free data that should not be freed. ::: ### Efficiency of code - On creating an object: $O(1)$ extra space (~a few extra ints) per object. - On making/deleting references: local, $O(1)$ increment/decrement. - Detecting cycles: more like ~$O(n)$ or worse. # Tracing garbage collection Tracing garbage collection uses more complicated graph algorithms to get around this problem with collecting cycles. Instead of each object always tracking the incoming edges, a tracing GC periodically traces everything reachable from the root set. ## Mark and sweep One core algorithm, called "mark and sweep", proceeds as follows: everything in the root set is marked reachable. Then in the "mark" phase, the GC traverses the reference graph, marking a few metadata bits to indicate an object is still "live". In the "sweep" phase, the GC does a linear scan of entire heap, freeing any data that was not marked as live. The downside of all tracing GCs is that they typically require the runtime to "stop the world" in order for garbage collection to periodically take place. More on this next time!