GopherCon 2021

# GopherCon 2021 ## Title : Queues, Fairness and The Go Scheduler ## Abstract How does Go let you seamlessly scale millions of goroutines atop your system's threads? Let's explore the working of the Go Scheduler, how it maps Go routines to OS threads, all while trying to ensure fairness and answer the question: why is 61 hardcoded in the Go scheduler? All of this with demos! ## Description ### Introduction Go is well known and loved for providing extremely minimal and easy to use concurrency primitives called Go routines. Some might even refer to them as _user space threads_ since the Go runtime is responsible their creation. Since these Go routines are created in the user space and completely managed by the Go runtime, they are not _directly_ runnable on the hardware, which means that there needs to be some form of mapping from Go routines to entities that actually can run on hardware, namely threads! While it is super interesting to think about how Go abstracts away most details so we can use these primitives out of the box, it helps to understand how this is achieved to scale to thousands or even millions of Go routines running seamlessly atop these threads. This mapping of user space threads to OS threads is done by the Go _scheduler_ and the decisions that the scheduler makes to arrive at this mapping are highly crucial to the performance of the running application. Here we start with looking at how, in general, such mapping schemes (user space threads -> OS threads) can be achieved, what the Go way of achieving this mapping is, why is this preferred over some of the other ways we discuss and most importantly, _how_ the Go scheduler achieves this. While it is important to understand this, its equally important to realise how the Go scheduler, while in process of making scheduling decisions, tries to make these decisions by keeping factors such as fariness among Go routines in mind. ### Outline - What are the different ways we can map Go routines to OS threads? - Have a single OS thread and multiplex multiple Go routines on this thread (M:1 scheduling)? - Have a single OS thread created and destroyed _per_ Go routine (1:1 scheduling)? - Let's try and multiplex multiple Go routines among multiple OS threads! (M:N scheduling) - Would the above M:N suffice? - Need for another level of indirection called _Processors_ - Introduction to the `P`, `G` and `M` structs. - Can we make do with a _global_ run queue? - What are the advantages of having this? - What are the major disadvantages of this? - Why does it take a hit on performance? - Need for distributed or _local_ run queues - Interlude: blocked Go routines and context switching, how does it happen? - An example of running a set of Go routines in a local run queue. - Room for furhter improvement? YES! - Why insert a one element LIFO buffer in front of the local run queue? - Concerns related to fairness in terms of convoy effects creeping into the local run queue. - Introduction to co-operative pre-emption in the Go scheduler. - Would pre-emption in its vanilla form work with the one element LIFO buffer? If not, what can we introduce to mitigate this? - Introduction to inheriting time slices. - Why does the need for a global run queue still persist? - What goes into the global run queue? - If we still make use of this queue, how would a processor know when to fetch Go routines from this queue? - An introduction to the mysterious saga of `61` in the Go scheduler. - Why the number `61` and not something else? - An introduction to blocking channels and Go routines waiting on them. - Where do these Go routines end up? - How does this mechanism interact with the different types of queues we've spoken about so far? - How does the scheduler handle scenarios where processors have empty run queues but other processors have loaded run queues? - Pull based re-balancing - Push based re-balancing - What does the Go scheduler prefer and why? - All that's pretty cool, but let's run some code! - Using our understanding of the scheduler to try and see if we can get performance improvements in our code - Using `GODEBUG=schedtrace` to try and see what's going on under the hood of the scheduler. - Looking at the scheduler specific `runtime` API provided by Go to try and see how we can use calls such as `Goexit` and `Gosched` to try and incorporate our mental model of the scheduler in our code - _Further_ speedups? Hmm ### Conclusion Phew! That was a lot to take in. Let's step back and review what we learnt today. - We learnt that having a an M:N scheme works best along with an abstraction for Processors. - We learnt why exactly having a global run queue is disadvatageous and distributing the run queue helps! - But we still need the global run queue as well! - We got a good understanding of the different ways the scheduler tries and improves efficeincy: - One element LIFO buffer - We learnt how the Go scheduler tries to maintain fariness in scheduling - Co-operative pre-emption - Time slice inheritence - When to place something in the global run queue - How often to check the global run queue (the magic of `61`) - We understood in brief about the different ways in which a Go routine can block and how these cases are handled by the scheduler. - Syscalls - Channels - We learnt how the Go scheduler uses work stealing to re-balance the load and make sure maximum amount of resources are being utilized. - We also saw how the Go scheduler works using `schedtrace` and what possibilities for performance improvements exist if we minful of the inner workings of the Scheduler.