2021-03-31 / 15:00

# 2021-03-31 / 15:00 [#107](https://github.com/rust-lang/wg-async-foundations/issues/107) ## * context * got folks doing large distributed systems * scientific computing * projects * studying fancy physics stuff * solving partial differential equations on multicore, distributed systems * large data, high perf, parallelization * correctness and obviousness * high scientific integrity * determinism is really imp't * lots of code maintained by astrophysicists, not professional programmers, relatively little support * ingesting a high volume stream of events, parsing it, storing it into a large database * or doing queries through a database and having to serve that up * don't need to impl your own distributed algorithms but does require a lot of concurrency to keep up * implemented in Go * runtime with goroutines etc was really nice * but felt unclear how this mapped to hardware * would memory be in stack? heap? * hard to predict performance * but it did largely work well in practice * implemented in F# * since ported to C#, Python * workflows * used "async workflows" at C# ## async workflows https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/asynchronous-workflows ```f# let urlList = [ "Microsoft.com", "http://www.microsoft.com/" "MSDN", "http://msdn.microsoft.com/" "Bing", "http://www.bing.com" ] let fetchAsync(name, url:string) = async { try let uri = new System.Uri(url) let webClient = new WebClient() let! html = webClient.AsyncDownloadString(uri) printfn "Read %d characters for %s" html.Length name with | ex -> printfn "%s" (ex.Message); } let runAll() = urlList |> Seq.map fetchAsync |> Async.Parallel |> Async.RunSynchronously |> ignore ``` * big challenge * interaction between many parallel tasks and one or two mgmt tasks * if you are writing a system reading from a kafka server * reading messages off * need to keep track when you've fully processed a message * so you know if there is a fault, how far you must retry * needs a coordinator tracking the state of ongoing processing * sometimes blocking or stopping concurrent tasks * to allow you to commit an offset > So how we implemented the kafka client in Go was roughly: we had a goroutine which read messages from kafka, recorded the incoming message into a ring buffer with an “Inflight” state, then posted the message to a channel. Downstream go routines processed the message then posted the Success/Fail status to a channel and the Kafka client layer would read those messages and update the state in the Ring buffer. > > The ring buffer provided a max number of in flight messages and if that was hit then it would stop reading from Kafka > > The ring buffer would then take the largest contiguous block of Success messages and commit the offset. ## notes on physics applications * https://www.github.com/clemson-cal * data pattern * not as simple as mapping across all tasks * light dag in rounds * kind of browsed code * found smol, tokio * first tried message passing and channels * launched 1 thread per task * had them sending synchronous messages * for students learning and coming into this system * background: fortran, python * learning curve was steep * still "fighting" the language (~5 months in) * async not really the source of problems * worked on error handling * plumbed errors through the async modules * took a bit to grok how things were working * figuring out how to work with `Ok` and `Err` etc * used to just throwing exceptions in C++ * read "The Book" * key challenge * how to share the futures so that arbitrary other tasks can pull things out of a matrix and await them * use `to_shared` * ## story Erich will take a stab at writing this PR. * Character: * Barbara, early in her career * Outline: * Started with launching a bunch of threads * Perf was subpar * Explored tokio and async * Replaced thread with runtime.spawn * Used async futures and `.await` to create dependency graphs between tasks * But realized you can't await more than once and you needed access to random stages * made a [map](https://github.com/clemson-cal/app-kilonova/blob/eb38c6bb66d780e51954a52ddd469f14f08a0e16/src/scheme.rs#L32-L39) but realized cloning it was expensive * introduced `Arc` to figure that out * used `shared` * don't love all `.map` and `.unwrap` calls, noisy * Initially when hitting errors just had panics out of convenience * If a background thread panics, it crashes all the other threads, so there was this concaphony of stacktraces * People were slack messages pages full of errors * Converted to `Result` because knew that was the right way * didn't know you could collect a `Iterator<Item = Result>` into a `Result<Collection<Item>>` * tasked Grace with it * took 3 or 4 days of banging her head against it to thread all the types, get `?` in the right places, etc * Things scale well excellently but * compilation time is a problem on simpler laptops * would like 2 or 3s, but 30s to a minute is common * could have quite a lot to do with async * the student Niklaus can't really hack on the core parts of the system * took 5 months to get up to speed * initial investment was non-trivial * What are the morals of the story? * Writing async code to do DAGs of computation is doable, but the pattern is not necessarily obvious and takes some tinkering to discover. * The libraries are tuned for I/O and may not be optimal. * Error handling and fault handling was not obvious, panics in particular were difficult to manage. * Using async Rust brings a fair amount of "non-essential" complexity to the problem. * What are the sources for this story? * Covers the experiences of Jonathan Zrake's group at Clemsen. * Why did you choose NAME to tell this story? * Fit the backgrounds. Barbara isn't a perfect hit but we can't have two Graces, and jonathan knew some Rust. * How would this story have played out differently for the other characters? * .. * Would Rayon cover this case? * Not quite, Rayon doesn't support arbitrary DAGs. You could do a full map pattern but you would get uneven CPU utilization towards the end of a phase.