A description of the TurboWish software architecture, where the pieces have been selected in a way to best take advantage of different skill sets across our team.
There's no schedule outlined here for when the different components will be delivered. I want to discuss the architecture with the team first, before trying to spell out how we get to a finished product using this architecture.
TurboWish gives our team's customers (namely, Rust developers) insight into what their programs are doing, in order to identify performance pitfalls.
The most immediate insight we seek to provide: Tools to answer the question "Why aren't my asynchronous tasks being scheduled in the manner I expect."
In order to answer this question, we want to provide the developer with an easy to digest view of the program's tasks, associated resources, the relationships between those tasks and resources, and the state of each.
TurboWish is broken into three parts:
Each of these parts is described in more detail below.
%% Note: `%%` at line start is a comment.
flowchart TD
Browser([Web Browser])
TwCollect -.- reqs -.-> TwIntrospect
reqs((introspection<br/>requests))
TracingEventBus -.- e1 .-> TwCollect
%% e1 is really a dummy node to make the rendering a bit nicer.
e1((event<br/>stream))
subgraph Developer [Developer Workstation ____]
TwTui <--> TwCollect
TwTui([TurboWish Console UI])
Browser <--> TwCollect
TwCollect[TurboWish Event Collector]
TwCollect --- EventLog
EventLog[(EventLog)]
end
subgraph Client [Client Service ____]
TwIntrospect[TurboWish Request Handler]
ClientCode[Client App Code]
Tokio
Rayon
TwIntrospect --- Tokio
TwIntrospect --- ClientCode
ClientCode -.-> TracingEventBus
Tokio -.-> TracingEventBus
Rayon -.-> TracingEventBus
TracingEventBus[Tracing Event Bus]
end
(Added post meeting): Client instrumentation should add minimal noise to timing measurement
(Added post meeting)
Event emission should not block application progress.
TurboWish is most useful when the full series of events is available to the collector. But: TurboWish should be somewhat useful even if the collector misses a prefix of the event stream.
Do not clog the event stream with events that the client does not need.
For Turbowish to be useful on an async program, one must use an async executor that has TurboWish instrumentation added to it. (Tokio is the executor used for current prototyping, so when you see "the executor", you can just think "Tokio" if you prefer.)
Client application code will benefit from adding extra instrumentation to their code as well, but developers should be able to use and benefit from TurboWish without going to extremes adding new instrumentation beyond what the executor has out-of-the-box.
Likewise, any linked crate that encapsulates state of interest (such as thread pools in the Rayon crate) may benefit from providing its own TurboWish instrumentation.
The added instrumentation takes the form of logs of events.
(For examples of the kind of instrumentation I expect us to add to client code and to tokio itself, see "Details: Client Service Instrumentation" below.)
These events may include program-internal details such as (partial) stack traces that will include memory addresses of program code.
#[track_caller]
, so in general this is the only way I expect to be able to infer calling context without putting undue burden on client code. See more discussion in appendix.)The fundamental piece of client service instrumentation is instrumenting the async executor itself. Tokio developers are a natural fit for this effort. However, Rust compiler engineers may be able to provide expert assistance on some of the details like stack backtracing (or finding other solutions to the problem of infering calling context).
After we have experience with the instrumented executor, we will be in a better position to evaluate what kinds of instumentation we might request the client code to add in order to make the developer experience delightful.
The Event Collector is responsible for receiving the events emitted from the Client Service Instrumentation, and using them to construct a model of the program's executor, its tasks, and the resources with which those tasks are interacting.
The Event Collector needs to be robust: It must efficiently process the stream of events coming from the client service, construct its internal model, and respond to user interface requests (which will usually take the form of queries on its constructed model). The event collector may need to run for a long time, processing many events, in order to monitor live services. This means it needs to avoid memory leaks or other resource consumption issues. (The Event Collector should consider either discarding past events or storing them onto disk, which is why there's a picture of a database in the diagram.)
Thus, the Event Collector is designed as an independent program that will run in its own process space and be easily monitored on its own, so that we isolate resource usage issues and can address them (potentially after delivery of Minimum Viable Product, but hopefully before such delivery).
The Event Collector will also need to access the text of the program and its associated debuginfo (to provide the user with a view of its machine code or source code, or to map any program memory addresses in events to the original calling functions, which are likely to be a much more customer-intelligible label for a calling context).
Finally, under the TurboWish architecture as currently envisioned, the Event Collector is also responsible for sending introspection requests (if any) to the client service.
I expect at least some of the development effort on the Event Collector to come from Rust compiler engineers, since they are the people who are in the best position to work with the emitted program text and associated debuginfo.
The TurboWish User Interface presents the Event Collector's model to the developer.
We will provide two user interfaces at launch time: a terminal console view, which will be optimized around providing a "bare-metal" interaction with the Event Collector, and a web browser interface, which provides a super-set of the features offered by the console (such as the rendered graph described in the user story.)
The main idea behind separating this out is that I want the Event Collector to be robust, while the User Interface can be developed in a more haphazard fashion
Anyone who wants to.
The console view development effort should probably be driven by the same people who own the Event Collector itself, since I expect it to be the quickest way for us to dogfood the Event Collector ourselves.
We should probably enlist 3rd party expertise on the web browser interface. Or at least, I don't know whom on the team is a web2.0/dhtml/ajax/whatever expert; I just know that its not my area of expertise at all.
The most basic functionality for the task/resource graph user story requires the executor to emit events whenever:
Supporting other user stories will likely require tracking other information as well (such as how many pending futures have had wake
called and are awaiting a call to poll
). This partly motivated the introspection request channel: The Event Collector can send a request for more detailed information, and that will toggle new event logging paths.
The emitted events should include unique identifiers (UID) for any referenced task/wakers.
The emitted events should also include some notion of the calling context for the event. This calling context should be meaningful from the viewpoint of the Client App Code.
<TimerFuture as Future>::poll
calls cx.waker().clone()
, we want the waker construction event to include (at minimum) that a waker was created from TimerFuture
, so that one can properly tell what kind of resource that waker
is associated with.These events may include program-internal details such as (partial) stack traces that will include memory addresses of program code
#[track_caller]
, so in general this is the only way I expect to be able to infer calling context without putting undue burden on client code.)cx.waker().clone()
will require either 1. client instrumentation that sets some thread- or task-local state, or 2. backtracing through the stack to find instruction addresses that the Event Collector can, via debuginfo, map back to the calling context.Question: Example of Introspect Requests?
Answer: e.g. Get list of current thread pool. Or turn on more fine-grained info.
Question: What is view from customer? How does customer interact with it?
Answer; Have to enable TW in service as a feature. (Maybe its an option, maybe not.) When I have a problem, I launch a separate program that starts collecting the events off the stream.
Example Question / User Story: Why am I getting 40% lower throughput than I expect?
Question: Don't know how to evaluate the architecture, because I don't know what TurboWish does.
Can believe the high-level architecutre:
Exmaple: Starting up service, HTTP response hangs
Question: Two differnt ways of interacting:
Observation: There are two problems this is trying to solve
Need to separate dev vs production?
perf
.Focus on problems seperately
How much do we try to integrate with existing (and/or improve) existing tools
Need more explicit user stories of the experience someone has using this
Our value-add can be expressing things in terms of the nouns that people are already using (like "tasks", "channels"), rather than system-wide vocabulary.
Aim: Shiny Future Stories by 8 April 2021
Aim: Design Documents for dedicated development efforts by 15 April 2021