TurboWish Development Plan (old; April 1st)

What this covers

A description of the TurboWish software architecture, where the pieces have been selected in a way to best take advantage of different skill sets across our team.

What this does not currently cover

There's no schedule outlined here for when the different components will be delivered. I want to discuss the architecture with the team first, before trying to spell out how we get to a finished product using this architecture.

The TurboWish vision

TurboWish gives our team's customers (namely, Rust developers) insight into what their programs are doing, in order to identify performance pitfalls.

The most immediate insight we seek to provide: Tools to answer the question "Why aren't my asynchronous tasks being scheduled in the manner I expect."

In order to answer this question, we want to provide the developer with an easy to digest view of the program's tasks, associated resources, the relationships between those tasks and resources, and the state of each.

For example, as described in a user story, TurboWish can show the developer a directed graph depicting both 1.) what resources a task is waiting on, and 2.) what tasks are responsible for making those resources available for use.

The TurboWish Architecture, at a high level

TurboWish is broken into three parts:

Client Service Instrumentation, which is responsible for emitting events describing the current client status and relevant state transitions made by the client.
The Event Collector, which receives the emitted events. The Event Collector builds an internal model of the service from these events, and is able to answer queries about this model. The Event Collector is also able to send introspection requests to the client, which can result in more elaborate descriptions of the client state.
A User Interface, which present the Event Collector's model to the developer.

Each of these parts is described in more detail below.

The TurboWish Architecture, diagrammed

%% Note: `%%` at line start is a comment.
flowchart TD
  Browser([Web Browser])

  TwCollect -.- reqs -.-> TwIntrospect
  reqs((introspection<br/>requests))
  TracingEventBus -.- e1 .-> TwCollect
  %% e1 is really a dummy node to make the rendering a bit nicer.
  e1((event<br/>stream))
  
  subgraph Developer [Developer Workstation ____]
  TwTui <--> TwCollect
  TwTui([TurboWish Console UI])
  Browser <--> TwCollect
  TwCollect[TurboWish Event Collector]
  TwCollect --- EventLog
  EventLog[(EventLog)]
  end
  
  subgraph Client [Client Service ____]
     TwIntrospect[TurboWish Request Handler]
     ClientCode[Client App Code]
     Tokio
     Rayon
     
     
     TwIntrospect --- Tokio
     TwIntrospect --- ClientCode
     
     ClientCode -.-> TracingEventBus
     Tokio -.-> TracingEventBus
     Rayon -.-> TracingEventBus
     TracingEventBus[Tracing Event Bus]
  end

Principles

(Added post meeting): Client instrumentation should add minimal noise to timing measurement
(Added post meeting)
Event emission should not block application progress.
- If the user wants to see internal state that requires Θ(n) space to represent, then we should deliver it incrementally over a series of Θ(n) events (with each one requiring O(1) delivery effort).
TurboWish is most useful when the full series of events is available to the collector. But: TurboWish should be somewhat useful even if the collector misses a prefix of the event stream.
- (In other words, one should be able to connect to a running service and still get utility out of TurboWish, potentially by making use of the introspection functionality to query the current state after the initial connection is made.)
Do not clog the event stream with events that the client does not need.
- Some events will be necessary for the Event Collector to maintain an accurate model of the executor state, and will be emitted unconditionally
- Other events that track more fine-grain details of service operation are off at the outset and enabled via an opt-in introspection request from the client.

Architectural Overview

Client Service Instrumentation

For Turbowish to be useful on an async program, one must use an async executor that has TurboWish instrumentation added to it. (Tokio is the executor used for current prototyping, so when you see "the executor", you can just think "Tokio" if you prefer.)

Client application code will benefit from adding extra instrumentation to their code as well, but developers should be able to use and benefit from TurboWish without going to extremes adding new instrumentation beyond what the executor has out-of-the-box.

Likewise, any linked crate that encapsulates state of interest (such as thread pools in the Rayon crate) may benefit from providing its own TurboWish instrumentation.

The added instrumentation takes the form of logs of events.

(For examples of the kind of instrumentation I expect us to add to client code and to tokio itself, see "Details: Client Service Instrumentation" below.)

These events may include program-internal details such as (partial) stack traces that will include memory addresses of program code.

(We cannot change existing APIs to thread through details like file/line-number info in the style of #[track_caller], so in general this is the only way I expect to be able to infer calling context without putting undue burden on client code. See more discussion in appendix.)

Who should own developing Client Service Instrumentation

The fundamental piece of client service instrumentation is instrumenting the async executor itself. Tokio developers are a natural fit for this effort. However, Rust compiler engineers may be able to provide expert assistance on some of the details like stack backtracing (or finding other solutions to the problem of infering calling context).

After we have experience with the instrumented executor, we will be in a better position to evaluate what kinds of instumentation we might request the client code to add in order to make the developer experience delightful.

Event Collector

The Event Collector is responsible for receiving the events emitted from the Client Service Instrumentation, and using them to construct a model of the program's executor, its tasks, and the resources with which those tasks are interacting.

The Event Collector needs to be robust: It must efficiently process the stream of events coming from the client service, construct its internal model, and respond to user interface requests (which will usually take the form of queries on its constructed model). The event collector may need to run for a long time, processing many events, in order to monitor live services. This means it needs to avoid memory leaks or other resource consumption issues. (The Event Collector should consider either discarding past events or storing them onto disk, which is why there's a picture of a database in the diagram.)

Thus, the Event Collector is designed as an independent program that will run in its own process space and be easily monitored on its own, so that we isolate resource usage issues and can address them (potentially after delivery of Minimum Viable Product, but hopefully before such delivery).

The Event Collector will also need to access the text of the program and its associated debuginfo (to provide the user with a view of its machine code or source code, or to map any program memory addresses in events to the original calling functions, which are likely to be a much more customer-intelligible label for a calling context).

Finally, under the TurboWish architecture as currently envisioned, the Event Collector is also responsible for sending introspection requests (if any) to the client service.

The main motivation for this, rather than having the User Interface send such requests directly, is that if the Event Collector is initially attached to an already running service (and thus only sees a suffix of the event stream), the Event Collector will already want to make introspection requests as part of the construction of its internal model of the executor.
However, it is possible that having all such requests go through the Event Collector adds a potential risk (point of failure) that is unwarranted. Feedback welcome!

Who should own developing Event Collector

I expect at least some of the development effort on the Event Collector to come from Rust compiler engineers, since they are the people who are in the best position to work with the emitted program text and associated debuginfo.

User Interface

The TurboWish User Interface presents the Event Collector's model to the developer.

We will provide two user interfaces at launch time: a terminal console view, which will be optimized around providing a "bare-metal" interaction with the Event Collector, and a web browser interface, which provides a super-set of the features offered by the console (such as the rendered graph described in the user story.)

The main idea behind separating this out is that I want the Event Collector to be robust, while the User Interface can be developed in a more haphazard fashion

For example: its okay if the User Interface leaks memory; just restart it, and let it reconnect to the Event Collector.

Who should own developing User Interface

Anyone who wants to.

The console view development effort should probably be driven by the same people who own the Event Collector itself, since I expect it to be the quickest way for us to dogfood the Event Collector ourselves.

We should probably enlist 3rd party expertise on the web browser interface. Or at least, I don't know whom on the team is a web2.0/dhtml/ajax/whatever expert; I just know that its not my area of expertise at all.

I would personally prefer to not build a web-server into the Event Collector itself, unless we can do so in a manner where almost no interesting logic is associated with that web server.

Appendix: Architectural Details

Details: Client Service Instrumentation

The most basic functionality for the task/resource graph user story requires the executor to emit events whenever:

a task is spawned,
a task is dropped,
a waker is created,
a waker is cloned,
a waker is dropped, or
a waker is tranferred from one task to another.

Supporting other user stories will likely require tracking other information as well (such as how many pending futures have had wake called and are awaiting a call to poll). This partly motivated the introspection request channel: The Event Collector can send a request for more detailed information, and that will toggle new event logging paths.

The emitted events should include unique identifiers (UID) for any referenced task/wakers.

For values that are themselves boxed or own a heap-allocated value, we should be able to use a memory address as a UID, as long as we also include some sort of timestamp with the events (and the Event Collector will infer when memory is being reused and update its internal model accordingly).
(If we need to track values that do not have a uniquely associated heap values, then we may need to add some sort of unique-id generation for them. So far I haven't seen a need in tokio's types.)

The emitted events should also include some notion of the calling context for the event. This calling context should be meaningful from the viewpoint of the Client App Code.

For example, when <TimerFuture as Future>::poll calls cx.waker().clone(), we want the waker construction event to include (at minimum) that a waker was created from TimerFuture, so that one can properly tell what kind of resource that waker is associated with.
(It would be even better to include enough information in the event for the Event Collector to know which specific resource is associated with the waker, rather than just its type.

These events may include program-internal details such as (partial) stack traces that will include memory addresses of program code

(We cannot change existing APIs to thread through details like file/line-number info in the style of #[track_caller], so in general this is the only way I expect to be able to infer calling context without putting undue burden on client code.)
More specifically: Based on my understanding of the API's available, providing contextual info about the calling context of cx.waker().clone() will require either 1. client instrumentation that sets some thread- or task-local state, or 2. backtracing through the stack to find instruction addresses that the Event Collector can, via debuginfo, map back to the calling context.

Meeting Notes

Big Picture Feedback

Question: Example of Introspect Requests?
Answer: e.g. Get list of current thread pool. Or turn on more fine-grained info.

Question: What is view from customer? How does customer interact with it?
Answer; Have to enable TW in service as a feature. (Maybe its an option, maybe not.) When I have a problem, I launch a separate program that starts collecting the events off the stream.

Example Question / User Story: Why am I getting 40% lower throughput than I expect?

Question: Don't know how to evaluate the architecture, because I don't know what TurboWish does.

Can believe the high-level architecutre:

Need to instrument the service, and
Need to interpret that instrumentation.

Exmaple: Starting up service, HTTP response hangs

Question: Two differnt ways of interacting:

Interactive REPL, try to determine what current state is
Post-processing the log events that have already been gathered

Observation: There are two problems this is trying to solve

Huge obvious problem, trying to understand behavior. (This architecture seems to resolve that.)
Nuanced peformance issue, where the instrumentation itself adds expense, and thus disrupts the observed performance (and makes the customer mistrusts the tool)

Need to separate dev vs production?

If we assume dev-only as an upfront target, then we can e.g. leverage tools like perf.

Focus on problems seperately

Deadlock debugging, vs
Better perf integration

How much do we try to integrate with existing (and/or improve) existing tools

Zipkin, X-ray, perf, ftrace

Need more explicit user stories of the experience someone has using this

Our value-add can be expressing things in terms of the nouns that people are already using (like "tasks", "channels"), rather than system-wide vocabulary.

Aim: Shiny Future Stories by 8 April 2021
Aim: Design Documents for dedicated development efforts by 15 April 2021

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.