or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
carllerche's tokio-console thoughts
This tool is intended to be run locally during debugging and/or benchmarks.
Shiny future
All stories start with the developer enabling tokio-console in their application then opening the tokio-console terminal UI.
Connect using the default tokio-console port.
Alternatively, a
cargo
command exists to both run the process and switch to the tokio-console interface:Why is my task stuck?
Cause: an incorrectly implemented future returns
Pending
without registering the waker.Pending
is returned without registering the waker.Cause: Deadlock cycle. A set of tasks are awaiting on values produced by other tasks. The dependency chain creates a cycle, preventing any tasks from making progress.
A cycle may include a single task if a task is receving on a channel and holds all senders.
Cause: Resource starvation.
This is similar to the deadlock case above, but may not be a true deadlock. Either the task may advance (but slowly) or the system cannot detect the deadlock due to external systems being at play (reading from a TcpStream, but no data is arriving).
We will break it down further to a few specifics.
Variant 1: Blocked waiting on channel capacity.
my_channel.send(value).await
Variant 2: Blocked waiting on reqwest's connection pool.
http_request.await
Take away: libraries will need to expose higher level data to tokio-console.
Cause: logic bug in application
Variant 1:
When parsing a protocol frame, the frame has been fully received but the developer accidentally issues a read. The peer will send no more data.
read
frame.Variant 2: The protocol frame is 128 bytes but the developer does
socket.read_exact(256)
read_exact
frame.read_exact
future has already buffered 128 bytes and the frame is waiting for 128 more bytes.read_exact
buffer and sees it contains the protocol frame.read_exact(128)
Cause: Unexpected cancellation from not calling
.await
In both of these cases, the compiler will emit a warning as futures are annotated
#[must_use]
. Developers do not always catch warnings, especially when they are drowned out by many other warnings.Variant 1: not calling await on a Tokio fn.
The
read
operation is immediately cancelled without polling.It is unclear how
tokio-console
can handle this. At the very least, it can be included in the list of warnings.Variant 2: not calling await on a user fn.
This is similar variant 1, however
tokio-console
will not be able to detect the problem without the rust compiler providing runtime instrumentation. Alternatively, the developer could explicitly annotate the function making the flow identical to variant 1.Creating a future without ever awaiting it or explicitly cancelling it with a combinator is a developer mistake.
Cause: Unexpected cancelation from
select!
Source
Consider a task that parses in-bound protocol frames from a socket and also must handle configuration change updates (for example, change the max frame size).
When
config_rx.recv()
completes first, any in-progressparse_frame
state is lost, resulting in either the protocol parsing state to be corrupted or a frame is lost. Let's assume the observed symptom is that a frame is lost, this will appear to the developer as a hung task. The data has been received but nothing was acted on orparse_frame
hangs waiting for more data even though the frame was received (because part of the data was lost).The correct strategy is described in the Tokio docs, though this is a tricky problem to debug.
select!
, expand that frame for theparse_frame
branch.Cause: did not spawn a task
TcpListener
resource is ready.Cause: Accidental blocking call starves the runtime
Variant 1: Calling
block_on
from thefutures
crateVariant 2: Performing a high CPU operation on the runtime.
There is no instrumentation to catch this issue. It is detect by noticing frames being polled for long periods of time (where long is tbd).
Variant 3: Accidentally holding a
std
mutex across an.await
.This requires syscall level instrumentation. This could be instrumenting
std
, though that would not catch everything (e.g. ffi calls).ptrace
can provide hooks. This will allow catching mutex waits and other blocking syscalls.spawn_blocking
orblock_in_place
.Panics
At a high level, tasks in the task list will track they panicked and will allow the developer to browse the task events (logs).
Cause: Accidental future cancellation
Why is my app slow?
At a high level, a common root cause for slowness in an async app is not processing events ASAP when they come in. In these cases, the developer needs to "record" events to play them back after.
Cause: running a task off runtime
Given the following:
One would expect this to run efficiently, however, the main task is run off the main runtime and the spawned tasks are on runtime, which means the snippet will run much slower than:
The slowdown is caused by a higher time between the rx task being notified (in
tx.send()
) and the task being polled.Variant 2: using a Tokio resource off runtime.
A Tokio TcpStream is moved to a separate thread and is used via
block_on
block_on
task.Initially,
block_on
calls may not automatically translate to tasks in the task view. If not, the developer could do:Cause: not currently polling the stream
Source
And a simpler variant:
Expectation: the snippet takes 600ms to iterate because the sleeps in the stream run in the background and are immediately ready when polling the stream. In practice, the sleeps do not execute because nothing is polling the stream while sleeping in the
while
block.This comes from the developer having a misunderstanding of Rust's async model. The developer expects async calls to always start processing immediately in the background and not lazily when being
.await
ed.To help debug, tokio-console must expose the details of execution timing.
sleep(...).await
, which is higher than expected.Note: the question is, how does the developer make the leap from seeing sleeps happen sequentially to understanding Rust's async execution model? My best answer now is the view will have some legend explaining the difference between a "plain" await call and a "concurrent" await (select!, join!, FuturesUnordered, …). There can be a help link there to learn more.
Cause: significant time spent moving large values
perf
tokio-console
to interpret perf data and isolate it per task?Cause: unbuffered reads / writes.
The application is exhibiting sub-optimal performance due to issuing socket read and write calls with small buffers.
BufReader
to solve the problem.Question: What if the warning is a false-positive, how are they silenced?
Alternative:
Cause: overloaded scheduler
The application has more tasks being notified then can be processed. This could happen with excessive "fan-out" style messaging.
.await
calls that have high wake -> poll times.Cause: slow channel receiver
When message passing, if the message receiver is overloaded, it can be perceived as an app slow down.
This implementation introduces a bottleneck with processing requests. This can be fixed by concurrently processing requests.
response.await
Interestingly, tokio-console could provide better insight if a higher level "actor" library is used to model this pattern and the library is tokio-console-aware.
handle.send(request).await
Spurious wakeups / excessive yielding
How to debug this? How do we know spurious wakeups are wrong vs. just yielding.
Concepts
block_on
Views
Task list
Data
Pending
returned without registering waker.Columns
Similar to
top
, columns are configurable, sortable, and filterable.Ideas:
TODO
block_on
)spawn_blocking
)Task type
TODO
Data
select!
.await
point dataNotes:
* Aggregates metrics for all types w/ a given type
* Filter tasks within view. e.g.:
* only show currently active tasks
* Only show tasks matching query
Task instance
TODO
Data
.await
points can "mask" children. For example, libraries do not want to expose internals by default.select!
.await
points.await
point. e.gselect!
level?)Pending
returned without registering waker.Resource
TODO
Warnings (clippy, hints, …)
TODO
Logs
TODO: Hierarchical log stream w/ filter ability
Runtime / scheduler /
.await
pointsTODO: Should there be a runtime centric view?
Data
.await
points top-like interfaceData
Deadlock detected
A deadlock cycle has been detected.
Data
Data
Instrumenting
Getting rich data requires application and library developers to include instrumentation.
Spawning
Other options
Needed ecosystem improvements
.await
calls.Context
argument topoll
can include an.await
stack?std
fn calls, especially blocking ones.Open questions
Misc notes
poll
fns to prevent polling if not woken. This helps catch missed wakeups but the future was polled due to another part of the task being woken and polling the original future.select!
forks frames.Data to collect / show
block_in_place
Rough: More questions and thoughts on how to answer them
(e)BPF tracing
- define static trace point?
- https://lwn.net/Articles/753601/