---
# System prepended metadata

title: 'Bandwidth-Delay Product Estimations: Are we Doing It Right?'

---

$$
\newcommand{\rtt}{\text{rtt}}
\newcommand{\sbatch}{s_{\text{batch}}}
$$

# Bandwidth-Delay Product Estimations: Are we Doing It Right?

A key aspect of performance in a block protocol is ensuring channel saturation. The _bandwidth-delay product_ (BDP) is an abstract quantity that can be computed for a channel and measures its capacity. Fig. 1 depicts a channel and its bandwidth and delay as two separate dimensions: the amount of data that can be kept in flight at any given moment is given by their product.

![bdp](https://hackmd.io/_uploads/BJhXfpj2Wg.png)
**Figure 1.** A channel with its bandwidth and delay.

**Bandwidth and delay.** The _bandwidth_ for a protocol represents the _rate_ at which a channel can transmit information. Even if a channel has zero delay, the rate is always constrained by the bandwidth. For instance, a channel with a $10\text{MBps}$ bandwidth capacity can transmit $1$ Megabyte in $0.1$ seconds, or $100$ kilobytes in $0.01$ seconds.

_Delay_ refers to the time it takes for any byte to travel across the channel, and affects all bytes equally. If a channel has delay $d$, then any byte inserted into the channel will take $d$ to reach its destination.

**Effective bandwidth.** Protocols that perform request-response cycles or which rely on acknowledgements for flow control (e.g. TCP) will effectively _stop_ transmission because of delays. These pauses in transmission will reduce the _effective bandwidth_ of the channel under that protocol.

Reasoning about this can be tricky - suppose a node transmits some payload of size $s$ over a channel with delay $d$ and bandwidth $b$, and waits for an acknowledgment before transmitting the next payload. What's the effective bandwidth?

![bdp-2](https://hackmd.io/_uploads/SkuFXTsh-g.png)

**Figure 2.** Effective bandwidth for a payload of size $s$.

This is depicted in Fig. 2: bandwidth constraints imply that it takes $t_e = s/b$ for the payload to enter the channel. At that point, the first byte to enter the channel has already been transmitted $s/b$ seconds ago, and therefore needs only $t_p = d - s/b$ to reach the receiver.

After $d - s/b$ seconds, the receiver begins seeing the first byte. After another $t_x = s/b$ seconds, the receiver finally receives all bytes. The total time required is then $t_{\text{total}} = t_e + t_p + t_x = s/b + (d - s/b) + s/b = s/b + d$. Note that this holds even when $s/b > d$: the first byte always takes $d$ to travel and, after that, channel bandwidth dictates that it takes another $s/b$ for the payload to exit it.

The _effective bandwidth_ $e$ for a payload of size $s$ can then be expressed as:

$$
\begin{equation}
e = \frac{s}{\frac{s}{b} + d} = \frac{s}{s + d\times b} \times b \leq b
\end{equation} \tag{1} 
$$

Where Eq. $1$ holds as an equality when $d = 0$; i.e., the channel has no delay. Under these terms, our goal is to make payloads always large enough that $e$ is as close to $b$ as possible, regardless of delay.

## Maintaining Saturation in a Block Protocol

Block protocols operate by sending discrete requests for _blocks_ -- individually addressable, fixed-sized chunks of data of size $s_b$ that make up a larger dataset -- and then waiting for those blocks to be transmitted back. The payload $s$ from Eq. $1$, in the case of a block protocol, is given by $k \times s_b$, where $k$ is the number of blocks we ask for as part of a request.

To make terminology consistent, we refer to a peer that requests blocks from another peer as a _requester_, and the peer that replies to those requests with blocks as the _sender_.

**Request schedules.** When a requester issues a request for a block $u$ to a sender $g$, we say that $u$ has been _scheduled_ to $g$. A block is typically scheduled to a single peer at a time. It would be trivial for a requester to saturate the channel towards some sender $g$ by simply scheduling all the blocks in the dataset towards $g$. Indeed, as Fig. 3 demonstrates, the more blocks we request on a channel, the better (or closer to $b$) our effective bandwidth is.

Scheduling everything towards a single peer is not desirable, however, because there might be multiple peers holding the dataset -- both now and in the future -- and we want to be able to schedule blocks towards those too -- if we tie up everything to $g$ upfront, we will not be able to do that.

![image](https://hackmd.io/_uploads/B1g_Jxp3bl.png)

**Figure 3.** Effective bandwidth as a function of blocks requested in a $10\text{MB}$ channel with a $3$ second delay and a block size $s_b = 65\,536$ bytes.

**Minimal request scheduling.** What we would like instead is to schedule some _minimal_ amount of blocks to $g$, enough to keep the channel saturated at all times, while keeping the remainder of blocks in a pool, free to schedule to other peers. Let us envision how we could do that.

The requester: _i)_ starts by picking a certain number $k_i$ of blocks from the unscheduled pool and scheduling those to the sender $g$; _ii)_ waits until a certain number $k_r$ of blocks are received from $g$; _iii)_ as blocks arrive, schedules additional $k_a$ requests to the $g$, ideally before $g$ runs out of blocks to send; _iv)_ continues from step _(ii)_. 

This allows us to schedule blocks incrementally and according to $g$'s speed (we only schedule more when we get a certain number of blocks back), while maintaining the unscheduled blocks in a pool where they can be directed to other peers: if some other peer $h$ appears later and happens to have parts of the dataset, we can do the same with $h$. Furthemore, if $h$ turns out to be faster than $g$, we will schedule more blocks towards $h$ than we will towards $g$.

Assuming delays are symmetrical and we know $d$ and $b$, the optimal $k$s are not hard to compute. It takes $d$ seconds for the requester to see the first block coming from the sender after the initial request, and an additional $d$ seconds for the a new request to reach the sender. This means that the sender must have at least $2 \times d$ worth of blocks to transmit in $k_i$ so it does not stall before the next  request for additional blocks arrives.

Since the bandwidth is $b$ and the block size is $s_b$, this is $k_i = 2 \times d \times b / s_b$ blocks. We refer to $c = d \times b / s_b$ as the channel's _block capacity_, and it turns out to be exactly the bandwith-delay product divided by the block size.

As soon as the requester sees the first block, the sender has already transmitted half of the blocks in $k_i$, so it needs to request additional $k_a = c$ blocks immediately. The requester can then steadily: _i)_ wait for $k_r = c$ blocks to arrive, _ii)_ schedule additional $k_a = c$ blocks towards $g$, until it runs out of blocks in the unscheduled pool.

The key issue with this protocol is that it requires knowledge about $b$ and $d$, and it assumes stability in those parameters to some degree, though as long as variations are not large and sudden, we should remain quite efficient. It also assumes that the delay $d$ is symmetric, though extending it to non-symmetric cases should be straightforward.

Practical protocols build direct and indirect estimates of these quantities, and might differ significantly from what we have described. We will look at two of those practical implementations next.

## Channel Saturation in libtorrent

libtorrent[^1] implements a simple _slow start_ protocol in which it keeps a single parameter - the Desired Queue Size (DQS) - which maintains how many blocks we would like to maintain scheduled towards a sender.

Every unchoked peer begins in slow start. In slow start, the DQS grows by one with every piece received from the peer:

```c++
void peer_connection::incoming_piece(peer_request const& p, char const* data) {
  ...
  if (m_slow_start)
    m_desired_queue_size += 1;
}
```

libtorrent then measures the one-second download rate, as follows. First, it collects the size of every received piece onto a stats object:

```c++
void bt_peer_connection::on_piece(int const received)
{
  ...
  else if (...) 
  {
    received_bytes(received, 0);
    ...
  }
  else
  {
    received_bytes(
      recv_pos - header_size
      , header_size - (recv_pos - received));
    ...

  }
  ...
}

void peer_connection::received_bytes(int const bytes_payload, int const bytes_protocol)
{
  m_statistics.received_bytes(bytes_payload, bytes_protocol);
  ...
}
```

The stats object is then cleared at every second. At the boundary of a second, this therefore contains some flavour of an "instantaneous download rate":

```c++
void stat_channel::second_tick(int tick_interval_ms)
{
	std::int64_t sample = std::int64_t(m_counter) * 1000 / tick_interval_ms;
	TORRENT_ASSERT(sample >= 0);
	m_5_sec_average = std::int32_t(std::int64_t(m_5_sec_average) * 4 / 5 + sample / 5);
	m_counter = 0;
}
```

libtorrent then makes a decision to end or continue slow start by comparing the previous one-second download rate to the current download rate:

```c++
void peer_connection::second_tick(int const tick_interval_ms) {

  // if our download rate isn't increasing significantly anymore, end slow
  // start. The 10kB is to have some slack here.
  // we can't do this when we're choked, because we aren't sending any
  // requests yet, so there hasn't been an opportunity to ramp up the
  // connection yet.
  if (m_slow_start
        && !m_peer_choked
        && m_downloaded_last_second > 0
        && m_downloaded_last_second + 5000
          >= m_statistics.last_payload_downloaded())
  {
    m_slow_start = false;
    ...
  }
  
  m_downloaded_last_second = m_statistics.last_payload_downloaded();
}
```

i.e., if we let $r$ be the current download rate, and $r_p$ the previous download rate, the criterion seems to be to end slow start if $r - r_p \leq 5\,\text{KBps}$.

This appears to be a very brittle approach: we need to obtain one or more pieces per second to maintain our download rate, and those need to translate into increases in download speeds _also_ within a couple of seconds or slow start ends. Under mix-like latencies this is unlikely to work, and fluctuations in download speed could also apparently prematurely end slow start.

The upside is that this is an extremely simple measurement on the surface, and could be easily replicated if we wanted to do that.

## Channel Saturation in Logos Storage

**Batched requests.** Logos storage always groups block requests in batches of fixed size $\sbatch$. Currently our blocks are $s_b = 2^{16} = 65\,536 = \text{KB}$ in length, and $\sbatch = 2^{22} = 4\text{MB}$, making for a batch size of $64$ blocks.

Requesters then request blocks in batches, and track the number of _in-flight_ or outstanding batches; i.e., batches of blocks we have requested, but have not yet received. There are some complications related to partial batch completions; e.g. what happens when get some blocks back for a batch but others we do not - is that batch still "in-flight?", but we will not discuss those here, and will for simplicity assume that all batches complete so we can focus on the channel saturation mechanism instead.

Our approach uses a two-state state machine which alternates between STABLE and PROBING, with the initial state being STABLE:

```nim=
proc optimalPipelineDepth*(self: var PeerPerfStats, batchBytes: uint64): int =
  let now = Moment.now()

  case self.probeMode
  of Stable:
    let
      bdpDepth = self.computeBdpDepth(batchBytes, now)
      gracePassed = (now - self.lastDepthChangeTime) >= ThroughputWindow

    if bdpDepth < self.currentDepth and gracePassed:
      self.currentDepth = max(MinRequestsPerPeer, bdpDepth)
      self.lastDepthChangeTime = now

    let effectiveInterval =
      ProbeIntervalBatches * (1 shl min(self.consecutiveReverts, MaxProbeBackoffShift))
    if self.batchesSinceProbe >= effectiveInterval and
        self.currentDepth < MaxRequestsPerPeer:
      let baseline = self.avgThroughputBps(now)
      if baseline.isSome:
        self.probeBaselineBps = baseline.get()
        self.probeStartTotalBytes = self.totalBytesDelivered
        self.probeStartTime = now
        self.probeMode = Probing
        self.batchesInProbeWindow = 0
        self.currentDepth = self.currentDepth + 1
        self.lastDepthChangeTime = now

    return self.currentDepth
  of Probing:
    # ...
```

Here `bdpDepth` represents the current estimate for the optimal number of batches to keep in flight. Lines $7$ does the actual work of computing it, with $6$--$12$ being in place to avoid constant changes to the parameter. `ThroughputWindow` is currently set to $3$ seconds. Lines $10$--$11$ update the parameter, with the minimum value bounded at `MinRequestsPerPeer = 2` batches.

Lines $14$--$26$ decide whether or not the machine should transition to `PROBING`. It starts by calculating `effectiveInterval`, which is a backoff interval for probes. It increases as `consecutiveReverts`, which is the number of consecutive probes that resulted in a stable trend in throughput, and is bounded above by `2^(log2(ProbeIntervalBatches) + MaxProbeBackoffShift) = 2^7= 128`, and below by `ProbeIntervalBatches = 16`.

Lines $19$--$26$ move the machine back to `PROBING`, registering current parameters. Let us now get deeper into this.

### Estimation of the Number of in-flight Batches

```nim=

proc computeBdpDepth(self: var PeerPerfStats, batchBytes: uint64, now: Moment): int =
  if batchBytes == 0:
    return DefaultPipelineDepth

  let rttMicrosOpt = self.avgRttMicros()
  if rttMicrosOpt.isNone:
    return DefaultRequestsPerPeer

  let throughputOpt = self.avgThroughputBps(now)
  if throughputOpt.isNone:
    return DefaultRequestsPerPeer

  let
    rttMicros = rttMicrosOpt.get()
    throughput = throughputOpt.get()
    rttSecs = rttMicros.float64 / 1_000_000.0
    bdpBytes = throughput.float64 * rttSecs
    depth = ceil(bdpBytes / batchBytes.float64).int
  clamp(depth, MinRequestsPerPeer, MaxRequestsPerPeer)
```

There are a two important here:

1. `avgRttMicros`, which we refer to as $\overline{d_{\rtt}}$;
2. `avgThroughputBps`, which we refer to as $\hat{b}$.

$\overline{d_{\rtt}}$ is a windowed average in which the samples are calls to `requestWantBlocks`:

```nim=
proc sendWantBlocksRequest(
  ...
): ... =
  ...
  let
    requestStartTime = Moment.now()
    requestResult = await self.requestWantBlocks(
      peer.id, BlockRange(treeCid: treeCid, ranges: ranges)
    )
    rttMicros = (Moment.now() - requestStartTime).microseconds.uint64
  ...
```

Calls to `sendWantBlocksRequest` request a batch at a time, and the measurement includes the whole cycle. The timing breakdown for a whole `sendWantBlocksRequest` can be written down as in Fig. 4.

![bdp-3](https://hackmd.io/_uploads/ryAfiglpWx.png)

**Figure 4.** Timing breadkdown for `requestWantBlocks`.

First, the requester serializes the block request and pushes it to the wire ($t_1$). Since the message is very small, this takes negligible time. The message then travels towards the sender, taking $d_{12}$. Once at the sender, the message is deserialize, the blocks are read from disk, and the message is serialized again ($t_2$).

The message is then pushed onto the wire, which takes $t_3 = \sbatch/b$, and another $d_{12}$ to reach the requester, which will put the data into a buffer and deserialize the header ($t_4$). A single $d_{\rtt}$ sample is threfore:

$$
d_{\rtt}^{(i)} = t_1 + d_{12} + t_2 + \frac{\sbatch}{b} + d_{21} + t_4
$$

For simplicity we drop the $i$ index on the right hand side, but those are all samples as well and can vary across measurements (even $\sbatch$, as we might have a partial/sparse batch).

We can now group the terms which do not depend on the bandwidth under a $u_i$ term[^2], and denote the single term that depends on the bandwidth, $\sbatch/b$ as the _batch transmission time_ $t_b$:

$$
d_{\rtt}^{(i)} = (t_1 + d_{12} + t_2 + d_{21} + t_4) + \frac{\sbatch}{b} = u^{(i)} + t_b^{(i)}
$$

By linearity of the average, we have:

$$
\overline{d_{\rtt}} = \overline{u} + \overline{t_b}
$$

Next we have $\bar{b}$, which is a time-windowed average of the transfer rate. Data points are $(c_i, t_i)$ tuples in which $c_i$ represents the total amount of bytes transferred over the channel at time $t_i$. Those get placed into a list:

```nim=
proc recordRequest*(self: var PeerPerfStats, rttMicros: uint64, bytes: uint64) =
  ...
  let now = Moment.now()
  self.totalBytesDelivered += bytes
  self.throughputSamples.addLast(
    ThroughputSample(time: now, cumBytes: self.totalBytesDelivered)
  )
  self.trimThroughputWindow(now)
  ...
  
proc trimThroughputWindow(self: var PeerPerfStats, now: Moment) =
  while self.throughputSamples.len > 0 and
      (now - self.throughputSamples[0].time) > ThroughputWindow:
    discard self.throughputSamples.popFirst()
```

The list is always pruned to discard samples that are older than $3$ seconds. $\bar{b}$ is then computed by taking the last element in the list $(c_f, t_f)$, and the first element $(c_s, t_s)$, and computing:

$$
\bar{b} = \frac{c_f - c_s}{t_f - t_s}
$$

Finally, the bandwidth-delay product estimate is done as:

$$
\overline{\text{bdp}} = \overline{b} \times \overline{d_{\rtt}} = \overline{b}\times \overline{u} + \overline{b} \times \overline{t_b}
$$

Note that this has a bandwidth-dependent term in it which is not $b$. However, when the channel saturates, we have that $\overline{b} \sim b$. If we further assume that on saturation $\overline{t_b} \sim \sbatch/b$[^3], we get that, at saturation:

$$
\overline{\text{bdp}} \sim b \times \overline{u} + b \times \frac{\sbatch}{b} = b \times \overline{u} + \sbatch
$$

The number of inflight batches we want to keep is given by:

$$
n_b = \overline{\text{bdp}} / \sbatch
$$

Plugging the approximation above, we have, on saturation, that:

$$
n_b \sim \frac{b \times \overline{u} + \sbatch}{\sbatch} = \frac{b \times \overline{u}}{\sbatch} + 1
$$

This is quite intuitive: the $b \times \overline{u}$ term is the product between the bandwidth and the bandwidth-independent time components, i.e., the intrinsic (queueing) delay of the channel. This is the definition of bandwidth-delay product and represents channel capacity. The $\frac{b \times \overline{u} + \sbatch}{\sbatch}$ term then represents how many batches we can "fit" into the channel.

Finally, $n_b$ is off by one on saturation because we do have a bandwidth-dependent term, $\overline{t_b}$. It is nevertheless an in-general good approximation of the number of batches that we should keep in flight.


[^1]: https://libtorrent.org/
[^2]: Having $t_2$ in here is a bit controversial to me as this _does_ depend on data size and a fluctuating ability of the node to process the batch, which is pretty much equivalent to bandwidth.
[^3]: This is definitely an iffy assumption. We are saying that $\mathbb{E}(X/Y) = \mathbb{E}(X)/\mathbb{E}(Y)$, which is typically not true.