## Floodgate: Taming Incast in Datacenter Networks
## 1. Abstract
### (1.1) Datacenter incast
A large number of senders can send data to a single receiver simultaneously, which makes the last hop the network bottleneck. This is called an incast. Incast scenarios occur frequently in the datacenter when clients run web search application, Spark-like data-parallel systems, TensorFlow-like machine learning systems, or large-scale storage data backup.


### (1.2) Issues caused from incast
#### (1.2.1) Buffer overflow
A large amount of incast traffic could lead to buffer built-up and even overflow, inducing packet drops. Timeout retransmission induced by packet drop could hurt the network throughput and result in long tail latency.
#### (1.2.1) Head of line (HOL) blocking
A large buffer occupancy induced by incast leads to HOL blocking problem thus impairs the performance of flows.

#### (1.2.2) Priority-based Flow Control (PFC) pause frame storm
PFC could lead to poor application performance due to its congestion-spreading characteristics, i.e., head-of-line (HOL) blocking, routing deadlocks, and PFC pause frame storms.
### (1.3) How Floodgate deals with incast
- Divide up the packet load among the switches on the path rather than just bottleneck switch.
- Protect other flows from suffering from the incast flow
## 2. Divide up the packet load among the switches
Floodgateโs switch uses a per-dst sending window to control the transmission of traffic for each receiver host.

- When will it be on effect?
For non-incast flows, the sending window does nothing on their transmission. Data packets are always forwarded successfully by the downstream switch. For incast flows, the returning rate of credits is limited by the last hop ToR, i.e., the network bottleneck.
- How to decide the value of sending window?
The sending window can be initialized with a small value, i.e., ๐ โ ๐ต๐ท๐๐๐๐ฅ๐ก๐ป๐๐ , where ๐ต๐ท๐๐๐๐ฅ๐ก๐ป๐๐ is the bandwidthdelay product between the switch itself and its next-hop downstream switch, and ๐ is a parameter that stands for how aggressively Floodgate recognizes a flow as encountering an incast.
- Pratical challenge
- callenge
When a data packet is forwarded to the switchesโ egress queue, a credit is sent back to its upstream switch immediately. However, it is costly for switches to generate credit packets at a high rate, for that it consumes the pps (packets per seconds) of switches. Besides, per-packet credit consumes network bandwidth.
- solution
A Floodgateโs switch leverages a timer ๐ for each ingress port. Timer ๐ records the elapsed time since last sent back credits.
Then, Floodgateโs switch records the number of packets that have already been forwarded but have not been returned with credits for each destination host from an ingress port. When the elapsed time reaches the preset value, a credit packet is sent back from the ingress port to the corresponding upstream switch, containing a pair of destination hostโs IP and the number of credits, i.e., < ๐๐๐ ๐ก๐๐๐๐ก๐๐๐_๐ผ๐, ๐๐๐๐๐๐ก๐ >.
The switchโs sending window is initialized to ๐ต๐ท๐_๐๐๐ฅ๐ก๐ป๐๐ + ๐ถ_๐๐ข๐ก โ ๐ , where ๐ is the transmission interval of aggregated credit packet, and ๐ถ๐๐ข๐ก is the switchโs egress port bandwidth.
## 3. Incast isolation
### (3.1) Deal with HOL via virtual output queue (VOQ)
Instead of directly sending packets to the egress queue, a VOQ does not transmit packets until the egress port has the resources to forward the traffic. In Floodgate, data packets transmitting to the same receiver hosts are allocated with a dedicated VOQ. It is aimed at isolating traffic passing through different congestion points.

### (3.2) How is Floodgate codesigned with VOQ
When a data packet is received, the switch first checks whether the dataโs destination host has already been allocated with a VOQ. If so, it indicates that the corresponding destination is encountering an incast. Therefore, the data packet is pushed into the corresponding VOQ. Otherwise, the switch checks whether there is sending window left. If the sending window is adequate, the data packet is forwarded to the egress queue directly.
## 4. Testbed Evaluation
### (4.1) Performance of testbed experiments under incast-mix scenarios

- (a) FCT of non-incast flows
Without Floodgate, non-incast flows are HOL blocked by incast flows, suffering a large queuing delay. By leveraging Floodgate, the FCT is reduced.
- (b) Buffer occupancy
Without Floodgate, data packets are mainly buffered on the core switch and destination ToR. By leveraging Floodgate, incast traffic is tamed via per-dst window, therefore the buffer on ToR-Up is slightly larger. Meanwhile, the maximum buffer occupancy on ToR-Down and core switches is reduced.
## 5. Simulation evaluation
### (5.1) Average and 99th-tail FCT of Poisson flows under incastmix scenarios.

- Floodgate improves flowsโ performance
The improvement comes from significantly reducing buffer occupancy to eliminate PFC. Meanwhile, non-incast flows are not HOL blocked by incast flows anymore via VOQ isolating.
### (5.2) Traffic reallocation and queuing time analysis
#### (5.2.1) Max buffer

- Web server workload
The maximum buffer occupancy on ToR-Up ports is also large because of the PFC frame storm.
- Hadoop workload
Because of the non-blocking topology, the buffer occupancy on ToR-Up ports is nearly zero. The maximum buffer occupancy on ToR-Down ports is the largest among all kinds, followed by buffer occupancy on core switches. This is because ToR-Down ports and core switches are the aggregation point of incast traffic.
- With Floodgate
- sending window, the buffer pressure put on the incast aggregation points is greatly reduced. Therefore, the buffer occupancy on core and ToR-Down ports is reduced significantly. Meanwhile, a relatively larger buffer occupancy on ToR-Up ports is observed, for that incast traffic is firstly tamed by source ToRs to avoid injecting into the downstream network aggressively.
- Meanwhile, a relatively larger buffer occupancy on ToR-Up ports is observed, for that incast traffic is firstly tamed by source ToRs to avoid injecting into the downstream network aggressively.
- Floodgate design leverages per-dst PAUSE mechanism to reduce the in-network traffic, therefore the buffer occupancy on ToR-Up ports is smaller than Floodgate.
#### (5.2.2) Queuing time
Figure 11(b) demonstrates the per-hop average queuing time of non-incast flows respectively.

- For DCQCN, under Web Server workload, packets spend much time on ToR-Up ports because of PFC frame storm.
- Under Hadoop workload, the time spent on core switches constitutes most of the queuing time because of HOL blocking caused by incast flows.
- When Floodgate is applied, the queuing time among each hop is greatly reduced. A slightly larger buffer occupancy on ToR-Up ports does not affect flowsโ queuing time, for that incast traffic is isolated via VOQs.
### (5.3) Robustness of Floodgate
#### (5.3.1) Handling packet loss
Under packet loss rate 5%, no obvious effect on throughput has been observed. Under packet loss rate 10%, throughput fluctuates within a small range. It indicates that the sending window of Floodgateโs switch can be recovered quickly after packet loss happens.

#### (5.3.2) Three-tier topology
#### (5.3.3) When the number of ToR scales up
When the number of ToRs reaches 20, PFC is triggered and the buffer occupancy reaches its maximum value. Floodgate is more robust to handle large-scale topology. The buffer occupancy of Floodgate remains stable when the number of ToR increases.
One may wonder why the buffer occupancy of core switches remains stable, as they can receive data packets from more ToR when the number of ToR increases. This is because Floodgate benefits from the delayCredit mechanism. Core switches are delayed to send back credits when there are plenty of packets buffered in VOQs.

### (5.4) When "per-dst PAUSE" is active
- What is "per-dst PAUSE"
When a Floodgateโs ToR receives a data packet whose VOQ queue length exceeds the threshold a special PAUSE frame, i.e., called dstPause, is sent back to source hosts, piggybacking the corresponding destination IP. Meanwhile, NICs maintain per-dst queues to pause the corresponding traffic. When a host receives a dstPause, it parses the frame and pauses the flow whose destination matches.
- Experiment
- Pure DCQCN
For DCQCN, the incast traffic quickly fills up ToR-Down portsโ buffer, as well as core switchesโ buffer. When the incast occurs 12 times or more, the buffer occupancy on ToR-Up ports starts to increase, which indicates that the PFC pause frame storm happens.
- Floodgate
In Floodgate, the buffer of Core switches and ToRDown ports are stable because the ToR-Up ports are the first hop gate-keepers of incast traffic. As a side-effect, the ToR-Up ports buffer occupancy increases at a rate proportional to the number of incast times when incast traffic arrives continuously. Nevertheless, it is notable that Floodgate can handle dozens of times of successive incast well.
- Floodgate with "per-dst PAUSE"
With per-dst PAUSE, the buffer occupancy is extremely small. Source hosts are paused by source ToRs. Therefore the in-network traffic can be reduced significantly.

## 6. Resource Overhead
### (6.1) Memory overhead
The memory overhead comes from the runtime status that Floodgate should maintain. Floodgateโs switch needs to maintain sending windows for active destination. In the worst case, the maximum number of sending windows scales with the number of hosts in the network.
- For a datacenter network consisting of 100,000 servers, entries of sending window equal 100,000 in the worst case, which consumes less than 10% of the switchโs dedicated stateful memory according to discussions in BFC.
- Besides, given that incast traffic is isolated from non-incast traffic, small non-incast flows can be finished quickly. Therefore, the number of active destinations can be modest.
- In addition, destination IPs can be hashed into hash tables to store the value of sending windows, saving memory at the cost of sacrificing precision.
### (6.2) Network overhead
Shown in Figure 18. Control packets, i.e., ACKs or CNP sent by receiver hosts, saturate 4.5% of bandwidth in both approaches, the same as in DCQCN. For Floodgate, credit packets saturate 0.175% of bandwidth, and that value is 3.0% in ideal design. It suggests that the network overhead induced by Floodgate is negligible.

## 7. Conclusion
This paper analyzes the consequence of incast, and proposes Floodgate, a switch-based hop-by-hop flow control, which handles incast by following steps:
1. recognize incast flows quickly and accurately.
2. tame the transmission of incast flows.
3. isolate incast flows from non-incast flows.
Floodgate reduces the buffer occupancy significantly, thus reducing packet loss/PFC. Therefore, the performance of non-incast flows is greatly improved while the performance of incast flows is not compromised. With the rapid growth of switchesโ programmability, co-design of congestion control with switch-based hop-by-hop flow control is promising.