# Achieving 100Gbps Intrusion Prevention on a Single Server
###### tags: `Accelerators`
###### paper origin: OSDI 2020
###### papers: [link](http://users.ece.cmu.edu/~jhoe/distribution/2020/osdi2020.pdf)
###### video: `none`
###### slide: `none`
# Outline
* 1. Intrduction
* 2. Background & Motivation
* IDS/IPS Functionality
* FPGA Basics
* FPGAs and IDS/IPS Performance
* 4 problems for baseline scheduling
* 3. System Overview
* An FPGA-First Design
* Pigasus Datapath
* Memory Resource Management
* 4. Reassembly
* Design Space for TCP Reassembly
* Pigasus TCP Reassembler
* Implementing the Flow Table
* Worst-Case Performance
* 5. Multi-String Pattern Matching
* MSPM in Pigasus
* 6. Evaluation
# 1. Intrduction
* background of this paper
* Intrusion Detection and Prevention Systems are a critical part of any operational security deployment .
* A recurring theme in IDS/IPS literature is **the gap between the workloads they need to handle and the capabilities of existing hardware/software implementations**.
* Problem of this paper
* We are faced with the need to build IDS/IPSes that can support line rates on the order of **100Gbps with hundreds of thousands of concurrent flows and capable of matching packets against tens of thousands of rules** .
* The shortcome of prior research
* For the most part these have focused on offloading only a specific functionality to the FPGA. Unfortunately, **traditional offloading cannot close the order-of-magnitude performance** gap to offer 100Gbps IDS/IPS processing within the footprint of a single server.
* Snort 3.0 would still on average operate at 400 Mbps/core, requiring 250 cores to keep up with line rates!
* The subject of this paper
* This paper answers this shortcome with the **Pigasus FPGA-based IDS/IPS which a single server**.
* The challenge of Pigasus FPGA-based IDS/IPS
* Multistring pattern matching for payload matching
* NF-specific programming frameworks for FPGAs do not provide the necessary abstractions for searching bytestreams
* TCP bytestream reassembly to be performed at 100Gbps line rate
* A practical system needs to be able to track at least 100K flows and check at least 10K patterns at 100Gbps line rates
* Contributions
* How to solve above problem :
1. Hierarchical Pattern Matching
* Inspiration from Hyperscan
* Pigasus uses a **set of hierarchical filters** to reduce the overall amount of memory required per pipeline replica, enabling search over 32 indices in parallel while consuming only about 2MB of BRAM.
2. Fast Common-Case Reassembly
* Pigasus adopts the **memory-dense linked list approach**, but only on an out-oforder slow path which runs in parallel to the primary fast path.
# 2. Background & Motivation
## 2.1 IDS/IPS Functionality
* The key goal of a signature-based IDS/IPS system is to identify when a network flow triggers any of up to tens of thousands of signatures, also known as rules.
* A given signature may specify one or several patterns. Patterns come in the following three categories:
* Header match
* String match
* Regular expression
* Signatures are detected at the granularity of a ‘Protocol Data Unit',that is. A **signature is only triggered if all matches are found within the same PDU**
* One of the most widely known IDS/IPSes is **Snort** and our work aims to be compatible with Snort rulesets

* Snort 3.0 cannot meet our goal of supporting 100Gbps, 100K flows, and 10K rules on a single server. This would require 125-667 cores to support 100Gbps of throughput, or 4-21 servers.
## 2.2 FPGA Basics
* Why look to FPGAs to improve IDS/IPSes?
* We choose FPGAs because they are energy efficient and because they are conveniently deployed on SmartNICs where they are poised to operate on traffic **without PCIe latency or bandwidth overheads**.
* FPGAs allow programmers to specify custom circuits using code. Because FPGA clock rates operate 5-20× slower than CPU clock rates,circuits must be designed with a high degree of parallelism
* BRAM of FPGA is very friendly to parallelism.

## 2.3 FPGAs and IDS/IPS Performance

* the fraction of CPU time spent on each task in Snort 3.0: MSPM , Full Matching , TCP Reassembly, and other tasks. As we can see, no single task dominates CPU time

* we show the idealized ‘speedup factor' from offloading any individual module for each of our traces; no module even reaches as much of a speedup as 2×.
* The key lesson is simple: a much more drastic approach is needed to achieve line-rate throughput on a single server.
# 3. System Overview
## 3.1 An FPGA-First Design
* By FPGA-first, we mean that the FPGA is the primary compute platform performing the majority of work
* Leave **regular expressions** and the **‘full match' stage** on traditional processors because FPGA provides lower performance benefits at a higher cost in terms of memory. The reason are below :
* the Full Match stage only interacts with ≈5% of packets in the Pigasus design.
* State-of-the-art regular expression engineul hardware algorithms do not reach our performance and memory demands for Pigasus
* Therefore, **offloading regular expressions would exhaust our memory budget for little gain**.
## 3.2 Pigasus Datapath

* The Parser, Reassembler, and Multi-String Pattern Matcher (MSPM) are implemented in the FPGA while the Full Matcher is offloaded to the CPU
* **Initial packet processing** : Each packet first goes through a 100Gbps **Ethernet Core** that translates electric signals into raw Ethernet frames. These frames are temporarily stored in the **Packet Buffer**; each frame’s header is separately sent to the Parser which extracts TCP/IP header fields as metadata for use by the Reassembler and MSPM – and then **forwards the header to the Reassembler**.
* **Reassembler** : The Reassembler sorts TCP packets in order so that they can be searched contiguously while the UDP packets are forwarded through the Reassembler without processing.
* **Data Mover** : Receives the packet metadata from Reassembler and issues requests to fetch raw packets from the Packet Buffer so that they can be forwarded to the MSPM.
* **Multi-String Pattern Matcher** : Be responsible for checking every packet against the header match for all 10,000 rules, and every index of every packet against all of the string-match filters for all 10,000 rules.
* **DMA Engine** : For each packet, the MSPM outputs the set of rule IDs that the packet partially matched.
* If the MSPM outputs the empty set, the packet is released to the network.
* Otherwise it is forwarded to the DMA Engine which transfers the packet to the CPU for Full Matching.
* **Full Matcher** :
* On the software side, **the Full Matcher polls a ring buffer** which is populated by the DMA Engine.
* the Full Matcher retrieves the complete rule and checks for a full match. It then writes its decision (forward or drop) to a transmission ring buffer which is polled by the DMA Engine on the FPGA side.
* If the decision is to forward, the DMA Engine forwards the packet to the network
* otherwise the packet is simply erased from the DMA Engine’s Check Packet Buffer
## 3.3 Memory Resource Management
* The core obstacle to realizing an FPGA-first design is fitting all of the above functionality within the limited memory on the FPGA.
* BRAM is limited to only 16MB even on our high-end Intel Stratix 10 MX FPGA.
* **Reassembler** and the **MSPM** which read or write to memory the most frequently use **BRAM**
* In the case of the **packet buffer**, packet data is written and read only once and hence bandwidth demand is low so use **eRAM**
* The **DMA Engine** uses **DRAM** which has the highest and variable latency and the lowest bandwidth
# 4 Reassembly
* Reassembly refers to the process of reconstructing a TCP bytestream in the presence of packet fragmentation, loss, and out-of-order delivery.
* The key objective of our Reassembler is to perform this reordering for 100K's of flows, while operating at 100Gbps
## 4.1 Design Space for TCP Reassembly
* Traditional:Hardware design often favors data structures that are **fixed length**, **constant-time** and **generally deterministic**
* shortcome : allocating a fixed buffer, they both waste memory and limit out-of-order flows
* Improve Method : Use a more memory-dense data-structure such as a **linked-list**, where each arriving segment is allocated on demand and inserted into the list in order.
* shortcome : pipeline parallelism depends on each stage of the pipeline taking a fixed amount of time.
* **Pigasus TCP Reassembler** : Combining the above two methods into fast path (only handling in-order flows) and a slow path (that handles the remaining out-of-order flows)
## 4.2 Pigasus TCP Reassembler

* Pigasus uses three execution engines to manage reassembly state
* The **Fast Path** processes **in-order packets for established flows** (run in constants time)
* the **Insertion Engine** handles **SYN packets for new flows** (not run in constant time)
* the **OOO Engine** handles **out-of-order packets for existing flows** (not run in constant time)
* Because Pigasus is implemented in hardware, these engines **can all run simultaneously without stalling each other**
## 4.3 Implementing the Flow Table
* While the Fast Path, Insertion Engine, and OOO Engine all operate simultaneously, **they must synchronize over shared flow state**

* **The key to maintaining the hash table’s deamortization property is the Insertion Engine** , which is responsible for inserting:
* (a) new flows
* (b) flows that were previously evicted from the hash table
* To guarantee conflict-free flow table access
* Fast Path > Insertion Engine (to ensure deterministic performance on the Fast Path)
* Insertion Engine > OOO Engine (to ensure that the queue drains and since, empirically, the OOO path is underutilized)
* **BRAM is dual-ported**, we allow the Fast Path direct access to the Flow Table, while accesses originating from the OOO Engine or Insertion Engine are managed by an arbiter
## 4.4 Worst-Case Performance
* Since Pigasus serves on the front-lines of network defenses, it is a prime target for Denial-of-Service (DoS) attacks.
* While all in-order packets are guaranteed full throughput, **an attacker could potentially slow down the OOO path by injecting out-of-order flows into the system**.

# 5 Multi-String Pattern Matching
* Pigasus' MSPM checks for fast patterns, headers, and non-fast patterns, reducing the load on the CPU-side full matcher (**Only packets which both match the header match and the fast pattern are forwarded to the full matcher**)
## 5.1 MSPM in Pigasus

* Fast Pattern String Matching (FPSM) :
* Pigasus has a filtering stage in which packets traverse two parallel filters
* shift-or
* (32×8) hash tables
* In theory, these filters can generate (32×8) matches per cycle however, in the common case, most packets and **most indices do not match any rules**, and therefore require no further processing.
* **Rule Reduction** : selects non-zero rule matches from the filter’s 256-bit wide vector output and narrows it down to 8 values.
* Header Matching :
* Use the packet header data to determine whether the matches produced by the previous stage are consistent with the corresponding rule's Port Group.
* If this packet's port number is a subset of this rule's Port Group, the rule is considered a match
* otherwise, the rule is ignored
* Non-Fast Pattern String Matching (NFPSM):
* on average, only 11% of packets reach the Non-Fast Pattern Matcher
* Pigasus further filters down the packets and rules destined for the CPU to only 5% of packets

1. To compute the fingerprint, we first represent the set of (index, length) tuples generated by the 8 NFPSM hash tables as a 16-bit vector by setting bit to ‘1’ for each length bucket
2. for each bucket, all of the 16-bit vectors generated for a given packet are ORed together to create a 16-bit ‘sub fingerprint’ for that bucket
* As the last stage of our hierarchical filtering, the non-fast pattern matcher has the **lowest throughput capacity**.
* Conclusion : By hierarchically filtering out packets, the MSPM reduces the amount of traffic traversing each subsequent stage of the MSPM
# 6 Evaluation



* Our experiments with a variety of traces show that Pigasus can support 100Gbps using an average of 5 cores and 1 FPGA, using 38x less power than a CPU-only approach.