Achieving 100Gbps Intrusion Prevention on a Single Server

# Achieving 100Gbps Intrusion Prevention on a Single Server ###### tags: `Accelerators` ###### paper origin: OSDI 2020 ###### papers: [link](http://users.ece.cmu.edu/~jhoe/distribution/2020/osdi2020.pdf) ###### video: `none` ###### slide: `none` # Outline * 1. Intrduction * 2. Background & Motivation * IDS/IPS Functionality * FPGA Basics * FPGAs and IDS/IPS Performance * 4 problems for baseline scheduling * 3. System Overview * An FPGA-First Design * Pigasus Datapath * Memory Resource Management * 4. Reassembly * Design Space for TCP Reassembly * Pigasus TCP Reassembler * Implementing the Flow Table * Worst-Case Performance * 5. Multi-String Pattern Matching * MSPM in Pigasus * 6. Evaluation # 1. Intrduction * background of this paper * Intrusion Detection and Prevention Systems are a critical part of any operational security deployment . * A recurring theme in IDS/IPS literature is **the gap between the workloads they need to handle and the capabilities of existing hardware/software implementations**. * Problem of this paper * We are faced with the need to build IDS/IPSes that can support line rates on the order of **100Gbps with hundreds of thousands of concurrent flows and capable of matching packets against tens of thousands of rules** . * The shortcome of prior research * For the most part these have focused on offloading only a specific functionality to the FPGA. Unfortunately, **traditional offloading cannot close the order-of-magnitude performance** gap to offer 100Gbps IDS/IPS processing within the footprint of a single server. * Snort 3.0 would still on average operate at 400 Mbps/core, requiring 250 cores to keep up with line rates! * The subject of this paper * This paper answers this shortcome with the **Pigasus FPGA-based IDS/IPS which a single server**. * The challenge of Pigasus FPGA-based IDS/IPS * Multistring pattern matching for payload matching * NF-specific programming frameworks for FPGAs do not provide the necessary abstractions for searching bytestreams * TCP bytestream reassembly to be performed at 100Gbps line rate * A practical system needs to be able to track at least 100K flows and check at least 10K patterns at 100Gbps line rates * Contributions * How to solve above problem : 1. Hierarchical Pattern Matching * Inspiration from Hyperscan * Pigasus uses a **set of hierarchical filters** to reduce the overall amount of memory required per pipeline replica, enabling search over 32 indices in parallel while consuming only about 2MB of BRAM. 2. Fast Common-Case Reassembly * Pigasus adopts the **memory-dense linked list approach**, but only on an out-oforder slow path which runs in parallel to the primary fast path. # 2. Background & Motivation ## 2.1 IDS/IPS Functionality * The key goal of a signature-based IDS/IPS system is to identify when a network flow triggers any of up to tens of thousands of signatures, also known as rules. * A given signature may specify one or several patterns. Patterns come in the following three categories: * Header match * String match * Regular expression * Signatures are detected at the granularity of a ‘Protocol Data Unit',that is. A **signature is only triggered if all matches are found within the same PDU** * One of the most widely known IDS/IPSes is **Snort** and our work aims to be compatible with Snort rulesets ![](https://i.imgur.com/tx2U2qb.png) * Snort 3.0 cannot meet our goal of supporting 100Gbps, 100K flows, and 10K rules on a single server. This would require 125-667 cores to support 100Gbps of throughput, or 4-21 servers. ## 2.2 FPGA Basics * Why look to FPGAs to improve IDS/IPSes? * We choose FPGAs because they are energy efficient and because they are conveniently deployed on SmartNICs where they are poised to operate on traffic **without PCIe latency or bandwidth overheads**. * FPGAs allow programmers to specify custom circuits using code. Because FPGA clock rates operate 5-20× slower than CPU clock rates,circuits must be designed with a high degree of parallelism * BRAM of FPGA is very friendly to parallelism. ![](https://i.imgur.com/iM2OVcQ.png) ## 2.3 FPGAs and IDS/IPS Performance ![](https://i.imgur.com/fYMwMZt.png) * the fraction of CPU time spent on each task in Snort 3.0: MSPM , Full Matching , TCP Reassembly, and other tasks. As we can see, no single task dominates CPU time ![](https://i.imgur.com/CF7gRah.png) * we show the idealized ‘speedup factor' from offloading any individual module for each of our traces; no module even reaches as much of a speedup as 2×. * The key lesson is simple: a much more drastic approach is needed to achieve line-rate throughput on a single server. # 3. System Overview ## 3.1 An FPGA-First Design * By FPGA-first, we mean that the FPGA is the primary compute platform performing the majority of work * Leave **regular expressions** and the **‘full match' stage** on traditional processors because FPGA provides lower performance benefits at a higher cost in terms of memory. The reason are below : * the Full Match stage only interacts with ≈5% of packets in the Pigasus design. * State-of-the-art regular expression engineul hardware algorithms do not reach our performance and memory demands for Pigasus * Therefore, **offloading regular expressions would exhaust our memory budget for little gain**. ## 3.2 Pigasus Datapath ![](https://i.imgur.com/4N4y0Rp.png) * The Parser, Reassembler, and Multi-String Pattern Matcher (MSPM) are implemented in the FPGA while the Full Matcher is offloaded to the CPU * **Initial packet processing** : Each packet first goes through a 100Gbps **Ethernet Core** that translates electric signals into raw Ethernet frames. These frames are temporarily stored in the **Packet Buffer**; each frame’s header is separately sent to the Parser which extracts TCP/IP header fields as metadata for use by the Reassembler and MSPM – and then **forwards the header to the Reassembler**. * **Reassembler** : The Reassembler sorts TCP packets in order so that they can be searched contiguously while the UDP packets are forwarded through the Reassembler without processing. * **Data Mover** : Receives the packet metadata from Reassembler and issues requests to fetch raw packets from the Packet Buffer so that they can be forwarded to the MSPM. * **Multi-String Pattern Matcher** : Be responsible for checking every packet against the header match for all 10,000 rules, and every index of every packet against all of the string-match filters for all 10,000 rules. * **DMA Engine** : For each packet, the MSPM outputs the set of rule IDs that the packet partially matched. * If the MSPM outputs the empty set, the packet is released to the network. * Otherwise it is forwarded to the DMA Engine which transfers the packet to the CPU for Full Matching. * **Full Matcher** : * On the software side, **the Full Matcher polls a ring buffer** which is populated by the DMA Engine. * the Full Matcher retrieves the complete rule and checks for a full match. It then writes its decision (forward or drop) to a transmission ring buffer which is polled by the DMA Engine on the FPGA side. * If the decision is to forward, the DMA Engine forwards the packet to the network * otherwise the packet is simply erased from the DMA Engine’s Check Packet Buffer ## 3.3 Memory Resource Management * The core obstacle to realizing an FPGA-first design is fitting all of the above functionality within the limited memory on the FPGA. * BRAM is limited to only 16MB even on our high-end Intel Stratix 10 MX FPGA. * **Reassembler** and the **MSPM** which read or write to memory the most frequently use **BRAM** * In the case of the **packet buffer**, packet data is written and read only once and hence bandwidth demand is low so use **eRAM** * The **DMA Engine** uses **DRAM** which has the highest and variable latency and the lowest bandwidth # 4 Reassembly * Reassembly refers to the process of reconstructing a TCP bytestream in the presence of packet fragmentation, loss, and out-of-order delivery. * The key objective of our Reassembler is to perform this reordering for 100K's of flows, while operating at 100Gbps ## 4.1 Design Space for TCP Reassembly * Traditional：Hardware design often favors data structures that are **fixed length**, **constant-time** and **generally deterministic** * shortcome : allocating a fixed buffer, they both waste memory and limit out-of-order flows * Improve Method : Use a more memory-dense data-structure such as a **linked-list**, where each arriving segment is allocated on demand and inserted into the list in order. * shortcome :　pipeline parallelism depends on each stage of the pipeline taking a fixed amount of time. * **Pigasus TCP Reassembler** : Combining the above two methods into fast path (only handling in-order flows) and a slow path (that handles the remaining out-of-order flows) ## 4.2 Pigasus TCP Reassembler ![](https://i.imgur.com/nPv2nrA.png) * Pigasus uses three execution engines to manage reassembly state * The **Fast Path** processes **in-order packets for established flows** (run in constants time) * the **Insertion Engine** handles **SYN packets for new flows** (not run in constant time) * the **OOO Engine** handles **out-of-order packets for existing flows** (not run in constant time) * Because Pigasus is implemented in hardware, these engines **can all run simultaneously without stalling each other** ## 4.3 Implementing the Flow Table * While the Fast Path, Insertion Engine, and OOO Engine all operate simultaneously, **they must synchronize over shared flow state** ![](https://i.imgur.com/n6sstjI.png) * **The key to maintaining the hash table’s deamortization property is the Insertion Engine** , which is responsible for inserting: * (a) new flows * (b) flows that were previously evicted from the hash table * To guarantee conflict-free flow table access * Fast Path > Insertion Engine (to ensure deterministic performance on the Fast Path) * Insertion Engine > OOO Engine (to ensure that the queue drains and since, empirically, the OOO path is underutilized) * **BRAM is dual-ported**, we allow the Fast Path direct access to the Flow Table, while accesses originating from the OOO Engine or Insertion Engine are managed by an arbiter ## 4.4 Worst-Case Performance * Since Pigasus serves on the front-lines of network defenses, it is a prime target for Denial-of-Service (DoS) attacks. * While all in-order packets are guaranteed full throughput, **an attacker could potentially slow down the OOO path by injecting out-of-order flows into the system**. ![](https://i.imgur.com/Suft7Re.png) # 5 Multi-String Pattern Matching * Pigasus' MSPM checks for fast patterns, headers, and non-fast patterns, reducing the load on the CPU-side full matcher (**Only packets which both match the header match and the fast pattern are forwarded to the full matcher**) ## 5.1 MSPM in Pigasus ![](https://i.imgur.com/46MyPC1.png) * Fast Pattern String Matching (FPSM) : * Pigasus has a filtering stage in which packets traverse two parallel filters * shift-or * (32×8) hash tables * In theory, these filters can generate (32×8) matches per cycle however, in the common case, most packets and **most indices do not match any rules**, and therefore require no further processing. * **Rule Reduction** : selects non-zero rule matches from the filter’s 256-bit wide vector output and narrows it down to 8 values. * Header Matching : * Use the packet header data to determine whether the matches produced by the previous stage are consistent with the corresponding rule's Port Group. * If this packet's port number is a subset of this rule's Port Group, the rule is considered a match * otherwise, the rule is ignored * Non-Fast Pattern String Matching (NFPSM): * on average, only 11% of packets reach the Non-Fast Pattern Matcher * Pigasus further filters down the packets and rules destined for the CPU to only 5% of packets ![](https://i.imgur.com/Bpta478.png) 1. To compute the fingerprint, we first represent the set of (index, length) tuples generated by the 8 NFPSM hash tables as a 16-bit vector by setting bit to ‘1’ for each length bucket 2. for each bucket, all of the 16-bit vectors generated for a given packet are ORed together to create a 16-bit ‘sub fingerprint’ for that bucket * As the last stage of our hierarchical filtering, the non-fast pattern matcher has the **lowest throughput capacity**. * Conclusion : By hierarchically filtering out packets, the MSPM reduces the amount of traffic traversing each subsequent stage of the MSPM # 6 Evaluation ![](https://i.imgur.com/4oE97P6.png) ![](https://i.imgur.com/9BPv7vN.png) ![](https://i.imgur.com/svb7xJw.png) * Our experiments with a variety of traces show that Pigasus can support 100Gbps using an average of 5 cores and 1 FPGA, using 38x less power than a CPU-only approach.