owned this note
owned this note
Published
Linked with GitHub
# CS 293N, Spring 2022
[Course Website](https://sites.cs.ucsb.edu/~arpitgupta/cs293n/spring22)
### Logistics
Please find the detailed [notes](https://sites.cs.ucsb.edu/~arpitgupta/cs293n/spring22/about/) on logistics here.
### Teams
| Team Name | Members| Proposal | Presentation (Proposal) | Final Presentation| Final Report |
| ----------- | ----------- | ----------- | ----------- |----------- | ----------- |
| Team 1 (Badcode Corporation) | Roman Beltiukov and Liu Kurafeeva| [PINOTheca: A Dataset for Quality of Experience Estimation from Encrypted YouTube Traffic](https://www.overleaf.com/read/bqvcdjncwmqv) | [Presentation](https://docs.google.com/presentation/d/1OxZ9M9obfVJCGL_W-wd1v-fX5uPx8ERUQ9y9CSo1vCk/edit?usp=sharing) |[Presentation](https://docs.google.com/presentation/d/1XNj4YuNwVmu3l3IF6gFb7yAZYRRxka-n6O5C8m_XlIo/edit?usp=sharing)|[Report](https://www.overleaf.com/read/fhzzxqvkpvnb)
| Team 2 (Neuralink)| Navya Battula| [A Look Behind the Curtain: Traffic Classification in an Increasingly Encrypted Web](https://www.overleaf.com/read/vmpzkccdmrzm) | [Presentation](https://docs.google.com/presentation/d/1gEhuhSvFnhFk9ZMb4gqQeC-Dy5mdBZUIEbsi04c-IK8/edit?usp=sharing) | [Final Presentation](https://drive.google.com/drive/folders/15HZjPWrPFwjw4wt68e-tPGeP88U-vIaf?usp=sharing) | [Report](https://www.overleaf.com/read/jkbbgqbxdtmp)
| Team 3 (Insignia) | Shubham Talbar, Nikunj Baid, Satyam Awasthi|[JitterNot: A Data-driven MPC scheme for ABR in VCAs](https://www.overleaf.com/read/dhxczmpfrzsw) | [Presentation](https://docs.google.com/presentation/d/1I-RFd3HZ-ZfWpmrxM5Z878I9usa9dtU00VC8lqm48us/edit?usp=sharing) | [Final Presentation](https://docs.google.com/presentation/d/1hbJClv6txokdfTwepcCxQnEFdsrtZjQgVP7Bh3bsM5o/edit?usp=sharing) | [Report](https://www.overleaf.com/read/gdzbwwmksgrf)
| Team 4 (Snack Overflow) | Samridhi Maheshwari, Deept Mahendiratta|[Network Intrusion Detection using Machine Learning and Deep Learning](https://www.overleaf.com/read/dfhjpsyryytx) | [Presentation](https://docs.google.com/presentation/d/1GBuf-ihJI8urkVzRX373cpBYwzu1voQZK__ia2XKcKw/edit?usp=sharing) | [Final Presentation](https://docs.google.com/presentation/d/1-Sn8lMcLkGrAg4f9S_eHpv75JirKmvO3LYjOKPzsAnk/edit?usp=sharing) |[Report](https://www.overleaf.com/read/dvxzwdsbsgpf)
| Team 5 (Team BAR) | Rhys Tracy, Aaron Jimenez, and Brian Chen|[An Open Source Chunk Size Predictor to Improve Video Streaming Efficiency](https://www.overleaf.com/read/ffjwkjhxrxhn) | [Presentation](https://docs.google.com/presentation/d/17AQ5ZYJOQN-bVrvjUf1Itr3Nd6UUOCRDjSimYlC6IZA/edit?usp=sharing) | [Final Presentation](https://docs.google.com/presentation/d/1kZLJRn8OSDQ4RaBaANnYlqv22Zs2jQ0Fc6i3gwfLcr4/edit?usp=sharing) |[Report](https://www.overleaf.com/read/ymvgqcfwqpkp)
| Team 6 | Pranjali Jain, Vinothini Gunasekaran |[QoE Estimation for Video Conferencing Applications](https://www.overleaf.com/read/drdnbmnbbxtp) | [Presentation](https://docs.google.com/presentation/d/1dHSgVaE9P2OwH9XiHsWJ1x_3WxcUEJzrroXDKLHtpII/edit?usp=sharing) | [Final Presentation](https://docs.google.com/presentation/d/1yKhXzBeie3LPnxlkZ5FUsDjgagJ2FSGR8dex6iu-2OU/edit?usp=sharing) | [Report](https://www.overleaf.com/read/kpczxkhvmvrh)
| Team 7 | Nagarjun Avaraddy, Apoorva Jakalannanavar, Alan Roddick |[Let There Be Light: Self Explainable Modeling for Network Traffic Analysis](https://www.overleaf.com/read/kwzkzgjgzdnx) | [Presentation](https://docs.google.com/presentation/d/1nSCx2zsoinpDDwdDqaX4psPJS68MMt91PkJdfBpBj4Y/edit?usp=sharing) | [Final Presentation](https://docs.google.com/presentation/d/1-7DRW3oee3SDsx4nvAtFKyN8uK-18tLV2OKXJSdOXEY/edit?usp=sharing) | [Report](https://www.overleaf.com/read/swszhbkjnyqr)
| Team 8 | Achintya Desai, Arjun Prakash, Ajit Jadhav |[Crypto-ransomware attack detection over encrypted file sharing traffic](https://www.overleaf.com/read/rbghdqqvxprg) | [Presentation](https://docs.google.com/presentation/d/1dVNpADRBCWzchiCUaNZ82-OYz3DRHJKz2JlENO5NbaQ/edit?usp=sharing) | [Final Presentation](https://docs.google.com/presentation/d/1XAbG3PBuZgmAz7ZjbbswQKIfuFaGpM4H/edit?usp=sharing&ouid=110678761124947988480&rtpof=true&sd=true) | [Report](https://www.overleaf.com/read/sctzhbgxywhg)
| Team 9 | Nawel Alioua, Jiamo Liu |[To Decipher The Myth Behind YouTube’s Bandwidth Estimation Strategy](https://www.overleaf.com/read/jdvdwnhrdtyy) | [Presentation](https://docs.google.com/presentation/d/1G7UBZyzcjMq9PL2c2dAOxsDLOMjJBRPHGLXEMXiOv_s/edit?usp=sharing) | [Final Presentation](https://docs.google.com/presentation/d/1i_rIcyKQo2hBR7OauRcDzd9ApherK0bb_eYG4_4cNPc/edit#slide=id.g130ba9e1078_0_0) | [Final Report](https://www.overleaf.com/read/bwsymcjzyhnc)
| Team 10 (Losers V2 )| Jaber Daneshamooz, Fahed Abudayyeh, Seif Ibrahim |[Using Feature Entropy to create a Data Collection Policy for ML in Networked Systems](https://docs.google.com/document/d/1rOLaSYBKcp5rMMXTteICoOBtljMYB15LMmg9ucMtq90/edit?usp=sharing) | [Presentation](https://docs.google.com/presentation/d/1sdVfQtg5k2j85PtPXOD8TDVrET-y3IgEEmM3WiDJxcs/edit?usp=sharing) | [Final Presentation](https://docs.google.com/presentation/d/1xuUb987N0bnyIsxZnQ8aO0Zl2ZOIYe3f7RX3BEoGOSk/edit?usp=sharing) | [Final Report](https://www.overleaf.com/read/wwwjdwzxbtpg)
| Team 11 | Punnal Ismail Khan |[Measure Last Mile RTT to detect latency inflammation events](https://www.overleaf.com/read/njpjjtkrmbvn) | [Presentation](https://docs.google.com/presentation/d/1QzF76kuFVhNfwAVVYVIbbr_x3PaUI01YaMnIG6eGHCc/edit?usp=sharing) | [Final Presentation](https://docs.google.com/presentation/d/1TA57ZcXfj1lATtvKEH75XkdGkFyLwyQ4Tqo2N3TmYL4/edit?usp=sharing) | [Report](https://www.overleaf.com/read/mnpfsdfrypgr)
| Team 12 | Shereen Elsayed|[Datzilla: Extention for PINOT to Cover Netflix/Twitch](https://www.overleaf.com/9668473319sphqnshgbmhk) | [Presentation](https://docs.google.com/presentation/d/10Q9oqIjqhp_nvMpDlR7OjobQm68v9MEK/edit?usp=sharing&ouid=103440494608772812383&rtpof=true&sd=true) | [Final Presentation](https://docs.google.com/presentation/d/1eTUeV7xry1p50eHdoejubNmjQrPK5nub/edit?usp=sharing&ouid=103440494608772812383&rtpof=true&sd=true) | [Report](https://www.overleaf.com/9782391173dzfqryvzmbvp)
<!-- | <Your team name> | <team members> | -->
### Self-driving Networks
In this course, we will learn about self-driving networks. More concretely, we will discuss what's the vision for self-driving networks, what recent progress has been made in this research direction, and what are the exciting new problems in this space.
------------------------------------
## Lectures
* [Lecture 1](#Lecture-1)
* [Lecture 2](#Lecture-2)
* [Lecture 3](#Lecture-3)
* [Lecture 4](#Lecture-4)
* [Lecture 5](#Lecture-5)
* [Lecture 6](#Lecture-6)
* [Lecture 7](#Lecture-7)
* [Lecture 8](#Lecture-8)
* [Lecture 9](#Lecture-9)
* [Lecture 10](#Lecture-10)
* [Lecture 11](#Lecture-11)
* [Lecture 12](#Lecture-12)
* [Lecture 13](#Lecture-13)
* [Lecture 14](#Lecture-14)
* [Lecture 15](#Lecture-15)
* [Lecture 16](#Lecture-16)
* [Lecture 17](#Lecture-17)
* [Lecture 18](#Lecture-18)
* [Lecture 19](#Lecture-19)
* [Lecture 20](#Lecture-20)
## Lecture 1
This [lecture](https://www.cs.princeton.edu/~jrex/papers/self-driving-networks18.pdf) will be based on the report from a workshop in **self-driving networks**, organized by NSF in 2018.
### Motivation
* How we use network has changed over time? more applications, more data, and more expectations regarding performance (i.e., high speed, low latency), reliability (i.e., stay performant all the time), and security (i.e., stay safe from malicious players all the time) from the network.
* To meet these expectations, network operators need to chase the tails, which requires
* data collection (telemetry)
* flexible control (SDN)
* closing-the-loop (AI/ML)
![](https://i.imgur.com/cWkMJIG.png)
* What's a self-driving network?
* A network that can run by itself by combination of query-driven network measurement, automated inference techniques, and programmatic control.
* Three key capabilities:
* Query-driven network measurement.
* Automated decision making
* Programmatic control
* Why all the excitement?
* resonance between `application pull` and `technological push`.
### Use Cases
#### QoE Optimization
* Network perspective
* Problems:
* accurately identify packets/flows for different applications
* infer QoE (what's QoE?)
* learn the packet (or flow) scheduling/routing policies
* Challenges
* Scale
* Encrypted payload
* limited control (for existing networks)
* End-host perspective
* Problems
* learn the policies to adapt
* bit rates,
* frame rates (video conferencing applications),
* encoding algorithms, etc.
* Challenges:
* Limited view of network conditions
Fundamental research challenges
* What is the relationship of network utilization to video quality of experience, and application quality of experience more generally?
* How accurately can application quality of experience (QoE) be diagnosed from passive network traffic monitoring?
* What features are most useful in diagnosing and predicting application QoE, for different applications?
* Can such inference be performed at high traffic rates?
* What techniques are applicable at different points along the end-to-end network path (e.g., in the home network, in the access network, at interconnection points, at the server)?
* In the event of degraded network conditions (e.g., congestion), to what extent should adaptation entail application adaptation (e.g., changes in video bitrate quality) vs. network adaptation (e.g., selection of an alternate route between the content and the user)?
* When network (re)action is required, how should reactive approaches be specified? For example, should network changes be automatically determined from optimization? To what extent should the operator be in the loop when executing these changes (both on longer, planning timescales and on shorter, operational timescales)?
#### Security
* Problem:
* Defend the network against a wide-range of (continuously evolving) attacks.
* Approach:
* Accurately detect attack patterns in the network, and mitigate its impact as effectively as possible
* Make the best use of all possible information for detection, i.e.,
* traffic patterns in the network data plane
* patterns in network control plane (BGP messages)
* patterns in DNS queries
* logs from existing network devices and security appliances (e.g., IDS logs)
* Challenges
* Scale: extracting features, deep-packet inspection, etc. at higher data rates is costly
* Privacy: Sharing information/data across organizations in a privacy-preserving manner
### Summary
Self-driving networks are a hot topic in the world of networking and telecommunications. These networks are designed to be able to run on their own, using a combination of query-driven network measurement, automated decision making, and programmatic control. The goal of self-driving networks is to meet the increasing demands for high performance, reliability, and security from network users, who are now using more applications, generating more data, and expecting more from their networks.
One of the key motivations behind the development of self-driving networks is the need for network operators to keep up with the rapidly changing landscape of network use. With more and more applications and data being transmitted over networks, and with users expecting high speeds and low latencies at all times, network operators are facing significant challenges in meeting these expectations. To address these challenges, self-driving networks have been designed to incorporate data collection, flexible control, and closed-loop AI/ML systems in order to better manage network performance and ensure that networks are able to meet the demands placed upon them.
Self-driving networks have several key capabilities that make them unique. First, they have the ability to gather data through query-driven network measurement. This allows them to constantly monitor the performance of the network and gather data on how it is being used. Second, self-driving networks are able to make automated decisions based on this data, allowing them to adapt to changing conditions in real-time. Finally, self-driving networks are able to programmatically control the network, allowing them to make changes to network configuration and operation as needed.
The excitement surrounding self-driving networks is due in part to the resonance between application pull and technological push. In other words, there is a strong demand from users for networks that are able to meet their needs and expectations, and self-driving networks offer a promising solution to this problem.
One key use case for self-driving networks is the optimization of Quality of Experience (QoE) for users. From a network perspective, this involves accurately identifying packets or flows for different applications, inferring QoE, and learning the packet scheduling and routing policies. However, this can be challenging due to the scale of modern networks, the use of encrypted payloads, and limited control for existing networks. From an end-host perspective, the challenges include learning the necessary policies to adapt bit rates, frame rates for video conferencing applications, and encoding algorithms, as well as having a limited view of network conditions.
There are also several fundamental research challenges that need to be addressed in order to optimize QoE. These include understanding the relationship between network utilization and QoE for different applications, accurately diagnosing QoE from passive network traffic monitoring, determining the most useful features for predicting QoE, and examining the applicability of different techniques at different points along the end-to-end network path. In the event of degraded network conditions, it is also important to consider both application and network adaptations and to determine how network changes should be specified and executed.
In addition to QoE optimization, self-driving networks also have the potential to improve network security. The goal here is to defend against a wide range of continuously evolving attacks, by accurately detecting attack patterns and mitigating their impact as effectively as possible. This requires the use of all available information, including traffic patterns in the network data plane, patterns in the network control plane, patterns in DNS queries, and logs from network devices and security appliances. However, this process is also faced with challenges, including the need to scale to handle large amounts of data and the need to preserve privacy when sharing information across organizations.
## Lecture 2
<!-- ### Closing the Loop -->
### Measurements for Self-Driving Networks
#### Data
##### Revisit QoE Optimization
Let's focus only on network's perspective on QoE optimization:
* Setting:
* Network operators for the last-mile networks (e.g., cable-, cellular-based ISPS, campus networks, etc.)
* Data rates: 1-100 Gbps
* Decision making happening at edge devices (wireless AP, modem, etc.), core/border routers
* Learning Problems:
* accurately identify packets/flows for different applications
* infer QoE (what's QoE?)
* learn the packet (or flow) scheduling/routing policies
* Requirements from the model
* `performance`: demonstrate that the model accurately classifies traffic, infers QoE, and improves QoE
* `generalizable`: model should demonstrate that it performs well in realistic conditions, especially ones that are different from the ones it observed in training data (i.e., immune to [inductive biases](https://en.wikipedia.org/wiki/Inductive_bias)). Let's unpack this information a bit:
* A trained model is required to not only predict well in the training domain, but also encode some essential structure of the underlying system.
* For some problems, the required structure corresponds to causal phenomena that remain invariant.
* However, for some problems the required structure is determined by the domain-specific insights.
* The ability to encode these **domain-specific insights** into the learning model determines whether it will generalize as expected (i.e., immune to inductive biases) in deployment scenarios or not (critical for establishing trust).
* Is this a data problem?
* mismatch in causal structures between training data and deployment settings
* selection bias
* Is this an algorithmic problem?
* Not being able to specify all the require domain-specific insights results in **under-specification problem**, which plagues many existing learning models.
* `robust`: model should be robust to changes in traffic patterns and network conditions
* `explainable`: model should be able to explain how it made its decisions
* Preliminaries
* `Input`: $X$, `Label` (or Output): $Y$
* `Training data`: $D$ drawn from a training distribution $P$
* `Model` $f$, s.t. $f:X \rightarrow Y$
* It is specified by a function class $F$ from which a predictor $f(x)$ will be chosen
* `Pipeline/Algorithm`
* Select $f(x)$ from $F$ by minimizing the predictive risk on the training data $D$
* Evaluate the performance on a randomly selected iid test set $D'$ (hold-out set)
* `Underspecification problem`
* Distribution of $X$ in deployment settings $P'$ is different that training settings $P$
* ML pipeline outputs multiple functions ($f \in F*$, where $F* \subset F$) that could return similar predictive risks --- pipeline cannot discriminate between these equivalent functions, $f \in F*$
* But, each of these functions encode substantially different inductive biases that result in different generalization behavior on distributions ($P'$) that differ from trainig distribution ($P$)
**Note**: We discussed underspecification problem briefly in the class. For those of you interested in learning more about underpsecication problem, I will recommend this [paper](https://arxiv.org/pdf/2011.03395.pdf) from Google. It talks about the prevalence of underspecification problem in current machine learning pipelines, and its implications. We will be covering this topic in greater depth later in the class.
#### Data requirements
What requirements do we have from the data collection infrastructure? The collected data should be:
* `fine-grained`: packet-level granularity
* `high-quality labels`: Examples of labels: application-packet, QoS-QoE, policy(action)-QoE(reward). Need to ensure that these labels are immune to selection bias. For example, if we only collect data when network conditions are good.
* `longitudinal`: order of hours
* `realistic`: minimize the differences in causal structure between training and test (deployment settings)
Is it possible to satisfy the data-collection requirements? Below are the challenges in satisfying the data requirements:
* scale
* privacy
* quality (of labels)
## Lecture 3
### Input from Survey
![](https://i.imgur.com/zlJANzG.png)
![](https://i.imgur.com/CakmWiy.png)
### Characteristics of Learning Problems in Networking
#### Stakeholders (Who?)
Before we understand the learning problems, we need to understand who is trying to solve the problem:
* Direct stakeholders
* End-user or eyeball
* Network service provider(s)
* Content (or compute service) providers
* Third-party
* policy makers
* researchers, etc.
#### Location (Where?)
We also need to understand where is one trying to solve the problem
- End-users
- Network interface
- TCP stack
- applications (native apps, browser, etc.)
- last-mile networks
- `Enterprise`: AP, core routers, border routers, etc.
- `HFC`: AP/modem, [CMTS](https://en.wikipedia.org/wiki/Cable_modem_termination_system), core, etc.
- `Fiber`: AP/modem, [PONs](https://en.wikipedia.org/wiki/Passive_optical_network)
- Cellular: AP/modem, base station, [radio access network](https://en.wikipedia.org/wiki/Radio_access_network), core network
- transit providers -- not interesting
- Content delivery networks
- points of presence (PoPs)
- wide-area networks
- data center networks
- compute servers
- Network interface
- software switch
- virtual machines
- TCP stack
#### Types
- End hosts (eyeballs or content providers)
- TCP stack
- how to accurately communicate and infer network conditions, and how to learn the right packet scheduling policy (congestion control algorithms)
- [Remy](http://web.mit.edu/remy/)
- [Aurora](http://proceedings.mlr.press/v97/jay19a/jay19a.pdf)
- applications (native apps, browser, etc.)
- how to best capture user's quality of experience (what to measure and at what frequency)
- how to best communicate the current state to remote sender (useful for real-time applications)
- how to make application-specific decision locally (from where to request the data, what bit rate or resolution chunk to request)
- [RPC](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p325.pdf)
- [Pensieve](https://web.mit.edu/pensieve/content/pensieve-sigcomm17.pdf)
- last-mile networks (discussed extensively in last lecture)
- how to best map packets to different users/devices/applications?
- [Neural packet classification](https://www.cs.jhu.edu/~xinjin/files/SIGCOMM19_NeuroCuts.pdf)
- [A Look Behind the Curtain: Traffic Classification in an Increasingly Encrypted Web](https://dl.acm.org/doi/10.1145/3447382)
- [nPrintML](https://dl.acm.org/doi/pdf/10.1145/3460120.3484758)
- how to infer various QoS and QoE metrics?
- [NetMicroscope](https://arxiv.org/pdf/1901.05800.pdf)
- [Requet](http://www.columbia.edu/~ebk2141/papers/requet-mmsys19.pdf)
- how to drive routing and packet scheduling decisions?
- [QFlow](https://par.nsf.gov/servlets/purl/10163437)
- Content delivery networks
- points of presence (PoPs)
- how to assess QoE for different available routing paths
- how to select the routing path for a given application-user pair
- wide-area networks
- similar questions as PoP but different level of visibility and control
- data center networks
- similar questions as PoP but different level of visibility and control
#### Learning is intrinsic to Networking
Learning is intrinsic to networks. This is attributable to distributed nature of networks, where each stakeholder (e.g., end-hosts or intermediate network devices) only has access to partial state and is required to use this partial state to make critical decisions.
##### Existing learning solutions in Networking
We will use two example to understand the prevalence of **rule-based** learning in networks.
###### Transmission Control Protocol (TCP)
Networks work (and fail) because of the different learning algorithms. Most of the existing learning models are ``rule-based``. For example, consider the case of TCP. Here, the sender infers the network conditions using the information it can extract from the stream of acknowledgements. Different variants of TCP use different techniques to infer network bandwidth and delay from these packet streams. Each of these variants use a different set of rules to determine how to send the packets on the network --- congestion control algorithms.
<!-- The success of these variants depends on how well they generalize for the tails, i.e., anomalous network conditions. -->
Some of these variants are custom-designed for certain types of network conditions (e.g., low-latency high-bandwidth networks in data center networks or high-latency and low-bandwidth satellite networks) and don't generalize well. Similarly, some of these variants are designed for most-common network conditions (e.g., 5-100 ms latency, upto 10-1000 Mbps bandwidth, and 0-1 % loss rate) and don't generalize well for the tail part of the network conditions. A meta-learning problem here is to determine which variant of TCP to use for a given network condition, and select appropriate configuration parameters (e.g., [Configtron](https://arxiv.org/pdf/1908.04518.pdf)).
###### Adaptive Bit Rate (ABR)
![](https://i.imgur.com/4grs3X9.png)
Abstract model of DASH Players ([reference](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p325.pdf))
![](https://i.imgur.com/77SVBe4.png)
Design space of ABR's learning problem. ABR solutions that only use of subset of these dimensions in the design space will suffer from the underspecification problem, i.e., they will converge on some shortcut learning solutions that will
<!-- ##### Phase 2: Narrowly-scoped ML models
##### Phase 3: Holistic ML models -->
<!-- ##### Phase 4: Roadtesting in realistic settings -->
## Lecture 4
Evolution trajectory of learning solutions (for networks)
- `Phase 1 (Rule-based learning)`: Use domain-specific insights to come up with rule-based learning models
- Take years to converge on a good set of rules that generalize better, i.e., not only perform well for the dominant distribution, but also the tail.
- Require revisiting the old problems becuase the underlying distributions keep changing
- Why do underlying distributions change?
- changes in requirements
- new applications (YouTube, Zoom), new usage patterns (COVID)
- changes in technology
- innovations in the control (e.g., OpenFlow) and data-plane (e.g., PISA switches) tools
- Few illustrative examples
- COVID increased the usage of Video Conferencing applications, such as Zoom. For better QoE it often requires more data upstremm than most other popular applications (e.g., NetFlix). This increased the demand for more upstream capacity or better utilization of limited upstream capacity, which requires revisiting the rules for resource allocation for upstream bandwidth in HFC and cellular networks
- In 90s and 2000s, we witnessed tremendous community effort to the set of rules that make the best use of available network capacity (of the bottleneck link). However, the focus was on TCP connections between end-hosts with (relatively) higher latencies and lower bandwidths. The exponential growth of content providers contributed to development of large-scale data center networks. The servers in these datacenters need to exchange vast amount of data with each other over links with very high capacity (tens of gigs) and low latencies (order of few ms). These network conditions are very different from the ones that heuristics for most TCP variants (e.g., TCP Cubic) were designed for at that time. This motivated the research community to revisit the learning problem, and lead to a new wave of innovation, i.e., new set of rules to sense and react to network conditions to make the best use of available network resources.
- `Phase 2 (Narrowly-scoped ML Models)`
- When do we need ML?
- Whenever we are trying to chase the tail
- Either the tail has elongated, i.e., the gap between requirements and available resources has increased significantly
- examples?
- Or, sensitivity to tail has increased
- examples?
- Chasing the tail requires
- detecting more complex patterns in the data, i.e., capture the underlying causal structure of the problem more effectively
- machine learning algorithms can beat simpler rule-based heuristics in identifying such structure if they have the right data and problem specification
- What do we need from learning models?
- **performance**, generalizability, robustness, and interpretability/explainability
- Most learning models in this phase focus on performance, i..e, demonstrate that ML-based performance beats rule-based heuristics for the **given** dataset
- Currently, the networking community is in phase 2
- Most research papers follow this trends
- identify a problem where a rule-based approach is used for automated decision making
- curate a datasets that captures network conditions which are different from the ones that existing heuristics assumed
- demonstrate limitations of these heuristics for this new dataste
- develop a learning model that outperforms the existing rule-based approach
- end
- Though this is the final step, the journey cannot end here. Very few of these learning algorithms are getting deployed.
- `Phase 3 (Holistic ML Models)`
- What gets a narrowly-scoped learning model deployed?
- extremely strong performance
- low-risk environments
- the cost of model failure is relatively low
- Why don't we develop holistic learning models, if that makes them more impactful
- lack of high-quality data
- data curation is challenging and time-consuming
- most works either use what's easily available --- can't argue much about generalizability
- or, spend a lot of time in data curation -- not enough time to explore the robustness, generalizability, interpretability aspects of the solution
- Why is curating high-quality data hard?
- scale (compute, storage), privacy, fidelity
- What can we do (in academia)?
- develop programmable research infrastructure to develop and test ML models
- Design principles
- minimal disruption
- enable democratization
- Approach
- Use PISA target to anonymize headers and strip payload at scale
![](https://i.imgur.com/1xdBrdO.png)
- Use existing network appliances to label a subset of traffic
![](https://i.imgur.com/jM8lkuI.png)
- Actively collect data and labels with programmabe end-hosts (e.g., RasPis)
![](https://i.imgur.com/SI99w1s.png)
Programmatically deploy ML models at different vantage points for road-testing
![](https://i.imgur.com/UQvovsT.png)
## PINOT
### What's PINOT?
PINOT is an abbreviation for programmable research infrastructure for NetAI.
It has two key components:
1. `Programmable interface`: A target-agnostic APIs that let the programmer(s) express their data-processing pipeline for collecting or applying learning models specifying what, when, and where to collect (or deploy) data, labels, and learning models.
![](https://hackmd.io/_uploads/SJETOHT-9.png)
2. `Driver(s)`: Translates target-agnostic programs into target-specific configuration/programs and commands.
![](https://hackmd.io/_uploads/SkVPcS6b9.png)
Currently, our focus is more on data-collection. Here, we have two key modules: (1) collects packet traces from different vantage points in the networks, (2) collects application-specific data from end-hosts. Here, the end-hosts are usually single-board computers (e.g., RasPis) that are sprinkled across the campus network.
### Closed-loop data collection
Last-mile latency
![](https://i.imgur.com/eTIyeUi.png)
Expect QoE degradations during latency inflation events
![](https://i.imgur.com/gCt2rND.png)
Last-mile latencies at UCSB
![](https://i.imgur.com/lf2eUFO.png)
25% time, latency > 20 ms
Collecting data at right time and location is important
![](https://i.imgur.com/x6qrUJL.png)
collect data for red and orange time slots and IP addresses (co-located RasPis)
Closing the loop
![](https://i.imgur.com/XSzd5Wc.png)
Sonata is a network streaming analytics system
### How can you use PINOT for your term project?
Possible approaches:
- Contribute to PINOT system
- data-compression algorithms to scale storage requirements
- flexible and scalable streaming analytics pipeline to scale network streaming analytics (critical for closed-loop data collection)
- dockerization of the platform (critical for democratization efffort)
- new tools to support data collection for additional learning problems
- Collect data to reproduce results for learning algorithm
- identify skews in existing datasets
- strategize how to collect data whose distribution is as different as possible from P
- Extend existing solutions to make them more holistic
- collect new high-quality dataset
- identify robusteness/underspecification problems in existing learning model
- new algorithms to make them more robust
- apply principles of active learning to improve the quality of curated datasets
- apply continual learning to make existing learning models more robust
- apply transfer learning to adapt existing models in new environments
## Lecture 5
### Possible Problems for Term Project
1. What heuristics/rules does Youtube uses to infer network bandwidth. You can use the dataset from ViaSat for this problem, and also use PINOT to curate a similar dataset from the campus network.
1. Similarly, we can also try to explain the decision rules for Youtube's ABR. These problems are non-trivial as we don't know what exact set of features YT uses to make these inferences/decisions (underspecification problem). An alternative can be to apply explainability tools to black-box models that mimic these decisions. For example, finding the precise decision rules for YT is hard, but doing so for NetMicroscope is (relatively) easy as we know which features it uses. Similarly, we can first consider training an LSTM for bandwidth prediction (treating YT's estimations as ground truth), aiming for as high prediction accuracy as possible. We can then use explainability tools to explain the decision-making for this black box.
1. Develop a self-explainable learning model for the QoE estimation problem. We train two models, determining which features to use for input and making inferences. We explain which features contribute to decision-making for all test data points using this approach. You can consider applying this tool to other problems as well. For example, traffic classification problems that we will learn from the [nPrintML](https://nprint.github.io/pcapml/) project.
1. Evaluate the performance of Netmicroscope for satellite networks. The student can use PINOT to curate a labeled dataset from the campus network to train NetMicroscope and apply this model to the satellite network dataset. Quantify the differences. The student can use domain-adaptation techniques to develop a variant of the original model that performs well for the satellite networks.
1. Curate (unlabelled) youtube dataset from the passive campus traces. Quantify the differences in distributions for different features. Apply the domain-adaptation techniques above to develop a more robust version of the original model trained using the PINOT dataset.
1. Learn a better closed-loop data collection policy. Curate the (unlabelled) youtube dataset as mentioned above. Explore if there is any relationship between the last-mile latency (see notes from last lecture) at different granularities (host, AP, building, etc.) and the entropy of feature space (for features considered in NetMicroscope). Suppose we can argue that collecting data when the latency is high results in higher entropy in the feature space. In that case, we have a solid motivation to use this metric to drive a closed-loop data collection policy. We can repeat this step with other metrics, such as the volume of background traffic, number of active hosts, etc. at different granularities (host, AP, building, etc.). For all the metrics where we observe a positive correlation, we can train the data-collection policy using RNNs.
1. Learn the active measurement policy for estimating coarse-grained QoE metrics for a video session (e.g., # of rebuffering events, average resolution, etc.). The intuition here is that these average metrics depend heavily on variability in download speeds. For networks that offer high-speed and low variability, very few measurements (download times for different chunk sizes) should suffice to tell that the video will have no rebuffering events and will operate at the highest possible resolution. For networks with low speeds and more variability, assessing the average QoE might take more data points. It should be possible to learn how many chunks to download from a measurement server of what size and when to report the average QoE accurately. Here, the goal is to maximize the prediction accuracy and minimize the number of chunks (or the number of bytes) downloaded.
1. For all problems discussed above, replace video streaming applications (e.g., Youtube) with video conferencing applications (Zoom, Meet, Teams, etc.).
2. You can also apply the ideas above to other problems we will discuss in this course:
* Compute the likelihood of a five-tuple flow triggering an alert at the IDS (leverage the multi-fractal and bursty features). You will have to curate an IDS dataset for this problem (see notes from previous lecture).
* RL-based ABRs (Pensieve)
* RL-based TCP congestion control (Aurora)
* RL-based packet scheduling (QFlow)
* ...
System-level contributions
* Expand PINOT to collect ground truth labels for more applications
* `Video conferencing applications` (VCAs): Currently, we can collect the ground truth for Google Meet and MS Teams. You can talk to Irene to learn more about how the data collection works for these two apps. We use WebRTC stats from the Chrome browser, and use selenium to automate data-collection pipeline. Unfortunately, we cannot use the same pipeline for Zoom. Two possible approaches:
- make Zoom app work on RasPi, and then use the statistics window to collect the ground truth;
- use the approach described in Salsify project where we use different QR codes for different frames, to explicitly measure the frame rate at the receiver.
* `Video Streaming applications` (VSAs): Currently, we only support curating the data for YouTube. We can extend the ideas from previous works to support data collection for other applications, such as NetFlix, Twitch, Amazon Prime, etc.
* data-compression algorithms to scale storage requirements
* flexible and scalable streaming analytics pipeline to scale network streaming analytics (critical for closed-loop data collection)
* dockerization of the platform (critical for democratization efffort)
* new tools to support data collection for additional learning problems
## Lecture 6
### Interpretable Learning Models
We will refer to this [book](https://christophm.github.io/interpretable-ml-book/) to learn about Interpretable ML
### Interpretability
#### What is interpretablility?
Interpretability is the degree to which a human can understand the cause of a decision ([1](https://arxiv.org/abs/1706.07269)). Interpretability is the degree to which a human can consistently predict the model’s result ([2](https://papers.nips.cc/paper/2016/hash/5680522b8e2bb01943234bce7bf84534-Abstract.html)).
#### Taxonomy
* Intrinsic or post hoc?
* Model-specific or model-agnostic?
* Local or global?
Different interpretation methods
1. **Feature summary statistic**: Many interpretation methods provide summary statistics for each feature. Some methods return a single number per feature, such as feature importance, or a more complex result, such as the pairwise feature interaction strengths, which consist of a number for each feature pair.
1. **Feature summary visualization**: Most of the feature summary statistics can also be visualized. Some feature summaries are actually only meaningful if they are visualized and a table would be a wrong choice. The partial dependence of a feature is such a case. Partial dependence plots are curves that show a feature and the average predicted outcome. The best way to present partial dependences is to actually draw the curve instead of printing the coordinates.
1. **Model internals (e.g. learned weights)**: The interpretation of intrinsically interpretable models falls into this category. Examples are the weights in linear models or the learned tree structure (the features and thresholds used for the splits) of decision trees. The lines are blurred between model internals and feature summary statistic in, for example, linear models, because the weights are both model internals and summary statistics for the features at the same time. Another method that outputs model internals is the visualization of feature detectors learned in convolutional neural networks. Interpretability methods that output model internals are by definition model-specific (see next criterion).
1. **Data point**: This category includes all methods that return data points (already existent or newly created) to make a model interpretable. One method is called counterfactual explanations. To explain the prediction of a data instance, the method finds a similar data point by changing some of the features for which the predicted outcome changes in a relevant way (e.g. a flip in the predicted class). Another example is the identification of prototypes of predicted classes. To be useful, interpretation methods that output new data points require that the data points themselves can be interpreted. This works well for images and texts, but is less useful for tabular data with hundreds of features.
1. **Intrinsically interpretable model**: One solution to interpreting black box models is to approximate them (either globally or locally) with an interpretable model. The interpretable model itself is interpreted by looking at internal model parameters or feature summary statistics.
#### Scope of Interpretability
* How does the algorithm create the model?
* How does the trained model make predictions? (global)
* Which features are important and what kind of interactions between them take place?
* How do parts of the model affect predictions? (global at modular level)
* Why did the model make a certain prediction for an instance? (local single)
* Why did the model make specific predictions for a group of instances? (local multi)
#### Intepretable Models
![](https://i.imgur.com/Na061XO.png)
##### Decision Trees
![](https://i.imgur.com/NwiJGwM.png)
Multiple algorithms to grow a tree. Most popular is [classification and regression trees (CART)](#) algorithm.
* CART takes a feature and determines which cut-off point minimizes the variance of y for a regression task or the Gini index of the class distribution of y for classification tasks.
* The variance tells us how much the y values in a node are spread around their mean value. The Gini index tells us how “impure” a node is, e.g. if all classes have the same frequency, the node is impure, if only one class is present, it is maximally pure.
* Variance and Gini index are minimized when the data points in the nodes have very similar values for y.
* As a consequence, the best cut-off point makes the two resulting subsets as different as possible with respect to the target outcome. For categorical features, the algorithm tries to create subsets by trying different groupings of categories.
* After the best cutoff per feature has been determined, the algorithm selects the feature for splitting that would result in the best partition in terms of the variance or Gini index and adds this split to the tree.
* The algorithm continues this search-and-split recursively in both new nodes until a stop criterion is reached. Possible criteria are: A minimum number of instances that have to be in a node before the split, or the minimum number of instances that have to be in a terminal node.
##### Analysing decision trees
`Feature importance`: Go through all the splits for which the feature was used and measure how much it has reduced the variance or Gini index compared to the parent node. The sum of all importances is scaled to 100. This means that each importance can be interpreted as share of the overall model importance.
`Branch Importance` ?
What are the pros and cons of using decision trees to develop interpretable learning models?
#### Model-agnostic Methods
What?
![](https://i.imgur.com/jBpFqYp.png)
Why?
* `Model flexibility`: The interpretation method can work with any machine learning model, such as random forests and deep neural networks.
* `Explanation flexibility`: You are not limited to a certain form of explanation. In some cases it might be useful to have a linear formula, in other cases a graphic with feature importances.
* `Representation flexibility`: The explanation system should be able to use a different feature representation as the model being explained. For a text classifier that uses abstract word embedding vectors, it might be preferable to use the presence of individual words for the explanation.
#### Global Model-Agnostic Methods
##### [PDP-based Feature Importance](https://christophm.github.io/interpretable-ml-book/pdp.html)
The partial dependence plot (short PDP or PD plot) shows the marginal effect one or two features have on the predicted outcome of a machine learning model.
What?
![](https://i.imgur.com/N3KFpzN.png)
![](https://i.imgur.com/4fFF0Zb.png)
![](https://i.imgur.com/lMZx9JF.png)
Cons:
* assumption of indepedence
* do not show the distribution of features --> correlation is not an evidence of causation
##### [Accumulated Local Effects (ALE)](https://christophm.github.io/interpretable-ml-book/ale.html)
Accumulated local effects33 describe how features influence the prediction of a machine learning model on average. ALE plots are a faster and unbiased alternative to partial dependence plots (PDPs).
To summarize how each type of plot (PDP, M, ALE) calculates the effect of a feature at a certain grid value v:
* `Partial Dependence Plots`: “Let me show you what the model predicts on average when each data instance has the value v for that feature. I ignore whether the value v makes sense for all data instances.”
* `M-Plots`: “Let me show you what the model predicts on average for data instances that have values close to v for that feature. The effect could be due to that feature, but also due to correlated features.”
* `ALE plots`: “Let me show you how the model predictions change in a small *window* of the feature around v for data instances in that window."
<!-- ## Lecture 6 -->
#### Local Model-Agnostic Methods
##### LIME
##### SHAP
#### Detecting Concepts
#### Self-explained Learning
## Lecture 7
### Reading Assignment
Read the paper, [Beauty and the Burst: Remote Identification of Encrypted Video Streams](https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/schuster)
You have to answer the following questions as part of your pre-reading assignment.
* What problem is this paper solving?
* Why is that problem important?
* How does DASH standardize a leak?
* Explain Figure 2.2. and 2.3. Use **WALTER** technique to describe the figures. Here, W=Why?, A=Axes, L=lines, T=trend, R=recap/takeaway
* How is data collection automated?
* What's the Bento4 MPEG-DASH toolset doing?
* Why did the authors use CNN? What can be a better tool to use here (if any)?
* What's the input to the classifier?
* What are the detection cascades? How can they be useful for this problem?
* What other systems have leveraged VBR leaks? How is the proposed system better/worse than these systems?
* How do you compare this approach with the one in [44]?
Each of you have to answer these questions individually. You will input your response to this document itself. As others responses will be visible to you, please avoid copying others responses. I will know which response came first and which one simply copied the other.
Please follow the template below to submit your reviews.
[Beauty and the Burst Reviews](/w3HLIqJNQb67drQi7xpXrA)
#### Post-lecture Blog --- Team 1
In general, the discussion goes along with the corresponding questions above, but there are changes and additions in the questions headings.
##### What problem?
The authors showed that the fingerprint of the video could be created using just encrypted network data.
##### Why important?
It is a privacy issue and that means that just traffic encryption is not enough.
##### The leak explanation:
Usually, the video stream is a sequence of frames, where some frames are keyframes and all other frames are just differences between the current frame and the previous one. Due to variable bitrate (VBR) being used, the traffic amount is higher for more active scenes (because more pixels change and therefore more differences you need to send). So, each frame difference theoretically has a distinguishable stable size.
Let’s say the whole frame takes 1Mb. The next will take less because changes are insignificant, but if significant changes will occur, that traffic amount would increase (see picture below).
![](https://i.imgur.com/eBNcsWQ.jpg)
So for each video with its unique scene changes we have unique sizes sent for each next frame. DASH sends frame changes in a predictable way (every N seconds) and that’s why you can create a (rather) unique video pattern - fingerprint.
Audio can be in the same stream (reuse same TCP connection) (for youtube - it usually is) but it is smaller and more stable than video, so we don’t look at it for now.
##### How to show that the pattern is unique beforehand training the model
In the pictures (Figures 2.2 and 2.3 from the paper), the authors try to show a correlation between scene activity and bitrate size. Authors created their own video with a combination of high-action and low-action scenes and showed that high-activity scenes bursts encoded by DASH are higher.
So we really see different bursts in a predictable pattern for different video chunks.
/* Authors selected a simple basic situation that clearly supports the hypothesis that they have; it’s usually recommended to introduce and check your idea in that way before extending and making it more complicated */
In order to check whether the same video could be reidentified, they showed that accumulated differences (difference for each frame) for one video are never more than N Kb, which is less than the smallest possible difference (like if 2 frames are absolutely different) for two different videos, so we have a clear threshold. (so the recall is secured).
To secure precision they took 3558 videos and calculated fingerprints for each of them, and then showed that about 20% of the videos have unique fingerprints that cannot be mistaken.
![](https://i.imgur.com/1cDhchs.jpg)
##### How is data collection automated?
What we need - unique labels and corresponding packet trace statistics.
Where to take videos from? Let’s imitate users’ patterns - use YouTube “recommended’ videos.
The authors used a tshark to collect the flow statistics.
The tools to collect data for our projects: tshark (packets and statistics), Wireshark (GUI of tshark), tcpdump (which does not collect packages statistics, just pure data). Authors needed only specific statistics (size per sec, etc) so tshark is a better choice for them (so you don’t need to spend much space for raw traffic).
##### What’s the Bento4 MPEG-DASH toolset doing?
They used this tool to convert YouTube videos to segments on a local streaming platform so fingerprint calculation is more reliable, and less noise is introduced to prove theoretical idea without network problems.
The authors used this tool for theoretical proof but afterward used raw network data with noise.
##### How the attack could be implemented?
The malicious code need only to know network packet sizes and timestamps, so it could locate on an ISP equipment, wifi router, a wifi device in promiscuous mode in the same network, or even in the JavaScript code on another tab of a browser (as that code can saturate the network with its own low-priority traffic and by changing in its traffic observer traffic bursts for youtube video).
##### Why did the authors use CNN? What can be a better tool to use here (if any)?
They use CNN because the focus is currently on short-term patterns, not long-term ones, at which RNN is good.
![](https://i.imgur.com/bRTB0E4.jpg)
Also, CNNs are simpler to start. Instead of CNN’s, it’s possible to use here methods that work with time-series data, like recurrent networks or transformers.
##### Theoretical problems for this work:
Most streaming platforms use adaptive bitrate (ABR) to change video resolution on-the-fly according to the network state.
When the network conditions change rapidly, the quality of the video can be changing as well, so bitrate will change and there is a chance that the pattern will change and we won’t have the same unique signature.
Also, the size of the video segment to find that unique pattern is questionable (1s? 5s? 2min? 10 min? the whole video?), and a lot depends on it (like too long is bad, but too short can help with ABR problem).
## Lecture 8
### Reading Assignment
Read the paper, [nPrintML](https://arxiv.org/pdf/2008.02695.pdf)
You have to answer the following questions as part of your pre-reading assignment.
* What problem is this paper solving?
* Why is that problem important?
* Why do we need a standard data representation for networking-related learning problems?
* Does increasing number of packets in nPrint vector improve performance for different learnings? If yes, why?
* What is a device fingerprinting problem? Which dataset did the paper use for this problem?
* What is the application identification problem considered in the paper? Which dataset did the paper use for this problem?
* How will you use PINOT to curate datasets considered in this paper?
* What are the limitations of nPrintML? How can we address these limitations?
#### Reviews
[nPrintML Reviews](https://hackmd.io/3hov0pqGTkqLFkslFJ7fYw)
#### Post-lecture Blog --- Team 2 (Navya)
##### Introduction
nPrintMl is an interesting paper to explore in the sense that it is the first of it's kind of paper to explain a standardized representation of data called nPrint and combining it with AutoML to automate most of the traffic analysis tasks like model selection, hyperparameter tuning, etc. In this blog we go through the entire discussion we had regarding this paper and present a proper summary based on everyone's hypothesis.
##### Problem the paper is trying to address
We all agreed on the premise that this paper is trying to establish a standardized representaion of networking data (packet data) and leverage the AutoGluon Tabular Auto Ml tool to do the next order business of choosing the appropriate model and finding the appropriate hyper parameters to train the model. And we all agreed that this hypothesis was based on the proposition that using a fixed size, complete, normalized and aligned representation of data (nPrint) would fit perfectly using AutoMl as it takes care of which features are necessary and which are not.
##### Why is this problem important
We all had a discussion on why this problem could be considered important and there were some interesting responses. Many of us argued that given any machine learning problem, if certain tasks could be automated, then it would save us a lot of time at the end of the day. Talking in network operator perspcetive, when most of the Machine learning tasks could be automated, the overhead of trying to understand and debug these programs would be narrowed down. However this discussion led us to another interesting angle on whether the representaion proposed by the paper is absolutely perfect or could we change anything?
##### Why do we need a standard data representation for networking-related learning problems? If we can propose a standard representation how would it look?
We started of this discussion with a question asking why is this format better or not. There were some really interesting responses. Some of us argued that this format is good enough in that it includes all the network data in a similar format (fill 1 if bit exists, fill 0 if doesn't exist and fill -1 if the field doesn't exist) and given its implications the AutoML can take care of it and create valid inferences based on data patterns. Some of us contradicted this statement bringing TCP options field into picture and explaining how much of unnecssary data we are trying to include within the vector. Some of us even argued that there is a lot of noise included in the vectors with all the unnecessary inclusion of -1s in the vector even when the field doesn't exist and proposed on using a variable size representation of the vector. Then there was an interesting discussion on why would there be a need to remove noise or reduce size if AutoML can take care of identifying essential features which led us to a discussion about dimensionality reduction and feature rejection. There was a discussion about how the curse of dimesionality could lead us to problems like incorrect inferences or overfitting and we all agreed that there is a possibility that if given data is large with unnecessary patterns, it could lead the model to make decisions taking these unnecessary patterns also into consideration. However we had an open ended discussion on whether the model the paper proposed is overfitting or fitting just right.
##### Does increasing number of packets in nPrint vector improve performance for different learnings? If yes, why?
There was a discussion on this arguing if we really need to know the number of packets we must be feeding to the network and agreed that if there was a need it would add an overhead on the network operators trying to decide on the number of packets in advance.
##### What is a device fingerprinting problem? Which dataset did the paper use for this problem?
We had a discussion on the device fingerprinting problem in the paper and agreed that this approach is trying to compare itself with the Nmap device fingerprinting tool. We had a brief discussion on what is Nmap and then we went through the statistics mentioned in the paper. After this we discussed the accuarcies mentioned in the paper and talked about the correlations between these statistics and accuracies proposed. We ended this discussion open ended assuming that paper was really accurate in estimating the device correctly with the accuracies mentioned.
##### What is the application identification problem considered in the paper? Which dataset did the paper use for this problem?
We all agreed that the paper proposes on expanding MacMillan et.al. work on application identification using snowflake, a pluggable transport for Tor that uses WebRTC to establish browser-to-browser connections. They propose that they want to infer the browser along with the application in pair and report ROC AUC of 99.8% and F1 score of 99.8% for their approach.
##### How will you use PINOT to curate datasets considered in this paper?
We can leverage the PINOT test bed to curate application identification dataset while making the minions communicate with each other via applications like zoom, facebook, etc and perform packet capture in the background.
##### What are the limitations of nPrintML? How can we address these limitations?
Some major limitations of this appraoch could be as we discussed, trying to come up with a fixed size for vectors for a standard representation. This fixed size could lead to so many problems like curse of dimensionality, unnecessary inclusion of features could increase computation footprint and time constarint, etc. The paper addresses that this approach has limited performance dealing with automated timeseries analysis and and classification using multiple flows.
##### Conclusion
From the overall discussion above, we could confer that nPrintML was succesfull in presenting a standard representation nPrint that leverages AutoML which was able to automate certain traffic analysis tasks as explained in the paper. However, the scope of the application of these techniques still reamins a question given that there are so many applications that deal with traffic analysis and each of them leverage their own kind of data sometimes needing from a little customization to a pain staking data pre-processing approach. If nPrintML or any alike work could rise to a level that would address these issues in the community, it would be a huge step in narrowing down the process of creating machine learning models for networking problems. However there is a long line of research that awaits before we reach that step. A more low hanging fruit in this direction of research would be expanding the class of problems we can deal with using this approach and contemplating how to take it from there in future.
## Lecture 9 -- Cancelled
## Lecture 10
### Proposal Presentation (Teams 1-6)
## Lecture 11
### Proposal Presentation (Teams 7-12)
## Lecture 12
### Reading Assignment
Read the paper, [NetMicroscope](https://arxiv.org/pdf/1901.05800.pdf)
You have to answer the following questions as part of your pre-reading assignment.
* What problem is this paper solving?
* Why is that problem important?
* How is it curating the QoE dataset? What are the fundamental challenges in the data-collection process? What are the limitations of this dataset?
* How will you use PINOT to curate datasets considered in this paper? Will usage of PINOT address the limitations discussed above?
* What's the precise learning problem considered in the paper? More concretely, what's the input, what's the output? What's the model selection pipeline?
* Is this problem vulnerable to underspecification? Is the current approach capturing the underlying causal structure of the problem?
* How is the paper analysing the feature importance? How is this approach different from different interpretability tools we discussed in previous lectures?
* What domain adaptation technique did the paper use?
* WALTER Figure 15 and Figure 17 in the paper.
#### Reviews
[NetMicroscope Reviews](https://hackmd.io/zhPDHFHKSY2jTQdn05sunA)
#### Post-lecture Blog — Team Insignia
##### Introduction
Video traffic is by far the most dominant traffic on today’s internet and the total volume of traffic generated by these types of services is still constantly increasing. As per custom reports like the one published by Cisco as shown in **Figure 1** predicted by 2022 will see more than 60% additional increase in traffic. The amount of traffic has become so high during the lockdown where current experiencing from before from multiple services like Netflix and YouTube have reduced the quality to avoid over running the resources available in the network. In order to cope with such sort of resource issues Internet Service Providers (ISPs) in the past were able to do a number of optimizations on the video traffic traversing their networks. But why do only service providers reduce the quality of networks and not anything that the ISPs can do here? With the widespread adoption of encryption it is almost impossible for ISPs to apply any optimization techniques that were previously possible. When encryption is used operators are left to observe rough features of the traffic like the **throughput** of the flows generated by a service. But there is no sense of the underlying quality of the content that is being transmitted.
![](https://i.imgur.com/vUxOC0W.png)
###### How to quantify video quality?
Inferring the video streaming quality from encrypted network traffic consists of determining a number of metrics that impact the experience called QoE value. The authors targeted two specific metrics : **Startup** **Delay** and **Resolution**.
###### Startup Delay
The time that it takes from the moment a user clicks on a link of the video that they are interested in watching and the moment the video actually starts playing.
###### Resolution
The common amount of pixels that constitute each frame of the video.
To infer these metrics from the traffic there are a number of features that could be collected by monitoring the traffic flowing through the network. The authors aggregated these featured into three categories :
- Network Layer Features :
Features that solely rely on information available from the observation of a network flow identified by the IP
- Transport Layer Features :
Those are the ones extracted from observing the layer for headers and possibly keeping the track of the state of these protocols.
- Application Layer Features :
Any feature related to the application data that can be deduced by observer patterns of the traffic.
![](https://i.imgur.com/NZM56aA.png)
##### Methodology and Model Validation
In order to design models that infer the aforementioned quality of metrics the authors collected the three categories of features for over **13,000** video sessions over four major streaming services - **Netflix, YouTube, Amazon Prime Video and Twitch**. In particular a control lab environment was used to collect data for different network conditions as well as to obtain the ground truth of the video sessions. The goal of the paper is to infer video quality metrics at 10 seconds intervals, a good trade-off between the precision of inferring the metrics and the likelihood that each time being good enough to contain a complete video segment. The startup delay is inferred in seconds using the features extracted from the first 10 seconds of the video session. The resolution is classified in multiple classes with one of the following resolution values: 240p, 360p, 480p, 720p, and 1080p.
The authors trained the models using six sets of input features: network-layer features (Net), transport-layer features (Tran), application-layer features (App), as well as a combination of features from different layers: Net+Tran, Net+App and all layers combined (All). For each target quality metric, 32 models were trained in total :
- Varying across these six feature sets and
- Using six different datasets, splitting the datasets with sessions from each of the four aforementioned video services - plus two combined datasets, one with sessions from all services and one with sessions from three out of the four services.
For each target quality metric, models were evaluated using 10-fold cross-validation. In order to label traffic traces with appropriate video quality metrics, the authors developed a Chrome extension that monitors application-level information for the four services. The extension supported any HTML 5-based video, allowed to assign video quality metrics to each stream as seen by the client. The extension collects browsing history by parsing events available from the Chrome WebRequest APIS.The data collection is further tailored for each service:
- Netflix - Parsing overlay text
- YouTube - iframe API
- Twitch and Amazon - HTML 5 tag parsing
In total 11 machines were instrumented to generate video traffic and collect packet traces together with data from the Chrome extension: six laptops in residences connected to the home WiFi network and four laptops located in a lab connected via local WiFi network and one desktop connected via Ethernet to lab network. The authors experimented with different types of regression and classification methods but finally picked Random Forest as it gave the best results.
##### Results
**Figure 2** demonstrates the results in terms of a Precision-Recall curve where each line corresponds to a Random Forest class classifier trained for each feature set. As evident from the curves, the models that rely on Net+App level features outperform the others that rely on Network+Tran level features across all services.
![](https://i.imgur.com/5eEMaDb.png)
The authors further explored the Gini Index to identify the feature importance. Figure 3 demonstrates the feature importance for Netflix and YouTube but the same conclusions were observed for other services as well.
![](https://i.imgur.com/eH5esfA.png)
We also observe that most features at the top of the ranking are actually related to segment sizes; these results confirm that the general intuition that given similar content a higher resolution implies more pixels per inch and thereby requiring more data to be delivered for each video segment. Infact without using segment related information these models wouldn’t have achieved similar precision and recall. Figure 4 demonstrates the accuracy for each video service using the best performing model, the one using Net+App level features. In general, the precision-recall is above 81% for all services.
![](https://i.imgur.com/f4X5g3v.png)
##### Deployment
Using the generated models the authors analyzed the data collected during a year long study in 66 homes across the United States and France. To gather the data used by the models the authors developed a monitoring tool that collects the features used during inference. The dataset collected was very diverse in terms that it included homes from different operators and with downstream throughputs ranging from 1 MBPS to 1 GBPS. In total 200,000 video sessions were collected from the aforementioned video streaming sessions.
##### Discussions
Video quality metrics in an encrypted stream, namely startup delay and video resolution, are invaluable to the ISP as it helps them get a feedback for their services. A user may polarize towords an alternative ISP if it offers a higher QoE for the price. During the class discussions, bitrate as a heuristic was ruled out to predict the resolution since:
1. Each video has a different VBR, and ISP cannot know the video being watched<br/>
1. ABR may be used instead of a fixed resolution during playback which continuously adjusts the screen resolution
The paper uses Traffic control (or TC, a useful Linux utility that gives you the ability to configure the kernel packet scheduler) to generate dataset in a laboratory using emulated network conditions. Although the dataset thus produced will not capture real-world network variance but still was needed to cover various network conditions (like low bandwidth) which might not available from the collected dataset (from mostly reliable home networks in US and France). However, mahimahi, a suite of user-space tools for network emulation and analysis, is a better and should have been used instead of TC. <br/>
The authors were able to collect all the application layer features in an encrypted stream using the upstream get requests. The paper does not give detail on how it collected RTT feature for Trasport Layer features.<br/>
The high precision-recall in results seems too good to be true, since the model described in paper only effectively is using the features considered by an ABR. But since segment size also depends on the VBR, inferring video resolution with high accuracy is challenging.<br/>
It seems unfeasible to maintain a single composite model that is general enough to infer video quality metric for multiple services even if it is trained using data from all possible streaming services. Since each service has its own peculiarities in content delivery logic: most use a different ABR scheme, for example Twitch almost always has a startup delay of 5 seconds, and paid services like NetFlix will differ from massive free platforms like Youtube as both have different objectives. A better approach would be to have separate model for each service unless there are constraints on the ISP server that limits the number of models that can be run simultaneously.<br/>
## Lecture 13
### Reading Assignment
Read the paper, [Traffic Refinery: Cost-Aware Data Representation for
Machine Learning on Network Traffic](https://arxiv.org/pdf/2010.14605.pdf)
You have to answer the following questions as part of your pre-reading assignment.
* What problem is this paper solving?
* Why is that problem important?
* Provide a brief description of proposed system.
* What types of costs did the paper consider?
* Describe the three case studies, and summarize their takeaways.
* Let's consider the QoE inference problem. Is inferring QoE for skewed data distributions (e.g., a well-provisioned campus network) simpler? Comment on model performance vs. low-cost representation tradeoffs for such skewed settings.
#### Reviews
[Traffic Refinery Reviews](/N8qUmaAqS5epwAGlQsx4fw)
#### Post-lecture Blog — Team 4
###### What problem is this paper solving?
The paper tries to examine the costs for machine learning pipelines - data collection, feature engineering, model selection and evaluation. The paper also examines what the right feature cost vs performance trade offs are for machine learning problems in networking. It introduces the concept of system level costs for any machine learning problem which are often overlooked.
###### Why is that problem important?
This paper explores system level optimisation, ignoring the optimisations that can be made at the ML model development level. This is an important problem to solve because storing and collecting data is the ground work to all machine learning problems, and if this can be solved optimally, then further optimising ML solutions will lead to better performance as a whole. In class we further discussed about different structural representation of network packets like nPrint and how they differ in their representation and their costs. We discussed how using raw packet traces or nPrintML have significant costs - in memory, CPU and storage costs. We talked about the different costs in detail. For CPU costs, the paper talks about how much processing power is required in Feature Engineering. In Memory cost, the RAM usage is profiled. The interval time used puts a limitation on the cost profiling. Storage cost profiles the storage cost required over time. Memory is associated with loading those features in training pipeline.
###### Brief description of proposed system
The authors propose Traffic refinery, to explore network data representations and evaluate the systems-related costs of these representations. Traffic Refinery implements a processing pipeline that performs passive traffic monitoring and in-network feature transformations at traffic rates of up to 10 Gbps. The pipeline supports capture and real-time transformation into a variety of common feature representations for network traffic and also has an API exposed to allow for new representations.
The proposed system is built in Go and has the following components:
Traffic Categorization - Analyze the traffic to associate the packets with the corresponding service & application. This is done by leveraging a cache that stores the mapping between the remote IP addresses and the services.
Packet Capture and Processing - This component has two sub components, first is the partition flow cache, where the state is stored. This prevents redundant processing of packets. The second is feature extraction where depedning on the use case, meaningful features are extracted from the packet files.
Aggregation and Storage - The algorithm exports high level features at regular time intervals. The intervals are determined by the configuration file for each service. It is then saved in a temporary file and uploaded to a remote location to be fed to the models.
###### Case Studies
In class we talked about the feature sets required for the QoE problem. We discussed about the different types of feature sets required for the QoE problem - Application Layer, Network Layer, Transport Layer features. We spoke about flow feature in depth - i.e. the packets from a particular connection which are collected for a single flow, and we further discussed about 5 tuple feature sets which include (Source IP, Source Port, Destination Port, Destination IP and Protocol) which are commonly used as a feature in such problems.
QoE Case Study - For this case study, the authors used different combinations of feature sets. Networking only, Application only, Transportation only, Network+Application and evaluated these models seperately. It was observed that just Network layer features did not yield a good result, but networking + application combined gave the best result. We discussed about the performance of all these combinations for the two case studies. We discussed how this paper used different interval windows for training (2 seconds, 10 seconds, 60 seconds). Smaller windows means more granularity but might not be meaningful if the training data does not have that window size. The relationship between finer granularity and performance is not linear. Cost improvement with higher window size balances out the performance loss in some sense. We also discussed how transport layer features use up the most amount of state cost i.e, most amount of in-use memory and so adding these features might not be fruitful given the high amount of cost as compared to the relatively low improvement in performance. The graphs below show the performance of the system for the QoE problem with different feature sets -
![](https://i.imgur.com/1Ffhkbr.png)
Malware Detection Case Study - The authors use CICIDS2017 dataset, where they convert the data to a byte representation like an image. They find out that cost is corelated with the size of the PNGs. The authors consider header, payload and header+payload features and train a CNN on this data. They report that using just the payload features has a 0.36 accuracy. We discussed why that could be in class and some of the reasons were - the payload is encrypted and hence nothing can be deduced from it, or that majority of that malware attacks have more information to be learnt from the header and not from the payload. Hence, we can disregard payload as one of the important features in this problem. The following figure shows the performance for different feature sets in this case study -
![](https://i.imgur.com/sqRTJEq.png)
![](https://i.imgur.com/a4j6W0k.png)
###### Further Discussion
In class we further discussed how future work can be done in this space.
We talked about using grid searching for getting to optimal costs v performance tradeoffs. There should be work done to come up with an algorithm which codifies the right sequence of steps to minimise exploration steps. The paper does not easily distinguish between training time costs and inference time costs and so exploring these seperately might be worth the effort. We discussed how there needs to be more information about inference costs (inference time for example) since it matters a lot for production level systems. We spoke about how all the costs are related to feature engineering tasks and how the costs need to be more specific to training or inference.
We talked about how a fixed algorithm of selecting feature sets may or may not work for some use cases. For example, if it’s a legacy application, then we can rely on one fixed approach. Otherwise, if the application is dynamic, then the feature sets can change. In Dynamic usecases, network conditions can affect the data distribution. The costs may be dependent on network conditions. For example - data rate might have more implication on network layer or transport layer. Performance may change as well with network conditions because the type of data will also define the performance changes. Making model changes with different distributions if needed can be another approach that can be worked upon as a future work in this space.
## Lecture 14
### Reading Assignment
Read the paper, [Pensieve](https://people.csail.mit.edu/hongzi/content/publications/Pensieve-Sigcomm17.pdf)
Questions:
* What problem is this paper solving?
* Why is that problem important?
* Describe the four practical challenges in designing a good ABR algorithm.
* Why learning-based approach makes sense for ABR algorithms? Are you satisfied with the arguments presented in the paper? Explain.
* Describe the design of the simulator used in the paper for training?
* Describe the input s_t taken as input by the learning agent.
* Describe how policy gradient training works for this problem. How A(s,a) is estimated?
* Why enhancement is required to generalize the learning model across multiple videos? What enhancement techniques are used in this paper?
* What's the meta story for evaluations? Does it justifies all the design choices with empirical results?
* How is this paper using network traces and Mahimahi tool for evaluation?
* How is this paper demonstrating the generalizability of the proposed solution? Is it making a strong case?
* How will you improve this work? What questions are left unanswered in this paper?
#### Reviews
[Pensieve Reviews](/0Sbc_SOFQn208KqZd9L4hg)
#### Post-lecture Blog — Team 5
##### What problem is this paper solving?
The paper considers the current heuristic based aroaches to be insufficient to handle ABR as the problem is complex and depends on network conditions (packet loss, trougput, latency), buffer size, and previous chunk quality. As such, Pensieve is proposed as a superior, reinforcement learning model for ABR
##### Why is that problem important?
Practically speaking, the matter of ABR is important as if ABR produces low QoE, then users will simply leave the platform to find better alternatives. Current heuristic based ABR aproaches utilize sets of previous chunks to try and infer bandwith. While such heuristic based aproaches may not necessarily be flawed, they still present oportunities for improvement. MPC for example is tuned for specific enviornments and is not easily generalized. Furthermore, heuristic aproaches have the risk of oversimplifying complex relationships between inputs, especially when the tuning maps to a linear relationship. Another benefit of machine learning is that it is self-driven and does not require manual tuning.
##### The practical challenges of designing an ABR algorithm
There are certain inherent challenges when it comes to designing an ABR algorithm. For one, network conditions can be volatile and inconsitent. Inconsitency of course tends to impact predictive ability negatively. Another issue for ABR algorithms is that many of the factors that contribute to QoE tend to conflict with each other. For example, minimizing buffering might lead to a loss of smoothness. When designing an ABR algorithm, it is necessary to define what is being optimizd. Another challenge is that features for this problem tend to be coarse-grained, making it difficult to make completly accurate predictions.
##### Reasons for adopting a learing-based approach for ABR algorithms
In summary, the biggest reason for adopting a learning-based approach is that learning models should tend to generalze better. At the very least, Pensieve was designed with the goal of generalization in mind. Another consideration is that learning-models can perform better on new data as they do not require external knowledge about network conditions.
##### Learing problem specifications of Pensieve
The input of the model consists of network conditions, previous state for k time periods, buffer information, remaining chunks in the current video, and previous bitrates. The output consists of a selection for the bitrate which is deemed to be optimal. The process of selecting this optimal bitrate involves a reward function that generates a value based on QoE metrics. When it comes to training the model, Pensieve opted for data acquired from simulations for the sake of generating large amounts on a scale that is faster than real-time since reinforcement learning models tend to require large amounts of data to be effective.
##### Description of the simulator design and discussion of simulators
The simulator that Pensieve used was deisgned to utilize recorded network traces to simulate network conditions in accelerated time. This system allowed for control over the network conditions as the input to the simulator is designd by the user. Furthermore, it is possible to use the simulator to minimize the effect of existing ABR systems and produce something more similar to raw data.
The way that the simulator functioned was by calculating the rate at which the buffer drained and then simply doing so computationally, as opposed to watching the video. Each chunk is assigned a download time based on its bitrate and simulated network conditions. The chunks size is then put into the buffer while its playback time is added to the video playback buffer. This allows the simulator to simulate video playback, including certain edge case states such as not having buffer space.
While a simulator is useful for collecting large amounts of data quickly, that data can also be a bane for models. Models trained with simulators are only as good as the simulators are accurate. This brings up the issue of designing simulators. Simulators that involve a lot of features tend to be more complicated to engineer. Furthermore, a model trained only on simulated data is less convincing than one trained on real data. Without experiments that demonstrate that models trained on simulated data function well in real network enviornments, it is difficult to believe in a models effectiveness. This is especially so as simulated data might have certain strange distributions that do not exist in deployment. These strange distributions could lead to shortcuts in learning and make the model inaccurate. Still, for the case of Pensieve a simulator was essential as reinforcement learning tends to have difficulty converging. The large amount of data procured from th simulator would otherwise take far too long if done by watching videos and measuring real network conditions.
##### General summary of gradient training for this problem
The simple answer here is that gradient training selects a policy which maximizes reward. This maximization occurs by calculating A(s,t), the reward obtained from the selected action added to the expected future reward from states following this action minus the expected future reward from the current state. In other words, the model is finding how much better the overall reward from the selected action and subsequent actions are compared to expected actions from the current state.
##### The necessity of enhancements to acount for variable bitrates
Videos are not uniform by any means. There is variable bitrate across videos and each video will have different available bitrate encodings. This can pose a problem to the generalizability of a model. For example, consider a model trained on a high action video which is applied to a low action video. The quality selection might be too conservative and waste bandwith as the high action video would have a higher bitrate at a given time. The model doesn't realize in this case that the low action video has a lower bitrate. The inverse can also hold true where a high action video might pause to buffer if it uses an ABR model trained on a low action video. As such, it is necessary for models to somehow consider these cirumstances.
##### How the paper utilized network traces and Mahimahi
This paper used public network trace datasets and sampled them to create finer grained data. The original data was collected at real clients on a network at a granularity of 30 minutes in the best case. Since this data is being collected from paying customers, it would be unreasonable to constantly blast the host network with speed tests. This would consume excessive bandwith and data. 30 minutes was the best that could be done without interfering too much with hosts. The paper used Mahimahi to emulate the network conditions from the network traces and included an 80ms round trip time between client and server.
##### Problems with passive measurement
There are several reasons that the dataset above was collected using active measurements and not passive measurements. For one, passive measurements naturally touch on the issue of privacy. This can be avoided by masking critical information such as IP addresses, but the issue still exists. Another issue is that passive measurements rarely represent a saturated network. As such, they may not necessarily be accurate. Yet another drawback is that clients may not necessarily be able to support passive measurement due to technical limitations.
##### General takeaway
The main takeway from the end of lecture is that you need to understand a machine learning solution and why it works. With understanding, it becomes possible to go back and improve on heuristics, which then raises the bar for machine learning aproaches. Only when this iterative cycle reaches a point where heuristics are deemed to oversimplify the issue and can no longer improve would blackbox machine learning models then be viable.
[Oveflow Document -- CS 293 N, Spring 22](/NaG7DV3qSASM05GghjK9Mw)