Traffic Refinery Reviews

# Traffic Refinery Reviews #### What problem is this paper solving? ###### Liu Kurafeeva This paper explains why in networking just measuring model of machine leraning accuracy (or other metrics) are not enough and suggest the way to include system-level costs of different representation of network traffic in pipeline. ###### Seif Ibrahim This paper examines the system costs of different feature representations used to represent network traffic. It explores the design space and tradeoffs between feature representation cost and model accuracy. ###### Aaron Jimenez The purpose of this paper is to propose a proof-of-concept system that can be used to define what features to extract during passive network data collection, as well as to model the costs of different data representations when applied to machine learning ###### Brian Chen This paper is trying to solve the lack of emphasis placed on the correlation between model inputs to accuracy and cost. More precisely, the paper is trying to draw attention to the fact that inputs to machine learning models do not only influence accuracy, but also influence the cost of deployment. The paper proposes a system with the intent to supporting future research into this correlation. It also seeks to address the relative rigidity of current data collection to allow model designers more access to data representations. ###### Achintya Desai The paper is attempting to solve the problem of suitable data representation for the ML model. It is known that raw packet traces are the best possible representation of data in terms of flexibility. However, this is impractical when cost of data collection is considered. On the other hand, the operational cost of deploying an ML model in practice also needs to be considered. Specifically, the paper is trying to resolve this tradeoff between the most suitable data representation for better ML model performance and the cost of deployment of ML model. ###### Ajit Jadhav The paper highlights the importance of considering system cost apart from the ML model accuracy while deploying in operational networks. The paper develops a new system for evaluation of both model performance metrics and system-level costs of different representations of network traffic. ###### Pranjali Jain The paper develops a new framework that can support a range of representations for network traffic and evaluate the cost associated with them. Their system both monitors network traffic and transforms traffic in real time to produce a variety of feature representations for machine learning. It enables exploring different representations for learning and balancing system costs related to feature extraction. ###### Rhys Tracy This paper is trying to address the issue of how to represent your network data for use in machine learning models. The paper keeps both model accuracy and costs to extract data and train models in mind when comparing data representations. ###### Shereen Elsayed The authors are emphasizing the importance of feature engineering and how to represent the features for the chosen model that it is curcial for the model's accuracy and system-level costs. They are working on network managament using machine learning. ###### Arjun Prakash This paper presents a proof-of-concept reference system implementation that is designed to explore different network data representations and evaluate the related costs for feature extraction and model training with these representations. ###### Apoorva Jakalannanavar The paper is trying to bridge the gap between the tradeoffs that can be considered between model performance and system-level costs during deployment. The paper mainly explores various feature representations of network data and how that can impact model performance and system-level costs. ###### Samridhi Maheshwari This paper is focused on two things - First, it aims on finding out how the cost of feature representtion changes with network conditions using Traffic Refinery. Second, it focuses on answering if the ML system costs can be reduced without reducing the accuracy of the model by making changes in the network representations itself. ###### Nawel Alioua Evaluating the system-level costs of different representations of the network traffic, including which features are used in the ML process, and how they are represented, in the context of two practical network management tasks: Video streaming quality inference and malware detection. ###### Navya Battula The paper tries to evaluate the system level costs associated with deploying ML algorithms in networking environments and explores better feature engineering techniques for learning, system costs associated with feature extraction and training vs accuracy. ###### Punnal Ismail Khan This paper presents a system that enables the evaluation of both the model accuracy along with the system-level costs of different representations of network traffic to see if the model can be deployed in practice. ###### Nagarjun Avaraddy This paper highlights the correlation of input to the real world deployment cost of ML models, which they believe is very much centered only towards ML model accuracy. ###### Alan Roddick The problem the paper is trying to solve is a better evaluation of system costs in the form of machine learning model accuracy and system level cost of different representations of network data ###### Satyam Awasthi This paper focuses on the associating for a Machine Learning model, its accuracy, and the deployment cost at the system level. And, it explores better feature engineering techniques for learning, system costs associated with feature extraction, and training vs accuracy. ###### Nikunj Baid The paper aims to highlight that performance of the model cannot determine the efficacy of an ML model. We also need to consider the cost involved involved in different representations of the network traffic. Hence the authors have come up with a hybrid solution that enables a joint evaluation of the cost involved in data representation and the performance of the resulting model. ###### Deept Mahendiratta It examines better feature engineering strategies for learning, system costs associated with feature extraction, and training vs accuracy in order to assess the system costs associated with deploying ML algorithms in networking contexts. ###### Shubham Talbar The paper highlights the importance of feature & data engineering and the system level costs associated with deploying machine learning models in operational networks. The authors aim to bring together the ideas of model accuracy and deployment cost of machine learning models, based on their representations using a system called Traffic refinery, which evaluates the effects of different representations on model performance. ###### Vinothini Gunasekaran This paper focuses on the correlation between Machine Learning model accuracy and the deployment cost at system-level. And it experiments different representations of data and how it is associated with ML model accuracy and the system cost. #### Why is that problem important? ###### Liu Kurafeeva This problem is important because, unlike in other ml applying areas, in networking it is extremely hard to base solution on representative dataset: a lot of researchers uses datasets that was created not for this problem and part of general pipeline that collects and cleans data almost always include aggregation, which can lead to the lose of essential examples. ###### Seif Ibrahim When data is collected from the network, certain aggregations and statistics are run on the data early in the pipeline before it is provided to the ML model. This is because feeding raw packets into a network is impractical due to the massive storage and bandwidth overheads. Thus it is important that we find the correct statistics and aggregations to extract from the data in order to be able to train a high fidelity model. ###### Aaron Jimenez This problem is important because in many cases the first three steps of the model design process (data collection, data cleaning, and feature engineering) are left out of the designers hands. This is important because in many ways these initial steps are critical for the performance of the model trained. ###### Brian Chen This problem is important as, fundamentally, it does not matter how advanced a system is in terms of accuracy if such a model could never feasibly be deployed. By addressing the correlation between input and not only accuracy but also cost, the paper is making machine learning models more practical. ###### Ajit Jadhav System-level costs for representation of network traffic determines the feasibility of deploying a model in practice. So, apart from model accuracy, these costs also need to be factored in the design and implementation of ML models. This makes it an important problem to consider for deployment of ML models in operational networks. ###### Rhys Tracy A system or application is only good if it can actually be used. If a certain set of network features yields high accuracy but is too expensive to gather, it is useless. We have discussed this a little bit in class, but balancing cost and model accuracy (as well as model explainability) is very important when it comes to machine learning on network systems. ###### Achintya Desai It is an important problem to solve because raw packet traces cannot always be the solution for data representation due to being impractical. The cost of deploying an ML model over an operational network carries the most significance in terms of real-life feasibility of ML solutions since high cost of deployment make them impractical. ###### Shereen Elsayed The authors explained the importance of this problem by representing how the network traffic measurement systems are working. They ignore the data collection, cleaning and feature engineering phases in the model design. Assuming data is fixed/given because measuring, sampling, aggrigating and storing network traffic data are made based on capabilities. This might negativily affect the accuracy of the model. ###### Arjun Prakash Data collections, processing, and feature engineering play an important role in the performance of network predictions. So, it is important to identify the proper data representation by considering the system costs associated with it. If the features and the models are not properly selected, then the costs associated with these tasks can get high without much gain in the performance. ###### Pranjali Jain Network management tasks that use machine learning do not allow the designer of the algorithm to determine the initial representation of the data provided to the model. These models assume that the data is fixed because data collection occurs way before modeling or prediction problems are considered. It is also important to consider the costs associated with processing, storage and latency to evaluate if a machine learning model can be deployed in practice. It is important to understand the relationship between data representation for network traffic and its effect on model performance and system costs. ###### Apoorva Jakalannanavar Often the best performing ML models employ a detailed feature representation and a more complex modelling approaches to provide a boost in the performance. But this doesnt necessarily translate to easier deployment of such models. Hence there is an important need to understand the tradeoff between the model performance and deployment cost, so that the right feature representation and modeling approach can be choosen according to the deployment needs. ###### Nawel Alioua When it comes to applying ML to network management, the representation of the traffic is as important as the choice of the model, and of the performance thereof. ###### Samridhi Maheshwari This paper explores system level optimisation, ignoring the optimisations that can be made at the ML pipeline development level. This is an important problem to solve because storing and collecting data is the ground work to all machine learning problems, and if this can be solved optimally, then further optimising ML solutions will lead to better performance as a whole. ###### Navya Battula This probelm is important beacuse a comprehensive managment of system costs is necessary not just while taining and validating the model but also while procuring, processing and feature engineering the data. An overall optimization of system costs could prove to be extremely helpful while trying to deploy this solutions in real environments. ###### Punnal Ismail Khan Having a highly accurate model is useless if it cannot be deployed in practice. This problem is important because it looks at both the accuracy and the system-level costs of using different representations of network traffic when evaluating a model. ###### Nagarjun Avaraddy This problem is important because this would directly result in reduced deployment cost of ML models which require some kind of networking data as inputs. As there is no general way of feature representation and data collection methods for networking data due its varibablity, size and difficulty in collection. ###### Alan Roddick System level costs are a huge part of the feasibility of implementing a machine learning model in production. If the system level costs are too large, this can be prohibitive to the usefulness of the model in real world tasks. ###### Nikunj Baid This problem is important because the initial steps of every ML problem involves data collection, data cleaning and feature extraction, to determine the representation of the data fed to the model, but not enough efforts have been out in this area from a networks pov. Also, unless the overall cost including the representation is involved, the decision about whether the model is feasible to be deployed in the real world cannot be made. ###### Deept Mahendiratta When evaluating a model, this problem is significant because it considers both the accuracy and the system-level costs of adopting multiple representations of network data. Not enought has bwwn done in this regard earlier. ###### Shubham Talbar Network management increasingly relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and the representation of those features, ultimately determine model accuracy, as well as where and whether the model can be deployed in practice. Thus, the design and evaluation of these models ultimately requires understanding not only model accuracy but also the systems costs associated with deploying the model in an operational network. ###### Vinothini Gunasekaran The practicality of using a ML model is tightly coupled with the system-level cost. If a model performance is high but the deployment cost is also high, then the chances of using that ML model are less. So, understanding the correlation between them is important. ###### Satyam Awasthi An ML model is only useful in practice if it has reasonable system-level costs. A balance between accuracy and operational cost is usually ideal for practical applications. The paper explores such a relationship when building an ML model. #### Provide a brief description of proposed system. ###### Liu Kurafeeva The systems allows simple was to collect different representation from network trafic using customized data collection. This way involves simulatiouse collection not only target data, but cost that was spend on that collection, so tradeof between model accuracy and these costs can be made. System includes traffic categorization module, module that process packets and work with profiling tools to store not only data, but it's costs as well. ###### Seif Ibrahim Traffic Refinery has a flexible feature extraction module which allows them to pick features they want to extract from each network layer, perform aggregations, and finally evaluate the accuracy of an ML model trained on these features. This system is highly configurable and allows users to write their own feature extraction code. It also has a system cost analysis module that gives us profiling tools to measure how much space and bandwith is required to extract these features. ###### Aaron Jimenez The system can be divided into two different subsystems; data collection and feature engineering, and performance cost analysis. In the first system, the user is allowed the flexibility to define what features to capture during network data collection. In addition, the system can extract new features based on the raw data packets contained within a network flow. In the second system, the user is able to calculate the performance cost of the entire pipeline with respect to the final model performance. Allowing the user to be able to make more informed decisions as to feature representation. ###### Brian Chen Traffic Refinery has a three-part packet processing pipeline followed by a cost analysis stage. The packet processing pipeline consists of a traffic categorization module, a packet capture and processing module, and an aggregation and storage module. Currently the categorization model identifies domains via IP and a lookup cache. The capture part utilizes current state of the art libraries while the processing portion is spit into a flow cache and service-driven packet processing. Processing is scaled up by utilizing parallel processing and multiple worker processes. The aggregation and storage stage essentially does as implied. Data is periodically extracted and transformed as well as stored to a remote location. The cost analysis has a live setting that profiles live traffic across a given time interval and an offline version that deals with recorded packets. ###### Ajit Jadhav The proposed system, Traffic Refinery, is capable of performing passive traffic monitoring and in-network feature transformations (up to rates of 10 Gbps) through implementation of a processing pipeline for the task. A variety of common feature representations are obtainable by the pipeline through capture and real-time transformation of the features. API interface is also provided to extend the coverage to new representations. Traffic Refinery is also able to quantify system costs for a transformation by profiling that helps gain understanding of system costs as well as accuracy for the representation. ###### Rhys Tracy This paper proposes Traffic Refinery, a system that performs passive monitoring and in-network feature transformations. It can capture data in real time at rates of up to 10Gb/s and tranform the captured data into a number of frequently used feature representations as well as estimate costs. Traffic Refinery also includes an API that allows additional feature representations to be added. ###### Shereen Elsayed They proposed a systematic approach to exlore the effect of network traffic data rep. and model performance and their associated cost. Traffic Refinery (their proposed system) is a piplined system that monitors the traffic and transforms in-network features. The packet processing pipeline consists of three components: traffic categorization, packet capture and processing and aggregation. Traffic categorization minimizes the overhead generated by the processing state of packets and flows that are irrelevant for computing the features of interest. Packet capture and processing collects network flow statistics and tracks their state at line rate. It has two main componets: State storage (flow cache) to store general data structure containing state and statistics related to a network flow, and Feature extraction: Service-driven packet processing. ###### Arjun Prakash Traffic Refinery was designed to explore network data representations and evaluate the systems-related costs of these representations. It works both for data representation design, helping network operators explore the accuracy-cost tradeoffs of different data representations; and for customized data collection in production. The data representation design follows three steps. 1) A superset of features is defined for exploring and Traffic Refinery is configured to collect all these features for a limited time period. 2) during this collection period, the system profiles the costs associated with collecting each individual feature. 3) The resulting data is analyzed for model accuracy versus traffic collection cost tradeoffs. ###### Nawel Alioua Traffic Refinery explores different feature representations and their effect on both the performance of prediction models and collection cost. It is implemented in Go and is customizable through a JSON configuration file. The system is comprised of the following modules: (1) a traffic categorization module responsible for associating network traffic with applications; (2) a packet capture and processing module that collects network flow statistics and tracks their state at line rate; (3) an aggregation and storage module that queries the flow cache to obtain features and statistics about each traffic flow. ###### Navya Battula The paper explores three guidelines for system design: 1. Detect only the flows and applications that interest the application at an earliest stages as it can remove unnecessary overhead. 2. Supporting state of the art packet processing while minimizing the entry cost for extending feature space. 3. Aggregating the flow statistics at regular intervals and storing them for future reference. Based on these guidelines they design their pipeline which consists of three components: 1. a traffic categorization module responsible for associating network traffic with applications 2. a packet capture and processing module that collects network flow statistics and tracks their state at line rate; moreover, this block implements a cache used to store flow state information 3. an aggregation and storage module that queries the flow cache to obtain features and statistics about each traffic flow and stores higher-level features concerning the applications of interest for later processing ###### Samridhi Maheshwari The authors propose Traffic refinery, to explore network data representations and evaluate the systems-related costs of these representations. Traffic Refinery implements a processing pipeline that performs passive traffic monitoring and in-network feature transformations at traffic rates of up to 10 Gbps. The pipeline supports capture and real-time transformation into a variety of common feature representations for network traffic and also has an API exposed to allow for new representations. ###### Punnal Ismail Khan The paper purposes a system called Traffic Refinery. The system has the following 3 modules. 1) Traffic categorization module: Associates network traffic to Application 2) Packet capture and processing module: Collects network flow stats at line rate and stores them using cache. 3) Aggregation and storage module: Stores higher-level features concerning the applications of interest by querying the flow cache to obtain features and statistics about each traffic flow. The system is also configurable by a JSON configuration file to configure system parameters. ###### Achintya Desai Traffic Refinery performs exploration for data representation design and customized data collection in production. The data representation design has 3 steps. In the first step, network operators define a superset of features worth exploring. In the second step, the system profiles the cost associated with collecting each individual feature. In the third step, the resulting data enables the analysis of model accuracy vs traffic collection cost tradeoffs. The packet processing pipeline has three main components. First component is traffic categorization module which associates the network traffic with applications. Second component is a packet capture and processing module which collects network flow statistics and tracks its state. Third component is an aggregation and storage module that queries the flow cache and extracts features and statistics about every traffic flow. It stores the high-level features which are required for the application. ###### Pranjali Jain The proposed system called Traffic Refinery, is a proof of concept reference system implementation to explore network traffic representation and their system-related costs. The pipeline has three components namely a traffic categorization module for associating network traffic with applications, a packet capture and processing module to collect network flow statistics and track state and store flow state information in a cache, an aggregare and storage module to query the flow cache and obtain features and statistics about each traffic flow and store high-level features of applications of interest for later. These components are used to detect flows and applications of interest early in the processing pipeline to avoid overhead, support state-of-the-art packet processing and aggregate flow statistics periodically for future consumption. ###### Apoorva Jakalannanavar Ppaer prososes a system "Traffic refinery", that enables joint exploration of both systems cost and model performance. Towards this end it provides a fliexible feature extraction module to customize the data representation and an integrated system cost analysis module to estimate how various system costs and performance are affected by the feature representations choosen.The system is implemented in Go and the feature representations can be customized using a json file. ###### Nagarjun Avaraddy The entire Traffic Refinery data representation generation framework along with a cost analysis tool. This system allows users to input the features to be collected and the system monitors traffic and extracts the same. It also uses the cost analysis module to document the cost during extraction as well as the training and accuracy costs. ###### Alan Roddick Traffic Refinery monitors network traffic at 10Gbps and transforms this traffic in real time by performing different feature engineering. Traffic Refinery also calculates and balances system costs due to feature extraction and model training in comparison to model accuracy. The pipeline has three components: Traffic categorization, packet capture and processing, and aggregation and storage. ###### Nikunj Baid The proposed system is built in Go and has the following components. - traffic categorization : that would analyze the traffic to associate the packets with the corresponding service / application. This is done by leveraging a cache that stores the mapping between the remote IP addresses and the services, that the packet belongs to. - packet capture and processing : It has two components : state storage : this is a partitioned flow cache, where the computed state is stored. This prevents redundant processing of packets. The second is feature extraction : the flow is first determined based on the service the packet belongs to. It then extracts the relevant features based on the service. - Aggregation and storage : The algorithm exports high level features at regular time intervals. The intervals are determined by the configuration file for each service. It is then saved in a temporary file and uploaded to a remote location to be fed to the models. ###### Deept Mahendiratta Traffic Refinery offers a configurable feature extraction module that allows them to select data from each network layer, aggregate them, and then assess the accuracy of an ML model trained on these features. It also makes use of the cost analysis module to track the expenses of extraction, as well as the costs of training and accuracy. ###### Shubham Talbar Traffic Refinery is a system designed to offer flexible extensible network data representations. It provides the ability to assess the systems-related costs of these representations and the effects of different representations on model performance. Traffic Refinery is implemented in Go to exploit performance and flexibility, as well as its built-in benchmarking tools. The system has three components: 1. A traffic categorization module responsible for associating network traffic with applications 2. A packet capture and processing module that collects network flow statistics and tracks their state; moreover, this block implements a cache used to store flow state information 3. An aggregation and storage module that queries the flow cache to obtain features and statistics about each traffic flow and stores higher-level features concerning the applications of interest for later processing ###### Satyam Awasthi Traffic refinery, the proposed system was designed to explore network traffic data representation and evaluate the associated costs with these representations. The proposed system pipeline includes data collection and feature engineering and operations cost analysis. During the data collection, the user is given the ability to specify the features to capture. Moreover, the given system can capture new features based on the raw data packets from the traffic flow. For the cost analysis, a user can calculate the performance cost of the entire pipeline and compare it to the final model. This allows the user to make informed decisions for feature selection. #### What types of costs did the paper consider? ###### Liu Kurafeeva State costs (in-use memory), processing costs (CPU usage for feature), storage costs (size to store, but no current optimization), profiling analysis costs (costs for online or offline profiling). ###### Seif Ibrahim They measure state cost (memory), processing cost (CPU), and storage cost (hard drive.) ###### Aaron Jimenez This paper measured the performance cost in three main areas; state cost (memory usage), processing cost (CPU usage), and storage cost with respect to model performance metrics. ###### Brian Chen The paper deals with hardware costs for data collection and did not consider the costs regarding the model itself. In particular, it considered the amount of in-use memory for each feature class, the CPU usage for each feature class, and the amount of storage necessary to store the features. ###### Ajit Jadhav In-use memory (i.e., state), per packet processing (i.e., compute), and data volume generated (i.e., storage) are the 3 costs considered by the paper. ###### Rhys Tracy The paper considers in-use memory, CPU usage (time cost to extract features), and storage use (for unoptimized JSON storage). ###### Apoorva Jakalannanavar The paper considers the following three costs. 1) State cost:The amount of in-use memory over time for each feature class. 2) Processing cost:The CPU usage for each feature class. 3) Storage costs:The size of the output generated overtime during the collection process. ###### Nawel Alioua Three types of costs were considered. This information can be collected either from live traffic or from offline traces: State cost: the amount of in-use memory over time for each feature class independently using Go’s pprof profiling tool. Processing cost: CPU usage for each feature class. the amount of time required to extract the feature information from each packet, minus any operation that shares costs across all possible classes. Storage cost: observing the size of the output generated over time during the collection process. The current version of the system stores this file in JSON format without implementing any optimization on the representation of the extracted information. ###### Navya Battula The paper makes use of three types of costs: 1. State costs: The paper collects the in-use memory over time for every feature and use Go's pprof profiling too to get instant snap shot of entire in-use memory of the system. 2. Processing costs: The paper calculates the CPU usage for each feature class by calculating the amount of time required to extract each feature information and leaving out any operation that shares costs accross all possible classes. 3. Storage costs: The paper addresses that the storage costs can be calculated by observing the size of the output generated over time during the collection process. ###### Shereen Elsayed The authors worked on three types of costs: - State costs: It is the in-use memory over time for each feature class independently. - Processing cost: It is the CPU usage for each feature class. They monitor the amount of time required to extract the feature information from each packet. - Storage cost: It is the size of the output generated over time during the collection process. ###### Arjun Prakash The paper considers three different costs. 1) State cost, which is the amount of in-use memory over time for each feature class. 2) Processing costs associated with CPU usage for each feature class and 3) Storage costs which are compared by observing the size of the output generated overtime during the collection process. ###### Samridhi Maheshwari The authors describe three costs - state costs (memory in usage by each feature class), processing costs (the cpu cost required to process each feature from network packets), storage cost (the size of the feature classes over the time of the collection process) ###### Punnal Ismail Khan State cost(Memory usage to keep state), Processing cost(CPU usage), and storage cost(Hard drive). ###### Achintya Desai This paper considers state costs (in terms of amount of in-use memory), processing costs (in terms of CPU usage), storage costs (in terms of size of output generated over time to be stored on harddrive) ###### Nagarjun Avaraddy This paper considers 3 types of costs; the state cost - relating to memory, the processing cost - relating to CPU and the storage cost - relating to permanent storage. ###### Alan Roddick The costs the paper considers are state costs (memory), processing costs (CPU), and storage costs (size of output). ###### Pranjali Jain Traffic refinery provides support to evaluate system-related costs for the user-defined data representations. Three cost metrics are considered namely state, processing and storage. State costs refer to the amount of in-use memory time for each feature class, processing costs refer to the CPU usage for each feature class and storage costs come from the size of output generated over time during data collection. ###### Deept Mahendiratta state costs, processing costs, and storage costs ###### Shubham Talbar During data representation design time, users employ the profiling method to quickly iterate through the collection of different features in isolation and provide a fair comparison for three cost metrics - state, processing and storage. ###### Nikunj Baid The paper considers the cost for the in-use memory ( state cost ), CPU ( processing cost ), and the storage that is needed for the various features. ###### Satyam Awasthi Three costs were considered by the paper - Processing cost: or CPU usage is the time cost to extract features Storage costs: the size of the feature classes over the time of the data collection process State cost: in-use memory use for each feature class. #### Describe the three case studies, and summarize their takeaways. ###### Liu Kurafeeva The paper describes two use cases (Video Quality Inference analysis and Malware Detection Analysis), for each authors preform 3 steps (definition and implementation of a superset of candidate features, feature collection and evaluation of system costs, analysis of the cost-performance tradeoffs). The general results showed that system allows to maintain accuracy level almost unchanged, while sagnificantly descrease the costs. For VQI authors showed ehat levels brings an impact to model accuracy and in what ranges the change of features are saginificat for improvement that accuracy. For Malware Detection Analysis was shown that cost dependensis are different then in VQI task (which shows that they will be different in general (the feature importance and perfomance results are presented as well), but the main result for me is that there is no "optimal" solution (and therefore dataset) for all networking task. ###### Seif Ibrahim The paper looks at two use cases, these are video quality inference analysis and malware detection analysis. In the video quality case the find that network layer info alone results in a poor accuracy model when it comes to predicting video resolution, adding transport layer information to that increases the accuracy dramatically at a low memory overhead, finally adding application layer info on top of that is not worthwhile because it increases memory usage significanly without significantly increasing model accuracy. They do a similar study for predicting startup delay and the optimal time granularity at which to collect data. In the malware detection case, the graph memory and cpu usage vs model performance and they find a point where the accuracy jumps up beyond which point additional cpu and memory resources would be wasted. ###### Aaron Jimenez Network-layer: When using network-layer features it has been shown that their collection consumes a relatively small amount of system memory, as the only major data it has to store are counters. This trend can also be seen in storage costs. In contrast, there are no major differences in processing cost between network, transport, and application features. Network-layer: When using transport-layer features it has been shown that their collection consumes a large amount of system memory. This trend can also be seen in storage costs. Application-layer: The trends for application-layer features very much mirror those of network-layer features, with a slightly larger memory footprint due to the manner in which the system stores video segments in memory. ###### Brian Chen The first case study is video quality inference. This study analyzed the costs of network transport and application level features and compared their results relative to cost. It was found that utilizing just network and application features was less accurate than using all features, but that the memory cost would spike for less than proportional gains. This held true in both startup delay and resolution. This is an example of costs no longer being worth the gains and demonstrates how cost profiling can be important. The other case study involves malware detection and analyzes size of input data and relevant layers in regards to costs. The result was that it was possible to determine that 20X20 was the cutoff point of accuracy gains that could be feasibly worthwhile. 28X28 had ballooning costs and no notable gains. Again, analyzing feature influence on costs provided useful information. ###### Ajit Jadhav Cases studies for video quality inference analysis and malware detection analysis are described in the paper. The first case study, video quality inference analysis showed that for both startup delay and resolution inferences, while using all features gave better results compared to just using network and application features, the system cost of using all features was disproportionately higher indicating the relative infeasibility of doing so in a system having limited computational resources. Similarly, in the case study for malware detection analysis, while the accuracy increases with increasing image size, beyond the size of 20x20, there is no significant accuracy gain while the memory costs continue to increase. Both these observations support the paper's argument of the importance of considering state costs for ML network model deployments. ###### Rhys Tracy The paper involves case studies for both video quality inference and malware detection. For video quality inference, some code was added to Traffic Refinery to allow capture of application layer features. 11 streaming services were analyzed using different combinations of the 3 sets of features (network, transport, and application layer data). The paper found that overall, using all 3 sets of features performed the best but had a very high cost, and using only network layer features was the fastest/lowest cost for almost everything but the worst performing. It looks like using network and application layer features was a solid intermediate with good performance but also relatively low cost for the most part. Traffic Refinery showed that accuracy improvements from adding in transport layer features might not outway the costs. For the malware detection, some code is added to allow Traffic Refinery to captured raw byte data from incoming packets and store the data in 2 different formats. The model then uses the CICIDS2017 malware dataset to analyze costs and performance. The dataset showed that as number of bytes of data collected increases, the PNG format starts to come closer to the raw data format when it comes to memory use and does about the same in processing time, but does a little better in overall storage use. They also found diminishing returns in accuracy as the size of images got larger. Unfortunately it doesn't look like they compared accuracies between raw data and PNG format. ###### Apoorva Jakalannanavar The authors explored 2 cases studies of using Traffic Refinery to prototype two common inference tasks: streaming video quality inference and malware detection. For video quality inference, they collected data from 11 different video services and performed feature selection, modeling and measured system costs for three classes of features i,e network,transport, and application features. They considered startup delay and resolution to measure QOE. They found that network features alone provide a lightweight solution but this yields the lowest model performance. Adding application layer features contributes to a very small memeory overhead but gives good performance boost whereas adding transport layer features provides limited benefits in terms of added performance. For malware detection they used CICIDS2017 dataset and mainly focused on deep learning based approaches and thier input data representation for experimentation. They looked at the how the data input size can beused to create the input image with raw traffic data and whether to perform online data transformations and how these choices affect model performance and system costs. They found that model performance flattens as the imagesize increases from a 20x20 to 28x28, whereas the cost in terms of memory maintained almost doubles. ###### Pranjali Jain Traffic Refinery is evaluated for two supervised learning problems in networking namely streaming video quality inference from encrypted traffic and malware detection. For the video quality inference study, the features from network, transport and application layer are considered across 11 video services. Traffic Refinery can be used to collect common features and specific features for a given inference task. In this case, state and storage requirements outpace processing requirements as traffic rates increase. Also, state cost can be significantly reduced without compromising model performance. Malware detection is a traffic classification task performed using deep neural networks like CNNs. Deep learning approaches require that the model learn the best data representation based on its input. Model designers need to consider the size of input data collected from the traffic for flow classification, layers to collect this data from and decide if data transformation should happen online or offline. For malware detection, the CICIDS 2017 dataset was used and it was observed that processing and storage costs dominate the state costs. This study also showed that a single best data representation might not exist for a given cost. The case studies indicate that fine-grained cost analysis can lead to better choice of traffic representation depending on model performance and network requirements. Also, in both cases data representation choices significantly lower system costs while preserving model performance. ###### Nawel Alioua The first case study is about video quality inference analysis. The models wre proposed by Bronzino et al. in “Inferring Streaming Video Quality from Encrypted Traffic: Practical Models and Deployment Experience.” Where the useful features fall in three classes: Network, Transport, and Application Layer features. The authors find that in terms of State costs, collecting transport features can require up to three orders of magnitude more memory compared to network and application features. Also, the application features require a median of a few hundred MB in memory on the monitored link, with a slightly larger memory footprint than network features. In terms of processing costs, no differences were noted for the different types of features, which shows that all feature classes considered for video inference are relatively lightweight in terms of processing requirements. In terms of storage costs, the same trend can be observed as for the state costs. As for the relationship between performance and representation, the authors find that network features alone can provide a lightweight solution to infer both startup delay and resolution but this yields the lowest model performance. Adding application features improves accuracy with little memory overhead, while adding transport features adds more memory costs with little performance improvement. The second case study is about malware detection analysis. The data collected for this task was stored in both raw bytes and in PNG files. The authors found that storing state in PNG costs double that of storing raw bytes alone. Similarly for processing costs, generating PNG images requires processing raw bytes through multiple filtering and compression stages, which result in a two orders of magnitude larger median of processing time. As for the storage costs, the authors found that compressing the data in 28x28 PNG saves up to 20% of space compared to saving in raw bytes, and does better than a 10x10 encoding. Then the authors explored the relationship between the representation and the model performance. The main takeaway is that as maintaining more information does not always lead to higher performance. ###### Shereen Elsayed The authors showed two case studies of the usage of Traffic Refinery: Streaming video quality inference and malware detection. - Video Quality Inference: This use case demonstrates how Traffic Refinery can be easily used to collect common features (e.g., flow counters collected in NetFlow) as well as extended to collect specific features useful for a given inference task. Data rep. cost for the three classes used (application features -> VideoSegment, network -> PacketCounters and transport -> TCPCounters). Their findings: "while some features add relatively little state (i.e., memory) and long-term storage costs, others require substantially more resources. Conversely, processing requirements are within the same order of magnitude for all three classes of features.". For model performance relationship with system cost, they found that the relationship between state cost and model performance is not proportional and that there you don't have to compromise the prediction performance by reducing the state-related requirements of model. For resolution inference, 10 sec. window turned out to be the worst performance. This emphasizes that the change of the window size changes how much information is used for prediction but it also affects the granularly of the inference. - Malware Detection Analysis: They found that there is one best data representation for the each cost mertics. They customized Traffic Refinery in two ways: byte copying and png copying. The used dataset is CICIDS2017 (5 days of - 8 hours - traffic sessions are 10 min). State-cost results shows that the Byte representation costs less than the PNG. Processing cost, the difference between bytes and png is not big, the flow size "greatly impacts the average packet processing time where the majority of packets of a short flow are retained for processing, whereas the majority of packets of a large flow are not retained, and thus have negligible processing cost." Storage costs the st ###### Arjun Prakash The paper presents two case studies 1) Video quality inference. Here we can see that state costs of transport features are 3 times more compared to network and application features. Storage costs follow similar trends as state costs as well. We can infer that the network features along with application features can provide a good model performance with a little memory overhead compared to just using the network features. Also, the ten-second windows can provide a good tradeoff between memory and prediction accuracy, achieving a minimum of 70 ms better predictions than all other time granularities. 2) Malware Detection analysis. Here the authors trained 15 models using different configurations of image sizes. From figure 9 we can see that the F1 score improves with an increase in image size and flattens after 20x20. Further increase in size only increases the cost and not the performance. ###### Navya Battula The paper proposes its approach using two use cases: Streaming video quality inference and malware detection. For both these case studies, the paper conducts three phases of data reprensentation design. 1. definition and implementation of a superset of candidate features 2. feature collection and evaluation of system costs 3. analysis of the cost-performance tradeof Video quality inference problem: In this the paper builds upon the Bronzo et. al. approach of collecting fetaures in regular time intervals of 10 secs upon which they tend to infer the start up delay in first ten seconds and then infer resolution upon next intervals. The paper modifies this by adding the 100 lines of GO code and using built in features to collect network and transport features. ###### Samridhi Maheshwari The paper includes 2 case studies - video quality inference analysis and malware detection analysis. In the first case study, it is shown that while using all features gave better results than just using network and application features for both startup delay and resolution inferences, the system cost of using all features was disproportionately higher, indicating the toughness of doing so in a system with limited computational resources. Similarly, in the second case study, while accuracy grows with image size, there is no meaningful accuracy improvement beyond images of size 20 by 20. However, Memory costs to process and store these images continue to rise. Both of these observations support the paper's argument that state costs are critical for ML network models. ###### Punnal Ismail Khan The paper describes two case studies. Video Quality inference and Malware Detection Analysis. For Video Quality Inference All features(network, transport, and application features) were giving high accuracy but transport layer features were increasing the cost by a lot. The takeaway here is that although using all features increases accuracy but in practice, it is better to use network and application layer features as they are less costly. The gain in accuracy is not worth the extra cost when using all features. Similarly, for Malware Detection Analysis, there is no significant accuracy gain after the size of 20x20 for the image but the cost was increasing a lot. Hence it is not worth increasing the image size beyond that size. ###### Achintya Desai The paper describes streaming video quality inference task and malware detection task as two use cases. In each case, it performs the three phases of data representation design. For video quality inference task, authors extract application features by implementing feature calculation functions and network & transport features using built-in feature classes. This feature collection is performed over Netflix, Amazon Prime, Youtube, Twitch streaming services. Authors perform the data representation cost evaluation over three feature classes mentioned previously. At first, they use the profiling tools to quantify the fine-grained costs. Later, authors deploy the system in a 10 Gbps interconnect link to study the data collection at scale. For malware detection, authors implement transformation methods in Traffic Refinery. First one produces a copy using bytescopycounter and addpacket function. The second one implements a PNGcopycounter to produce a PNG image from raw bytes. In both the cases, the results show that the cost can be lowered while preserving the model performance. However, the dominant cost factors differ in both the cases. It also shows the use of data transformation which allows meaningful decisions to be taken at deployment, thereby affecting the cost and model performance. In the first case, authors found that the relationship between state cost and the model performance is not proportional instead they show that is possible to significantly reduce the state-related requirements of a model without compromising prediction performance. In the second case, the processing time is higher when PNG generation is integrated within the packet processing pipeline. Storage costs have the exact opposite trend for larger image configurations. The result shows that it might be preferable to perform the image conversion online when storage is a bigger concern. ###### Alan Roddick The two case studies are video quality inference analysis and malware detection analysis. For video quality inference analysis, the state costs for transport layer features require three orders of magnitude more than application and network level features due to requiring all packets in the flow instead of simple counters. The processing costs have very little impact on the overall cost, even though the transport features cost more to process than network counters. The storage cost is a similar story to state cost, with transport layer features costing more. The results show that adding transport layer features may give a small bump in performance, but the cost increases many times more. This shows that transport layer features may not be needed if low cost is required. For malware detection, the jump from 20x20 image sizes to 28x28 images sizes decreased the accuracy, but greatly increased the memory cost. In terms of CPU cost, there was a dramatic increase in cost in processing PNG images instead of the raw bytes. For storage, the type of representation had less of an impact and the size had a much larger impact. ###### Nagarjun Avaraddy This paper has two case studies described. The first one is the video QoE inference analysis. This case study showed that the cost analysis can be valuable using extensive experimentation with input feature engineering as well as the performance vs cost analysis. The feature set involved application, transport and network layer features. The tradeoff between the performance of model and cost of the data collection, processing and storage was significant for transport layer features. The second case study is related to malware detection. The memory cost analysis was done for the storage format of the image which favoured raw bytes, the processing which favored PNG and the cost analysis for permanent storage. ###### Nikunj Baid The study includes cases relevant to 1) video quality inference analysis and 2) malware detection analysis. Video quality inference analysis, revealed that while using all features gave better results than just using network and application features for both startup delay and for inferring the resolution, the cost of doing that was not proportional to the gain in performance, and probably not feasible with limited resources. For the case of malware detection analysis, while accuracy grows with image size, the advantage is insignificant beyond the size of 20x20, but memory cost keeps piling up. Hence, it is evident that representation costs need to be considered when deploying a model in real life setting in the networks world. We can achieve similar performance with limited resources if fair share of due diligence is done in the cost analysis of the feature representation. ###### Deept Mahendiratta The study includes case examples for video quality inference analysis and malware detection analysis. The first case study, video quality inference analysis, revealed that while using all features gave better results than just using network and application features for both startup delay and resolution inferences, the system cost of using all features was disproportionately higher, indicating the relative impossibility of doing so in a system with limited computational resources. Similarly, in the case study for malware detection analysis, while accuracy grows with image size, there is no meaningful accuracy gain beyond the size of 20x20, but memory costs continue to rise. Both of these observations corroborate the paper's contention that state costs are critical for ML network models deployment. ###### Shubham Talbar The authors perform two case studies using Traffic Refinery to prototype two common inference tasks - streaming video quality inference and malware detection. For each problem, they conduct the three phases of the data representation design 1. Definition and implementation of a superset of candidate features 2. Feature collection and evaluation of system costs and 3. Analysis of the cost-performance tradeoffs The first case study showed that for both startup delay and resolution inferences, while using all features gave better results compared to just using network and application features, the system cost of using all features was disproportionately higher indicating the relative infeasibility of doing so in a system having limited computational resources. Similarly, in the second case study, while the accuracy increases with increasing image size, beyond the size of 20x20, there is no significant accuracy gain while the memory costs continue to increase. Both these observations support the paper’s argument of the importance of considering state costs for machine learning network model deployments. ###### Satyam Awasthi There are two studies that the paper describes: video quality inference and malware detection. In the video quality inference, it is shown that network layer features themselves are not sufficient to give an accurate model for video resolution. But adding transport layer features increases the model accuracy significantly and at a small memory usage cost. Moreover, adding application-layer features in addition to network and transport layer features yields low accuracy improvement for high memory overhead so, is are not worthwhile. During the malware detection, the effect of input data size and relevant layers on performance costs are analyzed. It was found that while accuracy does increase with image size, but does so until a point (20x20 image size) and yields no accuracy significant gains though the memory costs continue to rise. #### Let’s consider the QoE inference problem. Is inferring QoE for skewed data distributions (e.g., a well-provisioned campus network) simpler? Comment on model performance vs. low-cost representation tradeoffs for such skewed settings. ###### Liu Kurafeeva The knowledge about the network structure and the skewness that it brings to data somehow simplify the inference problem (any additional knowledge simplify it), but authors showed that transport features do not bring a sagnificant imporvement to model accuracy. Though I still think that applying the suggested technics and framework in each case are not very costly and including their costs as one of the metrics for data collection are interesting and promising idea. ###### Seif Ibrahim Infering QoE for a highly skewed data distribution is simpler since high quality or representative data can easily and quickly be collected from the network. Furthermore, skewed data will have less entropy leading to more opportunity for compression and low-cost representation which keeping the model performance high. ###### Aaron Jimenez Based on what I believe, I would think that inferring QoE for a skewed distribution would be easier to do since there is not as much variation in network conditions than if one were to do it on a much larger scale (such as an ISP). There would most likely be lower costs to generate the data representation; however, the trade-off would come at model performance on inputs much more closely matching the real-world distribution. ###### Brian Chen Presumably inferring QoE on skewed distributions is simpler. Since the data is skewed, it would tend towards certain patterns, regardless of whether these patterns are representative of the actual sample. Data with clear patterns should be easier to learn on. In the case of skewed settings, the performance to low-cost representation tradeoffs would probably not be accurate. If the tradeoffs are based on models that have an easier time, then it is likely that these tradeoffs would tend towards simpler representations being more cost effective while still maintaining accuracy. When it comes to a deployment setting, it could well be the case that there is no such ideal logarithmic relationship with cost on the x-axis and performance on the y-axis. ###### Ajit Jadhav Inferring QoE for skewed data distributions could be easier due to the data skew leading to a more uniform QoE distribution. In such a case, we might get away with relatively low-cost representation without losing significant model performance. ###### Rhys Tracy Inferring QoE on skewed distributions should be simpler since the data will be more consistent, therefore it will be easier for a model to understand what indluences QoE the most since there is less noise. Since it will be easier to determine the QoE on a skewed distribution, it is very likely that much less costly models can be used to give decent results. This would mean that costs for a real life dataset would likely be underestimated. ###### Arjun Prakash Inferring QoE for skewed distributions might be simpler. Since the data is skewed it will be easier for the model to learn and predict the results. So I believe it is possible to achieve high performance with a low-cost representation. ###### Apoorva Jakalannanavar When skewed data representation is considered for QOE inference problem, it will be relatively easier to represent the data and the feature representations can be made simpler as there might be less patterns/variations in the data. But when we look at the model performance, it might not yield a well performing, generalized model when new data representations are considered. ###### Pranjali Jain Inferring QoE for a skewed data distribution like campus network is possibly simpler as it is easier to collect data in such a setting and create data representations at lower cost as it will have a smaller feature space. The skewed feature representations are likely to lead to be able to train a model with high performance as the model only needs to learn a limited number of features. This way, skewed data will lead to very low costs in terms of data representation while providing high model performance. However, this model will not be generalizable in different network conditions or for arbitrary QoE metrics. ###### Nawel Alioua Inferring QoE on skewed data could potentially be simpler since we might need less costly features collected from a smaller amount of data to capture the most impactful traits of the traffic that is more uniform. ###### Samridhi Maheshwari Inferring QoE on skewed distributions might be simpler because the dataset will not vary a lot. It will be easier for a model to understand what factors influence the quality since there is less noise.Because it is easier to confer quality with skewed dataset, making models that are optimised and not complex would be easier as well. However, this would also mean that production level costs might be underestimated in the model and system development. ###### Alan Roddick Intuitively, it seems that QoE on skewed data would be better in terms of both performance and low-cost. With a concentrated dataset in the overall feature space, it will be more beneficial to collect lower-cost features while still maintaining high accuracy. This would be because a smaller amount of features would be needed to describe a smaller part of the distribution as opposed to the entire distribution. In this setting, striving for lower-cost features may not impact the model performance as much as for a non-skewed dataset. ###### Nagarjun Avaraddy The skewed data has a smaller feature space to be modelled which is easier for the model to not be very generalised to the prediction samples assuming they are from the same skewed distribution feature space. The cost analysis for this for feature space can prove to be helpful as the feature set can be reduced to obtain the most effective features and drop the cost of deployment. ###### Punnal Ismail Khan Inferring QoE from a highly skew date should be easier as we need to collect fewer features hence less cost. The accuracy will also be high as there are fewer factors to consider given the skewed nature of data. ###### Nikunj Baid For skewed data representations, inferring QoE might be relatively cheaper and simpler, as lesser number of features would be needed to distinguish the data points and hence the representation would be much more compact. However, this could lead to the underestimation problem in the real world setting. ###### Deept Mahendiratta Because the dataset will not vary much, inferring QoE from skewed distributions may be easier. Because there is less noise, it will be easier for a model to recognize what elements influence quality. Because a skewed dataset makes it easy to determine quality, creating models that are optimized and simple would be easier as well. However, this could lead to model and system development costs being overestimated at the production level. ###### Shubham Talbar QoE on skewed data could be better when considered model performance and low-cost. With a skewed data distribution it will be easier to collect lower-cost features while maintaining high accuracy. For the above QoE inference problem, seeking lower-cost features might not affect the model performance when compared to a non-skewed dataset. ###### Achintya Desai The skewed data implies that there is a simple learnable pattern present in the data which makes the QoE inference problem simpler although it might not be an acurate representation of the ground truth. Since the data is skewed, the feature space needed to represent such data is shorter than the actual feature space. This indicate that the cost of deployment can be reduced without affecting the model performance by trimming the feature set. ###### Satyam Awasthi Skewed data representation for QoE inference problem will be relatively simpler to represent the data and features since there might be lower variations in the data. But the model trained on such a dataset will be under-fitted, will result in poor performance, and will not be a generalized model when samples outside the training dataset are considered.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.