HackMD - Collaborative Markdown Knowledge Base

## nPrintML Reviews #### Question 1 : What problem is this paper solving? ###### Fahed Abudayyeh paper aims to simplify the task of applying machine learning techniques to network traffic analysis problems. Feature selection and representation, model selection, and parameter tuning are what ultimately determine the performance of a machine learning model. These tasks require a lot of manual work, method proposed in the paper automates them. ###### Alan Roddick The problem this paper is trying to solve is the difficulty of most machine learning tasks to fine tune hyperparameters and feature selection. The paper introduces packet representation combined with automated machine learning to help simplify the process of training models for networking tasks. ###### Nagarjun Avaraddy Paper is trying to automate the process of feature extraction, data representation and model selection which is currently manual. It does so by proposing nPrint which is a data representation format for network packets and nPrintML which is an addition to nPrint which includes AutoML to automate model selection and hyper parameter tuning. ###### Navya Battula The paper confers approches for automation of various tasks in traffic analysis using nPrint and nPrintML (an integration of nprint and AutoML) which makes it much easier for implementing various traffic classification problems. ###### Samridhi Maheshwari This paper introduces nPrintML which is a method to automate many aspects of traffic analysis. nPrint is a tool that generates unified packet representation that is amenable to learning and model training. The authors integrate nPrint with AutoML, that eliminates feature extraction and model tuning for a large part of traffic analysis tasks. ###### Aaron Jimenez This paper is meant to present a standardized representation of network packets along with a proposed pipeline to feed these nPrint representations into a ML model. ###### Apoorva Jakalannanavar The paper highlights the need for standardization of network traffic data format fed to machine learning models. Towards this end, the paper proposes “nPrint” which generates unified packet representations. The paper also proposes “nPrintML”, an automated machine learning framework which takes packet representations as inputs and automatically performs feature creation, selection and modeling. ###### Jaber Daneshamooz Although there are too many applications for machine learning over network traffice data, we do not have statndard representation of these data(packets). Also, there is a tedious human step for feature engineering step. The paper wants to solve this problem by standardizing the data representation for a class of network analysis problem. ###### Rhys Tracy The paper primarily aims to help automate and simplify feature engineering and model selection for machine learning in networks to help improve the real world feasability of it and automate it further. To do this, they proposed a standardized packet representation (nPrint). ###### Deept Mahendiratta The paper is all about making a standard representation called nPrint of network packets which can be used as an input for various machine learning models. Also, it introduces "nPrintML," an automated machine learning framework that takes packet representations as inputs and executes model development automatically. ###### Satyam Awasthi This paper introduces nPrintML, a framework to automate the process of feature extraction, data representation, and model selection - in the task of applying machine learning techniques to network traffic analysis problems. ###### Punnal Ismail Khan It is providing us with a standard data representation of packets for network traffic analysis problems. ###### Shereen Elsayed This paper is addressing the problem of human-driven feature engineering for ML in networking. The paper created nPrint which creates a unified packet generation for representation learning and model training. It helps the ML models to automatically detect the most effective features, without the need for manual extraction of features. ###### Seif Ibrahim This paper tries to define a standardized packet representation for feeding into Machine Learning models. This would make feature extraction easier. ###### Brian Chen The problem that this paper is solving is the issue of manual feature selection for network machine learning models. Currently, most models require for careful curation of the data to extract suitable features for the model to train on. Furthermore, such features vary wildly and can rarely if ever be reused. ###### Liu Kurafeeva The problem of different feature persistence in general trafic packets. All ML accepts constant number of features, so it is hard to make it work when we can not say for next packet if feature will be or won't be there. ###### Roman Beltiukov This paper is trying to implement the unified solution for network packet representation. ###### Achintya Desai It is well established that ML solutions are highly useful for network traffic analysis tasks. However, the performance determining aspects of ML pipelines are required to be done accurately and manually. In short, the paper is solving the problem of manual feature selection, model selection, and parameter tuning by automating it. This paper proposes a solution to make these tasks easier by releasing tools that generate a unified representation of traffic packets which can be further used by autoML pipelines. ###### Pranjali Jain This paper presents nPrint and nPrintML - tools that generate a unified packet representation from raw packet inputs. This helps to automate the process of feature extraction and model tuning for machine learning models that can then be used for various traffic analysis tasks. ###### Vinothini Gunasekaran This paper focuses on a new representation of packets for Machine Learning models. They introduce a new tool called nPrint that generates a unified packet representation and an autoML system called nPrintML that reduces the manual efforts needed in the feature extraction and model tuning tasks. ###### Nikunj Baid This paper acknowledges the manual efforts involved in feature selection/representation, model selection, hyper-parameter configuration, all of which are crucial to the performance of any ML soln. This paper attempts to automate this task and certain aspects of network analysis, making it more convenient to apply ML soln to such problems. ###### Arjun Prakash This paper introduces a tool that generates a unified packet representation for model training. This enables models to automatically discover features from the packets without the need for manual extraction. This representation was also integrated with AutoML, which enables automated model selection and hyperparameter tuning. ###### Shubham Talbar The paper tries to solve the problem of automating several aspects of traffic analysis such as - feature selection and representation, model selection and parameter tuning making it easier to apply machine learning techniques on traffic analysis tasks. ###### Ajit Jadhav The paper aims to simplify the process of applying ML to the spectrum of traffic analysis tasks by presenting nPrint and nPrintML geared toward automating many aspects of traffic analysis. ###### Nawel Alioua This paper automates the important tasks of feature selection and representation, model selection and parameter tuning. It introduces nPrint, which is a tool that generates a unified packet representation, and integrates it in AutoML for automated machine learning, resulting in nPrintML. #### Question 2 : Why is that problem important? ###### Fahed Abudayyeh This is an important problem because the lack of a standard data representation for network traffic previously made feature engineering unnecessarily difficult and time-consuming. The solution in this paper allows for the automation of the processing of network traffic data to achieve more accurate ML models that are built much more efficiently. ###### Nagarjun Avaraddy The problem is important as many researchers and engineers reinvent the wheel of feature selection, data representation and model selection/finetuning, it is timeconsuming. A solution like this will reduce a lot of man hours and that time can be spent on building solutions and products on top, faster to deployment. ###### Navya Battula This is an important problem because given any machine learning problem, the hardest part about it is data curation, pre-processing, formatting and hyper parameter tuning. Most data scientists face challenge in trying to get data in better shape before they could actually start training and tools like nPrint and nPrintML show a lot of promise for making this step easier for the researchers in the community. ###### Alan Roddick This problem is important because researches can spend many hours tweaking hyperparameters or engineering features and still won't pick the most optimal combination. A solution to this will allow researchers to focus more time on getting their models into the real world in production. ###### Samridhi Maheshwari In machine learning, manually engineering features and model tuning can omit features that may not seem important at first glance, or that have a nonlinear dependence on the output. This can cause performance of the ML model to reduce and can cause researchers to redo the model with different features everytime. Solving for making a standard representation of network packets by including all possible features can help reduce such manual errors. Also, offloading the model tuning part to autoML helps take away the manual work that has to be done to optimize ML models. ###### Aaron Jimenez This problem is important because up until this point there has been no real standardized data representation that can be used for learning problems. ###### Apoorva Jakalannanavar The framework takes away the painstaking process of feature engineering,hyperparameter tuning and can easily scale the modeling process for multiple tasks. Having an automated machine learning framework can also help researchers with limited machine learning domain knowledge to quickly build a model and perform experiments. ###### Rhys Tracy The feature engineering and model selection are time consuming and take active human work to do. As the paper argues, removing these steps will not only improve the speed that new machine learning strategies can be created, tested, and deployed on networks, but also further automate the process (which is the ultimate goal of applying ML on networks). ###### Deept Mahendiratta This paper aims to remove the need of manually curating data which is a very time consuming process. Also automating ML frameworks can help develop various ML models and solutions quickly and efficiently. ###### Jaber Daneshamooz As talked in the class, whenever there is an overwhelming and tedious job for human, we can think of it as a computer problem and solve it. Here, the feature engineering is the overwhelming part(which also requires a good background in the field). Also, there are many use cases for machine learning over this data which makes it more important. ###### Seif Ibrahim This problem is important because it automates the feature engineering portion of the machine learning process so that researchers won't have to do it manually over multiple iterations to find the best set of features. ###### Satyam Awasthi Feature engineering is a complicated and time-consuming process due to no fixed standard for data representation. Given work provides a solution to this problem as it seeks to automate the network traffic data processing task and so, achieve accurate machine learning models with less effort from researchers. ###### Brian Chen This problem is important as solving it would not only reduce the time need to set up models, but also create a standardized form to allow for models to be more easily compared. As it stands, with different models having different input parameters, it is difficult to perform a one-to-one comparison between the effectiveness of different models. Furthermore, having a standardized input format would allow for models to adapt to network changes across long periods of time. Essentially, more heuristics would be removed from network machine learning. ###### Punnal Ismail Khan It is important because it removes the time-consuming and tedious step of feature engineering from a normal machine learning workflow for classification problems. ###### Shereen Elsayed Manual feature extraction process takes a lot of time and if the relationship between the features is complex(non-linear), it can be very hard to detect the important feature. Engineers do not need to spend time engineering new features, selecting appropriate models, and tuning new parameters. ###### Liu Kurafeeva Because the way how features presented/not presented in packets force rerasearchers to manually parse and select the features, which is time-consuming and inefficient. ###### Roman Beltiukov Unified representation of network packets would allow to spent less time on repeating the same process and standartize feature extraction process for any network-oriented ML research. ###### Pranjali Jain Applying machine learning models for network analysis involves manually engineering features, and selecting models and model parameters which is a time-consuming and painstaking process. Handcrafted features not only require specialized domain knowledge and engineering effort but can end up becoming obsolete rather quickly. Also, every network analysis task can be different which requires coming up with new features and model selection. Automating some parts of this tedious task might help in reducing the time and engineering effort required for using machine learning for traffic analysis tasks, which is what this paper is doing. ###### Vinothini Gunasekaran The data collection phase needs a lot of manual effort and it is a time consuming process. It becomes challenging because of a few reasons such as the shortcoming of a standard packet representation, the process of selecting appropriate features and choosing an effective machine learning model. In the current system, manual engineering is essential for feature extraction and for model tuning. So, the paper is focusing on automating this process and reducing the need for human intervention. ###### Achintya Desai Manual tasks such as feature engineering and model selection require specific domain knowledge as well as accuracy in choices. There is also a possibility that manual extraction may omit some features that might not be immediately noticeable. Traffic patterns and conditions can also make these manually selected features useless. Additionally, every new network classification task requires coming up with a new feature selection, and appropriate model selection. Hence, it is worth considering this problem to reduce manual effort and error in the outcome of the ML model. ###### Nikunj Baid This is because feature representation, selection and the model it is used for drive the accuracy of the solution. And getting it all right requires tons of manual effort and configuration each time. Also, manual process implies that it is more prone to errors, and we might end up missing certain features that are actually crucial to the given problem statement. Since the same process is being repeated by various researches, it makes sense to automate it as much as possible. ###### Arjun Prakash Feature engineering and model selection are time-consuming tasks and it's difficult to come up with a good model even for the domain experts. nPrintML thus saves a lot of time for the developers and helps them focus more on interpretability and deployment rather than spending time on building the model. ###### Shubham Talbar Feature engineering and model selection are painstaking and time-consuming processes. Manual extraction of features may lead to omission of features that were either not immediately apparent or involve complex relationships. Automating these tasks paves the way for faster iteration and deployment of machine learning algorithms for networking. ###### Ajit Jadhav While ML is used for many network traffic analysis tasks, the components of the ML pipeline (feature selection, model selection and parameter tuning) are extremely time consuming with a lot of manual work involved. Addressing this problem could lead to significant time savings for the developers. ###### Nawel Alioua The above-mentioned tasks are usually done manually which is a “painstaking process” that requires extensive domain knowledge. In addition, the results are eventually imperfect, which can impact the model’s performance. #### Question 3 : Why do we need a standard data representation for networking-related learning problems? ###### Fahed Abudayyeh A standard data representation for networking-related learning problems allows ML models to automatically extract important features from sets of packets, removing the need for careful and tedious data preprocessing before applying an ML algorithm. ###### Nagarjun Avaraddy We need a standard data representation for networking-related learning problems as the process of feature extraction and representation is as important as the model selection as it effects accuracy a lot. Now for such an important part of learning process, research on the best possible standard representation will help many projects not be dented by the manual process of feature engineering which is hard even for domain experts. ###### Navya Battula We need standard representaions for networking data beacuse a lot of times when researchers are trying to solve certain problems, lets say traffic classification, there are a lot of things we need to get right before we could decide on a specific approach. Trying to choose from a set of features to understand which would be the most suitable one for our applications really requires a lot of research into these features and the model. However if we can some standard representations handy, we could narrow it down based on our application and save some time at the end of the day. ###### Alan Roddick A standard data representation will allow the machine learning models themselves to decide which features are important for the given task. Instead of relying on experts to manually engineer the features, we can encode the packets in such a way to preserve the semantics so to the models can pick the features by assigning weights. ###### Samridhi Maheshwari Standard data representation of networking packets can serve as the building block of automating a lot of network analysis tasks. Due to the standard representation, it is amenable to many machine learning models which require some type of structure as its input. Having a representation which is understandable and decodable to its original format, gives researchers freedom to experiment with it in many ways. It enables machine learning models to automatically discover important features from sets of packets for each distinct classification task without the need for manual extraction of features from stateful, semantic network protocols. ###### Aaron Jimenez It makes it easier to automate the collection of data into a format used for ML models without much human-driven feature engineering. In addition, it stops engineers from having to reinvent the wheel for each application of ML to a network problem. ###### Apoorva Jakalannanavar A standard packet representation reduces a lot of data cleaning, normalization efforts and also provides a uniform representation of the data which can be used for modeling. Having a uniform input data, can yield models to learn better feature representations and also reduces noise in the data and leads to better modeling performance. ###### Rhys Tracy Having a standard data representation removes the need to do any feature engineering (ie, we already know what features our data will have as standard). A standard also potentially can allow for more easy deployment and better generalizability of certain ML strategies accross different services, companies, and networks since they will all use the same data representation as standard. ###### Deept Mahendiratta It makes it easier to automate the gathering of data into a format that can be used by machine learning models without needing a lot of human intervention. ###### Satyam Awasthi Standardizing network packet representation can automate network analysis tasks by reducing efforts in data preprocessing. Also, a uniform representation allows for easier deployment, training, and, generalization in ML models. ###### Brian Chen Having a standard data representation would allow for much more automation in the development and training of network machine learning models. Furthermore, the existing machine learning models tend to be tuned for standard benchmarks such as images, video, and audio, but are not as effective on network traffic. Network traffic does not easily conform to these benchmarks. A standardized data representation would also assist in formatting network data such that models can better receive them. ###### Punnal Ismail Khan Deep learning techniques tend to be more effective with standard representations. For example, they perform really well with image classification tasks because images have a standard representation. We want to do something similar with networking-related learning problems. ###### Shereen Elsayed Standard data representation opens an opportunity for automation for model selection and hyperparameter tuning, fast iteration and deployment of ML, enables the creation of complete traffic analysis pipelines. ###### Liu Kurafeeva Different approach for feature selection can lead to less repredusability of the work. Also it makes the applying the work to the new area more complicated and long. ###### Roman Beltiukov As most of ML algorithms want fixed feature set in a particular order, it's important to provide them this data and ensure that any preprocessing of network packets would be repeatable, especially between different papers/researches. ###### Jaber Daneshamooz We need it to automate feature extraction part. By doing that, we eleminate the need for tedious feature extraction by researchers and they can easily apply their model rather than spending a lot of time and finding which features are relevant of not. ###### Vinothini Gunasekaran Having a standard data representation will potentially lead to ways for automating some of the data collection tasks and will help to reduce manual efforts in feature extraction tasks. Also, this may help to generalize the machine learning models that are applicable to a wide variety of problems. ###### Pranjali Jain Standard data representations are useful because many Machine Learning algorithms require inputs to be of the same size and have the same feature distribution. If the inputs are of the same length and aligned uniformly, feature extraction becomes faster and easier. Standard representations also help in preventing noise from being introduced into the network in the form of misaligned features. Such a representation also makes it easier to use the same data as inputs to several ML models. ###### Nikunj Baid Standardization of network packet representation can help automate various tasks involved in traffic analysis. For instanct, such a dataset can then be fed to a ML model to analyse and extract the important features that should be considered to improve the overall efficency of the model. Thus saving the manul effort involved in doing so. ###### Arjun Prakash Standard data representation reduces the effort of data preprocessing and helps the model to learn features easily. If the data is misaligned it might introduce noise in the model and might affect its performance. ###### Achintya Desai For each classification task, A standard representation can enable ML models to discover essential features from sets of packets without requiring manual extraction. It also allows faster iteration and practical deployment of ML algorithms for networking. Models trained on standard data representations are also shown to have achieved higher accuracy than in examples such as device fingerprinting in Nmap. ###### Shubham Talbar A standard fingerprint data representation for network-related learning problems eliminates feature engineering for a wide variety of traffic analysis problems. ###### Ajit Jadhav While there are many well-tuned ML models for standard benchmarks, they might not be suitable for a given requirement due to the limitation of the network traffic not having the same representation. Having a standard data representation removes the need for complex data processing as well as makes it easier to train ML models on the data. ###### Nawel Alioua A standard representation would ensure: o Completeness: include all header fields. o Constant size: some models expect a fixed input size. o Inherent normalization of features. o Alignment: fixed offset. #### Question 4: Does increasing number of packets in nPrint vector improve performance for different learnings? If yes, why? ###### Fahed Abudayyeh Based on the performance comparison between nPrintML and passive OS fingerprinting, it doesn't appear that increasing the packet count in an nPrint vector has a substantial impact on the performance of ML models. ###### Jaber Daneshamooz It does not(table 6). ###### Nagarjun Avaraddy Increasing number of packets in nPrint vector does not improve performance. This is because nPrint's data representation ensures that the major features TTL IPID are important as seen in the heatmap, and the granularity does not effect the inference a lot in comaprison to just one sample. ###### Navya Battula Based on Table 6 from the paper, it is apparent that adding more packets wouldn't essentially have any effect on the performance of nPrint. It gives approximately same precision and recall values for pipelines constructed with 1 packet, 10 packets and 100 packets proving that packet size is not a contributing factor for the performance. ###### Seif Ibrahim According to Table 6 in the paper, increasing the number of packets does not have a large impact on the performance. ###### Alan Roddick From the author's experiments and results for OS fingerprinting, it looks like nPrint already managed >= 0.99 Precision and Recall for every operating system with 1 packet. Increasing the numer of packets did not increase the performance, but it also did not decrease the performance. ###### Samridhi Maheshwari In fingerprinting (device, OS and application fingerprinting) problems, increasing the number of packets per nPrint may slightly increase (if at all - as shown in Table 6) the performance since it might give the ML model more data / patterns to learn from to correctly fingerprint a device / OS or application from their network traces. ###### Aaron Jimenez In certain circumstances, I could see adding more packets to nPrint vector could be beneficial trying if each packet in the vector is part of a greater whole – in the instance, a video stream. ###### Apoorva Jakalannanavar The paper details an experiment, where p0f and nPrint are both run with 1, 10, and 100 packet samples for device’s traffic for OS classification. From results detailed in Table 6, we can see that even though increasing the number of packets for p0f increases the performance, the same does not hold true for nPrint vectors. ###### Rhys Tracy The results shown in Table 6 suggest that changing the number of packets in the nPrint vector does not have any noticeable effect on performance (nPrint has near perfect precision and recall with 1, 10, and 100 packets for all OS tested) whereas increasing the number of packets in p0f does improve performance. ###### Deept Mahendiratta Using table 6, it can be seen that, while increasing the amount of packets for p0f improves performance, this is not the case with nPrint vectors. ###### Satyam Awasthi Table 6 shows that varying the packet numbers in the nPrint vector does not have a significant impact on performance. nPrint has high precision and recall for 1-100 packets for all OSs. This might be because of the better readability of compact packets by ML models. ###### Brian Chen Table 6 provides results for nPrintML vs p0f. There is a marginal increase of .01 in recall when the number of provided packets increases from 1 to 10. Presumably, this is due to better formatting by compacting packets in nPrint vectors. Perhaps doing this frames the data in a way that is more easily processed by the models. ###### Punnal Ismail Khan In the paper, the author showed in table 6 that increasing the number of packets in each nprint does not significantly increase performance for passive OS fingerprinting tasks. ###### Shereen Elsayed It is a very slight improvement in the performance when packet number increase. However, in Table 6, for Windows, the increase in packet number didn't show any improvement. ###### Roman Beltiukov Theoretically, for different tasks presence or absence of certain packets (for example, second) would be beneficial and important for obtaining the good performance. ###### Liu Kurafeeva It is prooven in the paper by figure 6, but the core idea here is that more packets leads to more representative and clear picture of the network (we can occasionally not include that really important one from underepresented group) ###### Nikunj Baid In the context of possive OS fingerprinting, increasing the number of packets did not signifcantly improve the performance of the model. It already outperformed the existing solutions with just a single packet in the vector, in terms of both precision and recall. We do see a slight increase though, as more packets would mean the model has more features to choose and infer from. ###### Arjun Prakash From table 6 related to OS fingerprinting, we could see there isn’t much performance difference between 1, 10, and 100 packets in nPrint. Precision and recall of nPrint is already >= 0.99 with just one packet. But it is possible to have learnings where having more datasets can improve the model performance. ###### Pranjali Jain Increasing the number of packets only very slightly improves the performance of nPrint for OS fingerprinting. It is able to achieve very high precision and recall with only a single packet (Table 6). However, for some examples in Table 1, the performance of nPrint improves on increasing the number of packets. In some cases, increasing the number of packets might enable better feature extraction and learning in the model which leads to better performance. ###### Achintya Desai For passive os fingerprinting, it can be seen from row 2 in table 1 as well as table 6 that the performance does not improve much with respect to an increasing number of packets. However, it can be seen from row 8 in table 1 that the accuracy increases when the number of packets is increased from 10 to 25 in identifying streaming video service via SYN packets. This shows that increasing number of packets sometimes do improve performance and sometimes does not. This could be possible because in low number of packets data could be more skewed towards majority/baseline packets. In certain cases such as os fingerprinting where TTL IPID fields are the major features in this task, number of packets are not going to make a huge difference to the outcome. ###### Vinothini Gunasekaran As per the data provided in Table 6, increasing the number of packets does not make a considerable difference in the performance result (Packets 1, 10 and 100 have almost the same results in all three OS). But, for passive OS fingerprinting, increasing the number of packets noticeably improves the performance. ###### Shubham Talbar As per one of the case studies performed in the paper, increasing the number of packets in the nPrint vector does not improve performance. Both the precision and recall for nPrint when used for Passive OS fingerprinting is near perfect (>= 0.99). ###### Ajit Jadhav Based on the results of the passive OS fingerprinting experiments, we can say that increasing the number of packets in nPrint vector does not correspond to improved performance for different learnings. ###### Nawel Alioua Increasing the number of packets has little impact on nPrint's performance. #### Question 5 : What is a device fingerprinting problem? Which dataset did the paper use for this problem? ###### Fahed Abudayyeh A device fingerprinting problem is one that employs a technique to distinguish network devices from each other using information from their hardware, software, or interactions with servers on the network. The information is aggregated into a unique identifier using some fingerprinting algorithm. This paper uses a downsampled subset of a dataset curated by Holland et. al. that consists of labeled devices and fingerprints. This paper adds a new device category to the dataset for IoT devices. ###### Nagarjun Avaraddy Device fingerprinting problem is to a problem to predict the device based on some kind of information collected about it; the information can be derived from network packets. The data used in this paper is Holland et al. dataset of Nmap output and raw packet data and curated IoT device data added to it. ###### Samridhi Maheshwari Active fingerprinting sends traffic to a system and analyzes the responses. In active device fingerprinting, the system constantly interacts with the devices and collects the network traces, making them into a fingerprint and labeling this trace with the corresponding device name. For this paper, the authors use the dataset made by Holland et al. They used a subset of Nmap’s probes to fingerprint the network device vendors at internet scale. They curated a labeled dataset of network devices through an iterative clustering technique on SSH, Telnet, and SNMP banners. ###### Alan Roddick A device fingerprint is information that when combined together allows someone to determine which device is which. The information may consist of TCP information such as flags, ICMP response codes, length of packet, etc. The authors of this paper use the dataset curated by Holland et al. by downsampling and adding a new IoT device category. ###### Navya Battula In the device fingerprinting case study, the paper compares the performance of a popular device fingerprinting tool NMap and nPrint against the dataset cuarted in the intial experiments of Holland et.al. on SSH, Telnet and SNMP banners. ###### Jaber Daneshamooz It's the process of finding which device is used. Identifying these devices can be performed by looking into the network packest. Then, we assign a tag to each of these identifiable devices. Data set of Holland et al. ###### Aaron Jimenez One of the issues with device fingerprinting in the manner described in Nmap is that it is reliant on the probes sent to devices and not any other data. The dataset used for this problem was adapted from Holland et al which included the Nmap output and raw packet responses. ###### Apoorva Jakalannanavar Device fingerprinting is a way to combine certain attributes of a device — like what operating system it is on, the type and version of the web browser being used to identify it as a unique device. The paper uses a dataset of labeled network, IOT device fingerprints over various network vendors which has inputs in both nPrint and Nmap formats. ###### Seif Ibrahim A fingerprinting problem involves indentifying which device was used to send a certain packet based on its packet headers. The paper uses a dataset by Holland et al. ###### Rhys Tracy Device fingerprinting involves determining attributes of a device from it's network activity. To do active device fingerprinting--determining the vendor who made a device from it's network traffic--, it looks like the paper used a downsampled version of a labeled network device dataset from a previous paper by Holland et al. with the addition of some extra data gathered from IoT devices. OS fingerprinting is a little different from device fingerprinting: this involves determining the OS of the device from it's network activity. For OS fingerprinting the paper uses the CICIDS2017 intrusion detection evaluation dataset. ###### Deept Mahendiratta Device fingerprinting is a technique for identifying a device by combining particular traits that distinguish it. The Nmap output and raw packet labels were included in the dataset for this task, which was adapted from Holland et al. ###### Satyam Awasthi Device fingerprinting involves determining the attributes of a device from its network pattern. For active device fingerprinting–, the system constantly interacts with the devices and collects the network traces, and labels them with respective device names. The paper uses downsample of the labeled network device dataset from Nmap (from Holland et al.) and expand the types of devices by adding an IoT category. ###### Brian Chen An active device fingerprinting problem is one where a model attempts to identify the particular device via its network patterns and transmitted data. The paper uses downsample of the labeled network device dataset from Nmap. It also expands the types of devices by adding an IOT category. ###### Punnal Ismail Khan Device fingerprinting is a problem where we identify a device based on its network patterns. The paper used CICIDS2017 intrusion detection evaluation dataset. ###### Roman Beltiukov Device fingerprinting is a task of device identification via unique patterns of their network activity. The authors use their own active fingerprinting dataset, published earlier. ###### Liu Kurafeeva The paper authors created their own dataset of unique device activity traces, in discussed paper the part of initialy created dataset used. ###### Arjun Prakash Device fingerprinting is a way to identify network/IoT devices based on interaction with those devices. The author's used the dataset curated by Holland et al, downsampled it according to the requirements, and added a new device category for IoT devices. ###### Achintya Desai Device fingerprinting(Active) identifies a target device based on the response it generates when interacted with. This can allow valid services such as banks from being abused by identity theft. The paper uses the dataset used by Holland et al which has a subset of Nmap’s probes to fingerprint network device vendors at an Internet scale. ###### Nikunj Baid Device fingerprinting involves identifying a device on the network based on the traffic flowing in/out of the device. The packets can be parsed to extract certain attributes that can hint at a pattern specific to a certain device type. This paper uses Holland et al.’s dataset, which is a labelled dataset of network devices. This dataset is downsampled ( as the original test for the data was curated for a problem which was much larger at scale ), and extended with certain IoT devices to test the performance across a wider range of device types. ###### Vinothini Gunasekaran Device fingerprinting is a process of identifying the source device based on the network packets. The authors use Holland et al’s labeled dataset and adds a new device category (Internet of Things devices) to the dataset. ###### Shubham Talbar Device fingerprinting is a way of uniquely identifying a remote device based on certain attributes such as the OS of the device, the web browser used and the device’s IP address. It’s an imperfect method of identification. The paper uses the same dataset for active device fingerprinting as in Holland et al. The authors downsampled the labeled network device dataset to create a set of devices to compare the classification performance of nPrintML with Nmap. The authors also added a new device category to the dataset - Internet of Things (IoT) devices. ###### Pranjali Jain Device fingerprinting is a method to identify a device in the network using network information and data received from that device like IP address, web browser version etc. The paper is using the dataset curated by Holland et al. of labeled network devices. ###### Ajit Jadhav Device fingerprinting problem deals with using device various attributes collected to separate classes of network devices. Dataset made by Holland et al. was used with the addition of curated IoT device data. ###### Shereen Elsayed The authors used Holland et al dataset and added their own new device category to it to include IOT devices. The device fingerprinting is used to identify the network using unique patterns. It is unclear to me why this method is faulty. ###### Nawel Alioua Device fingerprinting is a process of identifying a device or browser by determining which technology such as the OS, browser plugins and other active settings, are used in the device. The authors used Holland et al’s (Paper: Classifying Network Vendors at Internet scale.) dataset and downsampled the labeled network device dataset to create a set of devices to compare the classification performance of nPrintML with Nmap. They also expand the types of devices tested to test the adaptability of nPrintML across a larger range of device types, by adding the IoT device category. #### Question 6 : What is the application identification problem considered in the paper? Which dataset did the paper use for this problem? ###### Fahed Abudayyeh The application identification problem is the problem of identifying an application and browser that generated a DTLS handshake with nPrintML when provided with the handshake traffic. A dataset curated by MacMillan et al. is used, which is comprised of "almost 7,000 DTLS handshakes from four different services: Facebook Messenger, Discord, Google Hangouts, and Snowflake, across two browsers: Firefox and Chrome." ###### Jaber Daneshamooz ###### Nagarjun Avaraddy Application identification problem refers to identifying the application/browser which generated DTLS handshake. MacMillan et al. dataset which comprises of DTLS handshake data from different applications/browser combinations is used. ###### Samridhi Maheshwari The application identification problem in this paper is the identification of applications and the browsers that generated the DTLS (datagram transport layer security) protocol handshake. The dataset used in this paper is the one created by MacMillan et al. MacMillan et al wanted to fingerprint snowflake, a pluggable transport for Tor that is built to be indistinguishable from other WebRTCs. They collected 7000 handshakes from 4 different applications - FB messenger, Discord, Google Hangouts and Snowflake across 2 browsers - Firefox and chrome. ###### Alan Roddick The application identification problem considered in the paper is to use DTLS handshakes in order to determine the application and browser. The authors were inspired by Macmillan et al. who were trying to fingerprint Snowflake that uses WebRTC for Tor and is intended to be indistinguishable from others. The authors used the same dataset that Macmillan et al. used, which contains almost 7,000 DTLS handshakes. ###### Navya Battula In this case study the paper tries to expand on the intial research of MacMillan et.al. which attempts to fingerprint on snowflake, a pluggable transport for Tor that uses WebRTC to establish browser-to-browser connections. The current work tries to expand this by going forward a step ahead and infer (browser, application) pair using the given data and hence expanding the classes from four to seven. The paper is succesful in this attempt by attaining ROC AUC of 99.8% and F1 score of 99.8% which are some smashing numbers. ###### Seif Ibrahim The application indentification problem is to identify a certain aplication or browsewr based on its DTLS handshake. The MacMillan et. al. dataset provides a way of doing this by giving labels of which application or browser generated a certain handshake. ###### Aaron Jimenez The problem in question is about identifying a set of applications and browsers based on their DTLS handshakes using nPrintML. The dataset used was adapted from MacMillan et al. ###### Apoorva Jakalannanavar The paper considered identifying a set of applications and the browsers used to access them via their DTLS handshake traffic. The paper used the dataset from MacMillan et al which contains almost 7,000 DTLS handshakes from four different services: Facebook Messenger, Discord, Google Hangouts, and Snowflake, across two browsers: Firefox and Chrome. ###### Rhys Tracy The application identification problem involves attempting to determine what application (and browser) are being used from the network traffic that generated a DTLS handshake. To do this, the paper used a dataset from a previous paper by MacMillan et al. including DTLS handshake data from 4 different applications and 2 different browsers. ###### Deept Mahendiratta Problem: Identifying a group of applications and browsers depending on their DTLS handshakes MacMillan et al. developed the dataset that was used in this study. Snowflake, a pluggable transport for Tor that is meant to be indistinguishable from other WebRTCs, was the focus of MacMillan et al's work. ###### Satyam Awasthi It is the identification of a group of applications and browsers that generate DTLS handshakes. The given paper uses the dataset developed by MacMillan et al. and consists of DTLS handshakes from four services across two browsers. ###### Brian Chen The specific identification problem that the paper considers is based around DTLS handshakes. Th paper seeks to identify the application and browser that generated a DTLS handshake using handshake traffic. The paper uses the data curated from MacMillian et al. which consists of nearly 7000 DTLS handshakes from four services across two browsers. ###### Punnal Ismail Khan This problem identifies what applications are used on the network according to their DTLS handshakes. The data used here is Macmillan et al data which consists of DTLS handshakes. ###### Roman Beltiukov Application identification is a problem of identifying the application from closed set of classes based on the network activity. The paper uses dataset of DTLS handshakes by Macmillan et al. ###### Liu Kurafeeva The dataset of MacMillan et.al. was used in order to identify the initial application that was the acceptor/initiator for specific trafic. ###### Nikunj Baid This problem involves identifying the application / browser that generated a DTLS handshake. MacMillan et al. curated a dataset by collecting almost 7000 DTLS handshakes from four services : fb messenger, discord, hangouts, snowflake ( tor ), across two browsers : chrome and firefox. ###### Arjun Prakash It is a way to identify the application and browser through their DTLS handshakes. They collected almost 7,000 DTLS handshakes from four different services: Facebook Messenger, Discord, Google Hangouts, and Snowflake, across two browsers: Firefox and Chrome. ###### Shubham Talbar The authors test the ability of nPrintML to identify applications and browsers that generate a DTLS handshake when provided with the handshake traffic. For this purpose they used the dataset curated in MacMillan et al. composing almost 7000 DTLS handshakes from four different services: Facebook Messenger, Discord, Google Hangouts, and Snowflake, across two browsers - Firefox and Chrome. It identifies the application/the browser that generates the DTLS handshake. The paper aims to automatically identify these applicatoins. MacMillan et al. examined the feasibility offingerprinting Snowflake, a pluggable transport for Tor that usesWebRTC to establish browser-to-browser connections, which is built to be indistinguishable from other WebRTC services. They collect almost 7,000 DTLS handshakes from four different services. ###### Achintya Desai Specifically, the problem of application identification in the paper is to identify a set of applications through their DTLS handshake traffic. DTLS handshake (Datagram TLS handshake) is performed to establish secure communication between client and server. The paper uses the dataset from Macmillan et al. who collected 7000 DTLS handshakes from Facebook Messenger, Discord, Google Hangouts, and Snowflake services as well as from Chrome and Firefox browsers. ###### Vinothini Gunasekaran The authors focus on testing the nPrintML’s ability to identify the application and browser through their DTLS handshakes and aiming to find it automatically. They use the data curated by MacMillan et al which is collected from almost 7000 DTLS handshakes for four popular applications across two web browsers. ###### Ajit Jadhav Application identification problem deals with using DTLS handshake to identify the application/browser. MacMillan et al. dataset was used that has DTLS handshake data from a variety of application and browser combinations. ###### Pranjali Jain The paper uses nPrintML for application identification through their DLTS handshakes. nPrintML is provided with the handshake traffic and it identifies the application and browser that started the DLTS handshake. The dataset created by MacMillan et al. is used for this task which contains around 7000 handshakes from Facebook Messenger, Discord, Google Hangouts, and SnowFlake applications across Firefox and Chrome browsers. ###### Shereen Elsayed DTLS application identification is using nPrintML to identify the application and browser that generated the DTLS handshake. They used the dataset that contains almost 7000 DTLS handshakes from FB messengers, Discord, Google,..etc. The dataset is initially created by MacMillan et al. ###### Nawel Alioua Distinguishing four applications (Snowflake, Facebook, Google and Discord) running in two different browsers (Firefox and Chrome). The dataset used was collected in a previous work by MacMillan et al. (paper: Evaluating Snowflake as an Indistinguishable Censorship Circumvention Tool.) and it consists in 7,000 DTLS handshakes. The authors further split the classification task into the specific (browser, application) pair. #### Question 7 : How will you use PINOT to curate datasets considered in this paper? ###### Nagarjun Avaraddy PINOT can be used to collect network data and create labeled datasets. We can then apply nPrint data format representation for the network data and this can lead to curation of datasets considered in the paper; or even the other datasets based on the task we are trying to solve. ###### Samridhi Maheshwari PINOT can be used to curate datasets on which nPrintML can be applied for various types of problems - IDS, Device Fingerprinting, Video classification, QoE estimation etc. Researchers can use PINOT to collect and create labeled datasets from campus networks or other customisable networks and then use nPrint to create standard representations of the data. Most of the datasets used in the machine learning + networking area are often simulated and are not relative to the real world. PINOT allows us to customize how, when and where we collect data from. Using PINOT we can create more realistic datasets. ###### Aaron Jimenez PINOT can be used to curate the datasets by using its APIs to select what features to extract in order to then be able to obtain information such as device fingerprints or even applications being used for labels to train with. ###### Alan Roddick PINOT can be used to curate these datasets by collecting network packets from various applications, browsers, and operating systems. After collecting a large enough sample, nPrint can be used to standardize the packets into a digestable format for ML models to train on. ###### Apoorva Jakalannanavar PINOT can be used for collecting labeled data required for various learning tasks. The data collected using PINOT can be standardized using nPrint format and nPrintML framework can be used for modeling. ###### Fahed Abudayyeh Applying nPrint standardization to labeled network data gathered by PINOT could allow for experiments to be run using a variety of models that are much less painstaking to train than they'd be without nPrint. ###### Seif Ibrahim PINOT has a very general infrastructure that can be used to collect this type of dataset from the campus network. We can customize when, where, what data we collect to generate a labeled dataset which we can then feed into nPrint to standardize its representation. ###### Rhys Tracy PINOT's target-agnostic APIs can be used to capture raw packet data and necessary features, then this data can be converted to nPrint format for use in this paper. ###### Deept Mahendiratta PINOT can be used to collect data by using its APIs and this can be later used to train models by standardising this data using nPrint. ###### Jaber Daneshamooz PINOT collects and lables the data and then passes that data to the nPrintML and nPrintML standardises that data representation. ###### Satyam Awasthi PINOT can be used to curate datasets. Using its APIs we can select the features to extract and get the fingerprints for labeling this dataset. Then, nPrint can be used to standardize the dataset. ###### Brian Chen PINOT can be used to monitor network traffic. As such, it would be possible to utilize PINOT to acquire varying information from the network. Perhaps simply running PINOT and filtering out desired data would work. ###### Punnal Ismail Khan We can use PINOT API's to generate similar traffic and collect the data. I.e running applications and capturing the sent traffic for application identification problem. ###### Roman Beltiukov For application identification problem, PINOT could be used to run different applications and send traffic to each other to be captured, so the bigger dataset of application identification could be created. ###### Liu Kurafeeva PINOT can be used to recollect alike datasets with specific required features, because we litterally can just save the traces of devices or use the different applications, collect the data and the application/it's group themself as label. ###### Nikunj Baid PINOT offers a target-agnostic API that can be configured to select the features of interest that need to be collected from the network packets, tentatively labelled, which can then be transformed to nPrint format and finally generate the optimal model using nPrintML. ###### Arjun Prakash PINOT’s API could be used to collect and organize application specific data from the end-hosts or packet traces from different vantage points in the network. These data can then be converted into standard representation using nPrint and passed to a ML model. ###### Shubham Talbar PINOT could provide the collection of application-specific data from end-hosts which can be used to demonstrate the utility of nPrintML for tasks such as Active Device Fingerprinting. ###### Achintya Desai PINOT's ability to collect packet traces from different vantage points in the network as well as application-specific data can be used to organize a large collection of useful data for various tasks. The data collected by PINOT can be then standardized and used to solve network problems such as malware detection for IoT traces, Intrusion detection etc via nPrintML. This dataset would be much more valuable compared to synthesized datasets as the dataset organized by PINOT is more likely to capture the rare network scenarios that might occur intentionally or unintentionally. ###### Ajit Jadhav PINOT can be used to collect data from networks with varying network conditions. We can then use nPrint to process the data into standardized data that can be used to train ML models. ###### Vinothini Gunasekaran We could use PINOT for the application identification problem that is described in this paper. And we could use PINOT’s API for collecting a specific data and then apply nPrint to get a standardized format. ###### Pranjali Jain PINOT is an automated data collection mechanism for network traffic. It can be used for data collection in different network conditions easily. We can use PINOT to generate datasets that are application-specific or simply packet traces, which can then be standardized using nPrint and applied to machine learning models. This pipeline enables automation of the whole process of network traffic analysis using machine learning. ###### Shereen Elsayed PINOT can be used to collect information about network traffic, packet traces. ###### Nawel Alioua We could use PINOT to gather traffic from different applications, in different network conditions. If it allows the distinction of parameters such as the browser used, the type of OS, ...etc, then the curated datasets can be used for tasks such as device fingerprinting and application identification. #### Question 8 : What are the limitations of nPrintML? How can we address these limitations? ###### Nagarjun Avaraddy Firstly, the idea of standard representation and automation model selection, reduces the configurability of the entire solution and potential scope of experimentation for researchers. Secondly, as mentioned by authors there is still work need to be done when it comes to timeseries and temporal relationships between multiple network flows. ###### Samridhi Maheshwari nPrint is limited by not being very configurable, in the sense that engineers who use nPrint are limited by one standard representation it generates, and the autoML framework that is used. It could be possible in some problems that this standard representation may not work and a different kind of input is needed. Also, since the authors use autogluon, it could be possible that more advanced or new forms of machine learning algorithms are not yet integrated into it, which could help in some problems. Thus having some more configurable options would be a good inclusion in nPrintML. ###### Navya Battula Some limitations that paper addresses are dealing with automated timeseries analysis, classification involving multiple flows and applications of nPrintML being limited only to supervised machine learning techniques. Also my personal opinion on its limitations are, nPrintML couldn't possibly automate all the traffic classification use cases as some applications deal with wide varieties of data, process things seperately and need an elaborate data pre-processing before giving data to the model. However given that it could automate cartain traffic classification tasks, there is always a scope for exapansion spanning a new area of research in itself. ###### Alan Roddick A potential limitation of nPrintML is the lack of demonstration on other classification tasks that use time series or multiple flows. Before nPrint can be a widely used standard for many classification tasks, there needs to be more experimentation. For example, in IDS that analyze packets on the aggregate level instead of the per packet level may perform worse with the nPrint representation of packets. ###### Aaron Jimenez One potential limitation with nPrintML is the lack of granular control for the models to be used. You are not able to import custom models to train on that may serve your needs better than an AutoML library and integrate them as neatly into your pipeline. One potential solution is to add the ability to import custom saved models and run them directly from the command line. ###### Apoorva Jakalannanavar nPrintML does not provide support for automated timeseries analysis and classification involving multiple flows. Also nPrintML uses AutoGluon which has a far smaller set of model candidates than the multitude of candidates considered by other AutoML frameworks such as TPOT, Auto-WEKA, and auto-sklearn. ###### Seif Ibrahim One of the main limitations of nPrintML is its lack of options when it comes to choosing the type of model we are importing into its pipeline, however tis is more of a software engineering effort needed to increase adoptability. ###### Rhys Tracy I would say a primary limitation of nPrintML is that it is inflexible (which is also its main strength). nPrintML offers a standardized framework to do machine learning on network systems, so requires limited knowledge and time to get it up and running. However, there are situations that may require different models or features than supplied by nPrint and nPrintML, so this framework won't work (or won't work well) in those situations. For example, AutoML, the machine learning tool used in nPrintML, has been shown to have some difficulty with time series forcasting in some cases, so nPrintML might struggle with this. Additionally, AutoML is a black box and the only explainability you get is the approximate relative importances of each feature, so nPrintML can have some explainability but not the greatest. These limitations could all be solved with improvements to AutoML to allow a wider range of models to perform better in more situations and to improve explainability with more information related to the model being used. ###### Deept Mahendiratta The pipeline's lack of configurability and ability to easily integrate custom models in the pipeline are the potential drawbacks. It restricts the research to use one standard format. In the future, nPrintML can be made more flexible to overcome these limitations. ###### Satyam Awasthi nPrintML lacks refined control for ML models used so, importing custom models in the pipeline may be challenging. A solution could be to redesign the AutoML to allow easy integration of custom models. ###### Brian Chen Limitations that the paper mentions regarding nPrintML are its inability to be used to capture temporal relationships across multiple traffic flows and inability to be applied to longer traffic sequences. The only solution that comes to mind would be adjusting either the conversion to be more efficient or modifying the nPrint format itself to somehow encode sequence values. Realtime encodings would be unrealistic, but nPrint could contain a counter somewhere in its format to represent sequence of arrival. ###### Punnal Ismail Khan nPrintML lacks configurabilty as It is a standardized representation. This can be problematic when an ML model may need some other crafted features. Also, the number of features might be too high for some ML tasks. Bringing some flexibility to nPrintML can help solve these problems. ###### Roman Beltiukov nPrintML uses one standardized definition of a packet, that is not well-suited for machine learning solutions due to amount of generated features. Bit-level representation could make models to overfit and introduce a curse of dimensionality for small datasets, and also doesn't provide support for explainable features (and explainable models therefore). ###### Liu Kurafeeva The number of features generated is questionable, maybe some more advanced approach of feature representation will suit best for ML purposes. Also the AutoML options can be better represented and explained (though it is probably just the theme for other paper). ###### Nikunj Baid Using a standardized approach for data representation and model selection makes it less configurable for research and out of the box experimentation. And the authors did mention that such a solution would struggle with automated time series analysis or with classification involving multiple flows. One approach to address it would be the ability to decide from an explicit set of features/models to use, and also the flexibilty of incorporating additional models. ###### Shubham Talbar nPrint is not amenable to many open problems such as automated time series analysis and classification involving multiple flows. nPrint standardizes packet representations, removing the ability of configurability and nPrintML automates the Model selection part which can also be changed to accept custom changes. ###### Ajit Jadhav Lack of support for time series analysis with multiple flows and restrictive model options without the option to use custom models are some limitations for nPrintML. A mechanism to incorporate a variety of models could help to solve the problem of model usage. ###### Achintya Desai For live traffic, nPrintML is compatible with only libpcap. It can be made compatible with optimized packet libraries such as zero-copy pf_ring which are likely to improve the performance of the nPrintML as suggested by the authors. The paper focuses only on supervised learning and it could be possible that nPrint standard representations can be combined with unsupervised learning in the future. Another limitation could be that the paper does not show any viable demonstration of nPrintML for classification tasks based on time series with multiple flows. ###### Vinothini Gunasekaran As the authors have pointed out, even though nPrint has demonstrated that the network classification tasks can be automated, there are some limitations that exist in time series analysis and the multiple flow classification problems. ###### Pranjali Jain nPrint and nPrintML have representative results on 8 benchmarks. It will be good to see the classification results of this tool on more datasets. Also, tasks like automated timeseries analysis and classification involving multiple flows etc. have not been explored in this context. Also the nPrintML model works with AutoML, so it is harder to test newer ML models on it. ###### Nawel Alioua There are several open problems for nPrint such as automated timeseries analysis and classification involving multiple flows.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.