# NSL-KDD Survey
## Attack Description
### DDOS-Smurf
#### Definition
- network layer distributed denial of service
- sending a ```slews``` of ICMP Echo request packets.
- Smurf: amplification attack vector ---> exploit characteristics of broadcast networks to boosts damage potential
#### Step
1. Smurf malware is used to generate a ```fake Echo request``` containing a spoofed source IP, which is actually the **target server address**.
2. The request is sent to an intermediate IP broadcast network.
3. The request is transmitted to all of the network hosts on the network.
4. Each host sends an ICMP response to the spoofed source address.
5. With enough ICMP responses forwarded, the target server is brought down.
#### Parameter
- amplification factor: correlates to the ```number of the hosts``` on the intermediate network
### Probing-Port Sweep
#### Definition
- ```sweep``` is used as a host detection method
- a port scan that is focused on a ```specific port``` across a ```wide number of hosts```
- malicious users seeking out network vulnerabilities
#### Step
1. sending a message to a port and listening for an answer
2. received response indicates the port status and can be helpful in determining a host’s operating system and other information relevant to launching a future attack
### Probing-IP Sweep
#### Definition
- Similar to Probing-IP Sweep, but search for IP addresses
#### Step
1. attacker sends ICMP echo requests to multiple destination addresses
2. If a target host replies to these requests, the reply reveals the targets IP address to the attacker.
3. Attacker determine which range of IP addresses map to live hosts.
### Probing-Nmap
#### Definition
- network mapper: network discovery tools
- find live hosts on a network, perform port scanning
- uses IP packets to identify all the devices connected to a network
#### Step
1. Nmap checks a network for hosts and services.
2. Once found, the software platform sends information to those hosts and services which then respond.
## Dataset
### KDD Cup 99
#### Characteristic
- built based on the data captured in DARPA’98 IDS evaluation program
- 4 gigabytes of compressed raw (binary) tcpdump data
- 7 weeks of network traffic
- Be processed into about 5 million connection records, each with about 100 bytes
#### training dataset
##### description
- consists of approximately 4,900,000 single connection vectors
- each contains 41 features
- is labeled as either normal or an attack
##### Attack Categories
- Denial of Service Attack (DoS):
- attack in which the attacker makes some computing or memory resource too busy or too full to handle legitimate requests
- denies legitimate users access to a machine
- User to Root Attack (U2R):
- a class of exploit in which the attacker starts out with access to a normal user account on the system
- able to exploit some vulnerability to gain root access to the system
- Remote to Local Attack (R2L):
- attacker
- has the ability to send packets to a machine over a network
- does not have an account on that machine
- exploits some vulnerability to gain local access as a user of that machine
- Probing Attack:
- attempt to gather information about a network of computers for circumventing its security controls
#### test dataset
##### description
- not from the same probability distribution as the training data,
- includes specific attack types not in the training data which make the task more realistic
- 24 training attack types, with an additional 14 types in the test data only
#### features
- Basic features
- encapsulates all the attributes that can be extracted from a TCP/IP connection
- Most of these features leading to an implicit delay in detection.
- Traffic features: ```time-based```
- includes features that are computed with respect to a window interval
- ```same host```: examine only the connections in the past 2 seconds that have the same destination host as the current connection, and calculate statistics related to protocol behavior, service, etc.
- ```same service```: examine only the connections in the past 2 seconds that have the same service as the current connection.
- “```same host```” and “```same service```” features are re-calculated but based on the connection window of 100 connections rather than a time window of 2 seconds. These features are called **connection-based traffic features**.
- Content features
- R2L and U2R attacks don’t have any intrusion frequent sequential patterns because
- the DoS and Probing attacks involve many connections to some host(s) in a very short period of time
- the R2L and U2R attacks are embedded in the data portions of the packets, and normally involves only a single connection.
- To detect these kinds of attacks, we need some features to be able to look for suspicious behavior in the data portion, e.g., number of failed login attempts.
#### Deficencies
1. every attack data packets have a time to live value (TTL) of 126 or 253, whereas the packets of the traffic mostly have a TTL of 127 or 254.
- However, TTL values 126 and 253 do not occur in the training records of the attack
2. the probability distribution of the testing set is ```different``` from the probability distribution of the training set, because of ```adding new attack records``` in the testing set .
- This leads to ```skew or bias``` classification methods to be toward some records rather than the balancing between the types of attack and normal observations.
3. the data set is not a comprehensive representation of recently reported low foot print attack projections
### NSL-KDD
#### Discription
- suggested to solve some of the problems in the original KDD99 dataset
- deficiency: huge number of redundant records ---> causes the learning algorithms to be biased towards the frequent records
- Others are just the same as KDD but with smaller amount
- removing the duplication of the record in the training and test sets of the KDDCUP99 data set for the purpose of eliminating classifiers biased to more repeated records
- selecting a variety of the records from different parts of the original KDD data set is to achieve reliable results from classifier systems
- eliminating the unbalancing problem among the number of records in the training and testing phase is to decrease the False Alarm Rates (FARs)
#### Distribution

#### Deficiencies
- it does not represent the modern low foot print attack scenarios
- Few features, not realistic
### UNSW-NB15
#### Discription
- NIDS: Network Intrusion Detection System
- misuse/signature: matches the existing of known attacks to detect intrusions
- anomaly based: normal profile is created from the normal behavior of the network, and any deviation from this is considered as attack
- label: either normal is 0 or abnormal is 1
- environment: IXIA tool Testbed
- Servers 1 and 3 are configured for normal spread of the traffic
- Server 2 formed the abnormal/malicious activities in the network traffic
- Dataset statistic

- Dataset features






### CICIDS-2017
#### Description
- gotten from ISCX Consortium
- label: normal traffic defined as ```Benign``` traffic and anomaly traffic called as ```Attacks``` traffic
- 14 types of attacks in this dataset
- features that exist in CICIDS-2017 but are not available in NSL-KDD:
- Subflow Fwd Bytes and Total Length Fwd Package: to detect Infiltration and Bot attack types
- Bwd Packet Lenght Std feature: to detect the types of DDoS, DoS Hulk, DoE GoldenEye, and Heartbleed attacks
- Init Win Fwd Bytes: to detect the types of Web-Attack, SSH-Patator, and FTP-Patator attacks.
- Min Bwd Package Length, Fwd Average Package Length: to recognize normal traffic

#### Deficiencies
- the dataset is huge and spanned over eight files as five days traffic information of Canadian Institute of Cybersecurity
- contains many redundant records which seems to be irreverent for training any IDS
- the dataset is high class imbalance in nature.
- Scattered Presence
- the data of CICIDS2017 dataset is present scatter across eight files
- Huge Volume of Data
- contains data of all the possible recent attack labels at one place
- it consumes more overhead for loading and processing.
- Missing Values
- 288602 instances having missing class label and 203 instances having missing information
- High class imbalance

## Reference
1. Tavallaee, Mahbod; Bagheri, Ebrahim; Lu, Wei; Ghorbani, Ali-A., A detailed analysis of the KDD CUP 99 data set
2. Improving the Performance of Multi-class Intrusion Detection Systems using Feature Reduction
3. Nour Moustafa, Jill Slay, UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)
4. KURNIABUDI, DERIS STIAWAN , DARMAWIJOYO, MOHD YAZID BIN IDRIS, ALWI M. BAMHDI, and RAHMAT BUDIARTO, CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection