| Name | Email |
|-------------|-------------------------|
Boni Filippo | f.boni7@studenti.unipi.it |
Braccini Giovanni | g.braccini8@studenti.unipi.it |
This project aims to visualize and analyze logs generated by [Cowrie](https://github.com/cowrie/cowrie) and [Dionaea](https://github.com/DinoTools/dionaea) honeypot and monitor the environment where they run. Additionally, this project incorporates the utilization of z-score and forecasting techniques for effective anomaly detection.
# Prometheus Collectors
The Prometheus collectors use custom logic for the monitoring and retention of current attacks, malware downloads, and so on.
Additionally, the value of the Z-score is computed for each metric in real-time, exposed to Prometheus, and visualized within Grafana, allowing for an intuitive way of detecting outliers.
<div style="display: flex; justify-content: center;">
<img src="https://hackmd.io/_uploads/rJQkggP_h.png" alt="z-Score" style="width: 400px;">
</div>
In our analysis of honeypot log data, we used the Z-score to calculate the deviation of four key metrics in a 30 second interval:
1. Total number of connections
2. Number of accepted connections
3. Number of run commands
4. Number of files downloaded
By utilizing the Z-score calculation for each metric, we were able to measure the magnitude of the deviation of individual data points from the mean. This analytical approach allowed us to identify instances that displayed substantial deviations, thereby bringing attention to distinct outlier attack techniques like DDoS attacks, scanner activities, or automated attack scripts. This analysis provided valuable insights into these unique attack methodologies.
Once the Prometheus server is running, Z-score values can be visualized within Grafana using a heatmap, as shown below in *the Z-score* dashboard:

The presented figure depicts the Z-score values of various metrics recorded every 30 seconds over a 12-hour period on the Cowrie Honeypot. Upon examination, it becomes apparent that the majority of the values cluster around 0.5 standard deviations from the mean, indicating a relatively moderate deviation from the average. However, a number of outliers can be observed, exhibiting Z-scores ranging from 2.5 to as high as 25, particularly in the case of run SSH commands. It is worth noting that no exceptionally low negative values are encountered, which aligns with the nature of the analyzed metrics, where negative values are not feasible within the context under investigation.
Overall, this visualization method proves to be a valuable tool for detecting and understanding outlier values within the dataset, contributing to a deeper understanding of the observed metrics and the underlying attack patterns.
# SAD (Simple Anomaly Detector)
## What is it?
SAD is a time series analyzer tool specifically developed to detect anomalies within the time series data. Its primary objective is to offer valuable insights and predictions derived from historical data, employing two distinct algorithms: Prophet and Holt-Winters.
In the context of honeypots' data analysis, it has been deployed to monitor the number of attacks over time on Cowrie honeypot in an attempt to shine light on the various attack patterns and techniques.
## How does it work?
Upon launching SAD, the user is prompted to select the desired prediction methods:
1. Profet
2. Holt Winters
## Forecasting with Prophet
The Prophet forecasting method relies on Meta's Prophet library to facilitate accurate predictions. To train the model, data sourced from Prometheus is used.
After the training phase, regular updates are performed on the Prophet model at fixed intervals (every hour). These updates ensure that the model stays updated with the most recent data.
Once the model is sufficiently trained, it is capable of providing forecasts for the next 30 seconds. These forecasts serve as predictions for the expected values within this time frame.
## Forecasting with Holt Wilters' algorithm:
The Holt-Winters method offers advantages over the Prophet method when forecasting in short periods of time, its ability to capture seasonal patterns and adapt to changing trends makes it the preferred choice.
In our specific use case, through a thorough evaluation, it was determined that the values of alpha, beta, and gamma (0.14, 0.18, and 0.16, respectively) yielded the most accurate predictions.
These values were carefully chosen by considering the presence of excessive or insufficient seasonality, adherence to the underlying trend, and adaptability to dynamic and sudden changes in the data. It is worth noting that these parameters can be adjusted within the function's code if deemed necessary, allowing for further fine-tuning to meet specific requirements.
## Forcasts
Both algorithms foresee what the value will be in 30 seconds.
Afterwards, a *delta* value is computed by subtracting the to and considering the absolte value.
## Training
SAD **training** is divided in **two phases**:
1. Basic
2. Advaced
Training is needed in order to collect values that will be referred to as *delta values*
## Delta values
A **delta value** is a measure of the absolute **difference** between a **forecasted** value and the corresponding **actual value**. It represents how much the prediction deviated from the real outcome.
This calculation starts by subtracting the forecasted value from the actual value. By taking the absolute value of this difference, we obtain the delta value. The forecast is made for a point in time 30 seconds in the future, and once the actual value becomes available, it is compared to the forecasted value to determine the delta value.
In order to detect anomalies, the SAD tool incorporates the *Basic* training phase where it asks for a training time. This represents the duration during which the tool will collect delta values to gain insights into their distribution over time.
In the subsequent *Advanced* training phase, the tool offers the option to select a **confidence level**. This choice determines the **precision** of the anomaly detection process by estimating the number of delta values required to compute a reliable threshold value.
The determination of this **threshold** value takes into account various factors, including the mean, standard deviation, and variance of the gathered delta values. By considering these statistical measures, the tool can establish an accurate threshold for identifying anomalies with the desired level of confidence.
Throughout the process, the tool provides updates on the training progress and estimates the remaining time required for each training phase to be completed.
## Detecting anomalies
In the anomaly detection process, once the tool has accumulated a sufficient number of delta values, it evaluates a threshold. This **threshold serves as a benchmark** **for identifying anomalies**. It is determined by analyzing the distribution of the collected delta values.
Furthermore, SAD runs a Prometheus server on port 8003, enabling the collected delta values and threshold to be exposed as metrics.
## Visualize the results.
By adding the Prometheus input to Grafana, users can monitor the behavior of the delta values over time and easily identify anomalies through a graphical representation.
This integration enhances the overall user experience by providing a convenient and intuitive way to track and analyze the anomaly detection process.
Below is an example output of the script using the *Anomalies* dahsboard, which compares

# Testing
To test the script without setting up the honeypots, the files in the *test* folder can be used.
The scripts provide a CLI output, so if you need to run the scripts without setting up Grafana or Prometheus, the following guide can be used:
## Testing Cowrie's Collector
1. Install dependencies:
```
pip install prometheus_client
```
2. Extract the *test* folder
3. Run:
```
cd test
./testCowrie.sh
```
5. The script emulates Cowrie's logs in order to showcase the log parsing function. The script needs to collect several values in order to output an accurate Z-score.
The test will emulate a day of attacks (it will not use live data).
CLI example output:

## Testing Dionaea's Collector
1. Install dependencies:
```
pip install prometheus_client
```
2. Extract the *test* folder
3. Run:
```
cd test
./testDionaea.sh
```
5. The script emulates Dionaea's logs in order to showcase the parsing function and Z-score computation. The script needs to collect serveral values in order to output an accurate Z-score.
The test will emulate a day of attacks (it will not use live data).
CLI example output:

## Testing SAD
For showcasing purposes, SAD will access a provided active Prometheus database for testing purposes, and it will show live data. In order to use custon instances, the variables *PROMETHEUS_ADRESS* and *COLLECTOR_ADDRESS* can be edited.
1. Extract the *test* folder
2. Install dependencies:
```
cd test
pip install -r requirements.txt
```
3. Run:
```
./SAD/testDionaea.sh
```
4. Input the desired configuration:
* Time time range in days of the data that will be used for training the forecasting model
* Forecasting Algorithm:
- Facebook's Prophet
- Holt Winters Method
* Days of data to feed to the model:
- For Facebook's Prophet, the maximum is 7 days.
- For Holt Winters Method, the maximum is 3 days.
* Time in minutes to collect delta values in order to calculate the threshold.
* Whether or not to increase the threshold's accuracy based on the collected values' distribution. It's recommended if ,in the previous step, a very short time was chosen.
* The Anomaly detection precision in percent number. The greater this value, the more values that will be collected.
# Visualize results with Grafana and Prometheus.
All scripts can be run in the background and monitored
on the Grafana using the *Timeseries_Dashboard_Test* dashboard.
## Requirements
* Prometheus
* [Grafana](https://grafana.com/grafana/download/8.2.4?edition=oss)
## Setup Prometheus
1. In the Prometheus.yml configuration file, add the three clients:
```
- job_name: 'Cowrie Honeypot'
scrape_interval: 30s
static_configs:
- targets: ['127.0.0.1:8001']
- job_name: 'Dionaea Honeypot'
scrape_interval: 30s
static_configs:
- targets: ['127.0.0.1:8002']
- job_name: 'SAD'
scrape_interval: 30s
static_configs:
- targets: ['127.0.0.1:8003']
```
## Setup Grafana
1. Setup Prometheus as a data source in Grafana.
2. Import in Grafana the *Timeseries_Dashboard_Test* provided in this repository.
## Running Cowrie's Collector: Headless
To run Cowrie's Collector in the background, run:
```
./testCowrie.sh -h
```

On Grafana, while z-score values are exposed to the *Z-score* dashboard,
metrics can be visualized as shown below: *Timeseries_Dashboard_Test* as shown below:
## Running Dionaea's Collector Headless
To run Dionaea's Collector in the background, run:
```
./testDionaea.sh -h
```
On Grafana, while z-score values are exposed to the *Z-score* dashboard,
metrics can be visualized as shown below: *Timeseries_Dashboard_Test* as shown below:

## Running SAD Headless
In order to run SAD in the background, the *SAD.config* file has to be used. The *SAD.conf *file serves as substitute to user input.
An example config can be found in the *SAD* folder
Once the config file has been setup, the tool can be run in the background by using:
```
nohup python3 SAD.py &
```
### Training data persistance
Backups of the collected delta values are automatically created when a SIGINT signal is received. Therefore, when the script is run in the background, in order to trigger the backuo function run:
```
kill -SIGINT "SAD-process-pid"
```