# File Hashing
This document proposes a new feature in Falco: the ability to perform file hashing (in particular hashing of executed binaries) and to check hashes against a list, to detect the execution or access of known malicious files.
## Motivation
Thanks to its rich system call support and its powerful state-based enrichment, Falco is a fantastic behavioral monitoring tool. In other words, it can trace application activity and report behavior that looks suspicious or plain bad.
File hashing complements such a behavioral approach, and expands Falco's domain of applicability. Hash-based detection is perfect to quickly understand if malware, viruse or maliciuos applications are run in the system, by checking if any of the executed process against a malware list.
## Solution description
Traditionally, hash-based malware scanning is implemented by periodically traversing the file system, generating hashes for all of the executables/suspoicious files and then matching the hashes against a (potentially large) set of signatures. Threat intelligence providers offer both signature lists and APIs to enrich detections. This approach detects the presence of malware even if it's not executed, but is resource intensive, noisy and not real time.
The solution proposed in this document approaches the problem from the angle of runtime security, by leveraging Falco's architecture and DNA. Specifically, hashes are computed in real time, when specific system call events (specifically `execve`) are detected in the sinsp stream.
This appraoch has multiple benefits:
- it provides real time detection of malware and other types of attacks, with millisecond latencies. This makes it suitable to be integrated with response engines
- it reduces false positives by reporting only malware that is actually executed
- it minimizes overhead by focusing the hashing on relevant processes and files only, instead of requiring scanning the whole file system
- it integrates with Falco's filtering language for beautiful flexibility
### Changes in Falco
This proposal involves extending Falco in the following ways:
1. adding a new scap event type called `exehash`, generated when an executable is run. This event type includes the hash of the hash of the executable
2. adding the ability to load a (potentially large) set of checksums from the file system and quickly check if a hash belongs to the set
3. implementing a set of new filter checks to expose the hashing functionality
Let's look at each of these in detail.
#### 1. The exehash event
An `exehash` event is injected in the stream from the `execve` parser. This means that, when hashing is enabled, you will see a `exehash` event immediately following every `execve`. `exehash` contains the hash of the executed binary. The format is the following:
/* PPME_EXE_HASH_E */{"exehash", EC_METAEVENT, EF_SKIPPARSERESET, 3, {{"res", PT_ERRNO, PF_DEC}, {"exepath", PT_CHARBUF, PF_NA}, {"hash", PT_CHARBUF, PF_NA} } },
The event has three fields:
- **res**: a value of type `errno`. It lets the user know what the result of the hashing process is and, in case of failure, it explains the cause (e.g. file not found). If the hashing is successful, this field's value is 0.
- **exepath**: the full path of the hashed file. This is usually the path of the executed process (`proc.exepath`), but it can differ from it for scripting languages like python or for shell scripts (more on this below in this document).
- **hash** is the SHA256 checksum of the file.
#### 2. The hash files
Falco (and sysdig) have been extended to load files with the following CSV-like format:
```
6d1471d316fa6e7308034533268e7b3d430195fb5a87eacb8adb7670c9af2834 , trojan.linux/mirai
4147719e4750ccf259f7167c38fbe370463fc6e88d1d8f3fe9d73b35dbaefbe5 , trojan.linux/mirai
00ae07c9fe63b080181b8a6d59c6b3b6f9913938858829e5a42ab90fb72edf7a , miner.linux/camelot
087b20fb4b4885eecbc58e92e1bfa52fb095c4e8db735fd2ebbc500fd2c77af9 , trojan.linux/gafgyt
5f4abbfb617677fa31b44bbab236ef357e67b26e2a9bfac45788af191e92933c , trojan.linux/prometei
4b459ac044bd4635f2a019e2794918dd3ff49cf3fb2d37563fb17abb9734b405 , trojan.linux/xorddos
```
The files have two columns: the first one is the hash, while the second one is a threat intelligence provider-compatible category. One or multiple files can be loaded by Falco. Once loaded, the hash list stays in memory, so big files might have a substantial impact on Falco's memory footprint.
#### 3. New filter checks
There are three new basic hashing-related fields. They work with exehash events only and they expect at least one hash file to be loaded in memory (see the previous section):
- **proc.hash.has_match**: this field is 'true' if the executable hash matches an entry in the hashes list.
- **proc.hash.category**: the threat category corresponding to the hash, e.g. trojan.linux/kinsing
These fields allow creating rules such as
```yaml
- rule: Malware detection
desc: hash-based malware detection
condition: evt.type=exehash and proc.hash.has_match=true
output: detected the execution of malware %proc.hash.category, file=%evt.arg.exename, cmdline=%proc.cmdline
priority: INFO
```
In addition, a handful of more experimental FD-related fields have been added. They allow performing hashing operations against file descriptors of type `file`:
- **fd.file.sha256**: for file FDs, the SHA256 checksum of the file.
- **fd.file.md5**: for file FDs, the MD5 checksum of the file.
- **fd.file.hash.has_match**: for file FDs, if malware hashes are available, this field is 'true' if the file hash matches an entry in the hashes list.
- **fd.file.hash.category**: for file FDs, if malware hashes are available, the threat category corresponding to the file hash, e.g. trojan.linux/kinsing.
With these fields you can be more creative with rule creation, e.g.
```yaml
- rule: wget miner download with hash
desc: miner file downloaded by wget, include hash information
condition: evt.type=close and proc.name=wget and fd.name contains miner
output: attempt to make download a miner (file=%fd.name hash=%fd.file.sha256, category=%fd.file.hash.category)
priority: WARNING
```
Note that the FD fields should be used with extreme care and their use should be aggresively restricted through filtering as they can cause substantial CPU overhead.
### Configuring hashing in Falco
Configuring hashing involves using the following two new fields in `falco.yaml`:
- **hash_executables** (type: boolean, default: false): needs to be set to true for `exehash` events to be generated by the engine
- **hashing_checksum_files**: the list of hashes files to load
Example:
```yaml
hash_executables: true
hashing_checksum_files:
- falco_signatures1.txt
- /home/loris/falco_signatures2.txt
```
### Container support
The hashing engine has been designed to offer seamless support for containers. In particular, We use `/proc/<pid>/root` to navigate into the process file system and read the executable. This works for containers as well since, for a container, `/proc/<pid>/root` lets us access the container file system.
A problem with this approach is that, if the target process has already exited when we receive the execve event, `/proc/<pid>` will not exist any more and reading the file will be impossible. We obviate to this problem by navigating the parent process chain until we find an ancestor inside the same container that we can use to access the FS. This means that, unless the container is gone, we will be able to find the file.
## Performance
Hashing is inherently resource intensive, expecially when the files are big, so there is no way to make this feature truly low overhead. However, a lot of care has been put into making sure its impact is as low as possible. Here are some strategies that we employ to achieve good performance.
### Disabled by default
Hashing is curretly disabled by default in Falco and needs to be explicitly turned on using the `hash_executables` field in `falco.yaml`:
```yaml
hash_executables: true
```
When `hash_executables` is not set or is `false`, the impact of hashing will be zero.
### Caching
The hashing engine includes a caching system to store the computed hashes after calculating them. This means that, after the execution of a program has been detected once, every successful execution will cost almost nothing in terms of hashing overhead.
*NOTE*: currently, cache entries don't have an expiration, so there is a chance the engine could emit the wrong hash if an executable is changed.
*NOTE*: hashes created for **fd.file.*** fields are not cached, because arbitrary files are too susceptible to change and using the cache with them would be inaccurate. One more reason to be careful with those fields.
### Limiting resource usage
sinsp's `settings.h` includes three compile-time parameters that can be use to limit the hashing engine's CPU and memory usage:
```cpp
// Maximum size that an executable can have to be hashed.
// If the file is bigger than this, it won't be hashed.
#define HASHING_MAX_EXE_SIZE 300 * 1024 * 1024
// Maximum time that the hashing engine can spend hashing a file.
// If the hashing takes longer than this, it will be aborted.
#define HASHING_MAX_HASHING_TIME_NS 5LL * 1000000000LL
// If this is set, the hashing engine will attempt to hash the first argument
// instead of the executable binary for executables like python, perl, bash, etc.
// Currently disabled as suggested by threat researchers.
#undef HASHING_HASH_SCRIPTS
// Maximum size of the executable checksum cache.
// Each entry in the cache has a string containing a FS path as the key, and around 40 bytes of string hash as the value.
#define MAX_CHECKSUM_CACHE_ENTRIES 1024
```
### Using a separate file for hashes
On paper, a hash feed can be implemented in Falco as one or more rule files. In such a scenario, each rule would look more or less like this:
```yaml
- rule: Single hash rule
desc: single malware detection
condition: evt.type=exehash and evt.arg.hash=6d1471d316fa6e7308034533268e7b3d430195fb5a87eacb8adb7670c9af2834
output: detected the execution of the trojan.linux/mirai malware
priority: INFO
```
Rule files could be programmatically generated from threat feeds and could be easily distributed to the Falco engines by leveraging the upcoming rules feed functionality.
The disadvantage of such an approach is that it generates a huge number of Falco rules. It is not uncommon, in fact, to have thousands, or even tens of thousands of hashes in a threat feed. Falco is currently not equipped to handle such a big number of rules. The picture below shows how many `exehash` events per second Falco can process as a function of the number of hashes it need to check, when every hash is a separate rule. As you can see, performance degrades very quickly and becomes unacceptable at 100 to 200 hashes.

A way to mitigate this problem is packing more hash checks in a single rule. The two pictures below show the result for the same test, when 5 hash checks are packed in a single rule, using `or` or using `in`.
The improvement is noticeable, but this is still far from being acceptable at tens of thousands of hashes.


The solution we adopted consists, instead, in having the hashes in a separate file, that Falco loads at startup, and then checking all of them in a single rule using the `proc.hash.has_match` filtercheck:
```yaml
- rule: Malware detection
desc: hash-based malware detection
condition: evt.type=exehash and proc.hash.has_match=true
output: detected the execution of malware %proc.hash.category, file=%evt.arg.exename, cmdline=%proc.cmdline
priority: INFO
```
The results, as you can see, are dramatically better and nicely support the hundreds of thosands of hases use case:

## Design decisions
While implemeting the hashing engine, we took some potentially consequential design decisions. Here are the main ones.
### Using a new scap event (exehash)
At the architectural level, we don't really need a sperate event to carry the hash of an executed program. We could, instead, implement it as a new field in the execve event, or just as a set of filter checks, both of which would be simpler code-wise and as a user UX.
The advantage of a seperate event is that it gives us the option to delegate the hash calculation to a separate thread. Doing that brings benefits in terms of CPU usage and, more importantly in terms of reducing drops (calculating the hash for a big file can block the event processing pipeline for a long time and potentially cause substantial drops).
A second advantage of using a separate event type is that the hash gets stored in trace files, supporting offline use cases.
NOTE that asynchronous hash calculation is currently **NOT** implemented, so the drop issue described above is currently present. However, the design is structurally ready for a multi-thread implementation in the future.
### MD5 vs SHA256
Generating a SHA256 hash uses more or less twice the CPU compared to generating a MD5 hash for the same file. After consulting threat research teams, we decided to implement SHA256 for the `exehash` event, even if it's heavier. The reasons are:
- it's less collision prone
- it's more prevalent in threat feeds
In case we change our mind, the engine is equipped for MD5 hashing and `exehash` can easily be switched to MD5 in the future.
### Hashing interpreted languages and scripts (currently disabled)
**Note**: this feature is currently disabled. To enable it you need to set the HASHING_HASH_SCRIPTS flag set in `settings.h` and recompile Falco).
If the feature is enabled, when the name of the executed process is one of the following:
- python, python3
- java
- ruby
- perl
- node
- sh, bash, zsh, csh, tcsh
The `exehash` event calculates the hash of the first argument instead of the executable binary file. In that case, the `exepath` argument of the `exehash` event contains the name of the file that has been actually hashed.
The reason for doing this is obvious: hashing the actual executed script is more useful than hashing `python` over and over again.
The downside is that we will miss malware that replaces one of these scripting technologies.
## Future work
A couple of possible future enhancements have been left in the todo list. They are described in this section.
### Add the hash to the thread info
Currently the hash of an executable is just an argument of the `exehash` event. As a future evolution, we might consider including it in the thread table entry. That would allow us to create rules that can use the hash at any time. For example:
```yaml
- rule: Malware establishing a connection
desc: malware establishing a connection
condition: evt.type=connect and proc.hash.has_match=true
output: malware making a network connection!
priority: INFO
```
This has not been implemented yet because we would like to validate that the increase in memory and CPU usage is justfied by use cases that are useful in real life.
### Support for a mix of MD5 and SHA256 checksums in the hases file
Currently, the hashes file only supports SHA256 checksums. If needed, we can expand it to support MD5 checksums as well.
Again, this can be done if real use cases arise.
### Better access to files in containers with support for fuse-overlayfs
The current container FS access code (based on `/proc/<pid>/root`) can potentially have issues with fuse-overlayfs, for example failing to read files with some versions of podman. This PR has introduced a solution to the problem:
https://github.com/falcosecurity/libs/pull/677.
We should leverage it in the future.
## Show me the code
Feature implementation in the libs:
https://github.com/falcosecurity/libs/tree/hashes
Falco's patch to add hashes support from in `falco.yaml`:
https://github.com/falcosecurity/falco/tree/exec-hashes
Sysdig's patch to load hashes from the chisels directory:
https://github.com/draios/sysdig/tree/exec-hashes