# DevSecOps Lab 3
Ilya Kolomin i.kolomin@innopolis.university
Kirill Ivanov k.ivanov@innopolis.university
Anatoliy Baskakov a.baskakov@innopolis.university
### Task 1 - Theory
1. What is the differences between source code scanners and binaries scanners ?
* Source code scanners analyze the source code (text, basically), to find any possible vulnerabilities there. Binaries scanners, on the other hand, analyze binaries using dissasembly and pattern recognition.
* Two advantages of binary code scanners are that they can be used when source code is unavailable, for example when some third-party library distributed as a binary (e.g. `.so` file), and that they can find vulnerabilities produced by the compiler itself.
3. Explain how Abstract Syntax Trees (AST) can help to find vulnerabilities and what kind of vulnerabilities can be found more effective.
* Source programs are represented as ASTs in compilers. They allow to analyze structure of a program without worrying about syntax and focus more on the important structural constructs. If our analysis tool knows of some patterns that occur in the AST of a vulnarable application, it can search for these patterns in other applications and notify users if any match is found.
* It helps find vulnerabilities that are detectable by certain patterns in program structure or something that would be hard to parse by hand and it is easier to use existing parsers of the code. Also of course it is usable for analysis of data that is parseable into ASTs.
5. Can we consider secret detection tools as SAST tools?
* SAST is Static Application Security Testing in which we perform application code analysis to discover vulnerabilities. Secret detection is the process in which source code is analyzed for the presence of secrets which poses a security threat. Thus, we believe that secret detection is a part of SAST by definition
* If yes, what kind of rules can be used (explain algorithms)?
* One of the simplest approaches is pattern matching. Many frameworks and tools have well-defined places where they accept tokens and other secrets, so these can be scanned in order to detect if any hardcoded value is being passed
* Other approach can be to search for secret-specific formats. For example, one can detect `-----BEGIN OPENSSH PRIVATE KEY-----` or `-----END OPENSSH PRIVATE KEY-----`
* If no, how we can classify secret detection tools? When should we use these tools (please use arguments)?kkk,k,,kk,,k
* Even if we do not classify the secret detection tools as SAST, they should be used nonetheless. Otherwise there are high risks as anyone with the access to the repo, even if they have no rights to view secrets, could use them with malicious intent.
### Task 2 - Gitlab SAST with Semgrep
#### Configuring Semgrep to CI
We have used standalone Semgrep CI job (without Semgrep app).
Here is our `.gitlab-ci.yml`
```yaml
semgrep:
# A Docker image with Semgrep installed.
image: returntocorp/semgrep
rules:
# Scan changed files in MRs (diff-aware scanning):
- if: $CI_MERGE_REQUEST_IID
# Scan all files on the default branch and report any findings:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
variables:
# Add the rules that Semgrep uses by setting the SEMGREP_RULES environment variable.
SEMGREP_RULES: p/owasp-top-ten # See more rules at semgrep.dev/explore.
# Uncomment SEMGREP_TIMEOUT to set this job's timeout (in seconds):
# Default timeout is 1800 seconds (30 minutes).
# Set to 0 to disable the timeout.
# SEMGREP_TIMEOUT: 300
# Upload findings to GitLab SAST Dashboard
SEMGREP_GITLAB_JSON: "1"
script: semgrep ci --gitlab-sast > gl-sast-report.json || true
artifacts:
reports:
sast: gl-sast-report.json
```
We have run this job on the full code on master branch, and here is the result

The snipped from report artifact:
```json
{
"$schema": "https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/master/dist/sast-report-format.json",
"version": "14.1.2",
"vulnerabilities": [
{
"category": "sast",
"confidence": "High",
"cve": "vulnerabilities/csp/source/jsonp.php:8da894fa6f3d1fc255201e52cbf766cc930e6c66e729d33b207f4c1b17131fee:php.lang.security.injection.echoed-request.echoed-request",
"id": "a932ab9b-cd21-8f45-92fe-ac0bdebd4e68",
"identifiers": [
{
"name": "Semgrep - php.lang.security.injection.echoed-request.echoed-request",
"type": "semgrep_type",
"url": "https://semgrep.dev/r/php.lang.security.injection.echoed-request",
"value": "php.lang.security.injection.echoed-request.echoed-request"
}
],
"location": {
"end_line": 12,
"file": "vulnerabilities/csp/source/jsonp.php",
"start_line": 12
},
"message": "`Echo`ing user input risks cross-site scripting vulnerability. You should use `htmlentities()` when showing data to users.",
"scanner": {
"id": "semgrep",
"name": "Semgrep",
"vendor": {
"name": "Semgrep"
}
},
"severity": "High"
},
...
}
```
We got report in json format and wrote small python script to parse it. Results are the following:
* 25 vulnerabilities were found
* All of them found with high confidence
* 10 with high and 15 with medium severity
* All in PHP code.
* 24 of them are related to unsafe usage of user input. They are either about SQLi, SSRF, or XSS. One is about exposing sensitive environment info with `phpinfo` function.
These findings show that this application is indeed "Damn Vulnerable" since it contains many vulnerabilities that even static analysis can find.
* Static analyzers
* Perform source code analysis without any execution. They use various techniques to automatically detect common security-related programming mistakes and notify the developers so that they can fix them.
* They should balance false-positive and false-negative results amount. If there will be too much false-negatives, many problems will be left unnoticed. If same for false-positives, developers do not bother reading through gigantic reports (which eliminates its purpose).
* Their advantages
* over manual code review is that such tools scan all code (that they support), do not miss known mistakes due to inattentiveness, and do their job very quickly.
* over dynamic analyzers is that they find sources of problems and not symptoms.
* Nevertheless, these techniques are not mutually exclusive and should be used together to find as many security issues as possible.
* Static analyzers features
* Pattern Matching
* Match common patterns grep-like
* Abstract Syntax Tree
* Find more complex problems syntax-independently (described in this report in details)
* Data-Flow Analysis
* Analyze flow of input data to find where it ends up and with what modifications. For example if we have user input saved in `a` and we assign `b` to its substring (`b = a[3..10]`), then saving or returning them in the response can lead to XSS/stored XSS for example. This example is usually called "Taint analysis".
* Inter-procedural Data-Flow Analysis
* Same as DFA but across multiple procedures, hence slower.
* Post analyzers
* Do not perform code analysis. Instead, they enhance reports of static analyzers with additional information
* For example, they can include common weakness enumerations (CWE) to the report as well as check for false positives.
### Task 3 - Analysis with Semgrep
Running *semgrep* on two case files with the recommended Semgrep Registry rules.

* In the first file no vulnerabilities were found. However, there is a possibility that */semper* endpoint is vulnerable. It sets a cookie, however it is available in HTTP (secure=True would force access to HTTPS only) and in javascript (httponly is not set to True). While this is the default configuration in Flask (both set to False by default), this seems to be an issue
* In order to address this issue, we have introduced a custom rule:
```yaml
rules:
- id: avoid-insecure-cookies
languages:
- python
message: secure and httponly arguments are set to False by default. Consider setting them to True
patterns:
- pattern: ... .set_cookie(...)
- pattern-not: ... .set_cookie(..., secure=True, ..., httponly=True, ...)
severity: WARNING
```

* Now this issue is found in both places: lines 8 and 13-15
* In the second file the default rules detect that the JWT token is not checked for integrity in the accept_request function. This leads to a vulnerability as we will fail to notice that someone has tries to tamper with the token.
* However, we are worried about hardcoding the jwt secret, so we made a custom rule
```yaml
- id: avoid-hardcode-jwt
languages:
- python
message: Do not hardcode jwt secrets
pattern: ... .decode(..., "...", ...)
severity: ERROR
```

We have made the rules publicly available through https://semgrep.dev/s/z4Gn and https://semgrep.dev/s/pv2e .
Here is our CI job to run these rules
```yaml
semgrep:
# A Docker image with Semgrep installed.
image: returntocorp/semgrep
rules:
# Scan changed files in MRs (diff-aware scanning):
- if: $CI_MERGE_REQUEST_IID
# Scan all files on the default branch and report any findings:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
variables:
SEMGREP_GITLAB_JSON: "1"
script: semgrep ci --gitlab-sast --config s/z4Gn --config s/pv2e > gl-sast-report.json || true
artifacts:
reports:
sast: gl-sast-report.json
```
It runs successfully

Resulting artifact indeed shows our new rules (parsed and extracted important data for concise report with custom script):
```
$ python3 parse.py
Entry0:
confidence - High
severity - Medium
location - File first.py, line 8
message - secure and httponly arguments are set to False by default. Consider setting them to True
Entry1:
confidence - High
severity - Medium
location - File first.py, lines 13-15
message - secure and httponly arguments are set to False by default. Consider setting them to True
Entry2:
confidence - High
severity - High
location - File second.py, line 5
message - Do not hardcode jwt secrets
Entry3:
confidence - High
severity - High
location - File second.py, line 9
message - Do not hardcode jwt secrets
Total found 4 filtered vulnerabilities
```
For reference, original looks like this:
```json
{
"$schema": "https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/master/dist/sast-report-format.json",
"version": "14.1.2",
"vulnerabilities": [
{
"category": "sast",
"confidence": "High",
"cve": "first.py:0786115947f0be571ac5970f9bfbcc209cdaaa97dc9f5e22e53b39d2760ffcff:avoid-insecure-cookies",
"id": "4b782276-a0be-399a-c3d4-eada85cc83c1",
"identifiers": [
{
"name": "Semgrep - avoid-insecure-cookies",
"type": "semgrep_type",
"url": "https://semgrep.dev/r/avoid-insecure-cookies",
"value": "avoid-insecure-cookies"
}
],
"location": {
"end_line": 8,
"file": "first.py",
"start_line": 8
},
"message": "secure and httponly arguments are set to False by default. Consider setting them to True",
"scanner": {
"id": "semgrep",
"name": "Semgrep",
"vendor": {
"name": "Semgrep"
}
},
"severity": "Medium"
},
...
]
}
```
* Semgrep vs grep
* Grep is a simple command-line utility that performs a regular expression search on the given input.
* Semgrep, on the other hand, allows us to define rules with patterns we would like to avoid, severity levels for these rules and suggestions.