My name is Srikavin Ramkumar and I am an undergraduate student at University of Maryland studying Computer Science and Mathematics.
I have taken courses on Algorithms, Programming Language Theory, and Computer Systems. I am currently taking courses on Databases and Networking. I have written projects for these classes in C, OCaml, and Python, but I canβt share these publicly; if needed, I can send samples.
I also participate in cyber security capture-the-flag competitions. Some challenges require reverse engineering and exploiting unknown binaries and kernel drivers. So I am familiar with debugging tools as well as certain aspects of the linux kernel.
In 2019, I made contributions to the Fedora Project in the form of writing tests and other tasks as part of Google Code-In.
I have experience with Linux, CLI tools, Make, and Git, although Iβm still learning Automake. I am comfortable writing code in C, Python, Java, and Typescript. Most of my recent C code has been for class projects, so I canβt share these publicly. I have public repositories on my Github.
At this time, I do not have any commitments during the GSOC work period, and I will be able to work full-time on this project.
Generate strace system call decoders from descriptions similar to syzkaller system call definitions.
This project aims to incorporate a system for generating syscall decoders from a modified syzkaller's system call description language (syzlang) into straceβs build system. This project would allow the strace project to leverage descriptions of a large number of system calls and ioctls that are already described in a similar format.
This proposal consists of two parts:
Since strace is covered by extensive test cases, the correctness of generated decoders can be ensured by running existing test cases on generated decoders (in addition to new tests covering both the parser and code generation modules).
Many system call decoders in strace
are very similar in structure. Generating these decoders from a declarative definition could ensure consistency as well as reducing the difficulty of implementing new syscall and ioctl decoders.
The expected deliverables are:
All code written should be modular, easy-to-extend, and well-documented.
Future projects could extend the code generation to support printing structured output. Future projects could also generate test cases using these descriptions to ensure consistent decoding of similar types across all existing decoders and increase the coverage of certain edge cases.
Syzkaller includes tools for compiling syzlang descriptions into Go code. However, we are making extensive changes to the syszlang grammar. It may be easier to implement a new parser specifically for the purposes of generating strace decoders. The Go implementation and the documented syzlang grammar can be still be used as reference.
Syzlang was designed to improve kernel fuzzing results. It provides a good base to encode syscall descriptions, but in order to describe syscalls sufficiently enough to generate decoders, we need to make a few modifications to syzlang:
Syzlang doesn't differentiate between (enum-like) mutually exclusive flags and OR-able bit flags. To work around this limitation with syslang, we can extend syzlang by adding new types (such as flag_enum
and flag_bit
) to differentiate between these types of flags.
Syzlang supports specifying syscall variants. Supporting subvariants of syscalls will make it easier to describe certain syscalls (especially ioctls) where the value of some argument changes how the remaining arguments should be decoded.
Syzlang has certain features that are only applicable for fuzzing purposes. We can ignore/remove these features.
Integer types in syzlang are limited to sized integers. Adding platform dependent integer types such as long
, int
, kernel_ulong_t
, kernel_long_t
may be useful.
While syzlang allows importing C header files, adding support for including other syzlang files may reduce duplicate definition of some types (especially common flags). This can be done through an import
operator that essentially prepends the imported file to the current file.
All structure types need to be defined entirely. To maintain consistency with kernel headers, we can use an attribute to indicate that a struct should not defined, and should instead reference the type from an included header.
Throughout the rest of this document, I will be referring to our extended version of syzlang
. A new name should be given to this version to avoid confusion.
The general idea is the following;
The Lexer/Parser will be implemented using flex/bison in C.
Tokenize the input file and then convert the token stream into an AST-like structure using the documented pseudo-grammar. The AST structure shouldnβt be too complicated since syzlang is declarative and its grammar isnβt recursive.
The syslang descriptions also contain extra information such as the range of valid values for certain syscall variants and attributes only relevant to fuzzing. We can ignore these values.
For example, the syscall read(fd fd, buf string[out], count int) len[buf]
can be processed in the following manner:
First, it is lexed into the following stream of tokens:
identifier("read")
open_paren
identifier("fd")
identifier("fd")
comma
identifier("buf")
identifier("string")
open_bracket
identifier("out")
close_bracket
comma
identifier("count")
identifier("int")
close_paren
identifier("len")
open_bracket
identifier("buf")
close_bracket
end_of_line
end_of_file
Then, this token stream can be parsed into a structure similar to the following representation:
root:
- name: read
type: SyscallNode
lineno: 1
args:
- name: fd
type:
name: fd
options: []
- name: buf
type:
name: string
options: [out]
- name: count
type:
name: int
options: []
ret:
type:
name: len
options: [buf]
The generation of syscall decoders depends on a variety of factors.
Syzkaller supports generic type templates. Every instance of these type templates could be represented as a different C structure.
For example, given the syzlang type template definition
type nlattr[TYPE, PAYLOAD] {
nla_len len[parent, int16]
nla_type const[TYPE, int16]
payload PAYLOAD
} [align_4]
The type nlattr[FOO, int32]
can be represented in C as
struct nlattr__foo__int32 {
int16_t nla_len;
FOO nla_type;
int32_t payload;
}__attribute__((aligned(4));
Type aliases can either be transliterated directly as C typedef
s or they can be evaluted before emitting source code.
Decoding structures can be complicated. We would need to generate a function that is able to print out all of the fields of a structure. More complicated structures (fields with bitmasks, etc.) may be too difficult to express declaratively. Syscalls using those structs (at first) should be written by hand.
For structures with simple types a print function could be generated. Consider
Syzkaller descriptions can reference Linux Kernel header files. These are easy enough to translate directly into C-style includes.
For example, include <linux/sched/coredump.h>
can be converted into #include <linux/sched/coredump.h>
.
Constants can be specified as decimal, hex, or character literals; as well as C-style #define constants. These can be represented in C as a constant or the defined literal. When decoding syscalls, the literal name should be preferred.
In order to print decoded arguments, we can delegate to currently existing decoders for each type (e.g. printfd, printstrn
). A common header file (included by default for all generated code) can include functions/macros to decode default types.
When dealing with pointer arguments, we can use the direction annotations (in
, out
, inout
) in the description to decide how to decode these.
If we have an in
argument, then the value pointed to is unchanged by the syscall. So we just decode the argument based on the type pointed to.
If we have an out
arguement, then the initial value is irrelevant and we need to decode the argument when exiting the syscall. If the syscall errors, we print the address of the argument.
If we have an inout
argument, then we need to print the value pointed to both when entering and exiting the syscall. If the syscall errors, we should print just the initial value.
Syskaller allows defining variants of syscalls. This is espcially useful when the first argument denotes the operation of the syscall (e.g. prctl, ioctl).
If there is more than one variant syscall, we need to identify which variant is being called, and if we can't identify the variant, fall back to a generic variant.
To make the implementation simpler, it may make sense to only allow variants that depend only on a single argument.
Consider the following syzlang definition (line breaks for clarity):
// assume that this file has definitions for the flag `caps`
import "defs/caps.txt"
include <linux/prctl.h>
// the enum type argument indicates that we defined earlier indicates
// that these values are to be treated as mutually exclusive
prctl$PR_CAP_AMBIENT(option const[PR_CAP_AMBIENT],
mode flags[enum, prctl_cap_ambient], arg3 ulong,
arg4 ulong, arg5 ulong)
prctl$PR_CAP_AMBIENT$PR_CAP_AMBIENT_RAISE(option const[PR_CAP_AMBIENT],
mode const[PR_CAP_AMBIENT_RAISE], cap flags[enum, caps],
arg4 ulong, arg5 ulong)
// these constants come from the header <linux/prctl.h>
prctl_cap_ambient = PR_CAP_AMBIENT_RAISE, PR_CAP_AMBIENT_LOWER,
PR_CAP_AMBIENT_IS_SET, PR_CAP_AMBIENT_CLEAR_ALL
Then we need to generate methods for each syscall variant:
void prctl__pr_cap_ambient(struct tcb *tcp) {
// there are sub variants, so check if this invocation
// matches one of those
if (tcp->u_arg[0] == PR_CAP_AMBIENT && tcp->u_arg[1] == PR_CAP_AMBIENT_RAISE) {
prctl__pr_cap_ambient__pr_cap_ambient_raise(tcp);
} else {
// default case
// decoding logic here
}
}
Generated code for complex syscalls and ioctls may be less performant that hand-written code. Care must be taken to ensure that appropriate data structures and algorithms are used in the generated decoders.
The description language may not be flexible enough to implement some decoders. We can make modifications to accomodate those decoders, or we can maintain the current version of those syscall decoders.
read
SyscallThe following is a possible example conversion from syzlang into a strace decoder.
File decls/read.txt
include <linux/unistd.h>
resource fd[int]
read(fd fd, buffer ptr[out, int8], count ssize_t) len[buffer, ssize_t])
File gen/read.c
// AUTOMATICALLY GENERATED FILE - DO NOT EDIT
// header containing shared macros like prints, etc
// can map to already existing functions
#include "gen/common.h"
#include <linux/unistd.h>
SYS_FUNC(read) {
// print all non-in arguments on syscall exit
if (exiting(tcp)) {
// print a file descriptor arg
PRINT_FD(tcp->u_arg[0]);
// print seperator ", "
PRINT_ARG_SEP();
// only print value of out ptr if syscall was successful
if (!syserror(tcp)) {
// print string with length rval (since its type is len[buffer])
PRINT_NSTRING(tcp, tcp->u_arg[1], tcp->u_rval);
} else {
PRINT_PTR(tcp->u_arg[1]);
}
PRINT_ARG_SEP();
// print uint
PRINT_UINT(tcp->u_arg[2]);
}
}
File decls/ioctl_ext4.txt
include <uapi/linux/btrfs.h>
// define ioctl variants for each cmd
ioctl$BTRFS_IOC_SNAP_CREATE(fd fd, cmd const[BTRFS_IOC_SNAP_CREATE],
arg ptr[in, btrfs_ioctl_vol_args])
ioctl$BTRFS_IOC_SUBVOL_CREATE(fd fd, cmd const[BTRFS_IOC_SUBVOL_CREATE],
arg ptr[in, btrfs_ioctl_vol_args])
btrfs_ioctl_vol_args {
fd fd
name string[BTRFS_PATH_MAX]
}
// AUTOMATICALLY GENERATED FILE - DO NOT EDIT
#include "gen/common.h"
#include <uapi/linux/btrfs.h>
int ioctl_BTRFS_IOC_SNAP_CREATE(struct tcb *tcp) {
PRINT("BTRFS_IOC_SNAP_CREATE");
PRINT_ARG_SEP();
// could be generated, or manually defined
PRINT_STRUCT_BTRFS_IOCTL_VOL_ARGS(tcp, tcp->u_arg[2]);
}
int ioctl_BTRFS_IOC_SUBVOL_CREATE(struct tcb *tcp) {
PRINT("BTRFS_IOC_SUBVOL_CREATE");
PRINT_ARG_SEP();
// could be generated, or manually defined
PRINT_STRUCT_BTRFS_IOCTL_VOL_ARGS(tcp, tcp->u_arg[2]);
}
May 17 β June 7: | Community Bonding Period: Finalize implementation ideas and communicate any issues/updates |
W01 (Jun. 07 β Jun. 11): | Work on implementing a syzlang lexer using flex (with tests) and update straceβs CI script |
W02 (Jun. 14 β Jun. 18): | Implement syzlang parser with bison |
W03 (Jun. 21 β Jun. 25): | Buffer Week β Finish up remaining lexer/parser tasks and make sure it is well-documented and tested. |
W04 (Jun. 28 β Jul. 02): | Work on the foundation for code generation: aim to generate basic decoders (ptrs, strings, ints, etc.) |
W05 (Jul. 05 β Jul. 09): | Continue with W04 |
W06 (Jul. 12 β Jul. 16): | Add remaining code generation features (structs, union types, variants, type templates). |
July 16: | Phase 1 Evaluation |
W07 (Jul. 19 β Jul. 23): | Continue with W06. |
W08 (Jul. 26 β Jul. 30): | Continue with W06. Ensure code generation is documented with tests and is incorporated into the build system. |
W09 (Aug. 02 β Aug. 06): | Buffer Week β Finish up remaining code generation and ensure everything is well-documented and fix any issues with patches found in review |
W10 (Aug. 09 β Aug. 13): | Buffer Week β Finish up remaining code generation and ensure everything is well-documented and fix any issues with patches found in review |
W11 (Aug. 16 β Aug. 23): | Final Week: Buffer Week - Finish up anything remaining and ensure everything done is well-documented |
I have included four buffer weeks to account for any unforseen issues. If previous work is finished without using the buffer weeks, they can be skipped and the next task can be worked on.