--- title: "GSoC 2021 Proposal: Generating Syscall Decoders from Declarative Definitions" --- <!-- # Table of Contents [TOC] --> # About Me My name is Srikavin Ramkumar and I am an undergraduate student at University of Maryland studying Computer Science and Mathematics. ## Relevant Experiences I have taken courses on Algorithms, Programming Language Theory, and Computer Systems. I am currently taking courses on Databases and Networking. I have written projects for these classes in C, OCaml, and Python, but I can’t share these publicly; if needed, I can send samples. I also participate in cyber security capture-the-flag competitions. Some challenges require reverse engineering and exploiting unknown binaries and kernel drivers. So I am familiar with debugging tools as well as certain aspects of the linux kernel. In 2019, I made contributions to the Fedora Project in the form of writing tests and other tasks as part of Google Code-In. ## Relevant Skills I have experience with Linux, CLI tools, Make, and Git, although I’m still learning Automake. I am comfortable writing code in C, Python, Java, and Typescript. Most of my recent C code has been for class projects, so I can’t share these publicly. I have public repositories on my [Github][1]. [1]: https://github.com/srikavin ## Other Commitments At this time, I do not have any commitments during the GSOC work period, and I will be able to work full-time on this project. # Proposal Generate strace system call decoders from descriptions similar to syzkaller system call definitions. ## Abstract This project aims to incorporate a system for generating syscall decoders from a modified [syzkaller][2]'s system call description language [(syzlang)][3] into strace’s build system. This project would allow the strace project to leverage descriptions of a [large number of system calls][5] and ioctls that are already described in a similar format. This proposal consists of two parts: * Parsing syzlang descriptions into an in-memory representation * Generating decoders based on the in-memory representation Since strace is covered by extensive test cases, the correctness of generated decoders can be ensured by running existing test cases on generated decoders (in addition to new tests covering both the parser and code generation modules). [2]: https://github.com/google/syzkaller/ [3]: https://github.com/google/syzkaller/blob/master/docs/syscall_descriptions.md [4]: https://github.com/google/syzkaller/blob/master/docs/syscall_descriptions_syntax.md [5]: https://github.com/google/syzkaller/tree/master/sys/linux ## Expected Results Many system call decoders in `strace` are very similar in structure. Generating these decoders from a declarative definition could ensure consistency as well as reducing the difficulty of implementing new syscall and ioctl decoders. The expected deliverables are: * A lexer/parser for the syzlang description language * Code generation based on the parser to generate syscall/ioctl decoders * Incorporating the above (with tests) into the build script and updated CI scripts All code written should be modular, easy-to-extend, and well-documented. Future projects could extend the code generation to support printing structured output. Future projects could also generate test cases using these descriptions to ensure consistent decoding of similar types across all existing decoders and increase the coverage of certain edge cases. ## Existing Work Syzkaller includes tools for compiling syzlang descriptions into Go code. However, we are making extensive changes to the syszlang grammar. It may be easier to implement a new parser specifically for the purposes of generating strace decoders. The Go implementation and the [documented syzlang grammar][7] can be still be used as reference. [6]: https://github.com/golang/go/wiki/PortingPolicy#first-class-ports [7]: https://github.com/google/syzkaller/blob/master/docs/syscall_descriptions_syntax.md ## Modifications to Syzlang Syzlang was designed to improve kernel fuzzing results. It provides a good base to encode syscall descriptions, but in order to describe syscalls sufficiently enough to generate decoders, we need to make a few modifications to syzlang: 1. Syzlang [doesn't differentiate between (enum-like) mutually exclusive flags and OR-able bit flags][8]. To work around this limitation with syslang, we can extend syzlang by adding new types (such as `flag_enum` and `flag_bit` ) to differentiate between these types of flags. 2. Syzlang supports specifying syscall variants. Supporting subvariants of syscalls will make it easier to describe certain syscalls (especially ioctls) where the value of some argument changes how the remaining arguments should be decoded. 3. Syzlang has certain features that are only applicable for fuzzing purposes. We can ignore/remove these features. 4. Integer types in syzlang are limited to sized integers. Adding platform dependent integer types such as `long`, `int`, `kernel_ulong_t`, `kernel_long_t` may be useful. 5. While syzlang allows importing C header files, adding support for including other syzlang files may reduce duplicate definition of some types (especially common flags). This can be done through an `import` operator that essentially prepends the imported file to the current file. 6. All structure types need to be defined entirely. To maintain consistency with kernel headers, we can use an attribute to indicate that a struct should not defined, and should instead reference the type from an included header. Throughout the rest of this document, I will be referring to our extended version of `syzlang`. A new name should be given to this version to avoid confusion. <!-- Additionally, some strace parsers need to correlate between arguments. Consider `int read(fd, buf, n)`. The return value indicates how many bytes were succesfully read. In order to print the buffer we need to use the return value. We can extend syzlang by referencing other arguments and return values by prefixing the name with a `!`. The return value could be specified by the special value `!!ret` to avoid collisions with existing names. In the `read` example, we could write the following modified-syzlang description: ``` read(fd fd, buf ptr[out, string[!!ret]], ) ``` --> [8]: https://github.com/google/syzkaller/blob/master/docs/syscall_descriptions.md#flagsenums ## Tentative Implementation Ideas The general idea is the following; ```mermaid graph TB descs[Declarative syscall/ioctl Descriptions] --> lexer subgraph lexer[Lexer] -->|Token Stream| parser[Parser] end subgraph parser -->|AST| gen[Code Generator] end gen -->|C Code| c[Generated Source] ``` ### Lexer/Parser The Lexer/Parser will be implemented using [flex](https://github.com/westes/flex)/[bison](https://github.com/akimd/bison) in C. Tokenize the input file and then convert the token stream into an AST-like structure using the documented pseudo-grammar. The AST structure shouldn’t be too complicated since syzlang is declarative and its grammar isn’t recursive. The syslang descriptions also contain extra information such as the range of valid values for certain syscall variants and attributes only relevant to fuzzing. We can ignore these values. For example, the syscall `read(fd fd, buf string[out], count int) len[buf]` can be processed in the following manner: First, it is lexed into the following stream of tokens: ``` identifier("read") open_paren identifier("fd") identifier("fd") comma identifier("buf") identifier("string") open_bracket identifier("out") close_bracket comma identifier("count") identifier("int") close_paren identifier("len") open_bracket identifier("buf") close_bracket end_of_line end_of_file ``` Then, this token stream can be parsed into a structure similar to the following representation: ```yaml root: - name: read type: SyscallNode lineno: 1 args: - name: fd type: name: fd options: [] - name: buf type: name: string options: [out] - name: count type: name: int options: [] ret: type: name: len options: [buf] ``` ## Decoder Code Generation Rules The generation of syscall decoders depends on a variety of factors. #### Type Templates Syzkaller supports generic type templates. Every instance of these type templates could be represented as a different C structure. For example, given the syzlang type template definition ```clike=1 type nlattr[TYPE, PAYLOAD] { nla_len len[parent, int16] nla_type const[TYPE, int16] payload PAYLOAD } [align_4] ``` The type `nlattr[FOO, int32]` can be represented in C as ```c=1 struct nlattr__foo__int32 { int16_t nla_len; FOO nla_type; int32_t payload; }__attribute__((aligned(4)); ``` #### Structs/Unions/Type Aliases Type aliases can either be transliterated directly as C `typedef`s or they can be evaluted before emitting source code. Decoding structures can be complicated. We would need to generate a function that is able to print out all of the fields of a structure. More complicated structures (fields with bitmasks, etc.) may be too difficult to express declaratively. Syscalls using those structs (at first) should be written by hand. For structures with simple types a print function could be generated. Consider ### Includes Syzkaller descriptions can reference Linux Kernel header files. These are easy enough to translate directly into C-style includes. For example, `include <linux/sched/coredump.h>` can be converted into `#include <linux/sched/coredump.h>`. ### Constant values Constants can be specified as decimal, hex, or character literals; as well as C-style #define constants. These can be represented in C as a constant or the defined literal. When decoding syscalls, the literal name should be preferred. ### Arguments In order to print decoded arguments, we can delegate to currently existing decoders for each type (e.g. `printfd, printstrn`). A common header file (included by default for all generated code) can include functions/macros to decode default types. ### Pointer Arguments When dealing with pointer arguments, we can use the direction annotations (`in`, `out`, `inout`) in the description to decide how to decode these. If we have an `in` argument, then the value pointed to is unchanged by the syscall. So we just decode the argument based on the type pointed to. If we have an `out` arguement, then the initial value is irrelevant and we need to decode the argument when exiting the syscall. If the syscall errors, we print the address of the argument. If we have an `inout` argument, then we need to print the value pointed to both when entering and exiting the syscall. If the syscall errors, we should print just the initial value. ### Variants Syskaller allows defining variants of syscalls. This is espcially useful when the first argument denotes the operation of the syscall (e.g. prctl, ioctl). If there is more than one variant syscall, we need to identify which variant is being called, and if we can't identify the variant, fall back to a generic variant. To make the implementation simpler, it may make sense to only allow variants that depend only on a single argument. Consider the following syzlang definition (line breaks for clarity): ```clike= // assume that this file has definitions for the flag `caps` import "defs/caps.txt" include <linux/prctl.h> // the enum type argument indicates that we defined earlier indicates // that these values are to be treated as mutually exclusive prctl$PR_CAP_AMBIENT(option const[PR_CAP_AMBIENT], mode flags[enum, prctl_cap_ambient], arg3 ulong, arg4 ulong, arg5 ulong) prctl$PR_CAP_AMBIENT$PR_CAP_AMBIENT_RAISE(option const[PR_CAP_AMBIENT], mode const[PR_CAP_AMBIENT_RAISE], cap flags[enum, caps], arg4 ulong, arg5 ulong) // these constants come from the header <linux/prctl.h> prctl_cap_ambient = PR_CAP_AMBIENT_RAISE, PR_CAP_AMBIENT_LOWER, PR_CAP_AMBIENT_IS_SET, PR_CAP_AMBIENT_CLEAR_ALL ``` Then we need to generate methods for each syscall variant: ```c void prctl__pr_cap_ambient(struct tcb *tcp) { // there are sub variants, so check if this invocation // matches one of those if (tcp->u_arg[0] == PR_CAP_AMBIENT && tcp->u_arg[1] == PR_CAP_AMBIENT_RAISE) { prctl__pr_cap_ambient__pr_cap_ambient_raise(tcp); } else { // default case // decoding logic here } } ``` ## Possible Issues Generated code for complex syscalls and ioctls may be less performant that hand-written code. Care must be taken to ensure that appropriate data structures and algorithms are used in the generated decoders. The description language may not be flexible enough to implement some decoders. We can make modifications to accomodate those decoders, or we can maintain the current version of those syscall decoders. ## Examples ### `read` Syscall The following is a possible example conversion from syzlang into a strace decoder. File `decls/read.txt` ```clike=1 include <linux/unistd.h> resource fd[int] read(fd fd, buffer ptr[out, int8], count ssize_t) len[buffer, ssize_t]) ``` File `gen/read.c` ```c=1 // AUTOMATICALLY GENERATED FILE - DO NOT EDIT // header containing shared macros like prints, etc // can map to already existing functions #include "gen/common.h" #include <linux/unistd.h> SYS_FUNC(read) { // print all non-in arguments on syscall exit if (exiting(tcp)) { // print a file descriptor arg PRINT_FD(tcp->u_arg[0]); // print seperator ", " PRINT_ARG_SEP(); // only print value of out ptr if syscall was successful if (!syserror(tcp)) { // print string with length rval (since its type is len[buffer]) PRINT_NSTRING(tcp, tcp->u_arg[1], tcp->u_rval); } else { PRINT_PTR(tcp->u_arg[1]); } PRINT_ARG_SEP(); // print uint PRINT_UINT(tcp->u_arg[2]); } } ``` ### Ioctls File `decls/ioctl_ext4.txt` ```clike include <uapi/linux/btrfs.h> // define ioctl variants for each cmd ioctl$BTRFS_IOC_SNAP_CREATE(fd fd, cmd const[BTRFS_IOC_SNAP_CREATE], arg ptr[in, btrfs_ioctl_vol_args]) ioctl$BTRFS_IOC_SUBVOL_CREATE(fd fd, cmd const[BTRFS_IOC_SUBVOL_CREATE], arg ptr[in, btrfs_ioctl_vol_args]) btrfs_ioctl_vol_args { fd fd name string[BTRFS_PATH_MAX] } ``` ```c // AUTOMATICALLY GENERATED FILE - DO NOT EDIT #include "gen/common.h" #include <uapi/linux/btrfs.h> int ioctl_BTRFS_IOC_SNAP_CREATE(struct tcb *tcp) { PRINT("BTRFS_IOC_SNAP_CREATE"); PRINT_ARG_SEP(); // could be generated, or manually defined PRINT_STRUCT_BTRFS_IOCTL_VOL_ARGS(tcp, tcp->u_arg[2]); } int ioctl_BTRFS_IOC_SUBVOL_CREATE(struct tcb *tcp) { PRINT("BTRFS_IOC_SUBVOL_CREATE"); PRINT_ARG_SEP(); // could be generated, or manually defined PRINT_STRUCT_BTRFS_IOCTL_VOL_ARGS(tcp, tcp->u_arg[2]); } ``` ## Tentative Timeline | <!-- --> | <!-- --> | | ----------------------- | ----------------------------- | May 17 -- June 7: | **Community Bonding Period:** Finalize implementation ideas and communicate any issues/updates W01 (Jun. 07 -- Jun. 11): | Work on implementing a syzlang lexer using flex (with tests) and update strace’s CI script W02 (Jun. 14 -- Jun. 18): | Implement syzlang parser with bison W03 (Jun. 21 -- Jun. 25): | Buffer Week -- Finish up remaining lexer/parser tasks and make sure it is well-documented and tested. W04 (Jun. 28 -- Jul. 02): | Work on the foundation for code generation: aim to generate basic decoders (ptrs, strings, ints, etc.) W05 (Jul. 05 -- Jul. 09): | Continue with W04 W06 (Jul. 12 -- Jul. 16): | Add remaining code generation features (structs, union types, variants, type templates). July 16: | **Phase 1 Evaluation** W07 (Jul. 19 -- Jul. 23): | Continue with W06. W08 (Jul. 26 -- Jul. 30): | Continue with W06. Ensure code generation is documented with tests and is incorporated into the build system. W09 (Aug. 02 -- Aug. 06): | Buffer Week -- Finish up remaining code generation and ensure everything is well-documented and fix any issues with patches found in review W10 (Aug. 09 -- Aug. 13): | Buffer Week -- Finish up remaining code generation and ensure everything is well-documented and fix any issues with patches found in review W11 (Aug. 16 -- Aug. 23): | **Final Week:** Buffer Week - Finish up anything remaining and ensure everything done is well-documented I have included four buffer weeks to account for any unforseen issues. If previous work is finished without using the buffer weeks, they can be skipped and the next task can be worked on.