Try   HackMD

About Me

My name is Srikavin Ramkumar and I am an undergraduate student at University of Maryland studying Computer Science and Mathematics.

Relevant Experiences

I have taken courses on Algorithms, Programming Language Theory, and Computer Systems. I am currently taking courses on Databases and Networking. I have written projects for these classes in C, OCaml, and Python, but I can’t share these publicly; if needed, I can send samples.

I also participate in cyber security capture-the-flag competitions. Some challenges require reverse engineering and exploiting unknown binaries and kernel drivers. So I am familiar with debugging tools as well as certain aspects of the linux kernel.

In 2019, I made contributions to the Fedora Project in the form of writing tests and other tasks as part of Google Code-In.

Relevant Skills

I have experience with Linux, CLI tools, Make, and Git, although I’m still learning Automake. I am comfortable writing code in C, Python, Java, and Typescript. Most of my recent C code has been for class projects, so I can’t share these publicly. I have public repositories on my Github.

Other Commitments

At this time, I do not have any commitments during the GSOC work period, and I will be able to work full-time on this project.

Proposal

Generate strace system call decoders from descriptions similar to syzkaller system call definitions.

Abstract

This project aims to incorporate a system for generating syscall decoders from a modified syzkaller's system call description language (syzlang) into strace’s build system. This project would allow the strace project to leverage descriptions of a large number of system calls and ioctls that are already described in a similar format.

This proposal consists of two parts:

  • Parsing syzlang descriptions into an in-memory representation
  • Generating decoders based on the in-memory representation

Since strace is covered by extensive test cases, the correctness of generated decoders can be ensured by running existing test cases on generated decoders (in addition to new tests covering both the parser and code generation modules).

Expected Results

Many system call decoders in strace are very similar in structure. Generating these decoders from a declarative definition could ensure consistency as well as reducing the difficulty of implementing new syscall and ioctl decoders.

The expected deliverables are:

  • A lexer/parser for the syzlang description language
  • Code generation based on the parser to generate syscall/ioctl decoders
  • Incorporating the above (with tests) into the build script and updated CI scripts

All code written should be modular, easy-to-extend, and well-documented.

Future projects could extend the code generation to support printing structured output. Future projects could also generate test cases using these descriptions to ensure consistent decoding of similar types across all existing decoders and increase the coverage of certain edge cases.

Existing Work

Syzkaller includes tools for compiling syzlang descriptions into Go code. However, we are making extensive changes to the syszlang grammar. It may be easier to implement a new parser specifically for the purposes of generating strace decoders. The Go implementation and the documented syzlang grammar can be still be used as reference.

Modifications to Syzlang

Syzlang was designed to improve kernel fuzzing results. It provides a good base to encode syscall descriptions, but in order to describe syscalls sufficiently enough to generate decoders, we need to make a few modifications to syzlang:

  1. Syzlang doesn't differentiate between (enum-like) mutually exclusive flags and OR-able bit flags. To work around this limitation with syslang, we can extend syzlang by adding new types (such as flag_enum and flag_bit ) to differentiate between these types of flags.

  2. Syzlang supports specifying syscall variants. Supporting subvariants of syscalls will make it easier to describe certain syscalls (especially ioctls) where the value of some argument changes how the remaining arguments should be decoded.

  3. Syzlang has certain features that are only applicable for fuzzing purposes. We can ignore/remove these features.

  4. Integer types in syzlang are limited to sized integers. Adding platform dependent integer types such as long, int, kernel_ulong_t, kernel_long_t may be useful.

  5. While syzlang allows importing C header files, adding support for including other syzlang files may reduce duplicate definition of some types (especially common flags). This can be done through an import operator that essentially prepends the imported file to the current file.

  6. All structure types need to be defined entirely. To maintain consistency with kernel headers, we can use an attribute to indicate that a struct should not defined, and should instead reference the type from an included header.

Throughout the rest of this document, I will be referring to our extended version of syzlang. A new name should be given to this version to avoid confusion.

Tentative Implementation Ideas

The general idea is the following;

Token Stream
AST
C Code
Code Generator
Parser
Lexer
Declarative syscall/ioctl Descriptions
Generated Source

Lexer/Parser

The Lexer/Parser will be implemented using flex/bison in C.

Tokenize the input file and then convert the token stream into an AST-like structure using the documented pseudo-grammar. The AST structure shouldn’t be too complicated since syzlang is declarative and its grammar isn’t recursive.

The syslang descriptions also contain extra information such as the range of valid values for certain syscall variants and attributes only relevant to fuzzing. We can ignore these values.

For example, the syscall read(fd fd, buf string[out], count int) len[buf] can be processed in the following manner:

First, it is lexed into the following stream of tokens:

identifier("read")
open_paren
    identifier("fd")
    identifier("fd")
comma
    identifier("buf")
    identifier("string")
    open_bracket
        identifier("out")
    close_bracket
comma
    identifier("count")
    identifier("int")
close_paren
identifier("len")
open_bracket
    identifier("buf")
close_bracket
end_of_line
end_of_file

Then, this token stream can be parsed into a structure similar to the following representation:

root:
    - name: read
      type: SyscallNode
      lineno: 1
      args:
          - name: fd
            type:
                name: fd
                options: []
          - name: buf
            type: 
                name: string
                options: [out]
          - name: count
            type:
                name: int
                options: []
      ret:
          type:
              name: len
              options: [buf]

Decoder Code Generation Rules

The generation of syscall decoders depends on a variety of factors.

Type Templates

Syzkaller supports generic type templates. Every instance of these type templates could be represented as a different C structure.

For example, given the syzlang type template definition

type nlattr[TYPE, PAYLOAD] { nla_len len[parent, int16] nla_type const[TYPE, int16] payload PAYLOAD } [align_4]

The type nlattr[FOO, int32] can be represented in C as

struct nlattr__foo__int32 { int16_t nla_len; FOO nla_type; int32_t payload; }__attribute__((aligned(4));

Structs/Unions/Type Aliases

Type aliases can either be transliterated directly as C typedefs or they can be evaluted before emitting source code.

Decoding structures can be complicated. We would need to generate a function that is able to print out all of the fields of a structure. More complicated structures (fields with bitmasks, etc.) may be too difficult to express declaratively. Syscalls using those structs (at first) should be written by hand.

For structures with simple types a print function could be generated. Consider

Includes

Syzkaller descriptions can reference Linux Kernel header files. These are easy enough to translate directly into C-style includes.

For example, include <linux/sched/coredump.h> can be converted into #include <linux/sched/coredump.h>.

Constant values

Constants can be specified as decimal, hex, or character literals; as well as C-style #define constants. These can be represented in C as a constant or the defined literal. When decoding syscalls, the literal name should be preferred.

Arguments

In order to print decoded arguments, we can delegate to currently existing decoders for each type (e.g. printfd, printstrn). A common header file (included by default for all generated code) can include functions/macros to decode default types.

Pointer Arguments

When dealing with pointer arguments, we can use the direction annotations (in, out, inout) in the description to decide how to decode these.

If we have an in argument, then the value pointed to is unchanged by the syscall. So we just decode the argument based on the type pointed to.

If we have an out arguement, then the initial value is irrelevant and we need to decode the argument when exiting the syscall. If the syscall errors, we print the address of the argument.

If we have an inout argument, then we need to print the value pointed to both when entering and exiting the syscall. If the syscall errors, we should print just the initial value.

Variants

Syskaller allows defining variants of syscalls. This is espcially useful when the first argument denotes the operation of the syscall (e.g. prctl, ioctl).

If there is more than one variant syscall, we need to identify which variant is being called, and if we can't identify the variant, fall back to a generic variant.

To make the implementation simpler, it may make sense to only allow variants that depend only on a single argument.

Consider the following syzlang definition (line breaks for clarity):

// assume that this file has definitions for the flag `caps` import "defs/caps.txt" include <linux/prctl.h> // the enum type argument indicates that we defined earlier indicates // that these values are to be treated as mutually exclusive prctl$PR_CAP_AMBIENT(option const[PR_CAP_AMBIENT], mode flags[enum, prctl_cap_ambient], arg3 ulong, arg4 ulong, arg5 ulong) prctl$PR_CAP_AMBIENT$PR_CAP_AMBIENT_RAISE(option const[PR_CAP_AMBIENT], mode const[PR_CAP_AMBIENT_RAISE], cap flags[enum, caps], arg4 ulong, arg5 ulong) // these constants come from the header <linux/prctl.h> prctl_cap_ambient = PR_CAP_AMBIENT_RAISE, PR_CAP_AMBIENT_LOWER, PR_CAP_AMBIENT_IS_SET, PR_CAP_AMBIENT_CLEAR_ALL

Then we need to generate methods for each syscall variant:

void prctl__pr_cap_ambient(struct tcb *tcp) {
    // there are sub variants, so check if this invocation
    // matches one of those
    if (tcp->u_arg[0] == PR_CAP_AMBIENT && tcp->u_arg[1] == PR_CAP_AMBIENT_RAISE) {
        prctl__pr_cap_ambient__pr_cap_ambient_raise(tcp);
    } else {
        // default case
        // decoding logic here
    }
}

Possible Issues

Generated code for complex syscalls and ioctls may be less performant that hand-written code. Care must be taken to ensure that appropriate data structures and algorithms are used in the generated decoders.

The description language may not be flexible enough to implement some decoders. We can make modifications to accomodate those decoders, or we can maintain the current version of those syscall decoders.

Examples

read Syscall

The following is a possible example conversion from syzlang into a strace decoder.

File decls/read.txt

include <linux/unistd.h> resource fd[int] read(fd fd, buffer ptr[out, int8], count ssize_t) len[buffer, ssize_t])

File gen/read.c

// AUTOMATICALLY GENERATED FILE - DO NOT EDIT // header containing shared macros like prints, etc // can map to already existing functions #include "gen/common.h" #include <linux/unistd.h> SYS_FUNC(read) { // print all non-in arguments on syscall exit if (exiting(tcp)) { // print a file descriptor arg PRINT_FD(tcp->u_arg[0]); // print seperator ", " PRINT_ARG_SEP(); // only print value of out ptr if syscall was successful if (!syserror(tcp)) { // print string with length rval (since its type is len[buffer]) PRINT_NSTRING(tcp, tcp->u_arg[1], tcp->u_rval); } else { PRINT_PTR(tcp->u_arg[1]); } PRINT_ARG_SEP(); // print uint PRINT_UINT(tcp->u_arg[2]); } }

Ioctls

File decls/ioctl_ext4.txt

include <uapi/linux/btrfs.h>

// define ioctl variants for each cmd
ioctl$BTRFS_IOC_SNAP_CREATE(fd fd, cmd const[BTRFS_IOC_SNAP_CREATE], 
arg ptr[in, btrfs_ioctl_vol_args])

ioctl$BTRFS_IOC_SUBVOL_CREATE(fd fd, cmd const[BTRFS_IOC_SUBVOL_CREATE], 
arg ptr[in, btrfs_ioctl_vol_args])

btrfs_ioctl_vol_args {
    fd      fd
    name    string[BTRFS_PATH_MAX]
}
// AUTOMATICALLY GENERATED FILE - DO NOT EDIT
#include "gen/common.h"

#include <uapi/linux/btrfs.h>

int ioctl_BTRFS_IOC_SNAP_CREATE(struct tcb *tcp) {
    PRINT("BTRFS_IOC_SNAP_CREATE");
    PRINT_ARG_SEP();
    
    // could be generated, or manually defined
    PRINT_STRUCT_BTRFS_IOCTL_VOL_ARGS(tcp, tcp->u_arg[2]); 
}

int ioctl_BTRFS_IOC_SUBVOL_CREATE(struct tcb *tcp) {
    PRINT("BTRFS_IOC_SUBVOL_CREATE");
    PRINT_ARG_SEP();
    
    // could be generated, or manually defined
    PRINT_STRUCT_BTRFS_IOCTL_VOL_ARGS(tcp, tcp->u_arg[2]); 
}

Tentative Timeline

May 17 – June 7: Community Bonding Period: Finalize implementation ideas and communicate any issues/updates
W01 (Jun. 07 – Jun. 11): Work on implementing a syzlang lexer using flex (with tests) and update strace’s CI script
W02 (Jun. 14 – Jun. 18): Implement syzlang parser with bison
W03 (Jun. 21 – Jun. 25): Buffer Week – Finish up remaining lexer/parser tasks and make sure it is well-documented and tested.
W04 (Jun. 28 – Jul. 02): Work on the foundation for code generation: aim to generate basic decoders (ptrs, strings, ints, etc.)
W05 (Jul. 05 – Jul. 09): Continue with W04
W06 (Jul. 12 – Jul. 16): Add remaining code generation features (structs, union types, variants, type templates).
July 16: Phase 1 Evaluation
W07 (Jul. 19 – Jul. 23): Continue with W06.
W08 (Jul. 26 – Jul. 30): Continue with W06. Ensure code generation is documented with tests and is incorporated into the build system.
W09 (Aug. 02 – Aug. 06): Buffer Week – Finish up remaining code generation and ensure everything is well-documented and fix any issues with patches found in review
W10 (Aug. 09 – Aug. 13): Buffer Week – Finish up remaining code generation and ensure everything is well-documented and fix any issues with patches found in review
W11 (Aug. 16 – Aug. 23): Final Week: Buffer Week - Finish up anything remaining and ensure everything done is well-documented

I have included four buffer weeks to account for any unforseen issues. If previous work is finished without using the buffer weeks, they can be skipped and the next task can be worked on.