Types - HackMD

# Types Overview of pending tasks in radare2 types ## API https://github.com/radareorg/radare2/issues/16624 also see "Solution Approach 1: Base Types + C Types" below, this API is specifically for **Base Types**. ## General Representation of Types **Problem:** Currently, in r2, references to types (for example the type field of a variable) are just strings. This means that every time the semantics of these types are needed (for example to know which type is meant, whether it's a pointer, etc.), they have to be parsed, which can be done with https://github.com/radareorg/radare2/blob/f1e113ca8feee3e7caae85f535ab8ca1ef86dd6a/libr/include/r_parse.h#L78-L119, but there are also places where no correct parsing is done which results in brokenness: ``` [0x004003b0]> s main [0x004004a6]> "td struct S1 { int a; int b; int c; char d[256]; short e; }" [0x004004a6]> afvn s1 var_110h [0x004004a6]> afvt s1 S1 [0x004004a6]> pd 2 @ 0x004004be │ 0x004004be c785f0feffff. mov dword [s1.a], 1 │ 0x004004c8 c785f4feffff. mov dword [s1.b], 2 [0x004004a6]> afvt s1 struct S1 # <- this should be fine too [0x004004a6]> pd 2 @ 0x004004be │ 0x004004be c785f0feffff. mov dword [s1], 1 │ 0x004004c8 c785f4feffff. mov dword [rbp - 0x10c], 2 ``` ### Solution: Base Types + C Types The idea here is that there exist two distinct meanings of what "type" actually means: #### Base Types ... are named definitions of primitive integer types, structs (with their fields), enums (with their possible values), unions and typedefs that are saved in the database. They can NOT be pointers, arrays or anything of the like. They would correspond to the definitions of base types in the API above: *xvilka: we should consider to make R_ANAL_BASE_TYPE_KIND_CLASS for structure with methods* *thestr4ng3r: classes aren't necessarily types. For example you could have an abstract class or one with only static members if your language is so shitty that it forces you to use classes (java). Instead I think we should have regular structs and anal classes that refer to structs.* *hound: they are not necessarily types, but I think they should be separated from structs because classes are often associated with their methods and it'd help with the separation I think.* ```c typedef enum { R_ANAL_BASE_TYPE_KIND_STRUCT, R_ANAL_BASE_TYPE_KIND_UNION, R_ANAL_BASE_TYPE_KIND_ENUM, ... } RAnalBaseTypeKind; typedef struct r_anal_base_type_struct { RVector members; // describe the struct in some way } RAnalBaseTypeStruct; typedef struct r_anal_base_type_enum { RVector cases; // list of all the enum cases } RAnalBaseTypeEnum; typedef struct r_anal_base_type { RAnalBaseTypeKind kind; union { RAnalBaseTypeStruct ztruct; RAnalBaseTypeEnum inum; ... }; } RAnalBaseType; R_API RAnalBaseType *r_anal_get_base_type(const char *name); ``` Examples: * `struct MyStruct { uint32_t a; uint32_t b; };` * `typedef struct MyStruct MyTypedef;` #### C Types ... are built on top of base types and can be pointers, arrays or just a direct reference to a base type with maybe some qualifiers such as `const`, `volatile`, etc. They correspond to the definitions of a CType in the current string parser: https://github.com/radareorg/radare2/blob/f1e113ca8feee3e7caae85f535ab8ca1ef86dd6a/libr/include/r_parse.h#L78-L119 ```c typedef enum { R_PARSE_CTYPE_TYPE_KIND_IDENTIFIER, R_PARSE_CTYPE_TYPE_KIND_POINTER, R_PARSE_CTYPE_TYPE_KIND_ARRAY } RParseCTypeTypeKind; // ... typedef struct r_parse_ctype_type_t RParseCTypeType; struct r_parse_ctype_type_t { RParseCTypeTypeKind kind; union { struct { // ... char *name; bool is_const; } identifier; struct { RParseCTypeType *type; bool is_const; } pointer; struct { RParseCTypeType *type; ut64 count; } array; }; }; ``` These are the kinds of types that a variable for example would hold or that you link to an address to specify a global. Examples: * `const struct MyStruct *` => pointer to the base type called "MyStruct" which is constant * `MyStruct` => a direct reference to the "MyStruct" base type #### Conclusion This approach fits relatively well to how types are currently saved in r2. One could simply replace the places where strings are used (`RAnalVar.type` for example) by these CTypes + the Base Types API specified above. A problem here is that not only c types refer to base types, but also the other way around. For example a typedef (which is a base type) maps a name to a c type. However currently base types are stored in sdb. As long as this is the case, one would still have to use a string representation for the c types in these sdb places. Another minor issue is that it can't represent anonymous types, for example assigning `struct { int a; int b; } *` as the type to a variable. But that is only a small convenience problem since you can always just use named structs as well. ### Alternative Solution: Unifying these types One could also just dismiss this distinction and unify "base type" and "c type" under just "type". However this would first require refactoring the base types out of sdb. The Ghidra Decompiler uses such an approach. All types are stored in a `TypeFactory` and they can be pointers, structs, ...: https://github.com/NationalSecurityAgency/ghidra/blob/bcb825fb029232175625bc85653ec0e810b1252e/Ghidra/Features/Decompiler/src/decompile/cpp/type.hh#L380-L449 All types are pooled in such an object, no matter whether named or not. This solution is very different from the current implementation of types in r2. It should be easier to do it if the above solution with base types and c types is implemented already, so I would suggest this only as an option for later. However it is still important imo to think about this and also to read the ghidra code since it solves many issues quite nicely. ## Sizes of Types In many places, it is critical to know the size of a given type, including both base types and complex c types. The main problem here is that there are many dependencies involved. Here is an example for linking a type to an address (effectively defining a global variable): Let's say I have these base types saved in the db: ``` struct A { uint32_t a; }; struct B { struct A suba; uint64_t b; }; ``` Then I link `struct B` to the address `0x100` and run `pd @ 0x104`. `pd` has to whether `0x104` is just supposed to be disassembled or if it is part of some linked type and it must be able to do this efficiently to show me something like this: ``` 0x104 <the value of B.b here> 0x10c mov rax, 42 ... ``` It is definitely not enough to just make a snapshot of the type's size at the time I linked it to the address since I could edit it later. So IF the size would be stored at the link, it must be updated whenever `struct B` is edited, i.e. `struct B` needs to get some kind of reference to all the places where it is linked to. Moreover, if I edit `struct A`, this will also change my link. So `struct A` needs to know it is referenced by `struct B` which needs to know that it is linked at `0x100`. This could also lead to circles which would have to be dealt with somehow. Alternatively, one could not store the size at the link at all but only the c type and re-evaluate the size whenever needed. Links could be saved in a (non-augmented) RBTree so `pd @ 0x104` would do this: * Find the highest node in the link tree less or equal to 0x104 * Get the size of the type there * If node->addr + node->size > 0x104, then we are inside this link This would imply that we are oblivous about overlapping links. However that may or may not be acceptable. One might consider overlapping variables an error and always only use the one at the highest addr. Additionally, it must be possible to efficiently get the size of a c type. For the plain c type part, this should be possible but then getting the size of an underlying base type is trickier. Solutions include refactoring base types out of sdb and/or tracking dependencies only between base types and caching the size per base type. ## Type constraints Currently every variable can be linked with the value constraints - e.g. have type `int` but limited values [0;5] or with set of such constraints [0;6]U[8;12]U[15]. I think they should be provided at the types API out of the box, see https://github.com/radareorg/radare2/issues/11828 issue for more details ## Consider to distinct mutable from immutable types In a simple case it's an integer or a float constant, see https://github.com/radareorg/radare2/issues/4508 More complex case is the class structure - most of the times it's immutable. *thestr4ng3r: could a const qualifier specified at c type level be enough for this?* *xvilka: probably, just a thing to keep in mind though*