# CS 61C Notes
# Lecture 1: 8/26/20
Intro class, no notes
# Lecture 2: 8/28/20
* Refer to table for easy hex to binary conversion
* Numerals have infinite preceeding zeroes
* Numbers are representations of numerals
* Numbers are theoretical
* For a set number of bits, if addition (sub, mul, or div) cannot be represented, there is overflow
* Not underflow, this would be negative overflow
## Number representations
* Positive number representations are always the same for all representations
* Sign and magnitude
* Leftmost sign is int (0 is +, 1 is -)
* Problem 1: Making numbers smaller than 0 actually makes the number bigger in binary representation
* Problem 2: Also there are 2 zeroes
* Problem 3: Complicated circuitry
* One's complement
* Flip all the bits for a negative number
* Problem: there are still 2 0s (0000 and 1111)
* Solution: shift everything over left by 1
* Two's complement
* Flip all bits and add $1$
* Or multiply leftmost bit by $-1$
* All 1s represents -1
* $2^{N-1}$ non-negatives, $2^{N-1}$ negatives
* There are 1 more negative than positive
* Bias encoding
* Add a bias to bring the center of the range down
* Bias for N bits chosen as $-(2^{N-1}-1)$
# Lecture 3: 8/31/20
* ENIAC: First Electronic General-Purpose Computer
* Long time to program (patch cords and switches)
* EDSAC: First General Stored-Program Computer
* Programs held as numbers/bits in memory
* Compile changed source files into .o files (machine code object files) which are then linked into a .out file (machine code executable file)
* Executables must be rebuilt on each new system (porting)
* Parallel compiling: `make -j` to rebuild changed pieces
* Amdahl's Law: Linking is sequential, forming a bottleneck
* C Pre-Processor (CPP): Begins with `#`
* Ex. `#define` or `#include`
* Macros: small functions that text replaces everything below
* However, this means functions would be called multiple times
## C vs Java
C main function
```cpp=
#include <stdio.h>
int main(void) {
return 0;
}
```
| | C | Java|
| -------- | -------- | -------- |
| Type of Language | Function Oriented | Object Oriented |
| Programming Unit | Function | Class = Abstract Data Type |
| Compilation | `gcc hello.c` | `javac Hello.java` |
| Execution | `a.out` loads and executes | `java Hello` interprets |
| Storage | Manual(`malloc`, `free`) | `new` allocates & initializes, automatic garbage collection |
| Comments | `/* ... */` | `/* ... */` or `//` |
| Constants | `#define, const` | `final` |
| Preprocessor | Yes | No |
| Variable declaration | Beginning of block (C99 same as Java) | Before you use it |
| Variable naming conventions | under_scores | camelCase |
| Access a library | `#include <stdio.h>` | `import java.io.File;` |
* ANSI C was updated to C99 or C9x
* Added variable-length arrays
* Int types and booleans
* C11 and C18 are newer fixes
* Multi-threading, unicode, removed `gets()`
* Passing args into main: `int main(int argc, char *argv[])`
* `argc` containst he number of strings on command line
* Executable counts as one, plus one for each argument
* `argv` is a pointer to an array containing the arguments as strings
## C Basics
* Booleans
* False: `0` (int), `NULL`, booleans (C99)
* True: Everything else
* Types: `int`, `unsigned int`, `float`, `float`, `double`, `char`, `long`, `long long`
* `long` is a longer int (32b), `long long` is even longer (64b)
* `int` is 16b
* `float` is floating point decimal, `double` is equal or higher precision floating point
* Preferrable to use `intN_t` and `uintN_t`
* Constants: `const int days_in week`
* Enums: `enum color {RED, GREEN, BLUE}`
* Typedef: `typedef uint8_t BYTE;` Custom types
* Struct: Custom structured groups of variables
```cpp=
typedef struct{
int length_in_seconds;
int year_recorded;
} SONG;
```
* switches evaluate cases until it sees a break
* `goto` is nonlocal control flow, reserved for advanced use cases
* includes are a shared namespace so no `math.atan`
* Never while loop based on a floating point value
# Lecture 4: 9/2/20
## Pointers
* Variables have undefined values if not initialized in declaration
* Heisenbugs: Undefined behavior that run nondeterministically
* Address refers to a memory location (not value)
* Pointer contains the address of a memory location
* Syntax:
* `int *p;` p is a pointer (an address of an int)
* `p = &y;` assign the address of y to p
* Dereferencing:
* `z = *p;` assign value at adress in p to z
* `*p = z;` assign the value of z to what p is pointing to
* Must initialize pointers first
* `int *ptr; *ptr=5;` does not work since `*ptr` never had an initial value
* Benefit of cleaner code, faster to pass pointer to large struct/array
* Drawback is bugs (dynamic memory management, dangling references, memory leaks)
## Pointer Usage
* Pointers can point to any data type
* `void*` is a generic pointer (not recommended)
* Pointers can point to functions
* `int (*fn) (void*, void*) = &foo`
* fn is a function that accepts 2 `void*` pointers and returns an int. Initially points to function `foo`
* `(*fn)(x, y)` is equivalent to `foo(x,y)`
* Struct arrow notation: `int h = paddr->x;` equivalent to `int h = (*paddr).x;`
* Arrow notation is more commonly used
* `p1 = p2` copies all values, not pointers
* NULL pointer: The pointer of all 0s
* Writing or reading to null pointer causes crash
* Since 0 is false, simple test for null pointer `if(q)`
* Type declaration tells how many bytes to fetch for each pointer
* 32bit int is stored in 4 consecutive 8-bit bytes
* Word alignment: Can only access 4 byte boundaries
* `sizeof(type)` returns num of bytes in object
* By C99 definition, `sizeof(char) == 1`
## Arrays
* A contiguous block of memory
* Declaration: `int ar[] = {1, 2};`
* Arrays are almost identical to pointers
* `char *string` and `char string[]` are nearly identical
* Cannot increment an array
* `ar[0]` is same as `*ar`, `ar[2]` is same as `*(ar+2)`
* Declared arrays are only allocated while scope is valid
* Declare `int ARRAY_SIZE` instead of hardcoding
* Array does not know its own length and bounds not checked
* Exception is strings, the null terminator signifies end
* Common errors
* Segmentation fault - read/write to memory you don't have access to
* Bus error - alignment is wrong (bad word alignment)
* `pointer + n` adds n * sizeof(type)
* Handle: pointer to a pointer `**h`
* Needed to advance a pointer for an array
* Typically no more than 2 stars
# Lecture 5: 9/4/20
## Dynamic Memory Allocation
* `sizeof` gives size in bytes
* Also knows size of arrays (in bytes)
* Allocate room to point to: `malloc()`
* Ex. `ptr = (int *) malloc(n*sizeof(int))`
* This will cast the `void*` returned by malloc to `int*`
* `malloc` location contains garbage initially
* Dynamically allocated space must be freed
* When main returns it auto-frees it up, but bad practice
* Memory leak: If memory is not freed and the program exits
* Malloc returns `NULL` if request cannot be granted
* Common errors
* `free()`ing the same place of memory twice
* `free()`ing something that you didn't `malloc()`
* Resizing
* `realloc(p, size)` which returns a new pointer (`NULL` if fails)
* `realloc(ip, 0)` is same as `free(ip)`
* Array names are not variables
* `*a`: first value of array, `a`: pointer of array, `&a`: pointer of array (just like `a`)
* 3 Rules of pointers
* 1) Need to set up the pointee (separate from pointer)
* 2) Dereference a pointer to access its pointee
* 3) Assignment (=) between pointers makes them point to the same pointee
## Linked List
```cpp=
struct Node {
char *value;
struct Node *next;
};
typedef struct Node *List; // List is pointer to struct Node
List ListNew(void) { return NULL; } //Create new empty list
```
* When you `malloc` a string, need `strlen(string) + 1` for null terminator `\0`
* `strcpy(destination, original)`
## Memory locations
* Struct declaration does NOT allocate memory
* Variable declaration does allocate memory
* Methods of allocating memory
* Declaration of local variable
* Dynamic allocation at runtime like `malloc`
* Global variable (above `main()`) which has global scope
* 3 Pools of memory
* 1) Static storage: global variable storage (permanent)
* 2) Stack: Local variable storage, params, return address
* 3) Heap: Dynamic malloc storage: data lives until dallocated by programmer
* Program Address space
* Stack: local vars, grows downward
* Big stack: Deep recursion/fractals
* Heap: Space from malloc, grows upward
* Just big mallocs without function calls
* Static data: Declared outside of main. Does not grow/shirnk
* Code: Loaded when program starts. Does not grow/shrink
![](https://i.imgur.com/cBqquil.png)
### Heap
* Stack frame (LIFO)
* Return instruction address, params, space for other local variables
* Always contiguous blocks of memory, stack pointer tells where top stack is
* When procedure ends, stack frame is tossed off stack freeing memory for future stack frames
* Don't erase memory even if not needed, just move stack pointer up
### Heap
* Large pool of memory, not allocated in contiguous order
* Specify the num of bytes to allocate in `malloc`
* Tricky to manage unlike others
* Want `malloc()`, `free()` to be fast, minimal memory overhead
* Want to avoid (external) fragmentation (free memory separated in small chunks)
* K&R Implementation
* Each block of memory is preceded by a header with size of block and pointer to next block
* Free blocks are kept in a circular linked list
* Pointer field unused in allocated blocks
* `malloc()`: Search free list for a block large enough
* `free()`: Check if blocks adjacent to freed block are free
* Coalesce into 1 bigger free block, otherwise add new freed block to free list
* `free` is very quick, `malloc` can search the full list worst case
* `malloc` block choosing
* best-fit: choose smallest block that is big enough
* first-fit: choose first block that is big enough
* next-fit: same as first-fit but resume seraching where we finished last time
* distributes the small slivers of memory
## When Memory goes Bad
* Pointer problems
* Dangling reference: using ptr before malloc
* Writing off the end of arrays
* Cannot return pointer to variable in local frame
* Use after free
* Stale pointers after `realloc` moves data
* Freeing the wrong stuff (incremented pointer)
* Double free
* Memory leaks: tardy free, lose the ptr
* Incrementing a pointer
* Valgrind catches memory leaks, misues of free, writing over end of arrays
* Helpful for debugging
# Lecture 6: 9/9/20
* Very large/small numbers and fractional numbers need to be stored
* Agree where to put the binary point to sep integer and fractional parts
* Store where the binary point is in a diff variable
* Binary point can 'float' as the binary point is no longer fixed
* Normalized form: No leading 0s (like scientific notation)
## FLoating Point Representation
* First 1 is not recorded (since always 1)
* Sign: 1 bit (use sign-magnitude form)
* Exponent: 8 bits
* Significand: 23 bits (LSB represents $10^{-23}$)
* Can represent range of $(1.2 * 10^{-38}, 3.4 * 10^{38}$)
* Overflow: Exponent is larger than 8 bits ($>3.4 * 10^{38}, < -3.4 * 10^{38}$)
* Underflow: Negative exponent is larger than 8 bits
* Ends up storing small fraction as 0
* More precision: double (52 bit significand)
* Significand is always between 0, 1 (for normalized numbers)
* Representing 0: Exponent value of 0
* 0 has no leading 1
* Results in 2 zeroes because of the sign bit
* Biased exponent representation (IEEE Standard 754)
* 2's complement would make negative numbers look bigger
* Bias: 127
* $(-1)^{\text{Sign}} * (1 + \text{Significand}) * 2^{\text{Exponent} - 127}$
![](https://i.imgur.com/hyKuS0x.png)
## Special Numbers
* How do we represent $\pm$ infinity
* Use exponent 255 with 0 significand
* Representing NaN
* Use exponent 255 with nonzero signifcand
* NaNs contaminate results as they percolate up
* $2^{23} -1$ possible NaNs used to specify each error (sqrt neg number, 0/0, etc.) and include which line it occured on
* Problem of normalization and implicit 1
* Gap between FP numbers around 0
* Smallest pos num: $2^{-126}$
* Second smallest: $(1 + 2^{-23}) * 2^{-126} = 2^{-126} + 2^{-149}$
* Denorm: Solves implicit 1 problem
* Exponent 0 with nonzero significand
* No implicit leading 1, implicit exponent -126
* Smallest: $2^{-149}$
* Second smallest: $2^{-148}$
* Even steps until normalized region and the stride doubles with higher exponents
* Have $2^{23} \approx$ 8 million slots between each power of 2
* $2^{24} + 1$ is the smallest int that cannot be represented since the stride would be 2
* Example: $\frac{1}{3}$
* Sign: 0
* Exponent: $-2 + 127 = 125 = 0b01111101$
* Significand: $0b010101 \ldots$
## Fallacies
* Associativity is lost
* Small numbers can be lost when added to big numbers
* Precision: count of the number bits used to represent a value
* Accuracy: Difference between actual value and comp representation
* High precision permits high accuracy but no guarantee
* Rounding: 2 extra bits of precision to round
* Converting double to single precision (or FP to int)
* 1) Round to $+\infty$
* 2) Round to $-\infty$
* 3) Truncate (round to 0)
* 4) Unbiased: Round to even number
* Decimal examples: 2.5 rounds to 2, 3.5 rounds to 4
* FP Addition
* De-normalize to match exponents
* Add significands
* Keep same exponent
* Normalize again
* Casting
* `int -> float -> int` is not always same
* `float -> int ->float` is not always same
## Alternate FP Representations
* float: standard float (8exp, 23 sig)
* binary64: double precision (11exp, 52 sig)
* Quad-precision: binary128 (15exp, 112 sig)
* Oct-precision: binary256 (19 exp, 236 sig)
* Half-precision: binary16 or fp16 (5exp, 10 sig)
* bfloat16 (brain FP): (8exp, 7sig)
* unum: Tells how many bits are for exp/sig (customizable)
* u-bit indicates if the number is exact
# Lecture 7: 9/11/20
* CPU executes a lot of instructions
* Instruction Set Architecture (ISA): set of instructions a CPU implements
* RISC: Keep instruction set small and simple to build fast hardware
## RISC-V Basics
* Registers instead of variables
* Registers are inside the processor (not memory)
* Predetermined number of registers in hardware (32)
* Each RISC-V register is 32 bits wide (independent of variant)
* Numbered from `x0` to `x31`
* `x0` is special, always holds value 0
* Registers can be referred to by number or name
* Registers have no type
* Comments start with `#`
* Instruction: one statement/command
* Each line has at most 1 instruction
* One operation per instruction
## RISC-V add/sub
* `add x1, x2, x3`
* one: operation by name
* two: operand getting result (destination)
* three: first operand
* four: second operand
* Break longer sequences into multiple instructions
* Intermediates: numerical constants
* `addi x3, x4, 10`
* No `subi`, just `addi` a negative value
* Use `x0` for number zero to avoid using intermediates
# Lecture 8: 9/14/20
* Some registers have designated use (usually < 30 registers available)
* Optimizer compiler minimizes register usage
* Register footprint - number of registers needed for a compute kernel
* Memory is like a large 1D array
* Processor gives address(offset to base pointer) to point in memory
* Load from (read) and store to memory (write)
* Memory addresses are in bytes
* Word addresses are 4 bytes apart
* Little-endian: Least significant byte is smallest address
* Bits are always stored with most significant in upper position
* DRAM: Dynamic Random Access Memory (fast but not as fast as registers, medium capacity)
* registers are 50-500x faster than memory
## Loading from and storing Memory
* Load Word: `lw destination, offset(base_ptr)`
* Offset is in bytes (multiple of 4)
* Data flow moves from right to left
* C Code: `int A[100]; g = h + A[3];`
* Equiv: `lw x10, 12(x15); add x10, x12, x10`
* Store Word: `sw current, offset(base_ptr)`
* Data flow moves from left to right
* `int A[100]; A[10] = h + A[3];`
* `lw x10, 12(x15)`
* `add x10, x12, x10`
* `sw x10, 40(x15)`
* Load byte/store byte: `lb` `sb` for byte data transfers
* `lb x10, 3(x11)` copies content of `x11+3` to low byte pos of `x10`
* Sign-extend the sign bit to all higher bits
* Unsigned byte load: `lbu`
* No sign-extension needed
* No `sbu` (nonsensical)
* `addi` equiv to `sw, lw, add`, but `addi` is much faster
* Immediates must be <= 32 bits
* Need to load longer immediates from memory
## Decision Making
* if statement: `beq reg1, reg2, L1`
* Go to statement labeled L1 if (value reg1) == (value reg2)
* Go to next statement otherwise
* Counterpart: `bne` branch not equal
* conditional branch:
* Less than: `blt` (unsigned: `bltu`)
* Branch greater: `bge` (unsigned: `bgeu`)
* Unconditional branch (always)
* Jump: `j label`
* `==` usually translates to `bne` so next instruction executes if condition true
* Need a jump like `j Exit` at the end of `bne` block
* No `bgt`, `ble`
* Need to increment memory pointers by num of bytes
# Lecture 9: 9/16/20
* Logical instructions: isolate or packing bytes/nibles into words
* 2 variants: register and immediate
* No logical not, just use xor with 11111111
* Logical shifting:
* Shift logical left: `sll` (shift 2 bits left and insert 0s on right, bits fall off left end)
* Shift logical left immediate: `slli`
* Shift logical right: `slr` fills left with 0s
* Arithmetic shifting: logical shift for signed int
* Shift right arithmetic (`sra`, `srai): moves n bits to the right, inserts sign bit into empty bits
* Not equivalent to divide by $2^n$ (fails for odd neg numbers)
* C standards is division should round to 0
## Registers
* Programs are stored in memory
* Program lives in an area, data lives in an area (In practice, program above data)
* 1 RISC-V instruction is 32 bits
* Instruction fetched from memory using Program Counter (PC)
* Next instruction is 4 bytes away (incr PC +4)
* Symbolic register names
* `x10 - x17` are `a0-a7` for params, `a0, a1` for return values
* `x1` is `ra` (return address to point of origin)
* `x8-x9`: `s0-s1` and `x18-x27`: `s2-s11`: saved registers
* `x0` is `zero`
* Pseudo-instructions: shorthand for common assembly
* `mv rd, rs = addi rd, rs, 0` (move value to another register)
* `li rd, 13 = addi rd, x0, 13` (load immediate value into register)
* `nop = addi x0, x0, 0` (no operation, processor just waits)
## Function calls
* 6 fundamental steps
* 1) Put arguments in place where func can access them
* 2) Transfer control to Function
* 3) Acquire local storage resources needed for func
* 4) Perform desired task of func
* 5) Put return value where code can access it, restore used registers, release local storage
* 6) Return control to origin
* Use `jr` not `j` since `jr` allows us to use a variable to return (`jr ra`)
* Jump and link: `jal rd, Label` will jump and return save address (equiv to addi then j)
* Essentially link then jump
* Pseudo-instruction is `ret = jr ra`
* `jalr rd, rs, imm`: Jump and link register
* `j, jr, ret` are pseudo-instructions for `jal` (no ret address is just a jump)
# Lecture 10: 9/18/20
* Use stack to save old values before calling func
* Stack pointer: `sp` (x2)
* Stack grows downwards (stack address decreases when we push)
* Push decrements sp, pop increments sp
* Stack frame includes:
* return instruction address
* Params (args)
* Space for local vars
* Stack pointer points to bottom of stack frame
```assembly
addi sp, sp -8 # adjust stack for 2 items
sw s1, 4(sp) # Save s1 for use after
sw s0, 0(sp)
...
lw s0, 0(sp) # restore s0 for caller
lw s1, 4(sp)
addi sp, sp, 8 # adjust stack to delete 2 items
jr ra (ret)
```
## Nested function calls
* `ra` would be overwritten in future calls
* Register conventions: a set of rules for which registers are unchanged after `jal`
* Preserved: `sp, gp, tp, s0-s11` (s0 is fp)
* Not preserved: `a0-a7, ra, t0-6` (arg and temp registers)
* Caller must preserve args/rv, temp, ra
* Callee must preserve saved registers and sp
## Memory Allocation
* Types of storage classes:
* Automatic: local to func, discarded when func exits
* Use stack for local vars
* Procedure frame/activation record: segment of stack with saved registers and local vars
* Static: exist across procedures
* Stack: high end `0xbfff_fff0`
* Text segment in low end: `0x0001_0000`
* Static: gp points to static (`0x1000_000`)
* Heap: above static, grows up to high addresses
* The callee saves them on the stack at the beginning of the function call and restores the stack right before returning
# Lecture 11: 9/21/20
* Everything has memory addreses (instructions, data words)
* Branches and jumps to different sections
* One register keeps address of the instruction being executed (Program counter: PC) aka Instruction Pointer (IP)
* Programs are distributed as binaries but are different versions based on device
* Want backwards compatibility
* Data is in words(32 bit chunks)
* Each register is a word (lw, sw to access 1 word of memory)
* Divide instruction into fields
* R- register-register arithmetic ops
* I- register-immediate arithmeti cops, loads
* S- stores
* B- branches (minor variant of S)
* U- 20-bit upper immediate instructions
* J - jumps (minor variant of U)
## R-Format
![](https://i.imgur.com/tzDHERd.png)
* opcode: what format it is
* `0b0110011` for all R format
* funct7+funct3: tell what operation to perform
* `rs1, rs2`: source register: specifes register w/ first operand
* `rd`: destination register: register that receives result of computation
* Each register field holds a 5-bit uint corresp to reg number (`x0-x31`)
## I-Format
![](https://i.imgur.com/wttA7k3.png)
* `imm[11:0]` covers range $[-2048, 2047]$
* Sign-extend immediate
* `opcode`: `0b0010011`
* shift amounts is limited to 5 bits since larget shifts are meaningless
* Loads:
* 12bit signed immediate for the memory address (offset)
* `rd` contains value loaded from memory
* `rs1` is load from
## S-format
![](https://i.imgur.com/KzbMVst.png)
* Store needs 2 registers (`rs1, rs2`) but no `rd`
# Lecture 12: 9/23/20
## B-format (SB)
![](https://i.imgur.com/5Icv0bD.png)
* Need to read 2 registers but don't write to register
* Branches used for loops (func calls and unconditional jumps handled by jump instructions)
* Can only jump relative to PC
* Largest branch dist limited by size of code
* PC-Relative Addressing: Use immediate as 2s complement offset to PC
* 12 bits gives $\pm 2^{11} * 32$
* Can only branch to beginning of word
* Branch calculation
* No branch: `PC = PC + 4`
* Branch: `PC = PC + immediate*4`
* Support for 16-bit words means branch size reduced by half
* Half of targets will be errors
* $\pm 2^{10} * 32$
* Branch calculation
* No branch: `PC = PC + 4`
* Branch: `PC = PC + immediate*2`
* The 12bit immediate is actually 13bits but always has LSB = 0
* beq with offset 0 is infinite loop
## U-Format
![](https://i.imgur.com/y16TyRH.png)
* Need to branch to destination $> 2^{10}$
* Replace beq with bne and then j
* 20 bit immediate for upper 20 bits of 32 bit word
* lui: Load upper immediate
* lui then addi to set all 32 bits
* addi sign extends though which can cause problems
* Add 1 to LUI value to take that into account
* li: pseudo-op to handle this (lui and addi)
* auipc: Add upper immediate to PC
## J-Format (UJ)
![](https://i.imgur.com/zvYwHhQ.png)
* `jal` saves PC+4 to rd (return address)
* The immediate is the offset, remove the LSB 0
* PC relative jump to $\pm 2^{19}$ locations $=\pm 2^{18}$ 32bit instructions
* `jalr rd, rs, immediate` uses I-format
* Writes `rd = PC+4`, `PC = rs + immediate`
* Way to jump to absolute address
* Cannot assume LSB is 0 so range is reduced
* Use auipc, jalr to jump to pc-relative 32-bit offset
# Lecture 13: 9/25/20
* Interpreter: Directly executes a program in source language
* Higher level, better error msgs
* Slower (10x), code smaller (2x)
* Independent instructions (can run on any machine)
* Translator: converts prog from 1 source lang to another
* Translates to executable, don't know the source code
* Compilation, Assembly, Linking, Loading
* Compiler
* Input: High level lang code (`foo.c`)
* Output: Assembly code: (`foo.s`)
* Can contain pseudo-instructions
## Assembler
* Input: Assembly code (`foo.s`)
* Output: Object code (`foo.o`)
* Reads and uses directives
* Ex. `.text` (code), `.data` (static), `.globl sym`, `.string str` `.word w1...wn`
* Replace psuedo-instructions, create machine code, create object file
* Need to sign extend immediates
* Branch immediates count halfwords
* PC-relative branches/jumps: count half words between curr and where you want to go
* Foward reference problem: branching below
* Solution: Take 2 passes
* Pass 0: Replace pseudo-instructions
* Pass 1: Remember position of label
* Pass 2: Use label positions to generate code
* Pc-relative jumps (jal, beq, bne)
* `j ofset` expands to `jal, zero, offset`
* Count half-words between target and jump for position-independent code (PIC)
* PIC is good - relative to location
* Static data refs:
* `lui` `addi` require full 32b address
* `auipc` `addi` uses PIC
* Symbol table
* List of items that can be used by other files
* Labels: function calling
* Data: `.data` variables can be accessed across files
* Relocation table
* List of items whose address the file needs
* Absolute labels (`jal, jalr`)
* Static data (`la`)
* Object file format (ex. ELF)
* Object file header: size, pos of other pieces
* text segment: machine code
* Data segment: static data (in binary)
* Relocation info: Has lines of code that need to be fixed later
* Symbol table: List of files labels and static data that can be referenced
* Debugging info
## Linker
* Input: Object code files, info tables (`foo.o`)
* Output: Executable code (`a.out`)
* Combines several object files into a single executable
* Throws error for duplicate or missing symbols
* Types of addresses
* PC-Relative (beq, bne, jal, auipc/addi): Never need to relocate
* Absolute or external func address (quipc/jalr): Always relocate
* Static data reference (lui/addi): Always relocate
* Relocation editing:
* J-format: jal address
* I,S format: static pointers
* No conditional branches
* Text starts at `0x1000`
* Statically-linked approach: include entire library even if not all of it is used (self-contained)
* Dynamically-linked (DLL): include library at runtime
* Smaller storage, requires less memory
* Requires time to link at runtime
* Link at machine code level
## Loader
* Input: Executable code (`a.out`)
* Output: program run
* Part of the OS
* Read header to know size of text and data
* Create address space for text, data, stack
* Copy instructions and data from executable file to new address space
* Copy CLI args to stack
* Initialize machine registers (mostly cleared)
* Jump to start-up routine that copies program args from stack to register, set PC
# Lecture 14: 9/28/20
* Synchronous digital system
* Synchronous: operations coordinated by a central clock
* Digital: Values represented by discrete values
* Transistor: semiconductor device to amplify or switch signals
* CMOS: Complementary metal oxide semiconductor
* Drain, Gate, Source
* Current flows from source to drain depending on the gate
* Can crete AND, OR, NOR out of NAND
* Asserted switch = closed
![](https://i.imgur.com/Cnk1daj.png)
* Signals are transmitted instantly and continuously
* Only contains 1 value at a time
* [Block] propagation delay: Delay between clock start and when you can see the output
* Combinational Logic (CL) circuits
* Output is a function of inputs
* State elements
* circuits that store information (registers, cache)
# Lecture 15: 9/30/20
* State can store values for time (registers, caches)
* Control flow of info between logic blocks
* Register = n "flip-flops" aka D-type Flip-Flop
* D: data, q: output
![](https://i.imgur.com/MUTg8iv.png)
* At rising edge of the clock, sample input d.
* After a time delay, output q
* Setup time: Time before rising edge that input must be stable
* Hold time: Time after rising edge that input must be stable
* Clk-to-q delay: Time until q gets updated and is stable
![](https://i.imgur.com/ruN1tg7.png)
* Max delay = CLK to Q delay + CL delay + Setup time
* Extra registers added to speed up clock rate
* Clock period limited by propagation delay of adder/shifter
* Finite State Machine: Represent with a state transition diagram
* Use CL + register
* $S_{i-1}$ is the stable output of a register which holds result of $i-1$ iteration
# Lecture 15: 10/2/20
* XOR: 1 if odd number of 1s
* Boolean algebra: + is OR, $\cdot$ is AND, $\bar{x}$ is NOT
* Circuit verification: If they are equivalent in boolean algebra
* Uniting theorem: $xy + x = x, (x+y)x = x$
* DeMorgan's Law: $\bar{(x \cdot y)} = \bar{x} + \bar{y}$, $\bar{(x + y)} = \bar{x} \cdot \bar{y}$
* Distribute the "bar" to all 3 terms
* Sum-of-products: ORs of ANDs
* Sum together all possible outcomes of 1 (not for 0 for each input bit)
# Lecture 16: 10/5/20
Multiplexor
![](https://i.imgur.com/5GV4C6T.png)
Arithmetic and Logic Unit (ALU)
* 00: A + B
* 01: A - B
* 10: A & B
* 11: A | B
![](https://i.imgur.com/kt4n1zC.png)
![](https://i.imgur.com/5gd2nMB.png)
## Adders
* LSB: $s_0 = a_0 XOR b_0$, $c_1 = a_0 \& b_0$
* ith bit: $s_i = XOR(a_i, b_i, c_i)$, $c_{i+1} = MAJ(b_i, b_i, c_i)=a_ib_i + b_ic_i + a_ic_i$
* N 1-bit adders cascaded is a 1 N-bit adder
* For unsigned numbers, existence of carry indicates overflow
* For signed numbers, carry is not indicative
* For highest adder: overflow = $c_n XOR c_{n-1}$
* No c_out, no c_in: no overflow
* c_in and c_out: no overflow
* c_in but no c_out: overflow (A, B > 0)
* c_out but not c_in: overflow
* Subtractor: A-B = A + (-B)
* Does an XOR to flip the bits
* if SUB is high, then we just add that 1
# Lecture 17: 10/7/20
* Hardware is inherently parallel
* CPU - active part of computer that does work (data manipulation and decision making)
* Datapath: Part of processor that has hardware to perform operations
* Computer executes 1 instruction per tick
* State outputs drive inputs to combinational logic to make next state
* All state elements updated at rising clock edge
* Problem: single monolithic block is too bulky
* Solution: Break up process of executing instruction into steps
* Stage 1: Instruction Fetch (IF)
* Stage 2: Instruction Decode (ID)
* Stage 3: Execute (EX) ALU
* Stage 4: Memory Access (MEM)
* Stage 5: Write Back to Register (WB)
* All 5 stages occur within 1 clock cycle (rising edge fetch, write on the next rising edge)
![](https://i.imgur.com/WjbpPlw.png)
![](https://i.imgur.com/s9VxNJP.png)
# Lecture 18: 10/9/20
See lecture diagrams
![](https://i.imgur.com/g39nwYV.png)
# Lecture 19: 10/12/20
* Control and status register (CSR): diff from register file
* Monitor status and performance
* Up to 4096 CSRs
* Necessary for peripherals
* csrrw swaps value of a CSR and a integer register
* `csrw csr, rs1` is `csrrw x0, csr, rs1`
* rd=x0 meand just write rs1 to CSR
* `csrwi csr, uimm` is `csrrwi x0, csr, uimm`
* rd=x0 just writes uimm to CSR
![](https://i.imgur.com/1eNBk2A.png)
* LSB is stable first
* Critical path lw = $t_{clk-q} + t_{IMEM} + t_{Imm} + t_{ALU} + t_{DMEM} + t_{mux} + t_{setup}$
* Certain instructions take longer
* Most blocks will be idle for most of the time
* ROM: Read Only Memory
![](https://i.imgur.com/vUyZdhh.png)
# Lecture 20: 10/14/20
* Measure performance in maximum frequency
* Performance measures:
* Program execution time
* Throughput (num of server requests handled per hour)
* Energy per task
* Iron law of processor performance:
* $\frac{\text{Time}}{\text{Program}} =\frac{\text{Instructions}}{\text{Program}} * \frac{\text{Cycles}}{\text{Instruction}} * \frac{\text{Time}}{\text{Cycle}}$
* CPI: Cycles per instruction
* Instructions per program
* Determined by task, algorithm, language, compiler, ISA
* Average CPI:
* Determined by ISA, processor implementation, complex instructions (like strcpy, CPI >> 1), superscalar processors (CPI < 1)
* Time per cycle:
* Procesor microarchitecture (critical path), technology, power budget
* 30% of energy is leaked in CMOS
* $\frac{\text{Energy}}{\text{Program}} =\frac{\text{Instructions}}{\text{Program}} * \frac{\text{Energy}}{\text{Instruction}}$
* $\frac{\text{Energy}}{\text{Program}} \propto \frac{\text{Instructions}}{\text{Program}} * CV^2$
* Capacitance depends on technology processor features
* Supply voltage (run slower with less supply voltage)
* Energy Iron Law: Performance = Power * Energy Efficiency
* Performance: tasks/second
* Energy efficiency: tasks/joule
## Pipelining
* Does not help latency of single task
* Helps throughput of entire worload
* Multiple tasks operate simultaneously using diff resources
* (theoretical) Speedup = num of pipe stages
* Time to fill and drain pipeline in reality
* Pipeline rate is limited by slowest pipeline stage
# Lecture 21: 10/16/20
* Put registers in between the 5 stages to pipeline
![](https://i.imgur.com/qft6eUl.png)
* Time to process 1 instruction can be longer so worse latency, but better throughput
* Clock rate can be dramatically higher
* Pipelining hazards: situations that prevent startingthe next instruction in the next clock cycle
* Structural hazard: Required resource is busy (needed in multiple stages)
* 2+ instructions compete for single physical resource
* Sol1: Instructions take turns, causing some to stall
* Sol2: Add more hardware (can always solve)
* 2 independent read ports and 1 write port
* 2 separate memories (cache)
* Data hazard: Data dependency between instructions, need to wait for prev inst to complete its read/write
* Write in the first half of cycle, read in the second half of cycle (registers are fast)
* Sol1: Stall using bubble (`nop`) when it depends on previous inst, but reduces performance
* Compiler will try to rearrange code that is not dependent
* Sol2: Forwarding/bypassing - grab operand from pipeline stage rather than register file (output of ALU to input of ALU)
![](https://i.imgur.com/98F1UjL.png)
* Control hazard: Flow of execution depends on previous instruction (for branches)
# Lecture 23: 10/18/20
## Load Data Hazard
* Load first, then do operations on it
* Load is only available after DMEM, but needed 1 cycle earlier
* Sol 1: Load delay slot: Need 1 stall, convert next command into a nop
* Just set regWEn, MemRW, and do not update PC
* Cycle after that execute the original command
* Forward from DMEM output to ALU
* Sol2: Put unrelated instruction into load delay slot
## Control Hazards
* Branches are condition so don't know next inst
* Earliest to know branch is at end of execution (ALU)
* 2 subsequent instructions are executed regardless of the branch outcome
* Sol1: Use 2 stall cycles after a branch
* Sol2: If branch/jump taken, convert the 2 next to `nop`
* Use branch prediction to guess
* Generally loop, so branch is rarely taken
* Ex. Keep a bit of if the branch was taken last time
* Moving branch comparator to ID stage would incur data hazard and forwarding could fail
## Superscalar Processors
* Clock rate limited by tech and power dissipation
* Deeper pipelines (10, 15 stages)
* Less work per stage, so shorter clock cycle
* More potential for hazards (CPI > 1)
* Multi-issue superscalar
* Replicate pipeline stages: multiple pipelines (CPI < 1)
* Multiple execution units, start multiple instructions per clock cycle
* Out of order execution to reduce hazards
* Arrive at commit unit which reorders
* Get CPI using iron law (time program, count instructions, know time/cycle)