CS 61C Notes

Lecture 1: 8/26/20

Intro class, no notes

Lecture 2: 8/28/20

Refer to table for easy hex to binary conversion
Numerals have infinite preceeding zeroes
Numbers are representations of numerals
- Numbers are theoretical
For a set number of bits, if addition (sub, mul, or div) cannot be represented, there is overflow
- Not underflow, this would be negative overflow

Number representations

Positive number representations are always the same for all representations
Sign and magnitude
- Leftmost sign is int (0 is +, 1 is -)
- Problem 1: Making numbers smaller than 0 actually makes the number bigger in binary representation
- Problem 2: Also there are 2 zeroes
- Problem 3: Complicated circuitry
One's complement
- Flip all the bits for a negative number
- Problem: there are still 2 0s (0000 and 1111)
- Solution: shift everything over left by 1
Two's complement
- Flip all bits and add
  $1$
  - Or multiply leftmost bit by
    $- 1$
- All 1s represents -1
- $2^{N - 1}$ non-negatives,
  $2^{N - 1}$ negatives
- There are 1 more negative than positive
Bias encoding
- Add a bias to bring the center of the range down
- Bias for N bits chosen as
  $- (2^{N - 1} - 1)$

Lecture 3: 8/31/20

ENIAC: First Electronic General-Purpose Computer
- Long time to program (patch cords and switches)
EDSAC: First General Stored-Program Computer
- Programs held as numbers/bits in memory
Compile changed source files into .o files (machine code object files) which are then linked into a .out file (machine code executable file)
Executables must be rebuilt on each new system (porting)
Parallel compiling: make -j to rebuild changed pieces
Amdahl's Law: Linking is sequential, forming a bottleneck
C Pre-Processor (CPP): Begins with #
- Ex. #define or #include
- Macros: small functions that text replaces everything below
- However, this means functions would be called multiple times

C vs Java

C main function




#include <stdio.h>
int main(void) {
return 0;
}

	C	Java
Type of Language	Function Oriented	Object Oriented
Programming Unit	Function	Class = Abstract Data Type
Compilation	`gcc hello.c`	`javac Hello.java`
Execution	`a.out` loads and executes	`java Hello` interprets
Storage	Manual(`malloc`, `free`)	`new` allocates & initializes, automatic garbage collection
Comments	`/* ... */`	`/* ... */` or `//`
Constants	`#define, const`	`final`
Preprocessor	Yes	No
Variable declaration	Beginning of block (C99 same as Java)	Before you use it
Variable naming conventions	under_scores	camelCase
Access a library	`#include <stdio.h>`	`import java.io.File;`

ANSI C was updated to C99 or C9x
- Added variable-length arrays
- Int types and booleans
C11 and C18 are newer fixes
- Multi-threading, unicode, removed gets()
Passing args into main: int main(int argc, char *argv[])
- argc containst he number of strings on command line
  - Executable counts as one, plus one for each argument
- argv is a pointer to an array containing the arguments as strings

C Basics

Booleans
- False: 0 (int), NULL, booleans (C99)
- True: Everything else
Types: int, unsigned int, float, float, double, char, long, long long
- long is a longer int (32b), long long is even longer (64b)
- int is 16b
- float is floating point decimal, double is equal or higher precision floating point
Preferrable to use intN_t and uintN_t
Constants: const int days_in week
Enums: enum color {RED, GREEN, BLUE}
Typedef: typedef uint8_t BYTE; Custom types
Struct: Custom structured groups of variables




typedef struct{
    int length_in_seconds;
    int year_recorded;
} SONG;

switches evaluate cases until it sees a break
goto is nonlocal control flow, reserved for advanced use cases
includes are a shared namespace so no math.atan
Never while loop based on a floating point value

Lecture 4: 9/2/20

Pointers

Variables have undefined values if not initialized in declaration
Heisenbugs: Undefined behavior that run nondeterministically
Address refers to a memory location (not value)
Pointer contains the address of a memory location
Syntax:
- int *p; p is a pointer (an address of an int)
- p = &y; assign the address of y to p
Dereferencing:
- z = *p; assign value at adress in p to z
- *p = z; assign the value of z to what p is pointing to
Must initialize pointers first
- int *ptr; *ptr=5; does not work since *ptr never had an initial value
Benefit of cleaner code, faster to pass pointer to large struct/array
Drawback is bugs (dynamic memory management, dangling references, memory leaks)

Pointer Usage

Pointers can point to any data type
void* is a generic pointer (not recommended)
Pointers can point to functions
- int (*fn) (void*, void*) = &foo
- fn is a function that accepts 2 void* pointers and returns an int. Initially points to function foo
- (*fn)(x, y) is equivalent to foo(x,y)
Struct arrow notation: int h = paddr->x; equivalent to int h = (*paddr).x;
- Arrow notation is more commonly used
- p1 = p2 copies all values, not pointers
NULL pointer: The pointer of all 0s
- Writing or reading to null pointer causes crash
- Since 0 is false, simple test for null pointer if(q)
Type declaration tells how many bytes to fetch for each pointer
- 32bit int is stored in 4 consecutive 8-bit bytes
- Word alignment: Can only access 4 byte boundaries
sizeof(type) returns num of bytes in object
- By C99 definition, sizeof(char) == 1

Arrays

A contiguous block of memory
Declaration: int ar[] = {1, 2};
Arrays are almost identical to pointers
- char *string and char string[] are nearly identical
- Cannot increment an array
ar[0] is same as *ar, ar[2] is same as *(ar+2)
Declared arrays are only allocated while scope is valid
Declare int ARRAY_SIZE instead of hardcoding
Array does not know its own length and bounds not checked
- Exception is strings, the null terminator signifies end
Common errors
- Segmentation fault - read/write to memory you don't have access to
- Bus error - alignment is wrong (bad word alignment)
pointer + n adds n * sizeof(type)
Handle: pointer to a pointer **h
- Needed to advance a pointer for an array
- Typically no more than 2 stars

Lecture 5: 9/4/20

Dynamic Memory Allocation

sizeof gives size in bytes
- Also knows size of arrays (in bytes)
Allocate room to point to: malloc()
- Ex. ptr = (int *) malloc(n*sizeof(int))
  - This will cast the void* returned by malloc to int*
- malloc location contains garbage initially
- Dynamically allocated space must be freed
- When main returns it auto-frees it up, but bad practice
- Memory leak: If memory is not freed and the program exits
- Malloc returns NULL if request cannot be granted
Common errors
- free()ing the same place of memory twice
- free()ing something that you didn't malloc()
Resizing
- realloc(p, size) which returns a new pointer (NULL if fails)
- realloc(ip, 0) is same as free(ip)
Array names are not variables
- *a: first value of array, a: pointer of array, &a: pointer of array (just like a)
3 Rules of pointers
- 1. Need to set up the pointee (separate from pointer)
- 1. Dereference a pointer to access its pointee
- 1. Assignment (=) between pointers makes them point to the same pointee

Linked List







struct Node {
    char *value;
    struct Node *next;
};
typedef struct Node *List; // List is pointer to struct Node

List ListNew(void) { return NULL; } //Create new empty list

When you malloc a string, need strlen(string) + 1 for null terminator \0
strcpy(destination, original)

Memory locations

Struct declaration does NOT allocate memory
Variable declaration does allocate memory
Methods of allocating memory
- Declaration of local variable
- Dynamic allocation at runtime like malloc
- Global variable (above main()) which has global scope
3 Pools of memory
- 1. Static storage: global variable storage (permanent)
- 1. Stack: Local variable storage, params, return address
- 1. Heap: Dynamic malloc storage: data lives until dallocated by programmer
Program Address space
- Stack: local vars, grows downward
  - Big stack: Deep recursion/fractals
- Heap: Space from malloc, grows upward
  - Just big mallocs without function calls
- Static data: Declared outside of main. Does not grow/shirnk
- Code: Loaded when program starts. Does not grow/shrink

Heap

Stack frame (LIFO)
- Return instruction address, params, space for other local variables
- Always contiguous blocks of memory, stack pointer tells where top stack is
- When procedure ends, stack frame is tossed off stack freeing memory for future stack frames
- Don't erase memory even if not needed, just move stack pointer up

Heap

Large pool of memory, not allocated in contiguous order
Specify the num of bytes to allocate in malloc
Tricky to manage unlike others
Want malloc(), free() to be fast, minimal memory overhead
Want to avoid (external) fragmentation (free memory separated in small chunks)
K&R Implementation
- Each block of memory is preceded by a header with size of block and pointer to next block
- Free blocks are kept in a circular linked list
  - Pointer field unused in allocated blocks
- malloc(): Search free list for a block large enough
- free(): Check if blocks adjacent to freed block are free
  - Coalesce into 1 bigger free block, otherwise add new freed block to free list
- free is very quick, malloc can search the full list worst case
malloc block choosing
- best-fit: choose smallest block that is big enough
- first-fit: choose first block that is big enough
- next-fit: same as first-fit but resume seraching where we finished last time
  - distributes the small slivers of memory

When Memory goes Bad

Pointer problems
- Dangling reference: using ptr before malloc
  - Writing off the end of arrays
  - Cannot return pointer to variable in local frame
  - Use after free
  - Stale pointers after realloc moves data
  - Freeing the wrong stuff (incremented pointer)
  - Double free
- Memory leaks: tardy free, lose the ptr
  - Incrementing a pointer
Valgrind catches memory leaks, misues of free, writing over end of arrays
- Helpful for debugging

Lecture 6: 9/9/20

Very large/small numbers and fractional numbers need to be stored
Agree where to put the binary point to sep integer and fractional parts
Store where the binary point is in a diff variable
Binary point can 'float' as the binary point is no longer fixed
Normalized form: No leading 0s (like scientific notation)

FLoating Point Representation

First 1 is not recorded (since always 1)
Sign: 1 bit (use sign-magnitude form)
Exponent: 8 bits
Significand: 23 bits (LSB represents
$10^{- 23}$ )
- Can represent range of
  $(1.2 * 10^{- 38}, 3.4 * 10^{38}$ )
- Overflow: Exponent is larger than 8 bits (
  $> 3.4 * 10^{38}, < - 3.4 * 10^{38}$ )
- Underflow: Negative exponent is larger than 8 bits
  - Ends up storing small fraction as 0
More precision: double (52 bit significand)
Significand is always between 0, 1 (for normalized numbers)
Representing 0: Exponent value of 0
- 0 has no leading 1
- Results in 2 zeroes because of the sign bit
Biased exponent representation (IEEE Standard 754)
- 2's complement would make negative numbers look bigger
- Bias: 127
- $(- 1)^{Sign} * (1 + Significand) * 2^{Exponent - 127}$

Special Numbers

How do we represent
$\pm$ infinity
- Use exponent 255 with 0 significand
Representing NaN
- Use exponent 255 with nonzero signifcand
- NaNs contaminate results as they percolate up
- $2^{23} - 1$ possible NaNs used to specify each error (sqrt neg number, 0/0, etc.) and include which line it occured on
Problem of normalization and implicit 1
- Gap between FP numbers around 0
- Smallest pos num:
  $2^{- 126}$
- Second smallest:
  $(1 + 2^{- 23}) * 2^{- 126} = 2^{- 126} + 2^{- 149}$
Denorm: Solves implicit 1 problem
- Exponent 0 with nonzero significand
- No implicit leading 1, implicit exponent -126
- Smallest:
  $2^{- 149}$
- Second smallest:
  $2^{- 148}$
- Even steps until normalized region and the stride doubles with higher exponents
- Have
  $2^{23} \approx$ 8 million slots between each power of 2
- $2^{24} + 1$ is the smallest int that cannot be represented since the stride would be 2
Example:
$\frac{1}{3}$
- Sign: 0
- Exponent:
  $- 2 + 127 = 125 = 0 b 01111101$
- Significand:
  $0 b 010101 \dots$

Fallacies

Associativity is lost
- Small numbers can be lost when added to big numbers
Precision: count of the number bits used to represent a value
Accuracy: Difference between actual value and comp representation
High precision permits high accuracy but no guarantee
Rounding: 2 extra bits of precision to round
- Converting double to single precision (or FP to int)
- 1. Round to
    $+ \infty$
- 1. Round to
    $- \infty$
- 1. Truncate (round to 0)
- 1. Unbiased: Round to even number
  - Decimal examples: 2.5 rounds to 2, 3.5 rounds to 4
FP Addition
- De-normalize to match exponents
- Add significands
- Keep same exponent
- Normalize again
Casting
- int -> float -> int is not always same
- float -> int ->float is not always same

Alternate FP Representations

float: standard float (8exp, 23 sig)
binary64: double precision (11exp, 52 sig)
Quad-precision: binary128 (15exp, 112 sig)
Oct-precision: binary256 (19 exp, 236 sig)
Half-precision: binary16 or fp16 (5exp, 10 sig)
bfloat16 (brain FP): (8exp, 7sig)
unum: Tells how many bits are for exp/sig (customizable)
- u-bit indicates if the number is exact

Lecture 7: 9/11/20

CPU executes a lot of instructions
Instruction Set Architecture (ISA): set of instructions a CPU implements
RISC: Keep instruction set small and simple to build fast hardware

RISC-V Basics

Registers instead of variables
Registers are inside the processor (not memory)
Predetermined number of registers in hardware (32)
- Each RISC-V register is 32 bits wide (independent of variant)
- Numbered from x0 to x31
- x0 is special, always holds value 0
- Registers can be referred to by number or name
- Registers have no type
Comments start with #
Instruction: one statement/command
- Each line has at most 1 instruction
- One operation per instruction

RISC-V add/sub

add x1, x2, x3
- one: operation by name
- two: operand getting result (destination)
- three: first operand
- four: second operand
Break longer sequences into multiple instructions
Intermediates: numerical constants
- addi x3, x4, 10
- No subi, just addi a negative value
- Use x0 for number zero to avoid using intermediates

Lecture 8: 9/14/20

Some registers have designated use (usually < 30 registers available)
Optimizer compiler minimizes register usage
Register footprint - number of registers needed for a compute kernel
Memory is like a large 1D array
Processor gives address(offset to base pointer) to point in memory
Load from (read) and store to memory (write)
Memory addresses are in bytes
- Word addresses are 4 bytes apart
Little-endian: Least significant byte is smallest address
- Bits are always stored with most significant in upper position
DRAM: Dynamic Random Access Memory (fast but not as fast as registers, medium capacity)
- registers are 50-500x faster than memory

Loading from and storing Memory

Load Word: lw destination, offset(base_ptr)
- Offset is in bytes (multiple of 4)
- Data flow moves from right to left
- C Code: int A[100]; g = h + A[3];
- Equiv: lw x10, 12(x15); add x10, x12, x10
Store Word: sw current, offset(base_ptr)
- Data flow moves from left to right
- int A[100]; A[10] = h + A[3];
- lw x10, 12(x15)
- add x10, x12, x10
- sw x10, 40(x15)
Load byte/store byte: lb sb for byte data transfers
- lb x10, 3(x11) copies content of x11+3 to low byte pos of x10
- Sign-extend the sign bit to all higher bits
Unsigned byte load: lbu
- No sign-extension needed
- No sbu (nonsensical)
addi equiv to sw, lw, add, but addi is much faster
- Immediates must be <= 32 bits
- Need to load longer immediates from memory

Decision Making

if statement: beq reg1, reg2, L1
- Go to statement labeled L1 if (value reg1) == (value reg2)
- Go to next statement otherwise
- Counterpart: bne branch not equal
conditional branch:
- Less than: blt (unsigned: bltu)
- Branch greater: bge (unsigned: bgeu)
Unconditional branch (always)
- Jump: j label
== usually translates to bne so next instruction executes if condition true
Need a jump like j Exit at the end of bne block
No bgt, ble
Need to increment memory pointers by num of bytes

Lecture 9: 9/16/20

Logical instructions: isolate or packing bytes/nibles into words
2 variants: register and immediate
No logical not, just use xor with 11111111
Logical shifting:
- Shift logical left: sll (shift 2 bits left and insert 0s on right, bits fall off left end)
- Shift logical left immediate: slli
- Shift logical right: slr fills left with 0s
Arithmetic shifting: logical shift for signed int
- Shift right arithmetic (sra, `srai): moves n bits to the right, inserts sign bit into empty bits
- Not equivalent to divide by
  $2^{n}$ (fails for odd neg numbers)
- C standards is division should round to 0

Registers

Programs are stored in memory
Program lives in an area, data lives in an area (In practice, program above data)
1 RISC-V instruction is 32 bits
Instruction fetched from memory using Program Counter (PC)
- Next instruction is 4 bytes away (incr PC +4)
Symbolic register names
- x10 - x17 are a0-a7 for params, a0, a1 for return values
- x1 is ra (return address to point of origin)
- x8-x9: s0-s1 and x18-x27: s2-s11: saved registers
- x0 is zero
- Pseudo-instructions: shorthand for common assembly
  - mv rd, rs = addi rd, rs, 0 (move value to another register)
  - li rd, 13 = addi rd, x0, 13 (load immediate value into register)
  - nop = addi x0, x0, 0 (no operation, processor just waits)

Function calls

6 fundamental steps
- 1. Put arguments in place where func can access them
- 1. Transfer control to Function
- 1. Acquire local storage resources needed for func
- 1. Perform desired task of func
- 1. Put return value where code can access it, restore used registers, release local storage
- 1. Return control to origin
Use jr not j since jr allows us to use a variable to return (jr ra)
Jump and link: jal rd, Label will jump and return save address (equiv to addi then j)
- Essentially link then jump
- Pseudo-instruction is ret = jr ra
jalr rd, rs, imm: Jump and link register
- j, jr, ret are pseudo-instructions for jal (no ret address is just a jump)

Lecture 10: 9/18/20

Use stack to save old values before calling func
Stack pointer: sp (x2)
- Stack grows downwards (stack address decreases when we push)
- Push decrements sp, pop increments sp
Stack frame includes:
- return instruction address
- Params (args)
- Space for local vars
Stack pointer points to bottom of stack frame

addi sp, sp -8 # adjust stack for 2 items
sw s1, 4(sp) # Save s1 for use after
sw s0, 0(sp)
...
lw s0, 0(sp) # restore s0 for caller
lw s1, 4(sp)
addi sp, sp, 8 # adjust stack to delete 2 items
jr ra (ret)

Nested function calls

ra would be overwritten in future calls
Register conventions: a set of rules for which registers are unchanged after jal
- Preserved: sp, gp, tp, s0-s11 (s0 is fp)
- Not preserved: a0-a7, ra, t0-6 (arg and temp registers)
- Caller must preserve args/rv, temp, ra
- Callee must preserve saved registers and sp

Memory Allocation

Types of storage classes:
- Automatic: local to func, discarded when func exits
  - Use stack for local vars
  - Procedure frame/activation record: segment of stack with saved registers and local vars
- Static: exist across procedures
Stack: high end 0xbfff_fff0
Text segment in low end: 0x0001_0000
Static: gp points to static (0x1000_000)
Heap: above static, grows up to high addresses
The callee saves them on the stack at the beginning of the function call and restores the stack right before returning

Lecture 11: 9/21/20

Everything has memory addreses (instructions, data words)
- Branches and jumps to different sections
- One register keeps address of the instruction being executed (Program counter: PC) aka Instruction Pointer (IP)
Programs are distributed as binaries but are different versions based on device
- Want backwards compatibility
Data is in words(32 bit chunks)
- Each register is a word (lw, sw to access 1 word of memory)
- Divide instruction into fields
- R- register-register arithmetic ops
- I- register-immediate arithmeti cops, loads
- S- stores
- B- branches (minor variant of S)
- U- 20-bit upper immediate instructions
- J - jumps (minor variant of U)

R-Format

opcode: what format it is
- 0b0110011 for all R format
funct7+funct3: tell what operation to perform
rs1, rs2: source register: specifes register w/ first operand
rd: destination register: register that receives result of computation
Each register field holds a 5-bit uint corresp to reg number (x0-x31)

I-Format

imm[11:0] covers range
$[- 2048, 2047]$
- Sign-extend immediate
opcode: 0b0010011
shift amounts is limited to 5 bits since larget shifts are meaningless
Loads:
- 12bit signed immediate for the memory address (offset)
- rd contains value loaded from memory
- rs1 is load from

S-format

Store needs 2 registers (rs1, rs2) but no rd

Lecture 12: 9/23/20

B-format (SB)

Need to read 2 registers but don't write to register
Branches used for loops (func calls and unconditional jumps handled by jump instructions)
- Can only jump relative to PC
Largest branch dist limited by size of code
PC-Relative Addressing: Use immediate as 2s complement offset to PC
12 bits gives
$\pm 2^{11} * 32$
- Can only branch to beginning of word
- Branch calculation
  - No branch: PC = PC + 4
  - Branch: PC = PC + immediate*4
Support for 16-bit words means branch size reduced by half
- Half of targets will be errors
- $\pm 2^{10} * 32$
- Branch calculation
  - No branch: PC = PC + 4
  - Branch: PC = PC + immediate*2
The 12bit immediate is actually 13bits but always has LSB = 0
beq with offset 0 is infinite loop

U-Format

Need to branch to destination
$> 2^{10}$
- Replace beq with bne and then j
20 bit immediate for upper 20 bits of 32 bit word
lui: Load upper immediate
- lui then addi to set all 32 bits
- addi sign extends though which can cause problems
- Add 1 to LUI value to take that into account
- li: pseudo-op to handle this (lui and addi)
auipc: Add upper immediate to PC

J-Format (UJ)

jal saves PC+4 to rd (return address)
- The immediate is the offset, remove the LSB 0
PC relative jump to
$\pm 2^{19}$ locations
$= \pm 2^{18}$ 32bit instructions
jalr rd, rs, immediate uses I-format
- Writes rd = PC+4, PC = rs + immediate
- Way to jump to absolute address
- Cannot assume LSB is 0 so range is reduced
- Use auipc, jalr to jump to pc-relative 32-bit offset

Lecture 13: 9/25/20

Interpreter: Directly executes a program in source language
- Higher level, better error msgs
- Slower (10x), code smaller (2x)
- Independent instructions (can run on any machine)
Translator: converts prog from 1 source lang to another
- Translates to executable, don't know the source code
Compilation, Assembly, Linking, Loading
Compiler
- Input: High level lang code (foo.c)
- Output: Assembly code: (foo.s)
  - Can contain pseudo-instructions

Assembler

Input: Assembly code (foo.s)
Output: Object code (foo.o)
Reads and uses directives
- Ex. .text (code), .data (static), .globl sym, .string str .word w1...wn
Replace psuedo-instructions, create machine code, create object file
Need to sign extend immediates
Branch immediates count halfwords
PC-relative branches/jumps: count half words between curr and where you want to go
- Foward reference problem: branching below
Solution: Take 2 passes
- Pass 0: Replace pseudo-instructions
- Pass 1: Remember position of label
- Pass 2: Use label positions to generate code
Pc-relative jumps (jal, beq, bne)
- j ofset expands to jal, zero, offset
- Count half-words between target and jump for position-independent code (PIC)
  - PIC is good - relative to location
Static data refs:
- lui addi require full 32b address
- auipc addi uses PIC
Symbol table
- List of items that can be used by other files
- Labels: function calling
- Data: .data variables can be accessed across files
Relocation table
- List of items whose address the file needs
- Absolute labels (jal, jalr)
- Static data (la)
Object file format (ex. ELF)
- Object file header: size, pos of other pieces
- text segment: machine code
- Data segment: static data (in binary)
- Relocation info: Has lines of code that need to be fixed later
- Symbol table: List of files labels and static data that can be referenced
- Debugging info

Linker

Input: Object code files, info tables (foo.o)
Output: Executable code (a.out)
Combines several object files into a single executable
Throws error for duplicate or missing symbols
Types of addresses
- PC-Relative (beq, bne, jal, auipc/addi): Never need to relocate
- Absolute or external func address (quipc/jalr): Always relocate
- Static data reference (lui/addi): Always relocate
Relocation editing:
- J-format: jal address
- I,S format: static pointers
- No conditional branches
Text starts at 0x1000
Statically-linked approach: include entire library even if not all of it is used (self-contained)
Dynamically-linked (DLL): include library at runtime
- Smaller storage, requires less memory
- Requires time to link at runtime
- Link at machine code level

Loader

Input: Executable code (a.out)
Output: program run
Part of the OS
Read header to know size of text and data
Create address space for text, data, stack
Copy instructions and data from executable file to new address space
Copy CLI args to stack
Initialize machine registers (mostly cleared)
Jump to start-up routine that copies program args from stack to register, set PC

Lecture 14: 9/28/20

Synchronous digital system
- Synchronous: operations coordinated by a central clock
- Digital: Values represented by discrete values
Transistor: semiconductor device to amplify or switch signals
- CMOS: Complementary metal oxide semiconductor
- Drain, Gate, Source
- Current flows from source to drain depending on the gate
Can crete AND, OR, NOR out of NAND
Asserted switch = closed
Signals are transmitted instantly and continuously
- Only contains 1 value at a time
[Block] propagation delay: Delay between clock start and when you can see the output
Combinational Logic (CL) circuits
- Output is a function of inputs
State elements
- circuits that store information (registers, cache)

Lecture 15: 9/30/20

State can store values for time (registers, caches)
Control flow of info between logic blocks
Register = n "flip-flops" aka D-type Flip-Flop
D: data, q: output
At rising edge of the clock, sample input d.
- After a time delay, output q
- Setup time: Time before rising edge that input must be stable
- Hold time: Time after rising edge that input must be stable
- Clk-to-q delay: Time until q gets updated and is stable
Max delay = CLK to Q delay + CL delay + Setup time
Extra registers added to speed up clock rate
Clock period limited by propagation delay of adder/shifter
Finite State Machine: Represent with a state transition diagram
Use CL + register
$S_{i - 1}$ is the stable output of a register which holds result of
$i - 1$ iteration

Lecture 15: 10/2/20

XOR: 1 if odd number of 1s
Boolean algebra: + is OR,
$\cdot$ is AND,
$\bar{x}$ is NOT
Circuit verification: If they are equivalent in boolean algebra
Uniting theorem:
$x y + x = x, (x + y) x = x$
DeMorgan's Law:
$\bar{(x \cdot y)} = \bar{x} + \bar{y}$ ,
$\bar{(x + y)} = \bar{x} \cdot \bar{y}$
- Distribute the "bar" to all 3 terms
Sum-of-products: ORs of ANDs
- Sum together all possible outcomes of 1 (not for 0 for each input bit)

Lecture 16: 10/5/20

Multiplexor

Arithmetic and Logic Unit (ALU)

00: A + B
01: A - B
10: A & B
11: A | B

Adders

LSB:
$s_{0} = a_{0} X O R b_{0}$ ,
$c_{1} = a_{0} & b_{0}$
ith bit:
$s_{i} = X O R (a_{i}, b_{i}, c_{i})$ ,
$c_{i + 1} = M A J (b_{i}, b_{i}, c_{i}) = a_{i} b_{i} + b_{i} c_{i} + a_{i} c_{i}$
N 1-bit adders cascaded is a 1 N-bit adder
For unsigned numbers, existence of carry indicates overflow
For signed numbers, carry is not indicative
- For highest adder: overflow =
  $c_{n} X O R c_{n - 1}$
- No c_out, no c_in: no overflow
- c_in and c_out: no overflow
- c_in but no c_out: overflow (A, B > 0)
- c_out but not c_in: overflow
Subtractor: A-B = A + (-B)
- Does an XOR to flip the bits
- if SUB is high, then we just add that 1

Lecture 17: 10/7/20

Hardware is inherently parallel
CPU - active part of computer that does work (data manipulation and decision making)
Datapath: Part of processor that has hardware to perform operations
Computer executes 1 instruction per tick
State outputs drive inputs to combinational logic to make next state
All state elements updated at rising clock edge
Problem: single monolithic block is too bulky
Solution: Break up process of executing instruction into steps
- Stage 1: Instruction Fetch (IF)
- Stage 2: Instruction Decode (ID)
- Stage 3: Execute (EX) ALU
- Stage 4: Memory Access (MEM)
- Stage 5: Write Back to Register (WB)
All 5 stages occur within 1 clock cycle (rising edge fetch, write on the next rising edge)

Lecture 18: 10/9/20

See lecture diagrams

Lecture 19: 10/12/20

Control and status register (CSR): diff from register file
- Monitor status and performance
- Up to 4096 CSRs
- Necessary for peripherals
csrrw swaps value of a CSR and a integer register
csrw csr, rs1 is csrrw x0, csr, rs1
- rd=x0 meand just write rs1 to CSR
csrwi csr, uimm is csrrwi x0, csr, uimm
- rd=x0 just writes uimm to CSR

LSB is stable first
Critical path lw =
$t_{c l k - q} + t_{I M E M} + t_{I m m} + t_{A L U} + t_{D M E M} + t_{m u x} + t_{s e t u p}$
Certain instructions take longer
- Most blocks will be idle for most of the time
ROM: Read Only Memory

Lecture 20: 10/14/20

Measure performance in maximum frequency
Performance measures:
- Program execution time
- Throughput (num of server requests handled per hour)
- Energy per task
Iron law of processor performance:
- $\frac{Time}{Program} = \frac{Instructions}{Program} * \frac{Cycles}{Instruction} * \frac{Time}{Cycle}$
- CPI: Cycles per instruction
- Instructions per program
  - Determined by task, algorithm, language, compiler, ISA
- Average CPI:
  - Determined by ISA, processor implementation, complex instructions (like strcpy, CPI >> 1), superscalar processors (CPI < 1)
- Time per cycle:
  - Procesor microarchitecture (critical path), technology, power budget
30% of energy is leaked in CMOS
$\frac{Energy}{Program} = \frac{Instructions}{Program} * \frac{Energy}{Instruction}$
$\frac{Energy}{Program} \propto \frac{Instructions}{Program} * C V^{2}$
- Capacitance depends on technology processor features
- Supply voltage (run slower with less supply voltage)
Energy Iron Law: Performance = Power * Energy Efficiency
- Performance: tasks/second
- Energy efficiency: tasks/joule

Pipelining

Does not help latency of single task
Helps throughput of entire worload
Multiple tasks operate simultaneously using diff resources
(theoretical) Speedup = num of pipe stages
- Time to fill and drain pipeline in reality
- Pipeline rate is limited by slowest pipeline stage

Lecture 21: 10/16/20

Put registers in between the 5 stages to pipeline
Time to process 1 instruction can be longer so worse latency, but better throughput
- Clock rate can be dramatically higher
Pipelining hazards: situations that prevent startingthe next instruction in the next clock cycle
- Structural hazard: Required resource is busy (needed in multiple stages)
  - 2+ instructions compete for single physical resource
  - Sol1: Instructions take turns, causing some to stall
  - Sol2: Add more hardware (can always solve)
  - 2 independent read ports and 1 write port
  - 2 separate memories (cache)
- Data hazard: Data dependency between instructions, need to wait for prev inst to complete its read/write
  - Write in the first half of cycle, read in the second half of cycle (registers are fast)
  - Sol1: Stall using bubble (nop) when it depends on previous inst, but reduces performance
  - Compiler will try to rearrange code that is not dependent
  - Sol2: Forwarding/bypassing - grab operand from pipeline stage rather than register file (output of ALU to input of ALU)
- Control hazard: Flow of execution depends on previous instruction (for branches)

Lecture 23: 10/18/20

Load Data Hazard

Load first, then do operations on it
- Load is only available after DMEM, but needed 1 cycle earlier
- Sol 1: Load delay slot: Need 1 stall, convert next command into a nop
- Just set regWEn, MemRW, and do not update PC
- Cycle after that execute the original command
- Forward from DMEM output to ALU
- Sol2: Put unrelated instruction into load delay slot

Control Hazards

Branches are condition so don't know next inst
Earliest to know branch is at end of execution (ALU)
2 subsequent instructions are executed regardless of the branch outcome
- Sol1: Use 2 stall cycles after a branch
- Sol2: If branch/jump taken, convert the 2 next to nop
- Use branch prediction to guess
  - Generally loop, so branch is rarely taken
  - Ex. Keep a bit of if the branch was taken last time
- Moving branch comparator to ID stage would incur data hazard and forwarding could fail

Superscalar Processors

Clock rate limited by tech and power dissipation
Deeper pipelines (10, 15 stages)
- Less work per stage, so shorter clock cycle
- More potential for hazards (CPI > 1)
Multi-issue superscalar
- Replicate pipeline stages: multiple pipelines (CPI < 1)
- Multiple execution units, start multiple instructions per clock cycle
- Out of order execution to reduce hazards
- Arrive at commit unit which reorders
- Get CPI using iron law (time program, count instructions, know time/cycle)

CS 61C Notes

Lecture 1: 8/26/20

Lecture 2: 8/28/20

Number representations

Lecture 3: 8/31/20

C vs Java

C Basics

Lecture 4: 9/2/20

Pointers

Pointer Usage

Arrays

Lecture 5: 9/4/20

Dynamic Memory Allocation

Linked List

Memory locations

Heap

Heap

When Memory goes Bad

Lecture 6: 9/9/20

FLoating Point Representation

Special Numbers

Fallacies

Alternate FP Representations

Lecture 7: 9/11/20

RISC-V Basics

RISC-V add/sub

Lecture 8: 9/14/20

Loading from and storing Memory

Decision Making

Lecture 9: 9/16/20

Registers

Function calls

Lecture 10: 9/18/20

Nested function calls

Memory Allocation

Lecture 11: 9/21/20

R-Format

I-Format

S-format

Lecture 12: 9/23/20

B-format (SB)

U-Format

J-Format (UJ)

Lecture 13: 9/25/20

Assembler

Linker

Loader

Lecture 14: 9/28/20

Lecture 15: 9/30/20

Lecture 15: 10/2/20

Lecture 16: 10/5/20

Adders

Lecture 17: 10/7/20

Lecture 18: 10/9/20

Lecture 19: 10/12/20

Lecture 20: 10/14/20

Pipelining

Lecture 21: 10/16/20

Lecture 23: 10/18/20

Load Data Hazard

Control Hazards

Superscalar Processors

Read more

CS61B MT2 Cheat Sheet

Puzzle Game 2022 Sols

Write It Powerpoint It Recap

CS188 Notes Part 1