CS 61C Notes
Lecture 1: 8/26/20
Intro class, no notes
Lecture 2: 8/28/20
- Refer to table for easy hex to binary conversion
- Numerals have infinite preceeding zeroes
- Numbers are representations of numerals
- For a set number of bits, if addition (sub, mul, or div) cannot be represented, there is overflow
- Not underflow, this would be negative overflow
Number representations
- Positive number representations are always the same for all representations
- Sign and magnitude
- Leftmost sign is int (0 is +, 1 is -)
- Problem 1: Making numbers smaller than 0 actually makes the number bigger in binary representation
- Problem 2: Also there are 2 zeroes
- Problem 3: Complicated circuitry
- One's complement
- Flip all the bits for a negative number
- Problem: there are still 2 0s (0000 and 1111)
- Solution: shift everything over left by 1
- Two's complement
- Flip all bits and add
- Or multiply leftmost bit by
- All 1s represents -1
- non-negatives, negatives
- There are 1 more negative than positive
- Bias encoding
- Add a bias to bring the center of the range down
- Bias for N bits chosen as
Lecture 3: 8/31/20
- ENIAC: First Electronic General-Purpose Computer
- Long time to program (patch cords and switches)
- EDSAC: First General Stored-Program Computer
- Programs held as numbers/bits in memory
- Compile changed source files into .o files (machine code object files) which are then linked into a .out file (machine code executable file)
- Executables must be rebuilt on each new system (porting)
- Parallel compiling:
make -j
to rebuild changed pieces
- Amdahl's Law: Linking is sequential, forming a bottleneck
- C Pre-Processor (CPP): Begins with
#
- Ex.
#define
or #include
- Macros: small functions that text replaces everything below
- However, this means functions would be called multiple times
C vs Java
C main function
|
C |
Java |
Type of Language |
Function Oriented |
Object Oriented |
Programming Unit |
Function |
Class = Abstract Data Type |
Compilation |
gcc hello.c |
javac Hello.java |
Execution |
a.out loads and executes |
java Hello interprets |
Storage |
Manual(malloc , free ) |
new allocates & initializes, automatic garbage collection |
Comments |
/* ... */ |
/* ... */ or // |
Constants |
#define, const |
final |
Preprocessor |
Yes |
No |
Variable declaration |
Beginning of block (C99 same as Java) |
Before you use it |
Variable naming conventions |
under_scores |
camelCase |
Access a library |
#include <stdio.h> |
import java.io.File; |
- ANSI C was updated to C99 or C9x
- Added variable-length arrays
- Int types and booleans
- C11 and C18 are newer fixes
- Multi-threading, unicode, removed
gets()
- Passing args into main:
int main(int argc, char *argv[])
argc
containst he number of strings on command line
- Executable counts as one, plus one for each argument
argv
is a pointer to an array containing the arguments as strings
C Basics
- Booleans
- False:
0
(int), NULL
, booleans (C99)
- True: Everything else
- Types:
int
, unsigned int
, float
, float
, double
, char
, long
, long long
long
is a longer int (32b), long long
is even longer (64b)
int
is 16b
float
is floating point decimal, double
is equal or higher precision floating point
- Preferrable to use
intN_t
and uintN_t
- Constants:
const int days_in week
- Enums:
enum color {RED, GREEN, BLUE}
- Typedef:
typedef uint8_t BYTE;
Custom types
- Struct: Custom structured groups of variables
- switches evaluate cases until it sees a break
goto
is nonlocal control flow, reserved for advanced use cases
- includes are a shared namespace so no
math.atan
- Never while loop based on a floating point value
Lecture 4: 9/2/20
Pointers
- Variables have undefined values if not initialized in declaration
- Heisenbugs: Undefined behavior that run nondeterministically
- Address refers to a memory location (not value)
- Pointer contains the address of a memory location
- Syntax:
int *p;
p is a pointer (an address of an int)
p = &y;
assign the address of y to p
- Dereferencing:
z = *p;
assign value at adress in p to z
*p = z;
assign the value of z to what p is pointing to
- Must initialize pointers first
int *ptr; *ptr=5;
does not work since *ptr
never had an initial value
- Benefit of cleaner code, faster to pass pointer to large struct/array
- Drawback is bugs (dynamic memory management, dangling references, memory leaks)
Pointer Usage
- Pointers can point to any data type
void*
is a generic pointer (not recommended)
- Pointers can point to functions
int (*fn) (void*, void*) = &foo
- fn is a function that accepts 2
void*
pointers and returns an int. Initially points to function foo
(*fn)(x, y)
is equivalent to foo(x,y)
- Struct arrow notation:
int h = paddr->x;
equivalent to int h = (*paddr).x;
- Arrow notation is more commonly used
p1 = p2
copies all values, not pointers
- NULL pointer: The pointer of all 0s
- Writing or reading to null pointer causes crash
- Since 0 is false, simple test for null pointer
if(q)
- Type declaration tells how many bytes to fetch for each pointer
- 32bit int is stored in 4 consecutive 8-bit bytes
- Word alignment: Can only access 4 byte boundaries
sizeof(type)
returns num of bytes in object
- By C99 definition,
sizeof(char) == 1
Arrays
- A contiguous block of memory
- Declaration:
int ar[] = {1, 2};
- Arrays are almost identical to pointers
char *string
and char string[]
are nearly identical
- Cannot increment an array
ar[0]
is same as *ar
, ar[2]
is same as *(ar+2)
- Declared arrays are only allocated while scope is valid
- Declare
int ARRAY_SIZE
instead of hardcoding
- Array does not know its own length and bounds not checked
- Exception is strings, the null terminator signifies end
- Common errors
- Segmentation fault - read/write to memory you don't have access to
- Bus error - alignment is wrong (bad word alignment)
pointer + n
adds n * sizeof(type)
- Handle: pointer to a pointer
**h
- Needed to advance a pointer for an array
- Typically no more than 2 stars
Lecture 5: 9/4/20
Dynamic Memory Allocation
sizeof
gives size in bytes
- Also knows size of arrays (in bytes)
- Allocate room to point to:
malloc()
- Ex.
ptr = (int *) malloc(n*sizeof(int))
- This will cast the
void*
returned by malloc to int*
malloc
location contains garbage initially
- Dynamically allocated space must be freed
- When main returns it auto-frees it up, but bad practice
- Memory leak: If memory is not freed and the program exits
- Malloc returns
NULL
if request cannot be granted
- Common errors
free()
ing the same place of memory twice
free()
ing something that you didn't malloc()
- Resizing
realloc(p, size)
which returns a new pointer (NULL
if fails)
realloc(ip, 0)
is same as free(ip)
- Array names are not variables
*a
: first value of array, a
: pointer of array, &a
: pointer of array (just like a
)
- 3 Rules of pointers
-
- Need to set up the pointee (separate from pointer)
-
- Dereference a pointer to access its pointee
-
- Assignment (=) between pointers makes them point to the same pointee
Linked List
- When you
malloc
a string, need strlen(string) + 1
for null terminator \0
strcpy(destination, original)
Memory locations
- Struct declaration does NOT allocate memory
- Variable declaration does allocate memory
- Methods of allocating memory
- Declaration of local variable
- Dynamic allocation at runtime like
malloc
- Global variable (above
main()
) which has global scope
- 3 Pools of memory
-
- Static storage: global variable storage (permanent)
-
- Stack: Local variable storage, params, return address
-
- Heap: Dynamic malloc storage: data lives until dallocated by programmer
- Program Address space
- Stack: local vars, grows downward
- Big stack: Deep recursion/fractals
- Heap: Space from malloc, grows upward
- Just big mallocs without function calls
- Static data: Declared outside of main. Does not grow/shirnk
- Code: Loaded when program starts. Does not grow/shrink

Heap
- Stack frame (LIFO)
- Return instruction address, params, space for other local variables
- Always contiguous blocks of memory, stack pointer tells where top stack is
- When procedure ends, stack frame is tossed off stack freeing memory for future stack frames
- Don't erase memory even if not needed, just move stack pointer up
Heap
- Large pool of memory, not allocated in contiguous order
- Specify the num of bytes to allocate in
malloc
- Tricky to manage unlike others
- Want
malloc()
, free()
to be fast, minimal memory overhead
- Want to avoid (external) fragmentation (free memory separated in small chunks)
- K&R Implementation
- Each block of memory is preceded by a header with size of block and pointer to next block
- Free blocks are kept in a circular linked list
- Pointer field unused in allocated blocks
malloc()
: Search free list for a block large enough
free()
: Check if blocks adjacent to freed block are free
- Coalesce into 1 bigger free block, otherwise add new freed block to free list
free
is very quick, malloc
can search the full list worst case
malloc
block choosing
- best-fit: choose smallest block that is big enough
- first-fit: choose first block that is big enough
- next-fit: same as first-fit but resume seraching where we finished last time
- distributes the small slivers of memory
When Memory goes Bad
- Pointer problems
- Dangling reference: using ptr before malloc
- Writing off the end of arrays
- Cannot return pointer to variable in local frame
- Use after free
- Stale pointers after
realloc
moves data
- Freeing the wrong stuff (incremented pointer)
- Double free
- Memory leaks: tardy free, lose the ptr
- Valgrind catches memory leaks, misues of free, writing over end of arrays
Lecture 6: 9/9/20
- Very large/small numbers and fractional numbers need to be stored
- Agree where to put the binary point to sep integer and fractional parts
- Store where the binary point is in a diff variable
- Binary point can 'float' as the binary point is no longer fixed
- Normalized form: No leading 0s (like scientific notation)
FLoating Point Representation
- First 1 is not recorded (since always 1)
- Sign: 1 bit (use sign-magnitude form)
- Exponent: 8 bits
- Significand: 23 bits (LSB represents )
- Can represent range of )
- Overflow: Exponent is larger than 8 bits ()
- Underflow: Negative exponent is larger than 8 bits
- Ends up storing small fraction as 0
- More precision: double (52 bit significand)
- Significand is always between 0, 1 (for normalized numbers)
- Representing 0: Exponent value of 0
- 0 has no leading 1
- Results in 2 zeroes because of the sign bit
- Biased exponent representation (IEEE Standard 754)
- 2's complement would make negative numbers look bigger
- Bias: 127

Special Numbers
- How do we represent infinity
- Use exponent 255 with 0 significand
- Representing NaN
- Use exponent 255 with nonzero signifcand
- NaNs contaminate results as they percolate up
- possible NaNs used to specify each error (sqrt neg number, 0/0, etc.) and include which line it occured on
- Problem of normalization and implicit 1
- Gap between FP numbers around 0
- Smallest pos num:
- Second smallest:
- Denorm: Solves implicit 1 problem
- Exponent 0 with nonzero significand
- No implicit leading 1, implicit exponent -126
- Smallest:
- Second smallest:
- Even steps until normalized region and the stride doubles with higher exponents
- Have 8 million slots between each power of 2
- is the smallest int that cannot be represented since the stride would be 2
- Example:
- Sign: 0
- Exponent:
- Significand:
Fallacies
- Associativity is lost
- Small numbers can be lost when added to big numbers
- Precision: count of the number bits used to represent a value
- Accuracy: Difference between actual value and comp representation
- High precision permits high accuracy but no guarantee
- Rounding: 2 extra bits of precision to round
- Converting double to single precision (or FP to int)
-
- Round to
-
- Round to
-
- Truncate (round to 0)
-
- Unbiased: Round to even number
- Decimal examples: 2.5 rounds to 2, 3.5 rounds to 4
- FP Addition
- De-normalize to match exponents
- Add significands
- Keep same exponent
- Normalize again
- Casting
int -> float -> int
is not always same
float -> int ->float
is not always same
Alternate FP Representations
- float: standard float (8exp, 23 sig)
- binary64: double precision (11exp, 52 sig)
- Quad-precision: binary128 (15exp, 112 sig)
- Oct-precision: binary256 (19 exp, 236 sig)
- Half-precision: binary16 or fp16 (5exp, 10 sig)
- bfloat16 (brain FP): (8exp, 7sig)
- unum: Tells how many bits are for exp/sig (customizable)
- u-bit indicates if the number is exact
Lecture 7: 9/11/20
- CPU executes a lot of instructions
- Instruction Set Architecture (ISA): set of instructions a CPU implements
- RISC: Keep instruction set small and simple to build fast hardware
RISC-V Basics
- Registers instead of variables
- Registers are inside the processor (not memory)
- Predetermined number of registers in hardware (32)
- Each RISC-V register is 32 bits wide (independent of variant)
- Numbered from
x0
to x31
x0
is special, always holds value 0
- Registers can be referred to by number or name
- Registers have no type
- Comments start with
#
- Instruction: one statement/command
- Each line has at most 1 instruction
- One operation per instruction
RISC-V add/sub
add x1, x2, x3
- one: operation by name
- two: operand getting result (destination)
- three: first operand
- four: second operand
- Break longer sequences into multiple instructions
- Intermediates: numerical constants
addi x3, x4, 10
- No
subi
, just addi
a negative value
- Use
x0
for number zero to avoid using intermediates
Lecture 8: 9/14/20
- Some registers have designated use (usually < 30 registers available)
- Optimizer compiler minimizes register usage
- Register footprint - number of registers needed for a compute kernel
- Memory is like a large 1D array
- Processor gives address(offset to base pointer) to point in memory
- Load from (read) and store to memory (write)
- Memory addresses are in bytes
- Word addresses are 4 bytes apart
- Little-endian: Least significant byte is smallest address
- Bits are always stored with most significant in upper position
- DRAM: Dynamic Random Access Memory (fast but not as fast as registers, medium capacity)
- registers are 50-500x faster than memory
Loading from and storing Memory
- Load Word:
lw destination, offset(base_ptr)
- Offset is in bytes (multiple of 4)
- Data flow moves from right to left
- C Code:
int A[100]; g = h + A[3];
- Equiv:
lw x10, 12(x15); add x10, x12, x10
- Store Word:
sw current, offset(base_ptr)
- Data flow moves from left to right
int A[100]; A[10] = h + A[3];
lw x10, 12(x15)
add x10, x12, x10
sw x10, 40(x15)
- Load byte/store byte:
lb
sb
for byte data transfers
lb x10, 3(x11)
copies content of x11+3
to low byte pos of x10
- Sign-extend the sign bit to all higher bits
- Unsigned byte load:
lbu
- No sign-extension needed
- No
sbu
(nonsensical)
addi
equiv to sw, lw, add
, but addi
is much faster
- Immediates must be <= 32 bits
- Need to load longer immediates from memory
Decision Making
- if statement:
beq reg1, reg2, L1
- Go to statement labeled L1 if (value reg1) == (value reg2)
- Go to next statement otherwise
- Counterpart:
bne
branch not equal
- conditional branch:
- Less than:
blt
(unsigned: bltu
)
- Branch greater:
bge
(unsigned: bgeu
)
- Unconditional branch (always)
==
usually translates to bne
so next instruction executes if condition true
- Need a jump like
j Exit
at the end of bne
block
- No
bgt
, ble
- Need to increment memory pointers by num of bytes
Lecture 9: 9/16/20
- Logical instructions: isolate or packing bytes/nibles into words
- 2 variants: register and immediate
- No logical not, just use xor with 11111111
- Logical shifting:
- Shift logical left:
sll
(shift 2 bits left and insert 0s on right, bits fall off left end)
- Shift logical left immediate:
slli
- Shift logical right:
slr
fills left with 0s
- Arithmetic shifting: logical shift for signed int
- Shift right arithmetic (
sra
, `srai): moves n bits to the right, inserts sign bit into empty bits
- Not equivalent to divide by (fails for odd neg numbers)
- C standards is division should round to 0
Registers
- Programs are stored in memory
- Program lives in an area, data lives in an area (In practice, program above data)
- 1 RISC-V instruction is 32 bits
- Instruction fetched from memory using Program Counter (PC)
- Next instruction is 4 bytes away (incr PC +4)
- Symbolic register names
x10 - x17
are a0-a7
for params, a0, a1
for return values
x1
is ra
(return address to point of origin)
x8-x9
: s0-s1
and x18-x27
: s2-s11
: saved registers
x0
is zero
- Pseudo-instructions: shorthand for common assembly
mv rd, rs = addi rd, rs, 0
(move value to another register)
li rd, 13 = addi rd, x0, 13
(load immediate value into register)
nop = addi x0, x0, 0
(no operation, processor just waits)
Function calls
- 6 fundamental steps
-
- Put arguments in place where func can access them
-
- Transfer control to Function
-
- Acquire local storage resources needed for func
-
- Perform desired task of func
-
- Put return value where code can access it, restore used registers, release local storage
-
- Return control to origin
- Use
jr
not j
since jr
allows us to use a variable to return (jr ra
)
- Jump and link:
jal rd, Label
will jump and return save address (equiv to addi then j)
- Essentially link then jump
- Pseudo-instruction is
ret = jr ra
jalr rd, rs, imm
: Jump and link register
j, jr, ret
are pseudo-instructions for jal
(no ret address is just a jump)
Lecture 10: 9/18/20
- Use stack to save old values before calling func
- Stack pointer:
sp
(x2)
- Stack grows downwards (stack address decreases when we push)
- Push decrements sp, pop increments sp
- Stack frame includes:
- return instruction address
- Params (args)
- Space for local vars
- Stack pointer points to bottom of stack frame
Nested function calls
ra
would be overwritten in future calls
- Register conventions: a set of rules for which registers are unchanged after
jal
- Preserved:
sp, gp, tp, s0-s11
(s0 is fp)
- Not preserved:
a0-a7, ra, t0-6
(arg and temp registers)
- Caller must preserve args/rv, temp, ra
- Callee must preserve saved registers and sp
Memory Allocation
- Types of storage classes:
- Automatic: local to func, discarded when func exits
- Use stack for local vars
- Procedure frame/activation record: segment of stack with saved registers and local vars
- Static: exist across procedures
- Stack: high end
0xbfff_fff0
- Text segment in low end:
0x0001_0000
- Static: gp points to static (
0x1000_000
)
- Heap: above static, grows up to high addresses
- The callee saves them on the stack at the beginning of the function call and restores the stack right before returning
Lecture 11: 9/21/20
- Everything has memory addreses (instructions, data words)
- Branches and jumps to different sections
- One register keeps address of the instruction being executed (Program counter: PC) aka Instruction Pointer (IP)
- Programs are distributed as binaries but are different versions based on device
- Want backwards compatibility
- Data is in words(32 bit chunks)
- Each register is a word (lw, sw to access 1 word of memory)
- Divide instruction into fields
- R- register-register arithmetic ops
- I- register-immediate arithmeti cops, loads
- S- stores
- B- branches (minor variant of S)
- U- 20-bit upper immediate instructions
- J - jumps (minor variant of U)

- opcode: what format it is
0b0110011
for all R format
- funct7+funct3: tell what operation to perform
rs1, rs2
: source register: specifes register w/ first operand
rd
: destination register: register that receives result of computation
- Each register field holds a 5-bit uint corresp to reg number (
x0-x31
)

imm[11:0]
covers range
opcode
: 0b0010011
- shift amounts is limited to 5 bits since larget shifts are meaningless
- Loads:
- 12bit signed immediate for the memory address (offset)
rd
contains value loaded from memory
rs1
is load from

- Store needs 2 registers (
rs1, rs2
) but no rd
Lecture 12: 9/23/20

- Need to read 2 registers but don't write to register
- Branches used for loops (func calls and unconditional jumps handled by jump instructions)
- Can only jump relative to PC
- Largest branch dist limited by size of code
- PC-Relative Addressing: Use immediate as 2s complement offset to PC
- 12 bits gives
- Can only branch to beginning of word
- Branch calculation
- No branch:
PC = PC + 4
- Branch:
PC = PC + immediate*4
- Support for 16-bit words means branch size reduced by half
- Half of targets will be errors
- Branch calculation
- No branch:
PC = PC + 4
- Branch:
PC = PC + immediate*2
- The 12bit immediate is actually 13bits but always has LSB = 0
- beq with offset 0 is infinite loop

- Need to branch to destination
- Replace beq with bne and then j
- 20 bit immediate for upper 20 bits of 32 bit word
- lui: Load upper immediate
- lui then addi to set all 32 bits
- addi sign extends though which can cause problems
- Add 1 to LUI value to take that into account
- li: pseudo-op to handle this (lui and addi)
- auipc: Add upper immediate to PC

jal
saves PC+4 to rd (return address)
- The immediate is the offset, remove the LSB 0
- PC relative jump to locations 32bit instructions
jalr rd, rs, immediate
uses I-format
- Writes
rd = PC+4
, PC = rs + immediate
- Way to jump to absolute address
- Cannot assume LSB is 0 so range is reduced
- Use auipc, jalr to jump to pc-relative 32-bit offset
Lecture 13: 9/25/20
- Interpreter: Directly executes a program in source language
- Higher level, better error msgs
- Slower (10x), code smaller (2x)
- Independent instructions (can run on any machine)
- Translator: converts prog from 1 source lang to another
- Translates to executable, don't know the source code
- Compilation, Assembly, Linking, Loading
- Compiler
- Input: High level lang code (
foo.c
)
- Output: Assembly code: (
foo.s
)
- Can contain pseudo-instructions
Assembler
- Input: Assembly code (
foo.s
)
- Output: Object code (
foo.o
)
- Reads and uses directives
- Ex.
.text
(code), .data
(static), .globl sym
, .string str
.word w1...wn
- Replace psuedo-instructions, create machine code, create object file
- Need to sign extend immediates
- Branch immediates count halfwords
- PC-relative branches/jumps: count half words between curr and where you want to go
- Foward reference problem: branching below
- Solution: Take 2 passes
- Pass 0: Replace pseudo-instructions
- Pass 1: Remember position of label
- Pass 2: Use label positions to generate code
- Pc-relative jumps (jal, beq, bne)
j ofset
expands to jal, zero, offset
- Count half-words between target and jump for position-independent code (PIC)
- PIC is good - relative to location
- Static data refs:
lui
addi
require full 32b address
auipc
addi
uses PIC
- Symbol table
- List of items that can be used by other files
- Labels: function calling
- Data:
.data
variables can be accessed across files
- Relocation table
- List of items whose address the file needs
- Absolute labels (
jal, jalr
)
- Static data (
la
)
- Object file format (ex. ELF)
- Object file header: size, pos of other pieces
- text segment: machine code
- Data segment: static data (in binary)
- Relocation info: Has lines of code that need to be fixed later
- Symbol table: List of files labels and static data that can be referenced
- Debugging info
Linker
- Input: Object code files, info tables (
foo.o
)
- Output: Executable code (
a.out
)
- Combines several object files into a single executable
- Throws error for duplicate or missing symbols
- Types of addresses
- PC-Relative (beq, bne, jal, auipc/addi): Never need to relocate
- Absolute or external func address (quipc/jalr): Always relocate
- Static data reference (lui/addi): Always relocate
- Relocation editing:
- J-format: jal address
- I,S format: static pointers
- No conditional branches
- Text starts at
0x1000
- Statically-linked approach: include entire library even if not all of it is used (self-contained)
- Dynamically-linked (DLL): include library at runtime
- Smaller storage, requires less memory
- Requires time to link at runtime
- Link at machine code level
Loader
- Input: Executable code (
a.out
)
- Output: program run
- Part of the OS
- Read header to know size of text and data
- Create address space for text, data, stack
- Copy instructions and data from executable file to new address space
- Copy CLI args to stack
- Initialize machine registers (mostly cleared)
- Jump to start-up routine that copies program args from stack to register, set PC
Lecture 14: 9/28/20
- Synchronous digital system
- Synchronous: operations coordinated by a central clock
- Digital: Values represented by discrete values
- Transistor: semiconductor device to amplify or switch signals
- CMOS: Complementary metal oxide semiconductor
- Drain, Gate, Source
- Current flows from source to drain depending on the gate
- Can crete AND, OR, NOR out of NAND
- Asserted switch = closed

- Signals are transmitted instantly and continuously
- Only contains 1 value at a time
- [Block] propagation delay: Delay between clock start and when you can see the output
- Combinational Logic (CL) circuits
- Output is a function of inputs
- State elements
- circuits that store information (registers, cache)
Lecture 15: 9/30/20
- State can store values for time (registers, caches)
- Control flow of info between logic blocks
- Register = n "flip-flops" aka D-type Flip-Flop
- D: data, q: output

- At rising edge of the clock, sample input d.
- After a time delay, output q
- Setup time: Time before rising edge that input must be stable
- Hold time: Time after rising edge that input must be stable
- Clk-to-q delay: Time until q gets updated and is stable

- Max delay = CLK to Q delay + CL delay + Setup time
- Extra registers added to speed up clock rate
- Clock period limited by propagation delay of adder/shifter
- Finite State Machine: Represent with a state transition diagram
- Use CL + register
- is the stable output of a register which holds result of iteration
Lecture 15: 10/2/20
- XOR: 1 if odd number of 1s
- Boolean algebra: + is OR, is AND, is NOT
- Circuit verification: If they are equivalent in boolean algebra
- Uniting theorem:
- DeMorgan's Law: ,
- Distribute the "bar" to all 3 terms
- Sum-of-products: ORs of ANDs
- Sum together all possible outcomes of 1 (not for 0 for each input bit)
Lecture 16: 10/5/20
Multiplexor

Arithmetic and Logic Unit (ALU)
- 00: A + B
- 01: A - B
- 10: A & B
- 11: A | B


Adders
- LSB: ,
- ith bit: ,
- N 1-bit adders cascaded is a 1 N-bit adder
- For unsigned numbers, existence of carry indicates overflow
- For signed numbers, carry is not indicative
- For highest adder: overflow =
- No c_out, no c_in: no overflow
- c_in and c_out: no overflow
- c_in but no c_out: overflow (A, B > 0)
- c_out but not c_in: overflow
- Subtractor: A-B = A + (-B)
- Does an XOR to flip the bits
- if SUB is high, then we just add that 1
Lecture 17: 10/7/20
- Hardware is inherently parallel
- CPU - active part of computer that does work (data manipulation and decision making)
- Datapath: Part of processor that has hardware to perform operations
- Computer executes 1 instruction per tick
- State outputs drive inputs to combinational logic to make next state
- All state elements updated at rising clock edge
- Problem: single monolithic block is too bulky
- Solution: Break up process of executing instruction into steps
- Stage 1: Instruction Fetch (IF)
- Stage 2: Instruction Decode (ID)
- Stage 3: Execute (EX) ALU
- Stage 4: Memory Access (MEM)
- Stage 5: Write Back to Register (WB)
- All 5 stages occur within 1 clock cycle (rising edge fetch, write on the next rising edge)


Lecture 18: 10/9/20
See lecture diagrams

Lecture 19: 10/12/20
- Control and status register (CSR): diff from register file
- Monitor status and performance
- Up to 4096 CSRs
- Necessary for peripherals
- csrrw swaps value of a CSR and a integer register
csrw csr, rs1
is csrrw x0, csr, rs1
- rd=x0 meand just write rs1 to CSR
csrwi csr, uimm
is csrrwi x0, csr, uimm
- rd=x0 just writes uimm to CSR

- LSB is stable first
- Critical path lw =
- Certain instructions take longer
- Most blocks will be idle for most of the time
- ROM: Read Only Memory

Lecture 20: 10/14/20
- Measure performance in maximum frequency
- Performance measures:
- Program execution time
- Throughput (num of server requests handled per hour)
- Energy per task
- Iron law of processor performance:
- CPI: Cycles per instruction
- Instructions per program
- Determined by task, algorithm, language, compiler, ISA
- Average CPI:
- Determined by ISA, processor implementation, complex instructions (like strcpy, CPI >> 1), superscalar processors (CPI < 1)
- Time per cycle:
- Procesor microarchitecture (critical path), technology, power budget
- 30% of energy is leaked in CMOS
-
- Capacitance depends on technology processor features
- Supply voltage (run slower with less supply voltage)
- Energy Iron Law: Performance = Power * Energy Efficiency
- Performance: tasks/second
- Energy efficiency: tasks/joule
Pipelining
- Does not help latency of single task
- Helps throughput of entire worload
- Multiple tasks operate simultaneously using diff resources
- (theoretical) Speedup = num of pipe stages
- Time to fill and drain pipeline in reality
- Pipeline rate is limited by slowest pipeline stage
Lecture 21: 10/16/20
- Put registers in between the 5 stages to pipeline

- Time to process 1 instruction can be longer so worse latency, but better throughput
- Clock rate can be dramatically higher
- Pipelining hazards: situations that prevent startingthe next instruction in the next clock cycle
- Structural hazard: Required resource is busy (needed in multiple stages)
- 2+ instructions compete for single physical resource
- Sol1: Instructions take turns, causing some to stall
- Sol2: Add more hardware (can always solve)
- 2 independent read ports and 1 write port
- 2 separate memories (cache)
- Data hazard: Data dependency between instructions, need to wait for prev inst to complete its read/write
- Write in the first half of cycle, read in the second half of cycle (registers are fast)
- Sol1: Stall using bubble (
nop
) when it depends on previous inst, but reduces performance
- Compiler will try to rearrange code that is not dependent
- Sol2: Forwarding/bypassing - grab operand from pipeline stage rather than register file (output of ALU to input of ALU)

- Control hazard: Flow of execution depends on previous instruction (for branches)
Lecture 23: 10/18/20
Load Data Hazard
- Load first, then do operations on it
- Load is only available after DMEM, but needed 1 cycle earlier
- Sol 1: Load delay slot: Need 1 stall, convert next command into a nop
- Just set regWEn, MemRW, and do not update PC
- Cycle after that execute the original command
- Forward from DMEM output to ALU
- Sol2: Put unrelated instruction into load delay slot
Control Hazards
- Branches are condition so don't know next inst
- Earliest to know branch is at end of execution (ALU)
- 2 subsequent instructions are executed regardless of the branch outcome
- Sol1: Use 2 stall cycles after a branch
- Sol2: If branch/jump taken, convert the 2 next to
nop
- Use branch prediction to guess
- Generally loop, so branch is rarely taken
- Ex. Keep a bit of if the branch was taken last time
- Moving branch comparator to ID stage would incur data hazard and forwarding could fail
Superscalar Processors
- Clock rate limited by tech and power dissipation
- Deeper pipelines (10, 15 stages)
- Less work per stage, so shorter clock cycle
- More potential for hazards (CPI > 1)
- Multi-issue superscalar
- Replicate pipeline stages: multiple pipelines (CPI < 1)
- Multiple execution units, start multiple instructions per clock cycle
- Out of order execution to reduce hazards
- Arrive at commit unit which reorders
- Get CPI using iron law (time program, count instructions, know time/cycle)