CS 61C Notes

Lecture 1: 8/26/20

Intro class, no notes

Lecture 2: 8/28/20

  • Refer to table for easy hex to binary conversion
  • Numerals have infinite preceeding zeroes
  • Numbers are representations of numerals
    • Numbers are theoretical
  • For a set number of bits, if addition (sub, mul, or div) cannot be represented, there is overflow
    • Not underflow, this would be negative overflow

Number representations

  • Positive number representations are always the same for all representations
  • Sign and magnitude
    • Leftmost sign is int (0 is +, 1 is -)
    • Problem 1: Making numbers smaller than 0 actually makes the number bigger in binary representation
    • Problem 2: Also there are 2 zeroes
    • Problem 3: Complicated circuitry
  • One's complement
    • Flip all the bits for a negative number
    • Problem: there are still 2 0s (0000 and 1111)
    • Solution: shift everything over left by 1
  • Two's complement
    • Flip all bits and add
      1
      • Or multiply leftmost bit by
        1
    • All 1s represents -1
    • 2N1
      non-negatives,
      2N1
      negatives
    • There are 1 more negative than positive
  • Bias encoding
    • Add a bias to bring the center of the range down
    • Bias for N bits chosen as
      (2N11)

Lecture 3: 8/31/20

  • ENIAC: First Electronic General-Purpose Computer
    • Long time to program (patch cords and switches)
  • EDSAC: First General Stored-Program Computer
    • Programs held as numbers/bits in memory
  • Compile changed source files into .o files (machine code object files) which are then linked into a .out file (machine code executable file)
  • Executables must be rebuilt on each new system (porting)
  • Parallel compiling: make -j to rebuild changed pieces
  • Amdahl's Law: Linking is sequential, forming a bottleneck
  • C Pre-Processor (CPP): Begins with #
    • Ex. #define or #include
    • Macros: small functions that text replaces everything below
    • However, this means functions would be called multiple times

C vs Java

C main function

#include <stdio.h> int main(void) { return 0; }
C Java
Type of Language Function Oriented Object Oriented
Programming Unit Function Class = Abstract Data Type
Compilation gcc hello.c javac Hello.java
Execution a.out loads and executes java Hello interprets
Storage Manual(malloc, free) new allocates & initializes, automatic garbage collection
Comments /* ... */ /* ... */ or //
Constants #define, const final
Preprocessor Yes No
Variable declaration Beginning of block (C99 same as Java) Before you use it
Variable naming conventions under_scores camelCase
Access a library #include <stdio.h> import java.io.File;
  • ANSI C was updated to C99 or C9x
    • Added variable-length arrays
    • Int types and booleans
  • C11 and C18 are newer fixes
    • Multi-threading, unicode, removed gets()
  • Passing args into main: int main(int argc, char *argv[])
    • argc containst he number of strings on command line
      • Executable counts as one, plus one for each argument
    • argv is a pointer to an array containing the arguments as strings

C Basics

  • Booleans
    • False: 0 (int), NULL, booleans (C99)
    • True: Everything else
  • Types: int, unsigned int, float, float, double, char, long, long long
    • long is a longer int (32b), long long is even longer (64b)
    • int is 16b
    • float is floating point decimal, double is equal or higher precision floating point
  • Preferrable to use intN_t and uintN_t
  • Constants: const int days_in week
  • Enums: enum color {RED, GREEN, BLUE}
  • Typedef: typedef uint8_t BYTE; Custom types
  • Struct: Custom structured groups of variables
typedef struct{ int length_in_seconds; int year_recorded; } SONG;
  • switches evaluate cases until it sees a break
  • goto is nonlocal control flow, reserved for advanced use cases
  • includes are a shared namespace so no math.atan
  • Never while loop based on a floating point value

Lecture 4: 9/2/20

Pointers

  • Variables have undefined values if not initialized in declaration
  • Heisenbugs: Undefined behavior that run nondeterministically
  • Address refers to a memory location (not value)
  • Pointer contains the address of a memory location
  • Syntax:
    • int *p; p is a pointer (an address of an int)
    • p = &y; assign the address of y to p
  • Dereferencing:
    • z = *p; assign value at adress in p to z
    • *p = z; assign the value of z to what p is pointing to
  • Must initialize pointers first
    • int *ptr; *ptr=5; does not work since *ptr never had an initial value
  • Benefit of cleaner code, faster to pass pointer to large struct/array
  • Drawback is bugs (dynamic memory management, dangling references, memory leaks)

Pointer Usage

  • Pointers can point to any data type
  • void* is a generic pointer (not recommended)
  • Pointers can point to functions
    • int (*fn) (void*, void*) = &foo
    • fn is a function that accepts 2 void* pointers and returns an int. Initially points to function foo
    • (*fn)(x, y) is equivalent to foo(x,y)
  • Struct arrow notation: int h = paddr->x; equivalent to int h = (*paddr).x;
    • Arrow notation is more commonly used
    • p1 = p2 copies all values, not pointers
  • NULL pointer: The pointer of all 0s
    • Writing or reading to null pointer causes crash
    • Since 0 is false, simple test for null pointer if(q)
  • Type declaration tells how many bytes to fetch for each pointer
    • 32bit int is stored in 4 consecutive 8-bit bytes
    • Word alignment: Can only access 4 byte boundaries
  • sizeof(type) returns num of bytes in object
    • By C99 definition, sizeof(char) == 1

Arrays

  • A contiguous block of memory
  • Declaration: int ar[] = {1, 2};
  • Arrays are almost identical to pointers
    • char *string and char string[] are nearly identical
    • Cannot increment an array
  • ar[0] is same as *ar, ar[2] is same as *(ar+2)
  • Declared arrays are only allocated while scope is valid
  • Declare int ARRAY_SIZE instead of hardcoding
  • Array does not know its own length and bounds not checked
    • Exception is strings, the null terminator signifies end
  • Common errors
    • Segmentation fault - read/write to memory you don't have access to
    • Bus error - alignment is wrong (bad word alignment)
  • pointer + n adds n * sizeof(type)
  • Handle: pointer to a pointer **h
    • Needed to advance a pointer for an array
    • Typically no more than 2 stars

Lecture 5: 9/4/20

Dynamic Memory Allocation

  • sizeof gives size in bytes
    • Also knows size of arrays (in bytes)
  • Allocate room to point to: malloc()
    • Ex. ptr = (int *) malloc(n*sizeof(int))
      • This will cast the void* returned by malloc to int*
    • malloc location contains garbage initially
    • Dynamically allocated space must be freed
    • When main returns it auto-frees it up, but bad practice
    • Memory leak: If memory is not freed and the program exits
    • Malloc returns NULL if request cannot be granted
  • Common errors
    • free()ing the same place of memory twice
    • free()ing something that you didn't malloc()
  • Resizing
    • realloc(p, size) which returns a new pointer (NULL if fails)
    • realloc(ip, 0) is same as free(ip)
  • Array names are not variables
    • *a: first value of array, a: pointer of array, &a: pointer of array (just like a)
  • 3 Rules of pointers
      1. Need to set up the pointee (separate from pointer)
      1. Dereference a pointer to access its pointee
      1. Assignment (=) between pointers makes them point to the same pointee

Linked List

struct Node { char *value; struct Node *next; }; typedef struct Node *List; // List is pointer to struct Node List ListNew(void) { return NULL; } //Create new empty list
  • When you malloc a string, need strlen(string) + 1 for null terminator \0
  • strcpy(destination, original)

Memory locations

  • Struct declaration does NOT allocate memory
  • Variable declaration does allocate memory
  • Methods of allocating memory
    • Declaration of local variable
    • Dynamic allocation at runtime like malloc
    • Global variable (above main()) which has global scope
  • 3 Pools of memory
      1. Static storage: global variable storage (permanent)
      1. Stack: Local variable storage, params, return address
      1. Heap: Dynamic malloc storage: data lives until dallocated by programmer
  • Program Address space
    • Stack: local vars, grows downward
      • Big stack: Deep recursion/fractals
    • Heap: Space from malloc, grows upward
      • Just big mallocs without function calls
    • Static data: Declared outside of main. Does not grow/shirnk
    • Code: Loaded when program starts. Does not grow/shrink

Heap

  • Stack frame (LIFO)
    • Return instruction address, params, space for other local variables
    • Always contiguous blocks of memory, stack pointer tells where top stack is
    • When procedure ends, stack frame is tossed off stack freeing memory for future stack frames
    • Don't erase memory even if not needed, just move stack pointer up

Heap

  • Large pool of memory, not allocated in contiguous order
  • Specify the num of bytes to allocate in malloc
  • Tricky to manage unlike others
  • Want malloc(), free() to be fast, minimal memory overhead
  • Want to avoid (external) fragmentation (free memory separated in small chunks)
  • K&R Implementation
    • Each block of memory is preceded by a header with size of block and pointer to next block
    • Free blocks are kept in a circular linked list
      • Pointer field unused in allocated blocks
    • malloc(): Search free list for a block large enough
    • free(): Check if blocks adjacent to freed block are free
      • Coalesce into 1 bigger free block, otherwise add new freed block to free list
    • free is very quick, malloc can search the full list worst case
  • malloc block choosing
    • best-fit: choose smallest block that is big enough
    • first-fit: choose first block that is big enough
    • next-fit: same as first-fit but resume seraching where we finished last time
      • distributes the small slivers of memory

When Memory goes Bad

  • Pointer problems
    • Dangling reference: using ptr before malloc
      • Writing off the end of arrays
      • Cannot return pointer to variable in local frame
      • Use after free
      • Stale pointers after realloc moves data
      • Freeing the wrong stuff (incremented pointer)
      • Double free
    • Memory leaks: tardy free, lose the ptr
      • Incrementing a pointer
  • Valgrind catches memory leaks, misues of free, writing over end of arrays
    • Helpful for debugging

Lecture 6: 9/9/20

  • Very large/small numbers and fractional numbers need to be stored
  • Agree where to put the binary point to sep integer and fractional parts
  • Store where the binary point is in a diff variable
  • Binary point can 'float' as the binary point is no longer fixed
  • Normalized form: No leading 0s (like scientific notation)

FLoating Point Representation

  • First 1 is not recorded (since always 1)
  • Sign: 1 bit (use sign-magnitude form)
  • Exponent: 8 bits
  • Significand: 23 bits (LSB represents
    1023
    )
    • Can represent range of
      (1.21038,3.41038
      )
    • Overflow: Exponent is larger than 8 bits (
      >3.41038,<3.41038
      )
    • Underflow: Negative exponent is larger than 8 bits
      • Ends up storing small fraction as 0
  • More precision: double (52 bit significand)
  • Significand is always between 0, 1 (for normalized numbers)
  • Representing 0: Exponent value of 0
    • 0 has no leading 1
    • Results in 2 zeroes because of the sign bit
  • Biased exponent representation (IEEE Standard 754)
    • 2's complement would make negative numbers look bigger
    • Bias: 127
    • (1)Sign(1+Significand)2Exponent127

Special Numbers

  • How do we represent
    ±
    infinity
    • Use exponent 255 with 0 significand
  • Representing NaN
    • Use exponent 255 with nonzero signifcand
    • NaNs contaminate results as they percolate up
    • 2231
      possible NaNs used to specify each error (sqrt neg number, 0/0, etc.) and include which line it occured on
  • Problem of normalization and implicit 1
    • Gap between FP numbers around 0
    • Smallest pos num:
      2126
    • Second smallest:
      (1+223)2126=2126+2149
  • Denorm: Solves implicit 1 problem
    • Exponent 0 with nonzero significand
    • No implicit leading 1, implicit exponent -126
    • Smallest:
      2149
    • Second smallest:
      2148
    • Even steps until normalized region and the stride doubles with higher exponents
    • Have
      223
      8 million slots between each power of 2
    • 224+1
      is the smallest int that cannot be represented since the stride would be 2
  • Example:
    13
    • Sign: 0
    • Exponent:
      2+127=125=0b01111101
    • Significand:
      0b010101

Fallacies

  • Associativity is lost
    • Small numbers can be lost when added to big numbers
  • Precision: count of the number bits used to represent a value
  • Accuracy: Difference between actual value and comp representation
  • High precision permits high accuracy but no guarantee
  • Rounding: 2 extra bits of precision to round
    • Converting double to single precision (or FP to int)
      1. Round to
        +
      1. Round to
      1. Truncate (round to 0)
      1. Unbiased: Round to even number
      • Decimal examples: 2.5 rounds to 2, 3.5 rounds to 4
  • FP Addition
    • De-normalize to match exponents
    • Add significands
    • Keep same exponent
    • Normalize again
  • Casting
    • int -> float -> int is not always same
    • float -> int ->float is not always same

Alternate FP Representations

  • float: standard float (8exp, 23 sig)
  • binary64: double precision (11exp, 52 sig)
  • Quad-precision: binary128 (15exp, 112 sig)
  • Oct-precision: binary256 (19 exp, 236 sig)
  • Half-precision: binary16 or fp16 (5exp, 10 sig)
  • bfloat16 (brain FP): (8exp, 7sig)
  • unum: Tells how many bits are for exp/sig (customizable)
    • u-bit indicates if the number is exact

Lecture 7: 9/11/20

  • CPU executes a lot of instructions
  • Instruction Set Architecture (ISA): set of instructions a CPU implements
  • RISC: Keep instruction set small and simple to build fast hardware

RISC-V Basics

  • Registers instead of variables
  • Registers are inside the processor (not memory)
  • Predetermined number of registers in hardware (32)
    • Each RISC-V register is 32 bits wide (independent of variant)
    • Numbered from x0 to x31
    • x0 is special, always holds value 0
    • Registers can be referred to by number or name
    • Registers have no type
  • Comments start with #
  • Instruction: one statement/command
    • Each line has at most 1 instruction
    • One operation per instruction

RISC-V add/sub

  • add x1, x2, x3
    • one: operation by name
    • two: operand getting result (destination)
    • three: first operand
    • four: second operand
  • Break longer sequences into multiple instructions
  • Intermediates: numerical constants
    • addi x3, x4, 10
    • No subi, just addi a negative value
    • Use x0 for number zero to avoid using intermediates

Lecture 8: 9/14/20

  • Some registers have designated use (usually < 30 registers available)
  • Optimizer compiler minimizes register usage
  • Register footprint - number of registers needed for a compute kernel
  • Memory is like a large 1D array
  • Processor gives address(offset to base pointer) to point in memory
  • Load from (read) and store to memory (write)
  • Memory addresses are in bytes
    • Word addresses are 4 bytes apart
  • Little-endian: Least significant byte is smallest address
    • Bits are always stored with most significant in upper position
  • DRAM: Dynamic Random Access Memory (fast but not as fast as registers, medium capacity)
    • registers are 50-500x faster than memory

Loading from and storing Memory

  • Load Word: lw destination, offset(base_ptr)
    • Offset is in bytes (multiple of 4)
    • Data flow moves from right to left
    • C Code: int A[100]; g = h + A[3];
    • Equiv: lw x10, 12(x15); add x10, x12, x10
  • Store Word: sw current, offset(base_ptr)
    • Data flow moves from left to right
    • int A[100]; A[10] = h + A[3];
    • lw x10, 12(x15)
    • add x10, x12, x10
    • sw x10, 40(x15)
  • Load byte/store byte: lb sb for byte data transfers
    • lb x10, 3(x11) copies content of x11+3 to low byte pos of x10
    • Sign-extend the sign bit to all higher bits
  • Unsigned byte load: lbu
    • No sign-extension needed
    • No sbu (nonsensical)
  • addi equiv to sw, lw, add, but addi is much faster
    • Immediates must be <= 32 bits
    • Need to load longer immediates from memory

Decision Making

  • if statement: beq reg1, reg2, L1
    • Go to statement labeled L1 if (value reg1) == (value reg2)
    • Go to next statement otherwise
    • Counterpart: bne branch not equal
  • conditional branch:
    • Less than: blt (unsigned: bltu)
    • Branch greater: bge (unsigned: bgeu)
  • Unconditional branch (always)
    • Jump: j label
  • == usually translates to bne so next instruction executes if condition true
  • Need a jump like j Exit at the end of bne block
  • No bgt, ble
  • Need to increment memory pointers by num of bytes

Lecture 9: 9/16/20

  • Logical instructions: isolate or packing bytes/nibles into words
  • 2 variants: register and immediate
  • No logical not, just use xor with 11111111
  • Logical shifting:
    • Shift logical left: sll (shift 2 bits left and insert 0s on right, bits fall off left end)
    • Shift logical left immediate: slli
    • Shift logical right: slr fills left with 0s
  • Arithmetic shifting: logical shift for signed int
    • Shift right arithmetic (sra, `srai): moves n bits to the right, inserts sign bit into empty bits
    • Not equivalent to divide by
      2n
      (fails for odd neg numbers)
    • C standards is division should round to 0

Registers

  • Programs are stored in memory
  • Program lives in an area, data lives in an area (In practice, program above data)
  • 1 RISC-V instruction is 32 bits
  • Instruction fetched from memory using Program Counter (PC)
    • Next instruction is 4 bytes away (incr PC +4)
  • Symbolic register names
    • x10 - x17 are a0-a7 for params, a0, a1 for return values
    • x1 is ra (return address to point of origin)
    • x8-x9: s0-s1 and x18-x27: s2-s11: saved registers
    • x0 is zero
    • Pseudo-instructions: shorthand for common assembly
      • mv rd, rs = addi rd, rs, 0 (move value to another register)
      • li rd, 13 = addi rd, x0, 13 (load immediate value into register)
      • nop = addi x0, x0, 0 (no operation, processor just waits)

Function calls

  • 6 fundamental steps
      1. Put arguments in place where func can access them
      1. Transfer control to Function
      1. Acquire local storage resources needed for func
      1. Perform desired task of func
      1. Put return value where code can access it, restore used registers, release local storage
      1. Return control to origin
  • Use jr not j since jr allows us to use a variable to return (jr ra)
  • Jump and link: jal rd, Label will jump and return save address (equiv to addi then j)
    • Essentially link then jump
    • Pseudo-instruction is ret = jr ra
  • jalr rd, rs, imm: Jump and link register
    • j, jr, ret are pseudo-instructions for jal (no ret address is just a jump)

Lecture 10: 9/18/20

  • Use stack to save old values before calling func
  • Stack pointer: sp (x2)
    • Stack grows downwards (stack address decreases when we push)
    • Push decrements sp, pop increments sp
  • Stack frame includes:
    • return instruction address
    • Params (args)
    • Space for local vars
  • Stack pointer points to bottom of stack frame
addi sp, sp -8 # adjust stack for 2 items
sw s1, 4(sp) # Save s1 for use after
sw s0, 0(sp)
...
lw s0, 0(sp) # restore s0 for caller
lw s1, 4(sp)
addi sp, sp, 8 # adjust stack to delete 2 items
jr ra (ret)

Nested function calls

  • ra would be overwritten in future calls
  • Register conventions: a set of rules for which registers are unchanged after jal
    • Preserved: sp, gp, tp, s0-s11 (s0 is fp)
    • Not preserved: a0-a7, ra, t0-6 (arg and temp registers)
    • Caller must preserve args/rv, temp, ra
    • Callee must preserve saved registers and sp

Memory Allocation

  • Types of storage classes:
    • Automatic: local to func, discarded when func exits
      • Use stack for local vars
      • Procedure frame/activation record: segment of stack with saved registers and local vars
    • Static: exist across procedures
  • Stack: high end 0xbfff_fff0
  • Text segment in low end: 0x0001_0000
  • Static: gp points to static (0x1000_000)
  • Heap: above static, grows up to high addresses
  • The callee saves them on the stack at the beginning of the function call and restores the stack right before returning

Lecture 11: 9/21/20

  • Everything has memory addreses (instructions, data words)
    • Branches and jumps to different sections
    • One register keeps address of the instruction being executed (Program counter: PC) aka Instruction Pointer (IP)
  • Programs are distributed as binaries but are different versions based on device
    • Want backwards compatibility
  • Data is in words(32 bit chunks)
    • Each register is a word (lw, sw to access 1 word of memory)
    • Divide instruction into fields
    • R- register-register arithmetic ops
    • I- register-immediate arithmeti cops, loads
    • S- stores
    • B- branches (minor variant of S)
    • U- 20-bit upper immediate instructions
    • J - jumps (minor variant of U)

R-Format

  • opcode: what format it is
    • 0b0110011 for all R format
  • funct7+funct3: tell what operation to perform
  • rs1, rs2: source register: specifes register w/ first operand
  • rd: destination register: register that receives result of computation
  • Each register field holds a 5-bit uint corresp to reg number (x0-x31)

I-Format

  • imm[11:0] covers range
    [2048,2047]
    • Sign-extend immediate
  • opcode: 0b0010011
  • shift amounts is limited to 5 bits since larget shifts are meaningless
  • Loads:
    • 12bit signed immediate for the memory address (offset)
    • rd contains value loaded from memory
    • rs1 is load from

S-format

  • Store needs 2 registers (rs1, rs2) but no rd

Lecture 12: 9/23/20

B-format (SB)

  • Need to read 2 registers but don't write to register
  • Branches used for loops (func calls and unconditional jumps handled by jump instructions)
    • Can only jump relative to PC
  • Largest branch dist limited by size of code
  • PC-Relative Addressing: Use immediate as 2s complement offset to PC
  • 12 bits gives
    ±21132
    • Can only branch to beginning of word
    • Branch calculation
      • No branch: PC = PC + 4
      • Branch: PC = PC + immediate*4
  • Support for 16-bit words means branch size reduced by half
    • Half of targets will be errors
    • ±21032
    • Branch calculation
      • No branch: PC = PC + 4
      • Branch: PC = PC + immediate*2
  • The 12bit immediate is actually 13bits but always has LSB = 0
  • beq with offset 0 is infinite loop

U-Format

  • Need to branch to destination
    >210
    • Replace beq with bne and then j
  • 20 bit immediate for upper 20 bits of 32 bit word
  • lui: Load upper immediate
    • lui then addi to set all 32 bits
    • addi sign extends though which can cause problems
    • Add 1 to LUI value to take that into account
    • li: pseudo-op to handle this (lui and addi)
  • auipc: Add upper immediate to PC

J-Format (UJ)

  • jal saves PC+4 to rd (return address)
    • The immediate is the offset, remove the LSB 0
  • PC relative jump to
    ±219
    locations
    =±218
    32bit instructions
  • jalr rd, rs, immediate uses I-format
    • Writes rd = PC+4, PC = rs + immediate
    • Way to jump to absolute address
    • Cannot assume LSB is 0 so range is reduced
    • Use auipc, jalr to jump to pc-relative 32-bit offset

Lecture 13: 9/25/20

  • Interpreter: Directly executes a program in source language
    • Higher level, better error msgs
    • Slower (10x), code smaller (2x)
    • Independent instructions (can run on any machine)
  • Translator: converts prog from 1 source lang to another
    • Translates to executable, don't know the source code
  • Compilation, Assembly, Linking, Loading
  • Compiler
    • Input: High level lang code (foo.c)
    • Output: Assembly code: (foo.s)
      • Can contain pseudo-instructions

Assembler

  • Input: Assembly code (foo.s)
  • Output: Object code (foo.o)
  • Reads and uses directives
    • Ex. .text (code), .data (static), .globl sym, .string str .word w1...wn
  • Replace psuedo-instructions, create machine code, create object file
  • Need to sign extend immediates
  • Branch immediates count halfwords
  • PC-relative branches/jumps: count half words between curr and where you want to go
    • Foward reference problem: branching below
  • Solution: Take 2 passes
    • Pass 0: Replace pseudo-instructions
    • Pass 1: Remember position of label
    • Pass 2: Use label positions to generate code
  • Pc-relative jumps (jal, beq, bne)
    • j ofset expands to jal, zero, offset
    • Count half-words between target and jump for position-independent code (PIC)
      • PIC is good - relative to location
  • Static data refs:
    • lui addi require full 32b address
    • auipc addi uses PIC
  • Symbol table
    • List of items that can be used by other files
    • Labels: function calling
    • Data: .data variables can be accessed across files
  • Relocation table
    • List of items whose address the file needs
    • Absolute labels (jal, jalr)
    • Static data (la)
  • Object file format (ex. ELF)
    • Object file header: size, pos of other pieces
    • text segment: machine code
    • Data segment: static data (in binary)
    • Relocation info: Has lines of code that need to be fixed later
    • Symbol table: List of files labels and static data that can be referenced
    • Debugging info

Linker

  • Input: Object code files, info tables (foo.o)
  • Output: Executable code (a.out)
  • Combines several object files into a single executable
  • Throws error for duplicate or missing symbols
  • Types of addresses
    • PC-Relative (beq, bne, jal, auipc/addi): Never need to relocate
    • Absolute or external func address (quipc/jalr): Always relocate
    • Static data reference (lui/addi): Always relocate
  • Relocation editing:
    • J-format: jal address
    • I,S format: static pointers
    • No conditional branches
  • Text starts at 0x1000
  • Statically-linked approach: include entire library even if not all of it is used (self-contained)
  • Dynamically-linked (DLL): include library at runtime
    • Smaller storage, requires less memory
    • Requires time to link at runtime
    • Link at machine code level

Loader

  • Input: Executable code (a.out)
  • Output: program run
  • Part of the OS
  • Read header to know size of text and data
  • Create address space for text, data, stack
  • Copy instructions and data from executable file to new address space
  • Copy CLI args to stack
  • Initialize machine registers (mostly cleared)
  • Jump to start-up routine that copies program args from stack to register, set PC

Lecture 14: 9/28/20

  • Synchronous digital system
    • Synchronous: operations coordinated by a central clock
    • Digital: Values represented by discrete values
  • Transistor: semiconductor device to amplify or switch signals
    • CMOS: Complementary metal oxide semiconductor
    • Drain, Gate, Source
    • Current flows from source to drain depending on the gate
  • Can crete AND, OR, NOR out of NAND
  • Asserted switch = closed
  • Signals are transmitted instantly and continuously
    • Only contains 1 value at a time
  • [Block] propagation delay: Delay between clock start and when you can see the output
  • Combinational Logic (CL) circuits
    • Output is a function of inputs
  • State elements
    • circuits that store information (registers, cache)

Lecture 15: 9/30/20

  • State can store values for time (registers, caches)
  • Control flow of info between logic blocks
  • Register = n "flip-flops" aka D-type Flip-Flop
  • D: data, q: output
  • At rising edge of the clock, sample input d.
    • After a time delay, output q
    • Setup time: Time before rising edge that input must be stable
    • Hold time: Time after rising edge that input must be stable
    • Clk-to-q delay: Time until q gets updated and is stable
  • Max delay = CLK to Q delay + CL delay + Setup time
  • Extra registers added to speed up clock rate
  • Clock period limited by propagation delay of adder/shifter
  • Finite State Machine: Represent with a state transition diagram
  • Use CL + register
  • Si1
    is the stable output of a register which holds result of
    i1
    iteration

Lecture 15: 10/2/20

  • XOR: 1 if odd number of 1s
  • Boolean algebra: + is OR,
    is AND,
    x¯
    is NOT
  • Circuit verification: If they are equivalent in boolean algebra
  • Uniting theorem:
    xy+x=x,(x+y)x=x
  • DeMorgan's Law:
    (xy)¯=x¯+y¯
    ,
    (x+y)¯=x¯y¯
    • Distribute the "bar" to all 3 terms
  • Sum-of-products: ORs of ANDs
    • Sum together all possible outcomes of 1 (not for 0 for each input bit)

Lecture 16: 10/5/20

Multiplexor

Arithmetic and Logic Unit (ALU)

  • 00: A + B
  • 01: A - B
  • 10: A & B
  • 11: A | B

Adders

  • LSB:
    s0=a0XORb0
    ,
    c1=a0&b0
  • ith bit:
    si=XOR(ai,bi,ci)
    ,
    ci+1=MAJ(bi,bi,ci)=aibi+bici+aici
  • N 1-bit adders cascaded is a 1 N-bit adder
  • For unsigned numbers, existence of carry indicates overflow
  • For signed numbers, carry is not indicative
    • For highest adder: overflow =
      cnXORcn1
    • No c_out, no c_in: no overflow
    • c_in and c_out: no overflow
    • c_in but no c_out: overflow (A, B > 0)
    • c_out but not c_in: overflow
  • Subtractor: A-B = A + (-B)
    • Does an XOR to flip the bits
    • if SUB is high, then we just add that 1

Lecture 17: 10/7/20

  • Hardware is inherently parallel
  • CPU - active part of computer that does work (data manipulation and decision making)
  • Datapath: Part of processor that has hardware to perform operations
  • Computer executes 1 instruction per tick
  • State outputs drive inputs to combinational logic to make next state
  • All state elements updated at rising clock edge
  • Problem: single monolithic block is too bulky
  • Solution: Break up process of executing instruction into steps
    • Stage 1: Instruction Fetch (IF)
    • Stage 2: Instruction Decode (ID)
    • Stage 3: Execute (EX) ALU
    • Stage 4: Memory Access (MEM)
    • Stage 5: Write Back to Register (WB)
  • All 5 stages occur within 1 clock cycle (rising edge fetch, write on the next rising edge)

Lecture 18: 10/9/20

See lecture diagrams

Lecture 19: 10/12/20

  • Control and status register (CSR): diff from register file
    • Monitor status and performance
    • Up to 4096 CSRs
    • Necessary for peripherals
  • csrrw swaps value of a CSR and a integer register
  • csrw csr, rs1 is csrrw x0, csr, rs1
    • rd=x0 meand just write rs1 to CSR
  • csrwi csr, uimm is csrrwi x0, csr, uimm
    • rd=x0 just writes uimm to CSR

  • LSB is stable first
  • Critical path lw =
    tclkq+tIMEM+tImm+tALU+tDMEM+tmux+tsetup
  • Certain instructions take longer
    • Most blocks will be idle for most of the time
  • ROM: Read Only Memory

Lecture 20: 10/14/20

  • Measure performance in maximum frequency
  • Performance measures:
    • Program execution time
    • Throughput (num of server requests handled per hour)
    • Energy per task
  • Iron law of processor performance:
    • TimeProgram=InstructionsProgramCyclesInstructionTimeCycle
    • CPI: Cycles per instruction
    • Instructions per program
      • Determined by task, algorithm, language, compiler, ISA
    • Average CPI:
      • Determined by ISA, processor implementation, complex instructions (like strcpy, CPI >> 1), superscalar processors (CPI < 1)
    • Time per cycle:
      • Procesor microarchitecture (critical path), technology, power budget
  • 30% of energy is leaked in CMOS
  • EnergyProgram=InstructionsProgramEnergyInstruction
  • EnergyProgramInstructionsProgramCV2
    • Capacitance depends on technology processor features
    • Supply voltage (run slower with less supply voltage)
  • Energy Iron Law: Performance = Power * Energy Efficiency
    • Performance: tasks/second
    • Energy efficiency: tasks/joule

Pipelining

  • Does not help latency of single task
  • Helps throughput of entire worload
  • Multiple tasks operate simultaneously using diff resources
  • (theoretical) Speedup = num of pipe stages
    • Time to fill and drain pipeline in reality
    • Pipeline rate is limited by slowest pipeline stage

Lecture 21: 10/16/20

  • Put registers in between the 5 stages to pipeline
  • Time to process 1 instruction can be longer so worse latency, but better throughput
    • Clock rate can be dramatically higher
  • Pipelining hazards: situations that prevent startingthe next instruction in the next clock cycle
    • Structural hazard: Required resource is busy (needed in multiple stages)
      • 2+ instructions compete for single physical resource
      • Sol1: Instructions take turns, causing some to stall
      • Sol2: Add more hardware (can always solve)
      • 2 independent read ports and 1 write port
      • 2 separate memories (cache)
    • Data hazard: Data dependency between instructions, need to wait for prev inst to complete its read/write
      • Write in the first half of cycle, read in the second half of cycle (registers are fast)
      • Sol1: Stall using bubble (nop) when it depends on previous inst, but reduces performance
      • Compiler will try to rearrange code that is not dependent
      • Sol2: Forwarding/bypassing - grab operand from pipeline stage rather than register file (output of ALU to input of ALU)
    • Control hazard: Flow of execution depends on previous instruction (for branches)

Lecture 23: 10/18/20

Load Data Hazard

  • Load first, then do operations on it
    • Load is only available after DMEM, but needed 1 cycle earlier
    • Sol 1: Load delay slot: Need 1 stall, convert next command into a nop
    • Just set regWEn, MemRW, and do not update PC
    • Cycle after that execute the original command
    • Forward from DMEM output to ALU
    • Sol2: Put unrelated instruction into load delay slot

Control Hazards

  • Branches are condition so don't know next inst
  • Earliest to know branch is at end of execution (ALU)
  • 2 subsequent instructions are executed regardless of the branch outcome
    • Sol1: Use 2 stall cycles after a branch
    • Sol2: If branch/jump taken, convert the 2 next to nop
    • Use branch prediction to guess
      • Generally loop, so branch is rarely taken
      • Ex. Keep a bit of if the branch was taken last time
    • Moving branch comparator to ID stage would incur data hazard and forwarding could fail

Superscalar Processors

  • Clock rate limited by tech and power dissipation
  • Deeper pipelines (10, 15 stages)
    • Less work per stage, so shorter clock cycle
    • More potential for hazards (CPI > 1)
  • Multi-issue superscalar
    • Replicate pipeline stages: multiple pipelines (CPI < 1)
    • Multiple execution units, start multiple instructions per clock cycle
    • Out of order execution to reduce hazards
    • Arrive at commit unit which reorders
    • Get CPI using iron law (time program, count instructions, know time/cycle)