# CS186 MT1 Notes
# Lecture 1: 1/22/20
* SQL is declarative (say what you want, not how to get it)
* Turing complete, not functional
* Can call to other languages to extend SQL
* Database: set of named relations
* Relation/table:
* Schema: description/metadata
* Instance: set of data satisfying the schema
* Attribute/Colum/Field
* Tuple/Record/Row
* Schema is fixed (unique attribute names, atomic types)
* Instance can change often
## SQL Language
* 2 sublanguages:
* DDL - Data Definition Language: Define and modify schema
* DML - Data Manipulation Language: Queries are writtten intuitively
* RDBMS (Relational databaes management system): chooses and runs algorithms to give answers
* Primary key columns: provide unique lookup keys
* Cannot have any duplicate values
* Can be made of >1 column
* Foreign keys: pointers to reference other tables
* References the primary key
## Query Syntax
* `CREATE TABLE name(col1 text, col2 integer)`
* `CREATE TABLE pets(name text, age integer, type text, cuteness integer)`
* `UPDATE table_name SET attribute=new_value WHERE attribute=something`
* `UPDATE pets SET age = age+1 WHERE name = "Skippy";`
* `DELETE FROM table_name WHERE attribute=something`
* `DELETE FROM pets WHERE name = "Peter";`
* `SELECT [Distinct] <col expr> FROM <siungle table> [WHERE <predicate>] [ORDER BY <attr>]`
* Name output columns with `AS`
* Can order by `DESC` or `ASC` (Default is ASC)
* Use `LIMIT` to limit the number of results
* Use with `ORDER BY` or else the answer will be nondeterministic
## Aggregates
* Produce a summary of data (1 row of output)
* Ex. `SUM, COUNT, MAX, MIN`
* `GROUP BY` column values to partition table into groups
* Produce aggregate result for each group
* Can put grouping columns into select
* `HAVING` filters after grouping and aggregation
* Ex. `HAVING COUNT(*) > 2`
* `DISTINCT` is used inside aggregate functions or right after `SELECT` to remove duplicates from output
* Ex. `SELECT COUNT(DISTINCT S.name)`
* For aggregate queries, distinct outside does nothing since no duplicates
# Lecture 2: 1/23/20
* Evaluation:
* 1) FROM: Identify table
* 2) WHERE: Apply selection (eliminate rows)
* 3) SELECT: Project away columns (keep only the ones used in group by or having)
* 4) GROUP BY: Form groups & aggregate
* 5) HAVING: Eliminate groups
* 6) DISTINCT: Eliminate duplicates
* FROM multiple tables creates cross product
* Self-joins need aliases (`AS`)to distinguish
* String comparison
* `WHERE S.sname LIKE 'B_%'` or `LIKE 'B.*'`
* single char wildcard repeated any number of times
## Set Operations
* `UNION` `INTERESCT`, `EXCEPT`
* Tuples are elements in sets
* `UNION ALL` sums cardinalities of both sets
* `INTERSECT ALL` is min of cardinalities
* `EXCEPT ALL` is the difference of cardinalities (< 0 just means none in output)
## Nested queries
* Use `IN` for a subquery (or `NOT IN`)
* `ANY` or `ALL` for comparison
## Joins
* Inner joins: `FROM Sailors s INNER JOIN Reserves r`
* Inner, natural, or (left, right, full) outer joins
* Inner joins are default
* Natural joins: equality on matching column name joins
* Left outer join: return all matched rows and preserve unmatched rows from left table
* Use NULLs for fields of non matched tuples
* Right outer join: preserve unmatched rows from right table
* Full outer join: preserve unmatched rows from both sides
## Views
* `CREATE VIEW view_name AS select_statement`
* Subroutines
* Views are evaluated whenever you reference them (not snapshots of stored results)
* Can just create subqueries in the `FROM` clause
* Common table expression: `WITH <view> AS () SELECT`
## Nulls
* Come naturally fom outer joins
* `SELECT rating = NULL` will return NULL for everything
* Explicit checks `WHERE rating IS NOT NULL`
* NULL is treated as false for WHERE
* Three valued logic with T,F,N
* NOT N = N
* NULL can be T or F, so answer must accomodate either value (think of short circuiting)
* NULL column values are ignored by aggregate functions
# Lecture 3: 1/28/20
## DBMS Architecture
* Query parseing: checks, verifies SQL
* Relational operators are the algorithms that answer queries
* File index management: organize tables and pages in a logical file
* Buffer management layer: map disk blocks into memory (RAM)
* Disk space management: translate page requests into physical bytes on SSDs
* Concurrency control: multiple users use DB
* Recovery: if DB crashes, we can recover the data
## Storage Media
* No pointer derefs
* Instead we have READ and WRITE APIs (slow)
* Mag Disks are most economical (compared to SSD, RAM)
* Disk components:
* Platters spin 15000rpm
* Track "spins" under the disk head
* Platters make a cylinder
* Only 1 head reads/writes at any time
* Block/page size is a multiple of fixed sector size
* Accessing a disk page:
* Seek time (2-3ms) - move arm
* Rotational delay(0-4ms) - block rotates
* Transfer time (.25ms) - move data off
* Minimize seek/rotation delays
## Flash/SSD
* NAND flash: fine grain reads (4-8k reads), coarse grain writes (1-2MB writes)
* Cell fails aftre 2-3k erasures
* Wear leveling - don't rewrite to same location every time
* Write amplificatio - you end up writing big units and have to re-organize
* Slower for random and sequential writes
* Up to 10x the bandwidth of ideal HDDs
* Locality matters for flash and disk writes
* Smaller DBs can even fit on RAM
## Disk Space Management
* Sequential blocks is faster
* Amortize seek delays (HDDs), writes (SSDs)
* Cache: keep popular blocks
* Pre-fetch likely to be accessed blocks
* Buffer writes to sequential blocks
* Block/Page = unit of transfer for read/write
* 64-128KB
* Next block:
* sequential blocks on same track
* Blocks on same cylinder
* Blocks on adjacent cylinder
* Arrange file pages sequentially
* Lowest layer of DBMS
* Map pages to locations on disk
* Load pages from disk to memory
* Save pages back to disk, ensure writes
* Higher levels get requests to read/write pages and allocate pages
* Implementation
* A) Talk to the storage device directly (but device can change)
* B) Run over filesystem (FS)
* Allocate 1 large contiguous file
* FS require file to be on a single disk drive, but DBMS has multiple file systems
## Database files
* Tables stored as files which contain pages
* Pages contain a collection of records
* Disk space manager for disk page management
* Buffer manager: access pages in RAM (higher levels)
* API for higher levels:
* Insert/delete/modify record
* Fetch particular record by record id
* Scan all records
* Files can span multiple machines
### DB File Structures
* Unordered heap files - arbitrary order of records
* Pages are allocated as file grows
* Keep track of pages in a file, free space on pages, records on a page
* Page directory to list data sizes and their IDs (likely in cache)
## Page Structure
* Clustered heap files - records/pages are grouped
* Sorted files - pagesa nd records in sorted order
* Index files - disk based data structures
* B+ Trees, Linear hashing
* Header:
* Number of records, free space, next/last pointer, bitmaps, slot table
* Page layouts:
* Record length can be fixed or variable
* Page packing (packed or unpacked)
* Fixed Length Records, Packed
* Store record number as the offset from start
* Empty space calculated from number of records
* Deletion means we need to rearrange
* Record ID pointers need to be updated
* Fixed Length Records, Unpacked
* Bitmap denotes slots with records
* Insert - find first empty slot
* Delete - clear bit
### Variable Length Records
* Hard to know where each record begins
* Relocate metadata to footer
* Slot directory in footer
* First entry (end of the page) is pointer to free space
* Each other entry is length of record and pointer/offset to start
* Deletion: null length, null pointer
* Insert: Place record in free space, put pointer/length pair in next open slot, update free space pointer
* Problem: free space fragmentation not consolidated
* Pack all records together and update pointers
* Reorganizing data is safe
* Eager - always reorganize
* Lazy - only when full
* Track number of slots in slot directory (total)
* Slots grow fron end inward, records grow from beginning. When they meet, it is full
## Record Layout
* Each record in table has a fixed type
* Fixed Length: field types all the same
* Disk byte representation same as memory
* Variable length (var char):
* Don't just pad to be fixed length
* Use delimiters (commas)
* Could want to use delimiter in the data
* Scan cost in memory
* Record header: point to end of variable length fields
* Move variable length fields to the end
* Direct access to any field
* Handles nulls
# Lecture 4: 1/31/20
* Usually have trade-offs between techniques
* Heap files: good for full scans
* Sorted files: retrieval in order or range
* Clustered files: Group data into blocks
## Cost Model
* B: Number of data blocks in file
* R: Number of records per block
* D: Average time to read/write disk block
* Ignore: Sequential vs random I/O, pre-fetching, in-memory costs
* Assumptions:
* Single record insert/delete
* Equality selection - exactly 1 match
* Heap file inserts always appends to end of file
* Sorted files are packed
## File Scan
* B=5, R=2, D=5ms
| | Heap | Sorted |
| -------- | -------- | -------- |
| Scan | B*D | B*D |
| Equality search | $\frac{1}{2}B*D$ | $D\log_2(B)$ |
| Range search | B*D | $D(\log_2(B)+ Pages)$ |
| Insert | 2D | $D(\log_2(B)+ B)$ |
| Delete | (B/2 + 1)D | $D(\log_2(B)+ B)$ |
* Equality search (heap):
* P(i) = 1/B, T(i) = i
* Pages touched on average = B/2
* Equality search (sorted):
* Worst case $\log_2(B)$
* Expected cost: $\log_2(B) - \frac{B-1}{B}$
* Range search (heap):
* Always touch all blocks
* Range search (sorted):
* Binary search
* Insert (heap):
* Stick at end of file = 2D
* Read the page and then write the page
* Double cost for read and write
* Insert (sorted):
* Binary search
* Delete (heap):
* B/2 reads, 1 delete
* Delete (sorted):
* Have to repack the rest of the page
# Lecture 5: 2/8/20
* Want to look things up by value
* Index - data structure for fast lookup and modification of data entries by search key
* Lookup: multiple operations (equality, 1d range)
* Search key: subset of columns in region
* Items as (key, recordId) pair
* Want fast insert/delete
* Ex. B+ Tree, Hash, R-Tree
## ISAM
* Static structure
* Sorted (key, page id) file
* Recursive search files for search files
* All tuples in range $K_L \leq K \leq K_R$ in tree $P_L$
* Can just drop all the left keys
* ISAM
* Idea: have linked list of overflow pages
* Good search and locality, bad for lots of inserts
## B+ Tree
* Same (Key, Page pointer) pairs & key invariant
* Same search routine
* $\log_F(N)$ cost
* Dynamic tree index keeps it always balanced
* Efficient insertion/deletion
* Grow at root not leaves
* All data stored at leaves
* Occupancy invariant:
* Occupancy must be at least 1/2 of the node
* d = order of the tree
* d < occupancy < 2d
* Max fan out: 2d + 1 (1 for leftmost pointer)
* Root does not need to be half full
* Previous and next pointers allow traversal between leaves
* fan-out * leaf size
* Insertion:
* Split leaf if not enough room and redistribute
* Fix next/prev pointers
* Take lowest value on leaf and place in parent
* Recursively split parent
* If a leaf splits, value is copied up
* If internal node splits, push a value up instead of copying
* Most of the time the tree gets wider
* Deletion is not often implemented
* Occupancy invariant not enforced
* If page becomes completely empty, can delete
* Usually size of tree remains same
## B+ Tree Bulk Loading
* Naive way makes leaves/internal nodes end up mostly half-empty
* 1) Sort input records by key
* 2) Fill pages to some fill factor (ex. 75%)
# Lecture 6: 2/9/20
* Query support
* Choice of search key
* Data entry storage
* Variable-lenght key tricks
* Cost model for Index
## Query support
* B+ trees provide equality and range
* 2d boxes/circles (R-tree, KD-tree)
* Hard to separate points by distance for more dimensions
* Nighbor queries
* GIST - generalized search tree
* Focus on 1d range search, equality is a special case
## Search key/ordering
* Order by 1 column by search key
* Break ties on 2nd column, etc.
* Composite search key matches a query if conjunction of m equality clauses and $k_{m+1}$ is < or >
## Data Entry Storage
* Alternatives for data entry:
* 1) By value
* Store tuples of (key,value) in leaves
* 2) By reference
* Store (key, rid of matching data) tuples
* 3) By list of references
* Compressed version of #2
* Store (key, list of rid matching data records)
* Clustered
* Heap file records are kept mostly ordered by search keys in index
* Sort the heap file first so records are more packed
* Might need new blocks for inserts
* Clustered Pros:
* Efficient for range searches
* Potential locality benefits
* Supports compression
* Cons:
* More expensive to maintain (periodically resort)
* Heap file packed to 2/3 to accomodate inserts
## Variable Length Keys
* Redefine occupancy invariant for B+ trees
* Index pages often hold many more entries than leaves
* Occupancy in terms of bytes
* Prefix compression for keys
* Determine min splitting prefix wrt keys on right
* Suffix Key Compression
* All keys have largest common prefix
* Leave only suffix
* Very useful for composite keys
## Cost Model
| | Heap | Sorted | Clustered Index |
| -------- | -------- | -------- | -------- |
| Scan | B*D | B*D | 1.5B * D |
| Equality search | $\frac{1}{2}B*D$ | $D\log_2(B)$ | $D(\log_F(BR/E) + 2)$ |
| Range search | B*D | $D(\log_2(B)+ Pages)$ |$D(\log_F(BR/E) + 3*pages)$ |
| Insert | 2D | $D(\log_2(B)+ B)$ | $D(\log_F(BR/E) + 4)$ |
| Delete | (B/2 + 1)D | $D(\log_2(B)+ B)$ | $D(\log_F(BR/E) + 4)$ |
B = Num of data blocks
R = Num of records per block
D = Avg time to read/write
F = Avg internal node fanout
E = Avg num of data entries per leaf
# Lecture 7: 2/11/20
## Buffer Management
* Manages buffer pool
* Move pages from disk into/out of RAM
* Dirty pages - pages modified in RAM
* Buffer manager needs to find out page is dirty
* Large amout of memory malloc'd
* Metadata - small array in memory malloc'd at DBMS (stores dirty bit and pin count)
* Pin count - number of queries for the page
## Page Replacement
* If buffer is full, page replacement policy
* If requested page not in pool:
* Choose un-pinned frame for replacement
* If frame is dirty, write page to disk and mark clean
* read requested page into frame
* Pin page, return its address
* After requestor finishes:
* Set dirty bit if modified
* Unpin the page
* Increment pin count for each req, decrement for unpin
## LRU Policy
* Least recently used
* Add 1 column, last used to track time when frame was last used
* Replace frame with lowest value in last used
* Temporal locality: good for repeated accesses to popular pages
* Expensive in CPU cost to find lowest last used (linear scan or heap for log time)
* Approximate LRU: CLOCK policy
* clock hand points to next page to consider
* Reference bits for each page
* If reference bit is set, skip page and unset the bit
* Basically a second chance
* Better time performance
* Repeated scans of big files are bad for LRU and CLOCK
* Sequential flooding - flood the buffer pool so 0% hit rate in cache
## MRU Scans
* $\frac{B-1}{N-1}$ hit rate (seq scans)
* Window of pages slides left
## Prefetching
* Prefetch pages for a run of sequential pages
* Hybrid of LRU(hot/cold) and MRU(repeated seq)
* Use DBMS info to hint to BufMgr
* Predict I/O patterns for big queries
* Fancier stochastic policies (2Q, LUR-2, ARC)
* Can't just use filesystem(OS) buffer
* Bad portability for multiple FS
* DBMS forces pages to disk for recovery purposes
# Lecture 8: 2/13/20
* Relational algebra = tree of operations
* Operational (SQL is declarative)
* Codd's theorem: All calculus can be expressed in algebra
* Results of relations are relations
## Operators
* Operators on single relation
* Projection $\pi$: Retain only desired cols (vertical)
* $\pi_{\text{name,age}}(S2)$
* SELECT clause (duplicate elimination)
* Selection $\sigma$: Select subset of rows (horizontal)
* $\sigma_{\text{rating>8}}(S2)$
* WHERE clause
* Renaming: $\rho$: Rename attributes and relations
* $\rho(Temp1(1\rightarrow sid1, 4 \rightarrow sid2), R1 \times R2)$
## Binary operators: (pair of relations)
* Union: $\cup$: tuples in r1 or r2
* r1, r2 must be compatible (same num of fields)
* UNION clause
* Remove duplicate rows
* Set difference: $-$: Tuples in r1, but not r2
* r1, r2 must be compatible
* EXCEPT clause
* Don't need duplicate elimination
* Not commutative
* Cross product: $\times$: Combine 2 relations
* Cardinality: $|R_1||R_2|$
* Don't need compatibility or duplicate elimination
* Group by/aggregation $\gamma$:
* $\gamma_{\text{age, AVG(rating)}}(Sailors)$
* $\gamma_{\text{age, AVG(rating), COUNT(*)>2}}(Sailors)$
## Compound operators: macros
* Intersection: $\cap$: tuples in r1 and r2
* Requires compatibility
* $S1 \cap S2 = S1 - (S1 - S2)
* Joins: $\bowtie_\theta, \bowtie$: Combine relations that satisfy predicates
* Theta Join $\bowtie_\theta$: join on logical expression $\theta$
* Equi-join: Theta join with theta being a conjunction of equalities
* Natural Join $\bowtie$ equi-join on all matching column names
* $R \bowtie S = \pi_{\text{unique field}} \sigma_{\text{eq. matching fld}} (R \times S)$
# Lecture 9: 2/18/20
## Out of Core Algs
* Need to sort 100GB of data with 1GB of RAM
* Single-pass stream:
* Read chunk from input to input buffer
* When output buffer fills, write to output
* Double buffering:
* Swap between I/O thread and compute thread
* Assumes you have RAM buffers to spare
* Sorting: Store contents in order
* Hashing: No hash collision
## Sorting: 2-Way Strawman
* Pass 0: conquer a batch
* Read, sort, write in 1 buffer page
* Repeated batch job
* Pass 1,2,3
* Requires 3 buffer page
* Read 2 sorted blocks, pick smallest and put in output buffer
* Merge 2 buffers into 1 output buffer that is 2 buffers long
## Full External Sort
* Sort a file with N pages, B buffer pages
* Pass 0: use B buffer pages to produce $\lceil N/B \rceil$ sorted runs of B pages each
* Pass 1,2,... merge B-1 runs at a time
* Num of passes: $1 + \lceil \log_{B-1} \lceil N/B \rceil \rceil$
* Cost: $2N *$ num of passes
* Sort N pages of data in $\sqrt{N}$ space
## Alternative: Hashing
* Remove duplicates without ordering
* Streaming partition (divide):
* Hash function $h_p$ to stream records to disk partitions
* Matches go to same partition on disk
* ReHash (conquer):
* Read RAM hash table 1 at a time using $h_r$
* Each bucket contains small num of distinct values
* Write each bucket to disk (make sure duplicate values are contiguous)
* Cost = 2N num of passes = 4N I/Os
* B-1 partitions from Pass 1, each no more than B pages in size
* $\sqrt{N}$ space
* Cost:
* Hashing: 4N I/Os (divide then conquer)
* Don't have to worry as much about skew data
* Better for duplicate elimination
* Scales with num of distinct values instead of rows
* Sorting: 4N I/Os (conquer then divide)
* Good if output needs to be sorted
* Not sensitive to duplicates or bad hash functions
## Parallelization
* Hashing:
* Shuffle data across machines ($h_n$ partitions data across machines)
* Each machine partitions using $h_p$ and then conquer $h_r$
* Still 4N I/Os, but parallelized across n computers so n*faster
* Sorting:
* Choose which machine based on value range
* Each machine sorts locally
* Pass 1 merges to create large local runs
* Hard to ensure ranges give equal work(pages) or else have data skew