# CS186 MT1 Notes # Lecture 1: 1/22/20 * SQL is declarative (say what you want, not how to get it) * Turing complete, not functional * Can call to other languages to extend SQL * Database: set of named relations * Relation/table: * Schema: description/metadata * Instance: set of data satisfying the schema * Attribute/Colum/Field * Tuple/Record/Row * Schema is fixed (unique attribute names, atomic types) * Instance can change often ## SQL Language * 2 sublanguages: * DDL - Data Definition Language: Define and modify schema * DML - Data Manipulation Language: Queries are writtten intuitively * RDBMS (Relational databaes management system): chooses and runs algorithms to give answers * Primary key columns: provide unique lookup keys * Cannot have any duplicate values * Can be made of >1 column * Foreign keys: pointers to reference other tables * References the primary key ## Query Syntax * `CREATE TABLE name(col1 text, col2 integer)` * `CREATE TABLE pets(name text, age integer, type text, cuteness integer)` * `UPDATE table_name SET attribute=new_value WHERE attribute=something` * `UPDATE pets SET age = age+1 WHERE name = "Skippy";` * `DELETE FROM table_name WHERE attribute=something` * `DELETE FROM pets WHERE name = "Peter";` * `SELECT [Distinct] <col expr> FROM <siungle table> [WHERE <predicate>] [ORDER BY <attr>]` * Name output columns with `AS` * Can order by `DESC` or `ASC` (Default is ASC) * Use `LIMIT` to limit the number of results * Use with `ORDER BY` or else the answer will be nondeterministic ## Aggregates * Produce a summary of data (1 row of output) * Ex. `SUM, COUNT, MAX, MIN` * `GROUP BY` column values to partition table into groups * Produce aggregate result for each group * Can put grouping columns into select * `HAVING` filters after grouping and aggregation * Ex. `HAVING COUNT(*) > 2` * `DISTINCT` is used inside aggregate functions or right after `SELECT` to remove duplicates from output * Ex. `SELECT COUNT(DISTINCT S.name)` * For aggregate queries, distinct outside does nothing since no duplicates # Lecture 2: 1/23/20 * Evaluation: * 1) FROM: Identify table * 2) WHERE: Apply selection (eliminate rows) * 3) SELECT: Project away columns (keep only the ones used in group by or having) * 4) GROUP BY: Form groups & aggregate * 5) HAVING: Eliminate groups * 6) DISTINCT: Eliminate duplicates * FROM multiple tables creates cross product * Self-joins need aliases (`AS`)to distinguish * String comparison * `WHERE S.sname LIKE 'B_%'` or `LIKE 'B.*'` * single char wildcard repeated any number of times ## Set Operations * `UNION` `INTERESCT`, `EXCEPT` * Tuples are elements in sets * `UNION ALL` sums cardinalities of both sets * `INTERSECT ALL` is min of cardinalities * `EXCEPT ALL` is the difference of cardinalities (< 0 just means none in output) ## Nested queries * Use `IN` for a subquery (or `NOT IN`) * `ANY` or `ALL` for comparison ## Joins * Inner joins: `FROM Sailors s INNER JOIN Reserves r` * Inner, natural, or (left, right, full) outer joins * Inner joins are default * Natural joins: equality on matching column name joins * Left outer join: return all matched rows and preserve unmatched rows from left table * Use NULLs for fields of non matched tuples * Right outer join: preserve unmatched rows from right table * Full outer join: preserve unmatched rows from both sides ## Views * `CREATE VIEW view_name AS select_statement` * Subroutines * Views are evaluated whenever you reference them (not snapshots of stored results) * Can just create subqueries in the `FROM` clause * Common table expression: `WITH <view> AS () SELECT` ## Nulls * Come naturally fom outer joins * `SELECT rating = NULL` will return NULL for everything * Explicit checks `WHERE rating IS NOT NULL` * NULL is treated as false for WHERE * Three valued logic with T,F,N * NOT N = N * NULL can be T or F, so answer must accomodate either value (think of short circuiting) * NULL column values are ignored by aggregate functions # Lecture 3: 1/28/20 ## DBMS Architecture * Query parseing: checks, verifies SQL * Relational operators are the algorithms that answer queries * File index management: organize tables and pages in a logical file * Buffer management layer: map disk blocks into memory (RAM) * Disk space management: translate page requests into physical bytes on SSDs * Concurrency control: multiple users use DB * Recovery: if DB crashes, we can recover the data ## Storage Media * No pointer derefs * Instead we have READ and WRITE APIs (slow) * Mag Disks are most economical (compared to SSD, RAM) * Disk components: * Platters spin 15000rpm * Track "spins" under the disk head * Platters make a cylinder * Only 1 head reads/writes at any time * Block/page size is a multiple of fixed sector size * Accessing a disk page: * Seek time (2-3ms) - move arm * Rotational delay(0-4ms) - block rotates * Transfer time (.25ms) - move data off * Minimize seek/rotation delays ## Flash/SSD * NAND flash: fine grain reads (4-8k reads), coarse grain writes (1-2MB writes) * Cell fails aftre 2-3k erasures * Wear leveling - don't rewrite to same location every time * Write amplificatio - you end up writing big units and have to re-organize * Slower for random and sequential writes * Up to 10x the bandwidth of ideal HDDs * Locality matters for flash and disk writes * Smaller DBs can even fit on RAM ## Disk Space Management * Sequential blocks is faster * Amortize seek delays (HDDs), writes (SSDs) * Cache: keep popular blocks * Pre-fetch likely to be accessed blocks * Buffer writes to sequential blocks * Block/Page = unit of transfer for read/write * 64-128KB * Next block: * sequential blocks on same track * Blocks on same cylinder * Blocks on adjacent cylinder * Arrange file pages sequentially * Lowest layer of DBMS * Map pages to locations on disk * Load pages from disk to memory * Save pages back to disk, ensure writes * Higher levels get requests to read/write pages and allocate pages * Implementation * A) Talk to the storage device directly (but device can change) * B) Run over filesystem (FS) * Allocate 1 large contiguous file * FS require file to be on a single disk drive, but DBMS has multiple file systems ## Database files * Tables stored as files which contain pages * Pages contain a collection of records * Disk space manager for disk page management * Buffer manager: access pages in RAM (higher levels) * API for higher levels: * Insert/delete/modify record * Fetch particular record by record id * Scan all records * Files can span multiple machines ### DB File Structures * Unordered heap files - arbitrary order of records * Pages are allocated as file grows * Keep track of pages in a file, free space on pages, records on a page * Page directory to list data sizes and their IDs (likely in cache) ## Page Structure * Clustered heap files - records/pages are grouped * Sorted files - pagesa nd records in sorted order * Index files - disk based data structures * B+ Trees, Linear hashing * Header: * Number of records, free space, next/last pointer, bitmaps, slot table * Page layouts: * Record length can be fixed or variable * Page packing (packed or unpacked) * Fixed Length Records, Packed * Store record number as the offset from start * Empty space calculated from number of records * Deletion means we need to rearrange * Record ID pointers need to be updated * Fixed Length Records, Unpacked * Bitmap denotes slots with records * Insert - find first empty slot * Delete - clear bit ### Variable Length Records * Hard to know where each record begins * Relocate metadata to footer * Slot directory in footer * First entry (end of the page) is pointer to free space * Each other entry is length of record and pointer/offset to start * Deletion: null length, null pointer * Insert: Place record in free space, put pointer/length pair in next open slot, update free space pointer * Problem: free space fragmentation not consolidated * Pack all records together and update pointers * Reorganizing data is safe * Eager - always reorganize * Lazy - only when full * Track number of slots in slot directory (total) * Slots grow fron end inward, records grow from beginning. When they meet, it is full ## Record Layout * Each record in table has a fixed type * Fixed Length: field types all the same * Disk byte representation same as memory * Variable length (var char): * Don't just pad to be fixed length * Use delimiters (commas) * Could want to use delimiter in the data * Scan cost in memory * Record header: point to end of variable length fields * Move variable length fields to the end * Direct access to any field * Handles nulls # Lecture 4: 1/31/20 * Usually have trade-offs between techniques * Heap files: good for full scans * Sorted files: retrieval in order or range * Clustered files: Group data into blocks ## Cost Model * B: Number of data blocks in file * R: Number of records per block * D: Average time to read/write disk block * Ignore: Sequential vs random I/O, pre-fetching, in-memory costs * Assumptions: * Single record insert/delete * Equality selection - exactly 1 match * Heap file inserts always appends to end of file * Sorted files are packed ## File Scan * B=5, R=2, D=5ms | | Heap | Sorted | | -------- | -------- | -------- | | Scan | B*D | B*D | | Equality search | $\frac{1}{2}B*D$ | $D\log_2(B)$ | | Range search | B*D | $D(\log_2(B)+ Pages)$ | | Insert | 2D | $D(\log_2(B)+ B)$ | | Delete | (B/2 + 1)D | $D(\log_2(B)+ B)$ | * Equality search (heap): * P(i) = 1/B, T(i) = i * Pages touched on average = B/2 * Equality search (sorted): * Worst case $\log_2(B)$ * Expected cost: $\log_2(B) - \frac{B-1}{B}$ * Range search (heap): * Always touch all blocks * Range search (sorted): * Binary search * Insert (heap): * Stick at end of file = 2D * Read the page and then write the page * Double cost for read and write * Insert (sorted): * Binary search * Delete (heap): * B/2 reads, 1 delete * Delete (sorted): * Have to repack the rest of the page # Lecture 5: 2/8/20 * Want to look things up by value * Index - data structure for fast lookup and modification of data entries by search key * Lookup: multiple operations (equality, 1d range) * Search key: subset of columns in region * Items as (key, recordId) pair * Want fast insert/delete * Ex. B+ Tree, Hash, R-Tree ## ISAM * Static structure * Sorted (key, page id) file * Recursive search files for search files * All tuples in range $K_L \leq K \leq K_R$ in tree $P_L$ * Can just drop all the left keys * ISAM * Idea: have linked list of overflow pages * Good search and locality, bad for lots of inserts ## B+ Tree * Same (Key, Page pointer) pairs & key invariant * Same search routine * $\log_F(N)$ cost * Dynamic tree index keeps it always balanced * Efficient insertion/deletion * Grow at root not leaves * All data stored at leaves * Occupancy invariant: * Occupancy must be at least 1/2 of the node * d = order of the tree * d < occupancy < 2d * Max fan out: 2d + 1 (1 for leftmost pointer) * Root does not need to be half full * Previous and next pointers allow traversal between leaves * fan-out * leaf size * Insertion: * Split leaf if not enough room and redistribute * Fix next/prev pointers * Take lowest value on leaf and place in parent * Recursively split parent * If a leaf splits, value is copied up * If internal node splits, push a value up instead of copying * Most of the time the tree gets wider * Deletion is not often implemented * Occupancy invariant not enforced * If page becomes completely empty, can delete * Usually size of tree remains same ## B+ Tree Bulk Loading * Naive way makes leaves/internal nodes end up mostly half-empty * 1) Sort input records by key * 2) Fill pages to some fill factor (ex. 75%) # Lecture 6: 2/9/20 * Query support * Choice of search key * Data entry storage * Variable-lenght key tricks * Cost model for Index ## Query support * B+ trees provide equality and range * 2d boxes/circles (R-tree, KD-tree) * Hard to separate points by distance for more dimensions * Nighbor queries * GIST - generalized search tree * Focus on 1d range search, equality is a special case ## Search key/ordering * Order by 1 column by search key * Break ties on 2nd column, etc. * Composite search key matches a query if conjunction of m equality clauses and $k_{m+1}$ is < or > ## Data Entry Storage * Alternatives for data entry: * 1) By value * Store tuples of (key,value) in leaves * 2) By reference * Store (key, rid of matching data) tuples * 3) By list of references * Compressed version of #2 * Store (key, list of rid matching data records) * Clustered * Heap file records are kept mostly ordered by search keys in index * Sort the heap file first so records are more packed * Might need new blocks for inserts * Clustered Pros: * Efficient for range searches * Potential locality benefits * Supports compression * Cons: * More expensive to maintain (periodically resort) * Heap file packed to 2/3 to accomodate inserts ## Variable Length Keys * Redefine occupancy invariant for B+ trees * Index pages often hold many more entries than leaves * Occupancy in terms of bytes * Prefix compression for keys * Determine min splitting prefix wrt keys on right * Suffix Key Compression * All keys have largest common prefix * Leave only suffix * Very useful for composite keys ## Cost Model | | Heap | Sorted | Clustered Index | | -------- | -------- | -------- | -------- | | Scan | B*D | B*D | 1.5B * D | | Equality search | $\frac{1}{2}B*D$ | $D\log_2(B)$ | $D(\log_F(BR/E) + 2)$ | | Range search | B*D | $D(\log_2(B)+ Pages)$ |$D(\log_F(BR/E) + 3*pages)$ | | Insert | 2D | $D(\log_2(B)+ B)$ | $D(\log_F(BR/E) + 4)$ | | Delete | (B/2 + 1)D | $D(\log_2(B)+ B)$ | $D(\log_F(BR/E) + 4)$ | B = Num of data blocks R = Num of records per block D = Avg time to read/write F = Avg internal node fanout E = Avg num of data entries per leaf # Lecture 7: 2/11/20 ## Buffer Management * Manages buffer pool * Move pages from disk into/out of RAM * Dirty pages - pages modified in RAM * Buffer manager needs to find out page is dirty * Large amout of memory malloc'd * Metadata - small array in memory malloc'd at DBMS (stores dirty bit and pin count) * Pin count - number of queries for the page ## Page Replacement * If buffer is full, page replacement policy * If requested page not in pool: * Choose un-pinned frame for replacement * If frame is dirty, write page to disk and mark clean * read requested page into frame * Pin page, return its address * After requestor finishes: * Set dirty bit if modified * Unpin the page * Increment pin count for each req, decrement for unpin ## LRU Policy * Least recently used * Add 1 column, last used to track time when frame was last used * Replace frame with lowest value in last used * Temporal locality: good for repeated accesses to popular pages * Expensive in CPU cost to find lowest last used (linear scan or heap for log time) * Approximate LRU: CLOCK policy * clock hand points to next page to consider * Reference bits for each page * If reference bit is set, skip page and unset the bit * Basically a second chance * Better time performance * Repeated scans of big files are bad for LRU and CLOCK * Sequential flooding - flood the buffer pool so 0% hit rate in cache ## MRU Scans * $\frac{B-1}{N-1}$ hit rate (seq scans) * Window of pages slides left ## Prefetching * Prefetch pages for a run of sequential pages * Hybrid of LRU(hot/cold) and MRU(repeated seq) * Use DBMS info to hint to BufMgr * Predict I/O patterns for big queries * Fancier stochastic policies (2Q, LUR-2, ARC) * Can't just use filesystem(OS) buffer * Bad portability for multiple FS * DBMS forces pages to disk for recovery purposes # Lecture 8: 2/13/20 * Relational algebra = tree of operations * Operational (SQL is declarative) * Codd's theorem: All calculus can be expressed in algebra * Results of relations are relations ## Operators * Operators on single relation * Projection $\pi$: Retain only desired cols (vertical) * $\pi_{\text{name,age}}(S2)$ * SELECT clause (duplicate elimination) * Selection $\sigma$: Select subset of rows (horizontal) * $\sigma_{\text{rating>8}}(S2)$ * WHERE clause * Renaming: $\rho$: Rename attributes and relations * $\rho(Temp1(1\rightarrow sid1, 4 \rightarrow sid2), R1 \times R2)$ ## Binary operators: (pair of relations) * Union: $\cup$: tuples in r1 or r2 * r1, r2 must be compatible (same num of fields) * UNION clause * Remove duplicate rows * Set difference: $-$: Tuples in r1, but not r2 * r1, r2 must be compatible * EXCEPT clause * Don't need duplicate elimination * Not commutative * Cross product: $\times$: Combine 2 relations * Cardinality: $|R_1||R_2|$ * Don't need compatibility or duplicate elimination * Group by/aggregation $\gamma$: * $\gamma_{\text{age, AVG(rating)}}(Sailors)$ * $\gamma_{\text{age, AVG(rating), COUNT(*)>2}}(Sailors)$ ## Compound operators: macros * Intersection: $\cap$: tuples in r1 and r2 * Requires compatibility * $S1 \cap S2 = S1 - (S1 - S2) * Joins: $\bowtie_\theta, \bowtie$: Combine relations that satisfy predicates * Theta Join $\bowtie_\theta$: join on logical expression $\theta$ * Equi-join: Theta join with theta being a conjunction of equalities * Natural Join $\bowtie$ equi-join on all matching column names * $R \bowtie S = \pi_{\text{unique field}} \sigma_{\text{eq. matching fld}} (R \times S)$ # Lecture 9: 2/18/20 ## Out of Core Algs * Need to sort 100GB of data with 1GB of RAM * Single-pass stream: * Read chunk from input to input buffer * When output buffer fills, write to output * Double buffering: * Swap between I/O thread and compute thread * Assumes you have RAM buffers to spare * Sorting: Store contents in order * Hashing: No hash collision ## Sorting: 2-Way Strawman * Pass 0: conquer a batch * Read, sort, write in 1 buffer page * Repeated batch job * Pass 1,2,3 * Requires 3 buffer page * Read 2 sorted blocks, pick smallest and put in output buffer * Merge 2 buffers into 1 output buffer that is 2 buffers long ## Full External Sort * Sort a file with N pages, B buffer pages * Pass 0: use B buffer pages to produce $\lceil N/B \rceil$ sorted runs of B pages each * Pass 1,2,... merge B-1 runs at a time * Num of passes: $1 + \lceil \log_{B-1} \lceil N/B \rceil \rceil$ * Cost: $2N *$ num of passes * Sort N pages of data in $\sqrt{N}$ space ## Alternative: Hashing * Remove duplicates without ordering * Streaming partition (divide): * Hash function $h_p$ to stream records to disk partitions * Matches go to same partition on disk * ReHash (conquer): * Read RAM hash table 1 at a time using $h_r$ * Each bucket contains small num of distinct values * Write each bucket to disk (make sure duplicate values are contiguous) * Cost = 2N num of passes = 4N I/Os * B-1 partitions from Pass 1, each no more than B pages in size * $\sqrt{N}$ space * Cost: * Hashing: 4N I/Os (divide then conquer) * Don't have to worry as much about skew data * Better for duplicate elimination * Scales with num of distinct values instead of rows * Sorting: 4N I/Os (conquer then divide) * Good if output needs to be sorted * Not sensitive to duplicates or bad hash functions ## Parallelization * Hashing: * Shuffle data across machines ($h_n$ partitions data across machines) * Each machine partitions using $h_p$ and then conquer $h_r$ * Still 4N I/Os, but parallelized across n computers so n*faster * Sorting: * Choose which machine based on value range * Each machine sorts locally * Pass 1 merges to create large local runs * Hard to ensure ranges give equal work(pages) or else have data skew