{%hackmd SybccZ6XD %}
# database (NTUST ECE)
###### tags: `database`
<style>
img{
width: 60%;
}
</style>
## Databases and Database Users
### Basic Definitions
- Database: A collection of related data.
- Data: Known facts that can be recorded and have an implicit meaning.
- Mini-world: Some part of the real world
- Database Management System (DBMS): A software create and maintain a computerized database.
- Database System: The DBMS software, data and application
### Typical DBMS Functionality
- Define data types, structures, and constraints
- Construct or Load the initial database contents on a secondary storage medium
> secondary storage: hard disk drives (HDDs), solid-state drives (SSDs), optical disks, USB flash drives, floppy disks or other devices.
- Manipulating the database:
- Retrieval: Querying, generating reports
- Modification: Insertions, deletions and updates to its content
- Accessing the database through Web applications
- Processing and Sharing by a set of concurrent users and application programs
### Application Activities Against a Database
- Applications interact with a database by generating
- Queries: read data
- Transactions: read data and update data
- Applications must not allow unauthorized users to access data
### Main Characteristics of the Database Approach
- Self-describing nature of a database system:
- A DBMS catalog stores the description of a particular database (e.g. data structures, types, and constraints)
- The description is called meta-data*.
- Insulation between programs and data:
- Called program-data independence.
- Allows changing data structures and storage organization without having to change the DBMS access programs.
- Data Abstraction:
- Programs refer to the data model constructs rather than data storage details
- Support of multiple views of the data:
- Each user may see a different view of the database
- Sharing of data and multi-user transaction processing:
- Allowing a set of concurrent users to retrieve from and to update the database.
- OLTP (Online Transaction Processing) is a major part of database applications. This allows hundreds of concurrent transactions to execute per second.
### Database Users
- Actors on the Scene
- Those who actually use and control the database content, and those who design, develop and maintain database applications
- Workers Behind the Scene
- Those who design and develop the DBMS software and related tools, and the computer systems operators
### When not to use a DBMS
- Main inhibitors (costs) of using a DBMS:
- High initial investment and possible need for additional hardware.
- Overhead for providing generality, security, concurrency control, recovery, and integrity functions.
- When a DBMS may be unnecessary:
- If the database and applications are simple, well defined, and not expected to change.
- If access to data by multiple users is not required.
- When a DBMS may be infeasible:
- In embedded systems where a general purpose DBMS may not fit in available storage
- When no DBMS may suffice:
- If there are stringent real-time requirements that may not be met because of DBMS overhead (e.g., telephone switching systems)
- If the database system is not able to handle the complexity of data because of modeling limitations
- If the database users need special operations not supported by the DBMS
## Database System Concepts and Architecture
### Data Models
- Data Model: describe the structure of a database, the operations for manipulating these structures, and certain constraints that the database should obey.
- Data Model Structure and Constraints:
- Structure: elements (and their data types) as well as groups of elements (e.g. entity, record, table), and relationships among such groups
- Constraints: restrictions
- Data Model Operations:
- basic model operations: generic insert, delete, update
- user-defined operations: compute_student_gpa, update_inventory
### Schemas versus Instances
- Database Schema: database structure, data types, and the constraints on the database.
- Schema Construct: A component of the schema
- Database State(database instance): The actual data stored in a database at a particular moment in time
### Three-Schema Architecture
- Internal schema: physical storage structures and access paths
- Conceptual schema: the structure and constraints for the whole database
- External schemas: various user views.
### DBMS Languages
- Data Definition Language (DDL): define the schema of a database. ex. create
- Data Manipulation Language (DML): manipulate data. ex. insert, delete and update
- High Level or Non-procedural Language: the user has to specify only “what to do” and not “how to do”.
- Low Level or Procedural Language: the program code is written as a sequence of instructions.
### Data dictionary / repository
- Active data dictionary: they describe and automatically reflect any updates or changes in their host databases
- Passive data dictionary: separately from the databases they describe to act as a repository for data information.

### Centralized DBMS
- Combines everything into single system including- DBMS software, hardware, application programs, and user interface processing software.
- User can still connect through a remote terminal – however, all processing is done at centralized site.
### Basic 2-tier Client-Server Architectures
- Clients can access the specialized servers as needed

### Three Tier Client-Server Architecture
- Intermediate Layer called Application Server or Web Server:
- Stores the web connectivity software and the business logic part of the application used to access the corresponding data from the database server
- Acts like a conduit for sending partially processed data between the database server and the client.

### Data Independence
- Logical Data Independence: change the conceptual schema without having to change the external schemas
- Physical Data Independence: change the internal schema without having to change the conceptual schema.
## Data Modeling Using the Entity-Relationship (ER) Model
### ER Model Concepts
- Entities: Entities are specific things or objects in the mini-world
- Attributes: Attributes are properties used to describe an entity.
- Relationships (and their relationship types and relationship sets)
### Types of Attributes
- Simple: Each entity has a single atomic value for the attribute.
- Composite: The attribute may be composed of several components.
- Multi-valued: An entity may have multiple values for that attribute.
### Entity Types
- Entities with the same basic attributes are grouped or typed into an entity type.
### Key Attributes
- An attribute of an entity type for which each entity must have a unique value is called a key attribute of the entity type.
- Each key is underlined
- **this is different from the relational schema where only one “primary key is underlined**
### Entity Set
- Each entity type will have a collection of entities stored in the database
### Value Sets (Domains) of Attributes
- Each simple attribute is associated with a value set
- Lastname has a value which is a character string of upto 15 characters, say
- Date has a value consisting of MM-DD-YYYY where each letter is an integer
### Entity Type CAR with two keys and a corresponding Entity Set

### Relationships
- A relationship relates two or more distinct entities with a specific meaning.
### Relationship Types
- Relationships of the same type are grouped or typed into a relationship type.

### Relationship Set
- A set of relationships of the same type is known as a relationship set

### Constraints(ratio constraints) on Relationships
- Cardinality Ratio
- One-to-one (1:1)
- One-to-many (1:N) or Many-to-one (N:1)
- Many-to-many (M:N)
- Existence Dependency Constraint: whether an entity in a relationship is optional or mandatory.
- zero (optional participation, not existence-dependent)
- one or more (mandatory participation, existence-dependent)
- Specified on each participation of an entity type E in a relationship type R
- Specifies that each entity e in E participates in at least min and at most max relationship instances in R
### Recursive (self-referencing) Relationship Type
- A relationship type between the same participating entity type in distinct roles
### Weak Entity Types
- An entity that does not have a key attribute and that is identification-dependent on another entity type.
## Enhanced Entity-Relationship (EER) Modeling
### Subclasses and Superclasses
- A superclass is the class from which many subclasses can be created.
- The subclasses inherit the characteristics of a superclass.
### Specialization
- the process of defining a set of subclasses of a superclass
- ex. {SECRETARY, ENGINEER, TECHNICIAN} is a specialization of EMPLOYEE based upon job type.
### Generalization
- Several classes with common features are generalized into a superclass
### Types of Specialization
- Predicate-defined ( or condition-defined) : based on some predicate
- Attribute-defined: shows the name of the attribute next to the line drawn from the superclass toward the subclasses
- User-defined: membership is defined by the user on an entity by entity basis
### disjointness constraint
- Disjoint: an entity should not belong to no more than one lower-level entity set.
- Overlapping: the same entity may belong to more than one lower-level entity set.
### Complereness constraint
- Total: every entity in the superclass must be a member of some subclass
- Partial: an entity not to belong to any of the subclasses
### Hierarchy and lattice
- Hierarchy: every subclass has only one superclass (called single inheritance); this is basically a tree structure
- lattice: a subclass can be subclass of more than one superclass (called multiple inheritance)
### Categories (UNION TYPES)
- a single superclass/subclass relationship with more than one superclass
- ex. vehicle registration, a vehicle owner can be a PERSON, a BANK (holding a lien on a vehicle) or a COMPANY.
### KR
### Onlogies
## The Relational Data Model and Relational Database Constraints
### Informal Definitions
- a relation looks like a table of values.
- contains a set of rows(tuples).
- each row represent certain facts that correspond to a real-world entity or relationship
- Each column has a column header (attribute name) that gives an indication of the meaning of the data items in that column
- key
- Each row has a value of a data item (or set of items) that uniquely identifies that row in the table
- Sometimes row-ids or sequential numbers are assigned as keys, called artificial key or surrogate key
### Formal Definitions - Schema
- The Schema (or description) of a Relation
- Denoted by R(A1, A2, .....An)
- R is the name of the relation
- The attributes of the relation are A1, A2, ..., An
- Example: CUSTOMER (Cust-id, Cust-name, Address, Phone#)
- CUSTOMER is the relation name
- Defined over the four attributes: Cust-id, Cust-name, Address, Phone#
### Formal Definitions - Tuple
- A tuple is an ordered set of values (enclosed in angled brackets ‘< … >’)
- A row in the CUSTOMER relation is a 4-tuple and would consist of four values,
### Formal Definitions - Domain
- A domain has a logical definition:
- A domain also has a data-type or a format defined for it.
### Formal Definitions - State
- The relation state is a subset of the Cartesian product of the domains of its attributes
### Key Constraints
- Superkey of R:
- Is a set of attributes SK of R
- No two tuples in any valid relation state r(R) will have the same value for SK
- That is, for any distinct tuples t1 and t2 in r(R), t1[SK] $\neq$ t2[SK]
- This condition must hold in any valid state r(R)
- Any key is a superkey (but not vice versa)
- Any set of attributes that includes a key is a superkey
- A minimal superkey is also a key
- Candidate Key
- minimal super key.
- The minimal set of attributes that can uniquely identify a record.
- Primary Key
- It is a unique key.
- It can identify only one tuple (a record) at a time.
### Relational Database Schema
- A set S of relation schemas that belong to the same database.
- S is the name of the whole database schema
- S = {R1, R2, ..., Rn} and a set IC of integrity constraints.
- R1, R2, …, Rn are the names of the individual relation schemas within the database S
### Entity Integrity
- The primary key attributes PK of each relation schema R in S cannot have null values in any tuple of r(R).
### Referential Integrity
- The referencing relation and the referenced relation.
- Tuples in the referencing relation R1 have attributes FK (called foreign key attributes) that reference the primary key attributes PK of the referenced relation R2.
- A referential integrity constraint can be displayed in a relational database schema as a directed arc from R1.FK to R2.

## Basic SQL(Structured Query language)
### SQL Data Definition, Data Types, Standards
- Terminology: Table, row, and column used for relational model terms relation, tuple, and attribute
- data definition: CREATE statement
### Schema and Catalog Concepts in SQL
- SQL schema
- Identified by a schema name
- Includes an authorization identifier and descriptors for each element
- Schema elements include
- Tables, constraints, views, domains, and other constructs
- CREATE SCHEMA statement
- CREATE SCHEMA COMPANY AUTHORIZATION ‘Jsmith’;
- Catalog
- Named collection of schemas in an SQL environment
### The CREATE TABLE Command in SQL
CREATE TABLE EMPLOYEE
CREATE TABLE COMPANY.EMPLOYEE
- Specifying a new relation
- Provide name of table
- Specify attributes, their types and initial constraints
### Attribute Data Types and Domains in SQL
- Basic data types
- Numeric data types
- Integer numbers: INTEGER, INT, and SMALLINT
- Floating-point (real) numbers: FLOAT or REAL, and DOUBLE PRECISION
- Character-string data types
- Fixed length: CHAR(n), CHARACTER(n)
- Varying length: VARCHAR(n), CHAR VARYING(n), CHARACTER VARYING(n)
- Bit-string data types
- Fixed length: BIT(n)
- Varying length: BIT VARYING(n)
- Boolean data type
- Values of TRUE or FALSE or NULL
- DATE data type
- Ten positions
- Components are YEAR, MONTH, and DAY in the form YYYY-MM-DD
- Additional data types
- Timestamp data type
- Includes the DATE and TIME fields
- INTERVAL data type
- Specifies a relative value that can be used to increment or decrement an absolute value of a date, time, or timestamp
- Domain
- CREATE DOMAIN SSN_TYPE AS CHAR(9);
- Name used with the attribute specification
- Makes it easier to change the data type for a domain that is used by numerous attributes
- TYPE
- User Defined Types (UDTs) are supported for object-oriented applications. (See Ch.12) Uses the command: CREATE TYPE
## SQL assertion and trigger檔案
### General constraint
What is General constraint
> constraints that do not fit in the basic SQL categories
Mechanism (for oracle)
> CREATE ASSERTION
Components
> A constraint name
> followed by CHECK
> followed by a condition
Example
> 
### SQL Triggers
What is SQL Triggers?
> to monitor a database and take initiate action when a condition occurs
Components
> Event
> Condition
> Action
Example
> 
Trigger in Postgresql
> 
### Views in SQL
What is view
> virtual table
Limitation
> In the UPDATE operation
Component (CREATE VIEW)
> table name
> list of attribute names
> specify the table contents
Example
> 
How to drop virtual table
> Drop virtual_table_name
### View implementation
Strategy1: Query modification approach
> Compute the view as and when needed. Do not store permanently
> Drawback: time-consuming
Strategy2: View materialization
> Physically create a temporary view table when the view is first queried
> Keep that table on the assumpltion that other queries on the view will follow
Multiple ways to handle materialization
> immediate update: updates a view as the base tables are changed
> lazy update: updates the view when needed
> periodic update: updates the view periodically
Un-updatable views
> aggregate function
> multiple tables using joins
View as authorization mechanism
> 
## Embedded SQL
## Transaction
### Transaction
What is Transaction?
> local unit of database processing
Multiprogramming
> allow OS to execute multiple processes concurrently
Two types of Multiprogramming
> Interleaved processing
> Parallel processing
> 
Read item
> find the address of the disk block that cintains item X
> Copy that disk block into a buffer in main memory
> Copy item X from the buffer to the program variable named X
Write item
> find the address of the disk block that cintains item X
> Copy that disk block into a buffer in main memory
> Copy item X from the buffer to the program variable named X
> Copy item X from the program variable named X into its correct location in the buffer
> Store the updated block from the buffer back to disk
Example of transaction
> 
### Concurrency control
Lost update problem
> Occurs when two transactions that access the same database items have operations interleaved
> 
Temporary update problem
> Occurs when one transacation updates a database item and the transaction fails
> 
Incorrect summary problem
> If one transaction is calculating an aggregate summary function on a number of records while other transactions are updating some of these records
> 
Unrepeatable read problem
> transaction reads the same item twice
> 
Phantom read problem
> the variable does not exist
> 
### Recovery
Types of transaction failures
> Computer failure
> Transaction or system error
> Local errors or exception conditions detected by the transaction
> Concurrency control enforcement
> Disk failure
> Physical problem
Transaction diagram
> 
System log (Recovery)
> keep track of transaction operations
> is backed up periodically
The information of system log
> T: unique transaction-id
>
> [start_transaction,T]: Records that transaction T has started execution.
> [write_item,T,X,old_value,new_value]: Records that transaction T has changed the value of database item X from old_value to new_value.
> [read_item,T,X]: Records that transaction T has read the value of database item X.
> [commit,T]: Records that transaction T has completed successfully, and affirms that its effect can be committed (recorded permanently) to the database.
> [abort,T]: Records that transaction T has been aborted.
Commit of a transaction
> Occurs when operations have completed successfully
> Force-writing the log buffer to disk
ACID properties
> Atomicity: recovery occurs when the transaction is not complete
> Consistency preservation: recovery occurs when breaking the rule
> Isolation: avoid race condition
> Durability or permanency: Changes must persist in the database
Conflict operation
> Operations access the same item X
> Operations belong to different transactions
> At least one of the operations is a write_item(X)
> 
Recoverable schedule
> One where no committed transaction needs to be rolled back
> before T2 is committed, it can check whether the T11 and T12 are committed

Unrepeatable read
> 
Dirty read
> T1 is failed, but T2 read the uncommitted X
> 
Overwriting uncommitted data
> 
Conflict serializability
> 
Cascading rollback
> 
Cascadeless schedule
> 
Strict schedule
> A schedule in which a transaction can neither read or write an item X until the last transaction that wrote X has committed
> 
Serializable schedule
> always correct when concurrent transactions are executing
> EX. transaction T1 before T2
Result equivalent schedule
> produce the same final state, but may be accidental
> 
Dependency graphs
> 
> 
> 
Ioslation level
> 
## concurrency control
Executing with locks
> 
Lock
> control concurrent access to a data item
> One lock for each item in the database
Binary Lock (two states)
> Locked(1) : item cannot be accessed
> Unlocked(0) : item can be accessed when requested
Lock table
> specifies items that have locks
Lock manager subsystem
> keeps track of and controls access to locks
> rules enforced by lock manager module
Lock type
> Shared locks: When a statement reads data without making any modifications, its transaction obtains a shared lock on the data. Another transaction that tries to read the same data is permitted to read.
> Exclusive locks: When a statement modifies data, its transaction holds an exclusive lock on data that prevents other transactions from accessing the data.
Lock compatibility matrix
> 
Exclusive lock implementation
> 
Exclusive and shared lock implementation
> 
Two-phase locking
> concurrency control protocol that determines whether a txn can access an object in the database on the fly.
>
> sufficient to guarantee conflict serializability and precedence graph is acyclic
>
> draw back: cascafing absort
Two phase
> Growing
> shrinking
Transaction lifetime
> The transaction is not allowed to acquire/upgrade locks after the growing phase finishes.
> 
The solution Two-phase locking problem
> dirty reads: strong strict Two-phase locking
> deadlocks: detection or prevention
Strong strict Two-phase locking
> Allows only conflict serializable schedules, but it is often stronger than needed for some apps.
> 
Deadlock
> is a cycle of transactions waiting for locks to be released by each other
Deadlock detection
> create a wait-for graph
> Edge from Ti to Tj if Ti is waiting for Tj to release a lock.
> 
Dead lock handling
> When the DBMS detects a deadlock, it will select a "victim" txn to rollback to break the cycle.
>The victim txn will either restart or abort(more common) depending on how it was invoked.
Victim selection
>By age (lowest timestamp)
> By progress (least/most queries executed)
> By the # of items already locked
> By the # of txns that we have to rollback with it
> the # of times a txn has been restarted in the past to prevent starvation
Dead lock handling: rollback length
> decide on how far to rollback the txn's changes.
> - Completely
> - Minimally
Deadlock prevention
> When a txn tries to acquire a lock that is held by another txn, the DBMS kills one of them to prevent a deadlock.
Conditions for deadlock
> Mutual exclusion: processes require exclusive control of its resources (not sharing)
> Hold and wait: process may wait for a resource while holding others.
> No preemption: process will not give up a resource until it is finished with it. Also, processes are irreversible: unable to reset to an earlier state where resources not held.
> Circular wait: each process in the chain holds a resource requested by another
Conservative (static) Two-phase locking
> Requires a transaction to lock all the items it accesses before the transaction begins
Dealing with Deadlock and starvation
> Deadlock prevention protocols
> - Every transaction locks all items it needs in advance
> - Ordering all items in the database
> - Both approaches impractical
>
> Protocols based on a timestamp
> - Wait-die
> - Wound-wait
Deadlock prevention
> Assign priorities based on timestamps:
> Older Timestamp = Higher Priority (e.g., T1 > T2)
> Wait-Die ("Old Waits for Young")
> - If requesting txn has higher priority than holding txn, then requesting txn waits for holding txn.
> - Otherwise requesting txn aborts.
>
> Wound-Wait ("Young Waits for Old")
> - If requesting txn has higher priority than holding txn, then
holding txn aborts and releases lock.
> - Otherwise requesting txn waits.
>
> 
Deadlock Prevention Without Timestamps
> No waiting: If T needs X and X is already locked, immediately abort T, restart later.
> Cautious waiting: Suppose Ti needs X, and X is already locked by Tj.
> - If Tj is waiting for anything, then abort Tj
> - else abort Ti.
>
> Timeout: If T waits longer than some fixed time, abort and restart T.
Starvation and livelock
> A T might be aborted and restarted indefinitely: Starvation
> A T might wait indefinitely, not for the same other T (deadlock), but for different reasons at different times: Livelock
Database Lock Hierarchy
> 
Intention locks
> An intention lock allows a higher level node to be locked in shared or exclusive mode without having to check all descendent nodes.
> If a node is in an intention mode, then explicit locking is being done at a lower level in the tree.
The types of intention locks
> Intention-Shared (IS): Indicates explicit locking at a lower level with shared locks.
> Intention-Exclusive (IX): Indicates locking at lower level with exclusive or shared locks.
> Shared+Intention-Exclusive (SIX): The subtree rooted by that node is locked explicitly in shared mode and explicit locking is being done at a lower level with exclusive-mode locks.
Compatibility matrix
> 
Lockint protocol
> Each txn obtains appropriate lock at highest level of the database hierarchy.
> To get S or IS lock on a node, the txn must hold at least IS on parent node.
> To get X, IX, or SIX on a node, must hold at least IX on parent node.
Example
> T1: Read in R
> T2: update in R
> 
Example
> T1 – Scan R and update a few tuples.
> T2 – Read a single tuple in R.
> T3 – Scan all tuples in R.
> 
Lock Escalation
> Lock escalation is the process of converting many fine-grain locks to fewer coarse-grain locks, which reduces memory overhead at the cost of decreasing concurrency.
## Timestamp ordering concurrency control
Concurrency control approaches
> Two-phase locking: determine serializability order of conflicting operations at runtime while transactions execute
> Timestamp ordering: determine serializability order of transactions before they execute
Timestamp ordering concurrency control
> use timestamps to determine the serializability order of transactions
> If TS(Ti)<TS(Tj), then DBMS must ensure that the execution schedule is equivalent to a serial schedule where Ti appears before Tj
Timestamp allocation
> Each transaction Ti is assigned a unique fixed timestamp that is monotonically increasing.
Basic Timestamp ordering
> Every object X is tagged with timestamp of the last transaction that successfully did read/write
> - W-TS(X): write timestamp on X
> - R-TS(X): read timestamp on X
>
> check timestamps for every operation: if transaction tried to access an object "from the future", it absorts and restarts
READS
> 
Writes
> 
Example
> 
> 
> 
> 
> 
> 
> 
Example
> 
Thomax write rule
> 
> 
> 
Recoverable schedule
> 
> 
Optimistic concurrency control
> 
OCC phases
> 1. read phase
> 2. validation phase
> 3. write phase
> 
Example
> 
> 
> 
> 
> 
validation phase
> 
> 
> 
Serial validation
> 
read phase
> 
Backward validation
> 
forward validation
> 
Validation step 1
> 
> 
Validation step 2
> 
> 
Validation step 3
> 
> 
> 
> 