# PostgreSQL
# Installation
```bash=
# APT repository
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/sources.list.d/pgdg.list'
wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O - | sudo apt-key add -
# Install Postgres
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib postgresql-client libpq-dev
```
## Data Location
```=
echo $PGDATA
psql -c "SHOW data_directory"
`/path/to/postgres/data/pgdata`
```
## Database Version
```sql=
SELECT version();
```
## Forgot Password
- [Reset password](https://wiki.postgresql.org/wiki/FAQ#I_lost_the_database_password._What_can_I_do_to_recover_it.3F)
# Schema Design
## Structured
Keys + Fields
Accumulated
- Partition
The records accessed frequently should be stored in a small partition that takes a little time to retrieve. Otherwise, others are stored in other partitions but to keep to reduce the poor effect of large number of partition tables.
Hot target
- number of view
- number of notification
Personal
- Hash partition
Frequently access
- Cache
- O(1)
```
if (object < threhold) {
return db.search(object);
} else if (!cache.contain(object)) {
db.load(object);
}
return local.load(object);
```
```
// Trigger
db.markDirty();
```
## Non-structured
Keys + File
# Create Database
```bash=
# List databases
psql
\l
```
```bash=
# Create database
su - postgres createdb database_name -O role_owner
psql
CREATE DATABASE database_name OWNER role_owner;
# Change owner
ALTER DATABASE database_name OWNER TO new_role_owner;
# Drop database
su - postgres dropdb database_name --if-exists
psql
DROP DATABASE IF EXISTS database_name;
```
## Change Database
```
psql -U <user>
\c <database>
```
# `CREATE Table`
```bash=
psql
\dt[+]
```
## Column Identifier
Without double-quoted, the identifier is case-insensitive. Otherwise, it's case-sensitive.
## Serial
```sql=
CREATE TABLE table_ AS (
a SERIAL,
b BIGSERIAL
)
-- The SERIAL is
CREATE SEQUENCE seq
CALL nextval('seq');
```
## Tablespace
Tablespace can store different object to different disk. A frequently accessed data can be stored in a tablespace in SSD. In the other hand, the data can be stored in the HDD.
```sql=
CREATE TABLESPACE fastspace LOCATION '/ssd1/postgresql/data';
DROP TABLESPACE fastspace
-- Create an object to other space
CREATE TABLE foo(i int) TABLESPACE space1;
-- Use a variable to reduce declarations
SET default_tablespace = space1;
CREATE TABLE foo(i int);
```
The directory must be exist, empty, and owned by database user. If the directory part of the cluster is missing or lost, the database might be crashed.
The database stores system catalogs. Furthermore, it stores tables if there is no specific tablespace is specified.
The `pg_global` tablespace is used for shared system catalogs. The `pg_default` tablespace is the default tablespace of the `template1` and `template0` databases (and, therefore, will be the default tablespace for other databases as well).
Display existed tablespaces
```sql=
SELECT spcname FROM pg_tablespace;
\db
```
The directory `$PGDATA/pg_tblspc` contains symbolic links that point to each of the non-built-in tablespaces defined in the cluster.
### Move Tablespace from One Filesystem to Another Filesystem
- [Moving Tablespaces in PostgreSQL](https://www.enterprisedb.com/blog/moving-tablespaces-postgresql)
## Partition Table
Partitioning effectively substitutes for the upper tree levels of indexes, making it more likely that the heavily-used parts of the indexes fit in memory.
Seldom-used data can be migrated to cheaper and slower storage media.
### Criterion
Partition keys are used frequently in `WHERE` clause.
Not having enough partitions may mean that indexes remain too large that is similar to regular table with an index and that data locality remains poor which could result in low cache hit ratios. Too many partitions can mean longer query planning times and higher memory consumption during both query planning and execution; That's because each partition requires its metadata to be loaded into the local memory of each session that touches it.
Never just assume that more partitions are better than fewer partitions, nor vice-versa.
### Usage
```sql=
CREATE TABLE partition_table (
-- fields
) PARTITION BY RANGE (field) | LIST (field) | HASH (field);
```
It is a virtual table that doesn't store any physical data.
Updating the partition key of a row will cause it to be moved into a different partition if it no longer satisfies the partition bounds of its original partition.
Write a script to add new partition regularly.
```sql=
CREATE TABLE partition_table_n PARTITION OF partition_table
FOR VALUES FROM ('2008-01-01') TO ('2008-02-01')
TABLESPACE fasttablespace;
```
Lower is included and upper is excluded.
The bound can not overlap to an existing partition value.
Inserting a data to the `partition_table` will put automatically to the right partition table by the partition value. If the data that is not mapped to one of the partition value will cause an error.
The partition value will be default constrains.
### Maintenance
Remove old data
```sql=
DROP TABLE partition_table_n
```
The above method takes more time and lock the table, but below not
```sql=
ALTER TABLE partition_table DETACH PARTITION partition_table_n[ CONCURRENTLY];
```
Adding the `CONCURRENTLY` qualifier as in the second form allows the detach operation to require only *SHARE UPDATE EXCLUSIVE* lock on the **parent table**.
Add partition
```sql=
-- Inherit properties from another table by LIKE
CREATE TABLE partition_table_n
(LIKE partition_table INCLUDING DEFAULTS INCLUDING CONSTRAINTS)
TABLESPACE fasttablespace;
-- Without the CHEK, the ATTACH process has to check all rows.
ALTER TABLE partition_table_n ADD CONSTRAINT time_range
CHECK ( date_field >= DATE '2008-02-01' AND date_field < DATE '2008-03-01' );
-- Inserting data before attaching can reduce the time of locking.
\copy partition_table_n from 'partition_table_n_x'
ALTER TABLE partition_table ATTACH PARTITION partition_table_n
FOR VALUES FROM ('2008-02-01') TO ('2008-03-01' );
-- After attached, the redundent check could be removed.
ALTER TABLE DROP CONSTRAINT time_range
```
`ATTACH` takes *SHARE UPDATE EXCLUSIVE* lock.
### Pruning
Without *pruning*, the planner will seek all of partitions.
```
SET enable_partition_pruning = on;
```
Note that partition pruning is driven only by the constraints defined implicitly by the partition keys, not by the presence of indexes.
# Alter Column Type
```sql=
BEGIN;
ALTER TABLE tab ADD COLUMN new_col new_data_type;
UPDATE tab SET new_col = CAST(old_col AS new_data_type);
ALTER TABLE tab DROP COLUMN old_col;
COMMIT;
```
# Index
Without indexing, the database has to scan entire table to find rows matched the `WHERE`. With indexing, the database can scan indexed data matched the `WHERE` in the memory and know the position of the data.
Creating index may take long time(may be hours). The creating won't block `SELECT` but `INSERT`, `UPDATE`, and `DELETE`. `CONCURRENTLY CREATE INDEX` begins with two scan of table on two more transactions. The index building will wait on going transactions on the same tables. After the second scan, it will wait until the transactions that are before the scan to terminate. If there are problems such as deadlock occurred, the index status will be invalid. To solve this problem, we can 1. `DROP INDEX; CREATE INDEX CONCURRENTLY;` and 2. `REINDEX INDEX CONCURRENTLY`
After created of index, the indexes have to synchronize with the tables. It takes more time on manipulation operations `UPDATE`, `DELETE`, `INSERT`, and so on.
## Index Types
```
CREATE INDEX <NAME> ON <TABLE>[ USING <TYPE>] (<COL>)
```
### B-Tree (default)
```
< <= = >= >
LIKE 'foo%'
col ~ '^foo'
```
`LIKE '%foo'` can not apply the B-Tree constrained by a feature of tree.
### Hash
Using 32-bit hash code derived from the value of the indexed column.
```
=
```
### GiST
GiST operator classes for several two-dimensional geometric data types.
```
<< &< &> >> <<| &<| |&> |>> @> <@ ~= &&
```
[Geometric Operators](https://www.postgresql.org/docs/current/functions-geometry.html)
[GiST Operator Classes](https://www.postgresql.org/docs/current/gist-builtin-opclasses.html#GIST-BUILTIN-OPCLASSES-TABLE)
### SP-GiST
SP-GiST operator classes for two-dimensional points.
```
<< >> ~= <@ <<| |>>
```
[Geometric Operators](https://www.postgresql.org/docs/current/functions-geometry.html)
[SP-GiST Operator Classes](https://www.postgresql.org/docs/current/spgist-builtin-opclasses.html#SPGIST-BUILTIN-OPCLASSES-TABLE)
### GIN
For data values that contain multiple component values, such as arrays.
[Array Operators](https://www.postgresql.org/docs/current/functions-array.html)
[GIN Operator Classes](https://www.postgresql.org/docs/current/gin-builtin-opclasses.html#GIN-BUILTIN-OPCLASSES-TABLE)
# Regular Expression
```sql=
where col ~ 'regex'
```
# Parse Order
1. `JOIN`
2. `WHERE`
3. `GROUP BY`
- You can use customized field in the `SELECT`
5. `HAVING`
6. window function (`SELECT`)
7. `ORDER BY`
- You can use customized field in the `SELECT`
# Prepared Query and Normal Query
The normal query knows the values in the `WHERE` clause. As a result, the db can use statistics to organize the execution plan. The prepared query doesn't know the values will organize the plan with more safe method.
# `JOIN`
```sql=
FROM table_a AS a
JOIN table_b AS b ON (a.col = b.col);
FROM table_a AS a
JOIN table_b AS b USING (col);
```
## Index
*B-tree* indexes is recommended for scalar data in concurrent application.
*GIN*, *GiST* and *SP-GiST* indexes are recommended for non-scalar data in concurrent application.
## system catalogs
Accessing to systems catalogs, for example, of created tables, is not done using isolation level.
# Window Function
```=
invoked_func(args) OVER (
[
PARTITION BY cols
ORDER BY ASC|DES
NULLS FIRST|LAST
]
)
```
## Window Alias
```=
invoked_func() OVER alias
...
WINDOW alias AS (PARTITION BY ...)
```
# Common Table Expressions (CTE)
```sql=
WITH
table_a (col_1, col_2) AS (
select a, b from table_real
),
table_b (col_1) AS (
values (1), (2)
)
```
# Recursive CTE
```
Non-recursive statement -> working_table
while (working_table not null)
union all working_table
recursive table with other tables -> intermediate_table
working_table = intermediate_table
End Conditions
- Range
- Null match
```
# Function
The result of last query statement will be returned, if the statement is end with `;` or the function is with return type.
The returnable statements are `SELECT` or `INSERT`, `UPDATE`, and `DELETE` with `RETURNING`. The returned column name is the function name.
```sql=
CREATE OR REPLACE FUNCTION func_name([para ]type, OUT col1 type) RETURNS type AS &&
DECLARE
rename ALIAS FOR $1; -- $1 is the anonymous parameter
name [CONSTANT] type [NOT NULL] [{DEFAULT | := | =} expression];
var type := value;
BEGIN
var := new_value;
IF condition THEN
-- Do something
ELSEIF condition THEN
-- Do something
ELSE
-- DO something
END IF;
CASE
WHEN condition THEN
WHEN condition THEN
ELSE
END CASE;
[WHILE condition] LOOP
EXIT when condition;
CONTINUE when condition;
END LOOP;
FOR i IN [REVERSE] from...to BY step LOOP
END LOOP;
FOR row IN query LOOP
END LOOP;
RETURN val | col1 := value | RETURN QUERY statement;
EXCEPTION
WHEN condition THEN statements;
END;
$$ language plpgsql;
SELECT func(para); -- without out of col1
-- returns result with column name of `col1`
```
type
- `integer`
- `real`
- `varchar`
- `RECORD` for any row type
- `TABLE_NAME`
- `TABLE(col1 type, col1 type, ...)`
- `para%TYPE` for referring type
- `table_para%ROWTYPE` to declare a table variable
The statements in the function block have to be no return rows or catch the rows by `INTO`. The `SELECT` can be instead of `PERFORM`, or it will get error `ERROR: query has no destination for result data` when executing.
# Procedure
```sql=
CREATE OR REPLACE PROCEDURE func_name(OUT col1 type) AS &&
DECLARE
var type := value;
BEGIN
-- can ROLLBACK or COMMIT;
RAISE NOTICE 'notice msg %', var;
RAISE EXCEPTION 'error msg'; -- with rollback
END;
$$ language plpgsql;
CALL func(col1); -- Necessary with col1
```
# Query Plan
```sql=
-- Only print plan
EXPLAIN query
-- (cost=start-up-cost..total-cost rows=num_of_rows width=avg_row_byte)
-- Print plan with executing
EXPLAIN ANALYZE query
-- Update after altering, updating, deleting
ANALYZE [table_name [(col1,col2,...)]]
```
The statistics are stored in
```sql=
SELECT relname, relkind, reltuples, relpages
FROM pg_class
where relname like 'table_name%'
-- relkind
-- r for table
-- i for index
```
They are updated by `VACUUM`, `ANALYZE`, and a few DDL commands such as `CREATE INDEX`.
# Performance Tips
Disable auto-commit when doing lots of `INSERT`.
Remove indexes and foreign key constrains before doing lots of `INSERT`. The bulk of creating index is faster than the creating iterated row by row.
Increase `maintenance_work_mem` for those tasks needing lots of memory such as `CREATE INDEX` and `ALTER TABLE ADD FOREIGN KEY`
Increase `max_wal_size` which is a checkpoint to flush dirty pages to disk.
Disable WAL for journal by setting `archive_mode` to `off`, `wal_level` to `minimal`, and `max_wal_senders` to `0`.
`pg_restore` to `--jobs` for concurrent data loading and create index.
After loading large of data, making `ANALYZE` can ensure the planner has up-to-date statistics about the table.
# Scenarios
## Find the max of value in rows
```sql=
-- Leak : if there are two rows having the same sum
select id, sum(val) as s
from tableA
group by id
order by s DESC for max | ASC for min
LIMIT 1
with sums as (
select id, sum(val) as s
from tableA
group by id
)
select *
from sums
where s = (select max(s) from sums)
```
## Equally split ranks to N categories as much as possible
```sql=
ntile(n) OVER (ORDER BY col)
```
## To String
```sql=
TO_CHAR(expression, format)
-- expression
timestamp
interval
integer
double precision
numeric
-- format
-- numeric format
.|D -- Decimal point
,|G -- Thousand seporator
9 -- Unmatched length without padding
0 -- Unmatched length with zero padding
S -- Negative with -, Positive with +
L -- Dollar sign
-- time format
Y,YYY
YYYY
YYY
YY
Y
MONTH|MON -- JANUARY|JAN
Month|Mon -- January|Jan
month|mon -- january|jan
MM -- 01 ~ 12
DAY|DY -- SUNDAY to SATURDAY | SUN to SAT
Day|Dy
day|dy
DDD -- 001 ~ 366
DD -- 01 ~ 31
D -- Sunday (1) to Saturday (7)
ID -- Monday (1) to Sunday (7)
W -- week of month (1-5)
WW -- week of year (1-53)
CC -- Centry
HH -- 0 ~ 12
HH24 -- 0 ~ 23
MI -- 0 ~ 59
SS -- 0 ~ 59
MS -- Millisecond
US -- Microsecond
AM|A.M|PM|P.M|lower cases
```
## Casting
```sql=
type ''
''::type
CAST('' AS type)
```
## Date Difference
```sql=
-- Diff days
timestamp '' - timestamp ''
-- Data before or after an INTERVAL
timestamp + interval 'term'
-- term
'1 month'
'1 day'
```
## Find the part of date
```sql=
EXTRACT(part FROM time_related_type)
-- part
MILLENNIUM -- year/1000
CENTURY -- year/100
DECADE -- year/10
QUARTER -- month/4 + 1
YEAR
MONTH
*WEEK -- week of year
DAY
*DOW -- day of week Sunday(0) ~ Saturday(6)
*ISODOW -- Monday (1) to Sunday (7)
*DOY -- day of year, 1 to 366
HOUR
MINUTE
SECOND
MILLISECONDS
EPOCH -- seconds since 1970-01-01 00:00:00 UTC
*TIMEZONE
*TIMEZONE_HOUR
*TIMEZONE_MINUTE
-- time_related_type
timestamp
interval
```
> `*` means the parts are illegal on the `interval` type.
## Serialization
```sql=
generate_series(type '', type '', interval '')
```
## Number of Day in Month
```sql=
(date + interval '1 month') - date
```
## Reset a Date to Start of YEAR, MONTH, DAY
```sql=
date_trunc('year|month|day', date)
```
## Rest of Day from a Date
```sql=
```
## Trim
```sql=
trim(string)
```
## Divide by Integer or Float
```sql=
col / 2 -- for integer
col / 2.0 -- for float
```
## Round
```sql=
ROUND(NUMERIC[, to_digit])
```
## Padding
```sql=
lpad(CHAR(n), len, 'with')
```
## Substring
```sql=
substr(string, from_index, len) -- index starts from 1
```
## Replace
```sql=
translate(string, 'chars', 'as_string') -- for each chars
regexp_replace(string, 'pattern', 'as_string', 'g')
```
## Current Time
```sql=
CREATE TABLE test (modtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP);
```
## Get with Default Value
```sql=
COALESCE(col1, 'DEFAULT')
```
## Audit Trail
- [Using trigger](https://wiki.postgresql.org/wiki/FAQ#Is_there_a_way_to_leave_an_audit_trail_of_database_operations.3F)
# Trigger
```sql=
CREATE OR REPLACE FUNCTION func_name() RETURNS TRIGGER AS $$
BEGIN
-- TG_OP returns INSERT, UPDATE, DELETE, or TRUNCATE;
-- TG_WHEN returns BEFORE, AFTER, or INSTEAD OF
-- TG_TABLE_NAME returns the table the trigger is focussing
-- TG_TABLE_SCHEMA
-- TG_ARGV[]
-- TG_LEVEL returns ROW or STATEMENTS
-- NEW
-- OLD
RAISE EXCEPTION 'error msg'; -- with rollback
RETURN NEW|OLD|NULL;
END;
$$ language plpgsql;
DROP TRIGGER IF EXISTS trigger_name ON table_name;
CREATE TRIGGER trigger_name BEFORE|AFTER|INSTEAD OF INSERT OR UPDATE OR DELETE ON table_name FOR EACH ROW EXECUTE FUNCTION func_name();
```
The `BEFORE` and `INSTEAD OF` have to return `NEW` for newest stat. The `AFTER` can return `NULL`;
# Server Configuration
```bash=
# Show location of configuration file
su - postgres psql
> SHOW config_file;
```
## Configurations
- `max_connections=<num>` - Too large will causes memory outage, it'll cause swap.
- `shared_buffers=<num>` - Buffer for disk write-back, 25% of total memory for practice
- `effective_cache_size=<num>` - `shared_buffers` plus a buffer for using index scan loading indexes to here, 50% of total memory
- `work_mem=<num>` - For sorting. If the value is too large, it'll cause swap.
- `max_fsm_pages=<num>` - The pages are to keep free space map to provide rewrite data in a free space.
- `fsync=<boolean>` - If true, the WAL will be stored before a traction is committed. It pledges the recovery but reducing write speed.
- `commit_delay=<num> & commit_siblings=<num>` - Improve performance by writing out multiple transactions that are committing at once.
- `random_page_cost=<num>` - A higher value makes it more likely that a sequential scan will be used over an index scan indicating that your server has very fast disks.
- `wal_buffers` - A good start would be around 32-64 corresponding to 256-512K memory.
## Restart Server
```bash=
/etc/init.d/postgresql restart
systemctl stop postgresql-<ver>.service
systemctl reload postgresql-<ver>.service
```
## Check Effect
```bash=
su - postgres psql
> SHOW config_name
> SHOW max_connections;
```
# Authentication
```bash=
# Change user from OS
su - user_name
# Change password of root of database (postgres)
su - postgres psql
> \password
su - postgres -c "psql -c '\password'"
```
# Create Role (Login User)
```bash=
# List roles
psql
\du
```
```bash=
su - postgres
createuser new_role_name options
# options
-c limit # connection limit for the role
-d # can create database (default is not)
-P # password
-l # can login
-L # only in local
psql
CREATE ROLE new_role_name WITH LOGIN CREATEDB PASSWORD 'password';
```
```sql=
-- Change password
ALTER ROLE new_role_name WITH PASSWORD 'new_password';
-- Assign authorization of creat database
ALTER USER new_role_name WITH CREATEDB;
```
# Back-up Database
```bash=
# Back-up data
# If db_name is not specified, all database will be dumpped
pg_dump -U role_name [db_name] -Fc > backup.dump
# Back-up to statements instead of data
pg_dump -U role_name db_name -s > schema.sql
# Restore
PGPASSWORD=<password> pg_restore -Fc --no-acl --no-owner -U role_name -d db_name backup.dump
```
# Show clusters
```bash=
pg_lsclusters
```
# High Availability (Replication)
Before using `CREATE TABLESPACE`, the path used in the tablespace has to be created in both the master and slaves.
## Architecture
- read/write server (master)
- read server (slave)
## Mechanism
- Synchronized commit, the commits will be success only if all of the servers are committed.
- Asynchronized commit, the commits are success on the master server, the the master server propagates to the slaves. It might cause a select is before the propagation.
# Upgrade Database Version
```bash=
# Stop newer cluster
pg_dropcluster new_ver cluster_name --stop
pg_upgradecluster old_ver cluster_name
pg_dropcluster old_ver main
```
# Hardware Influences
## CPU
A database connection is run on a process not a thread.
## Memory
Less memory size will cause *swap* making memory usage slow. The memory is used for caching tables or for sorting such as `ORDER BY`, `CREATE INDEX`, or `JOIN`. The *shared-buffer* can be set by `shared_buffers` in `postgresql.conf` and the *sort memory* can be set by `sort_mem` in `postgresql.conf`.
## Disk
A File system provides journal-based and with 8KB block size.
A checkpoint is performed to force all dirty data to disk, when write-ahead log files are filled or every few minutes. If the number of write operation are high, the checkpoints will be performed frequently. To reduce the effect, to modify parameter`wal_files` to increase the number of write-ahead log files in the file `postgresql.conf`, or another parameter `checkpoint_segments`(default is 3).
`Tablespace` can assign an object such as databases and tables to a specific disk drive. Moving write-ahead log files to another disk drive by `initdb -X` or symbolic links. Note that the different drives run on the same disk is helpless. Or using RAID 1+0, RAID 0+1, or RAID 5 for six or more drives.
# System Overview
- Postmaster - For connection and authentication
- Postgres - Query process
- Shared memory - High level cache by database
- Kernel buffer - Low level cache by operating system
- Disk
# Data Limitations
- Database - unlimited
- Table - 32TB
- Row - 400GB
- Column - 1GB
- Number of rows - unlimited
- Number of cols - 250-1600
- Number of indexes - unlimited
# `ERROR: Memory exhausted in AllocSetAlloc()`
Your db has probably run out of virtual address on your system. Doing one of commands below by which shell you are using.
```
ulimit -d 262144
limit datasize 256m
```
# Non-overwriting Update
Update tuples are appended to the table, the elder versions are removed later by periodically executing the `VACUUM` command.
# Recycle Table Leaks
The `UPATE` and `DELETE` don't affect the changes in the original position. Instead, it appends the new version of data to the end of table and marks it deleted.
The `VACUUM` command labels the old version data to free space mapping. So that after the `INSERT` and `UPDATE`, the back-end checks the free space mapping first, it appends the data if no area is available in the free space mapping.
```sql=
VACUUM [VERBOSE] [table_name]
```
# Multiversion Concurrency Control (MVCC)
For maintaining strict data integrity, a solution make two or more sessions to access the same data at the same time efficient.
Each SQL statement sees a snapshot of data instead of locking the data. As a result, the query(reading) doesn't block the writing, and the writing doesn't block the reading as well.
Table- or row-level locks are available for someone who don't need full transaction isolation, but prefer to particular points of conflict.
## `TRUNCATE` and `ALTER TABLE`
DDL of `TRUNCATE` and `ALTER TABLE` are not MVCC-safe unless transactions using `ACCESS SHARE` for table-lock. So that the DDL will be blocked until the transactions completed.
# Four Isolation Level
Four phenomena
- *dirty read* - A transaction reads a data written by an uncommited transaction
- *nonrepeatable read* - A transaction re-read a data that is modified by another commited transaction
- *phantom read* - A transaction re-executes a query and finds the rows are changed by another recently commited transaction
- *serialization anomaly* - The result of commits is inconsistent with any possible ordering of transactions.
Isolation level
||dirty read|nonrepeatable read|phantom read|serialization anomaly|
|:---:|:---:|:---:|:---:|:---:|
|Read uncommited|X(A)|O|O|O|
|Read commited(**default**)|X|O|O|O|
|Repeatable read|X|X|X(A)|O|
|Serializable|X|X|X|X|
> Due to the definition of MVCC, there is no *Read uncommited* in the application.
>
> [nonrepeatable v.s. phantom](https://stackoverflow.com/questions/11043712/what-is-the-difference-between-non-repeatable-read-and-phantom-read)
## Read Committed Isolation Level
It is possible that the two `SELECT` in a transaction has different result. As a result, the snapshot in this level is inconsistent by other transactions.
- If the first transaction rollbacks, the second transaction does process originally.
- If the first transaction is committed and the data read by the second deleted, the second will skip the process.
- If the first transaction is committed, the second re-evaluates `WHERE` to see if the update of the row still matches the `WHERE`. If so it do operation with the new version of data.
## Repeatable Read Isolation Level
When a transaction A began, a snapshot is created for following reading and writing. Meanwhile, there is another transaction B updating the data the first transaction is using. The B is before A. The A will wait until the B commit or rollback.
If B is committed, A will suffer
```
ERROR: could not serialize access due to concurrent update
```
If B is rollback, A can commit directly due to no conflict in the same data.
## Serializable Isolation Level
The result of Serializable Isolation Level is equal to the transactions executed serially.
This level checks the dependency of all transactions. If the transactions can result different result by the different commit order, you'll get an error
```
ERROR: could not serialize access due to read/write dependencies among transactions
```
**A thing an application should do is to deal with the occur of the error by a re-try mechanism.**
## SQL
```sql=
BEGIN TRANSACTION ISOLATION LEVEL {SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED}
ROLLBACK;
END;
```
[ref](https://www.postgresql.org/docs/current/sql-set-transaction.html)
# Lock
The currently locked objects are stored in a system catalog of `pg_locks`.
## Table Level
The locks are released at the end of transaction or rollback.

## Row Level
The locks are released at the end of transaction or rollback.

# File Formats
`PGDATA` is an environment variable for location of data that storing the meta data and tables. The common `PGDATA` is `/var/lib/pgsql/data`.
```
// To print PGDATA
SHOW DATA_DIRECTORY
```
To print files storing the table
```
SELECT pg_relation_filepath('<table>')
base/<table number>/<data number>
```
For large table (> 1GB), the files will be suffix with `.seq` from 1 to N.
```
1234.1
1234.2
...
1234.N
```
# Layout of File for Table
Any table or index is stored as an array of *pages* in many files(if file is more than 1GB). A page is with size of `8kB` in default.
DB read/write data with a page size.
## Page
| | |
|:-------------:|:---------------------------------:|
| Page header | includes free space pointers |
| ItemId | (offset, length) in 4 bytes |
| Free space | unallocated area |
| Item | data bytes |
| Special space | Index access method specific data |
In a table, an *item* is a *row*; in an index, an *item* is an *index entry*.
If a field in a row is unfixed size, the actual size will be written in the field header to told how long the size should be read. If the field can not be written in the row due to oversize, the solution is *TOAST* which splits the large data to many small size data and stores those split data to other tables called TOAST table.
### The Oversized Attribute Storage Technique(TOAST)
The TOAST table uses a page with four rows which is ~2kB (8kB / 4) for each. If data is more than 2kB, the DB will split the data into chunks until the last chunk is less than 2kB. The chunks are stored in the TOAST table. The TOAST table has three columns `chunk_id`, `chunk_seq`, and `chunk_data`. The `chunk_id` is indicated to the TOASTed value, for example, a large article; the `chunk_seq` is indicated to the ordering of chunks, for example, part 1, part 2, ... part N.
To print the TOAST table for the specific table
```sql=
SELECT reltoastrelid::regclass
FROM pg_class
WHERE relname = '<table>';
-- pg_toast.pg_toast_<id>
```
TOAST table
```sql=
SELECT * FROM pg_toast.pg_toast_<id>;
-- | chunk_id | chunk_seq | chunk_data |
```
### Page header
| Field | Description |
| ------------------- |:-------------------------------------------------------------------------:|
| pd_lsn | LSN: next byte after last byte of WAL record for last change to this page |
| pd_checksum | Page checksum |
| pd_flags | Flag bits |
| pd_lower | Offset to start of free space(for ItemId) |
| pd_upper | Offset to end of free space(for Item) |
| pd_special | Offset to start of special space |
| pd_pagesize_version | Page size and layout version number information |
| pd_prune_xid | |
### Item
item header | null bitmap | object Id | user data
# Ref
- [Hardware tips](https://momjian.us/main/writings/pgsql/hw_performance/index.html)
- [Online exercise](https://pgexercises.com/)
- [Common operations](https://zaiste.net/posts/postgresql-primer-for-busy-people/#user)
###### tags: `Database`