PostgreSQL Training: Part 1 // 2021

# PostgreSQL Training: Part 1 // 2021 ###### tags: `database`,`postgresql` See [part 2](https://hackmd.io/c2ZPSoj-SkqIMqlt50uiJQ). # Prep ## Links ### GitLab-related > The items in this section are only available to GitLab team members. - Issue: [PostgreSQL 2021 Training with Cybertec](https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/167) // confidential - [Slack channel](https://gitlab.slack.com/archives/C02CCC3HRKL) ### General - https://hackmd.io/@steveazz-gitlab/S1NLmXcbK -- Notes from Steve A. - [Mockaroo](https://www.mockaroo.com/) - Generate realistic test data. - [SQL reserved keywords](https://en.wikipedia.org/wiki/SQL_reserved_words) - [Docker CLI and Dockerfile Cheat Sheet](https://design.jboss.org/redhatdeveloper/marketing/docker_cheatsheet/cheatsheet/images/docker_cheatsheet_r3v2.pdf) [PDF] ## Let's Get Set Up! How about a container with Postgres all ready to go? - [How to set up a postgresql docker container](https://gitlab.com/-/snippets/1970115) snippet ``` docker run --name postgres_training \ -e POSTGRES_PASSWORD=mysecretpassword \ -d postgres:13 ``` Use [the credentials](https://www.passwordstore.org/) in `pass`: ``` docker run --name postgres_training -e POSTGRES_PASSWORD=$(pass postgres/container/user) -td postgres:13 ``` Cool! ``` # docker exec -it postgres_training psql -U postgres psql (11.13 (Debian 11.13-1.pgdg90+1)) Type "help" for help. postgres=# \c You are now connected to database "postgres" as user "postgres". ``` Introducing `lsb-core` alone will require interactive set up to specify timezone info. Our friend `DEBIAN_FRONTEND` comes in handy, try this instead: ``` docker run --name ubuntutty -e DEBIAN_FRONTEND=noninteractive -td ubuntu:20.04 docker exec -it ubuntutty /bin/bash apt-get -y update apt-get -y install gnupg less lsb-core lsb-release sudo vim wget ``` I recommend doing the above before class starts to avoid being annoyed. ``` sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add - ``` # Day 1: Indexing / Transactions and Locking Sharding will be out of scope for the training. We **will** talk a bit about partitioning. Docs, [the greatest resource](https://www.postgresql.org/docs/current/index.html). [The PostgreSQL Wiki](https://wiki.postgresql.org) --> Good for things in the PostgreSQL ecosystem, broader scope than the docs. - [Monitoring](https://wiki.postgresql.org/wiki/Monitoring) on the wiki - [mailing lists](https://www.postgresql.org/list/) ## About the Project free, open source software. License: [PostgreSQL is released under the PostgreSQL License, a liberal Open Source license, similar to the BSD or MIT licenses.](https://www.postgresql.org/about/licence/) A great many forks of PostgreSQL: - [Elephant Roads: A Tour of Postgres Forks](https://www.slideshare.net/pgconf/elephant-roads-a-tour-of-postgres-forks) - [PostgreSQL Derived Databases](https://wiki.postgresql.org/wiki/PostgreSQL_derived_databases) There are somewhere between 40-500 contributors to PostgreSQL, depending on how you define a contributor. ## Versioning See the [versioning policy](https://www.postgresql.org/support/versioning/). Major versions are released yearly, usually in the fall. We expect **14** in fall 2022. Releases are supported for **5** years. Supported means you'll get new minor releases (always back fixes, never new behavior). > For minor releases, the community considers not upgrading to be riskier than upgrading. The goal is for people to be _able_ to upgrade to a new minor release without needing to test their code. They focus on not introducing bugs in minor releases fo facilitate this. - **Q**: How risky to upgrade while skipping a major version? - **A**: Never a problem to skip major versions when upgrading. Upgrade with either `pg_dump` and `pg_restore` or in-place upgrade. Both find options. The former is super simple "unparalleled simplicity" and the other is quicker. In-place upgrades are a bit stricter (can't make some modifications). There is no requirement to follow a certain upgrade path ([as there is with GitLab](https://docs.gitlab.com/ee/update/#upgrade-paths)). This could change one day. Crystal ball required. In-place upgrades may leave you without the benefit of some index updates that require changes to physical, on-disk representations. ## Let's do something! ### Install PostgreSQL There are a few ways to do it (compile from source, repos from distro, straight from PG, Docker, etc). The [docs](https://www.postgresql.org/download/linux/ubuntu/) have more information on installing it on Ubuntu. Installing it via `apt` creates a PostgreSQL cluster. Nice but what if I want to control how it's created? ``` dpkg -l | grep postgres ``` Install debugging symbols to make analysis with `perf` less annoying: ``` apt install -y postgresql-13 postgresql-13-dbgsym ``` Let's become the user `postgres` to proceed from here. (`su - postgres`) Let's take a look: ``` # pwd && ls /usr/lib/postgresql/13 bin lib # bin/postgres --version postgres (PostgreSQL) 13.4 (Ubuntu 13.4-1.pgdg20.04+1) ``` There are more goodies in `/usr/share/postgresql/13/` on Ubuntu. Omnibus: `/opt/gitlab/embedded/postgresql/[version]/lib` **11:52** Break! **12:03** Back! ## Postgres Cluster and Creating a Database Run `initdb` to create a database cluster. Try `/usr/lib/postgresql/13/bin/initdb -? | less` Consider adding `/usr/lib/postgresql/13/bin` to your `$PATH`. ### initdb It creates a `DATADIR`. There is no default location but the standard is `/var/lib/postgresql/13/main`. That's familiar! The install put stuff there but we are nuking it. Only a `DATADIR` must be passed to `initdb`. Specific options to consider: ``` -E, --encoding=ENCODING set default encoding for new databases --locale=LOCALE set default locale for new databases --lc-messages=LOCALE controls the language for error messages, use **English** for better search results --lc-mone ``` PostgreSQL encoding options: consider `UTF-8`. The default is taken from the environment (shell running `initdb` or whatever), don't leave the default encoding mechanism to fate. `;-)` An option for calling `initdb`: ``` initdb -E UTF-8 ``` Set `--locale` and leave things like `lc-collate` or `lc-monetary` alone unless you want to override them. A few of these options can not be changed later, like `--lc-collate` and `--lc-ctype`. This controls how databases are created in your cluster moving forward. Once the database has been created the character type and collation can **not** be changed. `--lc-ctype` controls character type, determines what class a character belongs to (punctuation, number, letter, space, etc). `--lc-collate` determines the sort order of strings. **NOTE**: PostgreSQL uses the collations from C when left to its own devices (no `ORDER BY`). Upgrading C library could change collation order in PostgreSQL. Not great, could corrupt indices by changing collation. Consider rebuilding all indices. Try `locale -a` to see what collation options are available. ``` # locale -a C C.UTF-8 POSIX ``` Using **C** collations means that upgrading C libraries is not a concern: the code points won't change. ``` -k, --data-checksums use data page checksums ``` In databases, everything is stored in a block. Blocks are the same size. PostgreSQL block size: 8K. With `-k`, checksums are written/read to disk to alert of possible storage problems. (Something changed between when the block was written and when it was later read.) #### PostgreSQL Cluster What? Why? The term **cluster** is somewhat unfortunate. A cluster contains several databases. A better word would be instance. A cluster contains: - multiple databases - info outside of a single database like - users - tablespaces (determine physical placement of the data, irrelevant when using Docker) We have the same set of users for all databases in a cluster. The permissions can be different for one user across different databases. Default Port: **5432** When connecting to `postgres`, you must be connected to _a_ database. Once you are connected to that database, that's the one. You can't query a different database from that connection. ### Let's create a cluster! Forget about `initdb` and use Ubuntu's thing. 1. Remove cluster created during install with `/usr/bin/pg_dropcluster 13 main` 2. Create cluster with `pg_createcluster` which accepts `initdb` options: ``` pg_createcluster 13 main -- -E UTF-8 --locale=C.UTF-8 -k ``` Docker: `pg_ctlcluster 13 main start` Ubuntu: `systemctl start postgresql && systemctl enable postgresql` Now that it's up and running: - Config files: `/etc/postgresql/13/main/` - Data directory: `/var/libg/postgresql/13/main` - Processes: ``` $ ps x -u postgres PID TTY STAT TIME COMMAND 6214 ? Ss 0:00 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file 6216 ? Ss 0:00 postgres: 13/main: checkpointer 6217 ? Ss 0:00 postgres: 13/main: background writer 6218 ? Ss 0:00 postgres: 13/main: walwriter 6219 ? Ss 0:00 postgres: 13/main: autovacuum launcher 6220 ? Ss 0:00 postgres: 13/main: stats collector 6221 ? Ss 0:00 postgres: 13/main: logical replication launcher ``` Postgres uses a multi-process architecture and does not use threading. It's designed to be portable and run on a range of OSs which will implement multithreading differently. That leaves us with a dedicated process on the PostgreSQL server for **each** database connection. - [ ] **READ** [Re: Reasoning behind process instead of thread based](https://www.postgresql.org/message-id/1098894087.31930.62.camel@localhost.localdomain) ### Connecting to a database in a cluster There are three methods we will cover: - TCP socket - Unix socket - connection strings The only client shipped is `psql`, our friend. See the list of [PostgreSQL clients](https://wiki.postgresql.org/wiki/PostgreSQL_Clients) on the wiki. - [PostgreSQL Cheat Sheet](https://www.postgresqltutorial.com/postgresql-cheat-sheet/) It's like `vi`. You might not like it but you might _need_ to use it one day so try to get along. `:-)` There are four things we must specify with each connection: - host - port - database name - database user ``` psql -h db.brie.dev -p 5432 -U postgres -d nameofdatabase ``` You'll be prompted for a password with the above so we'll have to try another way. Thankfully, PostgreSQL listens on TCP and Unix sockets. Yay! :tada: ``` psql -h /var/run/postgresql -p 5432 -U postgres -d postgres ``` The `psql` client is smart enough to know what to do with the directory with the command above. Now, we have a prompt: ``` $ psql -h /var/run/postgresql -p 5432 -U postgres -d postgres psql (13.4 (Ubuntu 13.4-1.pgdg20.04+1)) Type "help" for help. postgres=# ``` How do I bail? (It's not `:wq!`.) Bail with `\q` or `Ctrl` + `d`. Commands starting with `\` are client commands; all other commands are server commands. Defaults with `psql`: - `-h` --> local Unix socket - `-p` --> **5432** - `-U` --> user calling `psql` - `-d` --> same as database username So, we can just say `psql`. :laughing: We can use [**connection strings**](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING): ``` psql 'host=/var/run/postgresql port=5432 user=postgres dbname=postgres' ``` Support for this comes from the shared linked libraries (specifically `libpq.so.5`). **HELP**: `/?` ``` export PAGER='less -RS' ``` Consider `ls -RS` as [the pager](https://explainshell.com/explain?cmd=less+-RS). Get a list of databases with `\l`. Maybe you don't need an interactive session: ``` psql -d template1 -l ``` The above just says connect to the `template1` database and list all databases. ### Creating a database You do have to be connected to a database (like `postgres`) in order to create a database. Get help for SQL statements with `\h`! That's kind of awesome... :sparkles: ``` \h CREATE DATABASE ``` ``` postgres=# \h CREATE DATABASE Command: CREATE DATABASE Description: create a new database Syntax: CREATE DATABASE name [ [ WITH ] [ OWNER [=] user_name ] [ TEMPLATE [=] template ] [ ENCODING [=] encoding ] [ LOCALE [=] locale ] [ LC_COLLATE [=] lc_collate ] [ LC_CTYPE [=] lc_ctype ] [ TABLESPACE [=] tablespace_name ] [ ALLOW_CONNECTIONS [=] allowconn ] [ CONNECTION LIMIT [=] connlimit ] [ IS_TEMPLATE [=] istemplate ] ] URL: https://www.postgresql.org/docs/13/sql-createdatabase.html ``` Creating a database in PostgreSQL is essentially copying an existing database (and then doing a little bit of magic). You can use any database as a template. (Permits cloning or duplicating a database.) You'll get the default locales and collation information when you create a database unless you specify them differently when creating the database. A database can only be used as a template if no one else is connected to it. (To assist with this, you can toggle `ALLOW_CONNECTIONS` to prevent anyone from connecting to a database.) You get tab completion inside the `psql` prompt. Double tab for _all_ options. Terminate statements with `;`. The prompt will change to indicate that the server is still waiting on you. Observe: ``` postgres=# CREATE DATABASE cutecat postgres-# postgres-# ; CREATE DATABASE postgres=# ``` We need a new connection in order to connect to our new database. We can do this from within the same `psql` session! :smile_cat: Use `\c cutecat` for this: ``` postgres=# \c cutecat You are now connected to database "cutecat" as user "postgres". cutecat=# ``` Change the default database info with environment variables like `PGHOST`, `PGPORT`, `PGDATABASE` and `PGUSER`. **13:02**: Break! **13:19**: Back! ### Case in Commands We wrote `CREATE DATABASE course;` to create a database called `course`. SQL demands that everything is case folded to upper case but PostgreSQL case folders to lower case. We could do `cReAtE dAtAbAsE course;` and the same thing would happen. Prevent case folding with double quotes. Do `CREATE DATABASE "COURSE";` to really create a database called `COURSE`, otherwise `CREATE DATABASE COURSE;` would create a database called `course`. :warning: Exercise caution with using `""`. (You could use `""` to create a database called `USER` but maybe don't do that because it's a reserved keyword and `DROP DATABASE USER;` won't do what you would expect. It'll nuke `user`, not `USER`.) See [Capitalization/case-folding](https://nlp.stanford.edu/IR-book/html/htmledition/capitalizationcase-folding-1.html) from Stanford's NLP group or what the W3 has to say about [case mapping and case folding](https://www.w3.org/TR/charmod-norm/#definitionCaseFolding). ### Template Databases ``` cutecat=# \c template0 FATAL: database "template0" is not currently accepting connections Previous connection kept cutecat=# \c template1 You are now connected to database "template1" as user "postgres". ``` - **Q**: Why can't I connect to the `template0` database? - **A**: It is not allowing connections. Databases with open connections can not be used as templates. ``` datname | datallowconn -----------+-------------- postgres | t cutecat | t template1 | t template0 | f course | t ``` We see that `template0` is not allowing connections. The docs have more info about [the template databases](https://www.postgresql.org/docs/current/manage-ag-templatedbs.html). Having multiple databases in a single cluster comes with some pros and cons: - Databases in a single cluster must be stopped and started together. Yikes! Don't put databases belonging to different services in the same cluster. - Databases in a single cluster (obviously) must be using the same version of PostgreSQL. - There are some cluster-wide configuration options that must be applied to all databases in a cluster. - If there's any doubt, use a different cluster. There's additional overhead but grants flexibility. ## Indexing We need a bunch of rows to make the database sweat. :cold_sweat: ``` CREATE TABLE test (id integer GENERATED ALWAYS AS IDENTITY, name text); ``` Describe a table with `\d table_name`. ``` cutecat=# \d test; Table "public.test" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+------------------------------ id | integer | | not null | generated always as identity name | text | | | ``` That query would be akin to the following in [MySQL](https://dev.mysql.com/doc/refman/8.0/en/example-auto-increment.html): ``` CREATE TABLE test ( id INT NOT NULL AUTO_INCREMENT, name CHAR(30) NOT NULL, PRIMARY KEY (id) ); ``` The `GENERATED ALWAYS AS IDENTITY` has been part of PostgreSQL since 11. Text columns are limited to a size of **1 GB** in PostgreSQL. (What are you doing?!?) use `text` if you don't have a limit that needs to be enforced. (Don't add a limit from the database side when there is no such limit in the application. ) - [ ] **READ** about [CREATE SEQUENCE](https://www.postgresql.org/docs/current/sql-createsequence.html) Let's add a few rows to our table: ``` cutecat=# INSERT INTO test (name) VALUES ('plop'), ('boo radley'); INSERT 0 2 ``` Duplicate the database: ``` cutecat=# INSERT INTO test (name) SELECT name FROM test; INSERT 0 3 cutecat=# SELECT * FROM test; id | name ----+------------ 1 | plop 2 | boo radley 3 | fluffy 4 | plop 5 | boo radley 6 | fluffy (6 rows) ``` Run the query above until you are happy with how large and slow your database will be. :smile_cat: Measure time from client: ``` cutecat=# \timing Timing is on. cutecat=# ``` OK, now we have a super slow query: ``` cutecat=# SELECT * FROM test WHERE id = 42; id | name ----+------- 42 | tiger (1 row) Time: 679.481 ms ``` What happens with a query you provide? - It's parsed. The query rewriter updates the `*` with the names of the columns. - Query optimizer: how can I calculate the result in the best way of some of the possible ways? This plans the query. - Query executor: runs the executed plan! Use `EXPLAIN` to get the execution plan for your query. ``` # EXPLAIN SELECT * FROM test WHERE id = 42; QUERY PLAN ------------------------------------------------------------------------ Gather (cost=1000.00..82407.93 rows=1 width=11) Workers Planned: 2 -> Parallel Seq Scan on test (cost=0.00..81407.83 rows=1 width=11) Filter: (id = 42) (4 rows) Time: 0.734 ms ``` Let's turn parallel off to make the plan simpler. `SET max_parallel_workers_per_gather = 0;`. Here's the updated `EXPLAIN` result with this change to the [parallel works](): ``` # EXPLAIN SELECT * FROM test WHERE id = 42; QUERY PLAN --------------------------------------------------------------------------------- Seq Scan on test (cost=0.00..137334.80 rows=1 width=11) Filter: (id = 42) JIT: Functions: 2 Options: Inlining false, Optimization false, Expressions true, Deforming true (5 rows) Time: 34.111 ms ``` The value of the cost is not in any unit, it's primarily useful for comparison. You get two numbers. From above: - `0.00` -- start up cost, cost to deliver first row - `75100.80` -- total cost, cost for retrieving all results - `rows=1` a guess at how many rows will be returned. We don't care much about the `width` value. **Analogy**: A table is a library. A book is a table row. If you go to the library with no information about how the books are organized, you'll have to do a sequential scan, looking at each book, spine-by-spine. Thankfully, that's not reality! Today, you consult a card catalog. This is akin to a database index. - [ ] **READ** about [Using `EXPLAIN`](https://www.postgresql.org/docs/current/using-explain.html) Tables are also referred to as **heaps** in PostgreSQL. Tables have no particular order (not ordered by `id`). An index _is_ ordered. **Each entry in an index points to the corresponding row in the table.** (OK, think about it that way. It's actually implemented in a tree structure.) ### Create an Index ``` # CREATE INDEX test_id_idx ON test (id); CREATE INDEX Time: 7084.656 ms (00:07.085) ``` Let's look at that `SELECT` again: ``` cutecat=# EXPLAIN SELECT * FROM test WHERE id = 42; QUERY PLAN ------------------------------------------------------------------------- Index Scan using test_id_idx on test (cost=0.43..8.45 rows=1 width=11) Index Cond: (id = 42) (2 rows) Time: 1.027 ms ``` The start up cost is higher but **wow** is the total cost lower. We did it. :thumbsup: - **Q**: What happens when the column ID reaches the maximum number allowed by the type `integer`? - **A**: You have a big problem. You can change the datatype to `bigint`, bringing you from a four-byte integer to an eight-byte one. Doing this would rewrite the entire table (lots of downtime, resource usage). There are tricks you can do to do this but it's a pain. Don't _ever_ use `integer`. See the [Convert ci_builds.id to bigint - Step 1: Add new columns and sync data](https://gitlab.com/gitlab-org/gitlab/-/issues/325618) issue for the pain GitLab is experiencing with this very topic. > No matter what: stay safe, use `bigint`. -- Laurenz Albe If you didn't stay safe and you have this problem, here are (from memory) guidance on converting from `int` to `bigint`: - Add another nullable `bigint` column - Create a `TRIGGER` that copies the existing primary key into the new column - Run a series of updates that populate existing rows in batches - add a non-validated check constraint that ensures NOT NULL - Validate the constraint - CREATE UNIQUE INDEX CONCURRENTLY - Convert the index into a constraint - Drop the old column - Accounting for foreign keys is an [exercise left to the reader](http://catb.org/jargon/html/E/exercise--left-as-an.html) :pray: Please just use `bigint`. **14:00** Break! **14:09** Back! ## Using an index, looking at query plans Using [`EXPLAIN`](https://www.postgresql.org/docs/13/using-explain.html) to understand what PostgreSQL does when we give it queries. ``` SELECT reltuples, relpages FROM pg_class WHERE relname = 'test'; ``` Tell me how many rows (`reltuples`) and 8K pages (`relpages`). These are estimates (they'll be in the right order of magnitude, "close to the truth".) `seq_page_cost` and `cpu_tuple_cost` give us the cost associated with - [`seq_page_cost`](https://www.postgresql.org/docs/13/runtime-config-query.html#GUC-SEQ-PAGE-COST) - Sets the planner's estimate of the cost of a disk page fetch that is part of a series of sequential fetches. The default is 1.0\. This value can be overridden for tables and indexes in a particular tablespace by setting the tablespace parameter of the same name (see [ALTER TABLESPACE](sql-altertablespace.html "ALTER TABLESPACE")). - [`cpu_tuple_cost`](https://www.postgresql.org/docs/13/runtime-config-query.html#GUC-CPU-TUPLE-COST) cost to process one row. The default is 0.01. Picking the lowest cost query is easy; generating queries with lowest costs accurately is....not so easy. Let's look at a more interesting query: ``` EXPLAIN SELECT * FROM test WHERE id < 42; ``` When scanning more than one row with an index: PostgreSQL - **Q**: What if the index consists of more than one 8K page? - **A**: More I/O. Size of index does not impact performance as much as number of reads from index. The real cost in an index scan is not in reading the index. The expensive part is in reading the table row that the index points to. The index can be read from the beginning to the end or from the end to the beginning. With a query like `SELECT * FROM test WHERE id > 42;`, the query optimizer could choose to do an index scan but we see it did a sequential scan instead: ``` cutecat=# EXPLAIN SELECT * FROM test WHERE id > 42; QUERY PLAN --------------------------------------------------------------------------------- Seq Scan on test (cost=0.00..137335.20 rows=7669977 width=11) Filter: (id > 42) JIT: Functions: 2 Options: Inlining false, Optimization false, Expressions true, Deforming true (5 rows) Time: 1.158 ms ``` With a query like this, it's back to the index scan: ``` cutecat=# EXPLAIN SELECT * FROM test WHERE id > 4000000; QUERY PLAN ------------------------------------------------------------------------------------ Index Scan using test_id_idx on test (cost=0.43..121539.97 rows=3587802 width=11) Index Cond: (id > 4000000) JIT: Functions: 2 Options: Inlining false, Optimization false, Expressions true, Deforming true (5 rows) Time: 1.070 ms ``` We can be reasonably sure that the optimizer is doing the fastest thing when choosing between [index scan]() and a [sequential scan](). The index is useful for doing things like `EXPLAIN SELECT max(id) FROM test;`: ``` Result (cost=0.46..0.47 rows=1 width=4) InitPlan 1 (returns $0) -> Limit (cost=0.43..0.46 rows=1 width=4) -> Index Only Scan Backward using test_id_idx on test (cost=0.43..218357.71 rows=7670016 width=4) Index Cond: (id IS NOT NULL) (5 rows) ``` The index is useful also for sorting: ``` cutecat=# EXPLAIN SELECT * FROM test ORDER BY id; QUERY PLAN ------------------------------------------------------------------------------------ Index Scan using test_id_idx on test (cost=0.43..240645.67 rows=7670016 width=11) (1 row) Time: 0.598 ms ``` I can **forbid** PostgreSQL from using an index scan with `SET enable_indexscan = off;`. Use `RESET enable_indexscan;` to revert that change. - [ ] **READ** about [Server Configuration and Query Planning](https://www.postgresql.org/docs/current/runtime-config-query.html) :smile: The index is clearly super helpful but...there are associated downsides to be aware of. :frowning: - The size of the index on-disk can be **larger** than the size of the table. (This doesn't necessarily mean that you did something wrong.) This is redundant data. A small annoyance. - A big annoyance: each data modification updates the index. Inserting a row into the table is quick and cheap, inserting a row into the index can be expensive because the sorted order is maintained. A single table could have multiple indexes. Updating the table means updating _each_ of those indices. > Inserting a row into the heap > is cheap. As a result: it is not always necessarily a good idea to have an index. You need to balance speed when querying vs speed when updating. (If you are querying monthly but inserting daily......maybe don't bother with the index.) - **Q**: How can I tell how much time is spent updating each index? - **A**: No straightforward approach. Observe and measure. An index you don't use is a bad thing: it degrades performance and provides no benefit. - [ ] **READ** about [CQRS (Command Query Responsibility Segregation) ](https://martinfowler.com/bliki/CQRS.html) for information about an architectural pattern to help avoid the inherent conflict between optimizing writes vs. reads What if I create an index on the text column? ``` CREATE INDEX test_name_idx ON test (name); ``` Will be index even be used? ``` # EXPLAIN SELECT * FROM test WHERE name = 'plop'; QUERY PLAN -------------------------------------------------------------------------------------- Bitmap Heap Scan on test (cost=24286.17..92968.34 rows=2177773 width=11) Recheck Cond: (name = 'plop'::text) -> Bitmap Index Scan on test_name_idx (cost=0.00..23741.73 rows=2177773 width=0) Index Cond: (name = 'plop'::text) (4 rows) Time: 1.197 ms ``` Yes, but it's also doing a new thing called a [bitmap heap scan](https://www.postgresql.org/message-id/12553.1135634231@sss.pgh.pa.us). Each table block is only fetched once. The entire bitmap might not fit into memory. PostgreSQL will degrade to one bit per block. Less efficient but there's only so much memory so.... - [ ] **READ** about [bitmap heap scan](https://www.postgresql.org/message-id/12553.1135634231@sss.pgh.pa.us) by Tom Lane [who we believe](https://stackoverflow.com/questions/6592626/what-is-a-bitmap-heap-scan-in-a-query-plan/6592963) authored the technique. You can drop an index like you'd drop a database. For some purposes, having an index is super useful. ### Partial Index Let's use a `WHERE` to create a [partial index](https://www.postgresql.org/docs/current/indexes-partial.html). ``` CREATE INDEX test_name_idx ON test (name) WHERE name NOT IN ('plop'); ``` The index will _only_ be created for rows that satisfy the `WHERE` condition. A partial index is also referred to as a conditional index. In addition to having a **partial index**, you can have a **unique index**. You can also have a **unique partial index**. - [ ] **READ** about [using a unique index](https://www.postgresql.org/docs/current/indexes-unique.html) You might wish to create a partial index that only includes conditions you are frequently querying for. Like: ``` CREATE INDEX on test (name) WHERE account_status NOT IN ('active'); ``` Completely and quickly made up to demonstrate: querying only for users that are **not** active. :thought_balloon: It's really important to understand your particular application and dataset to know how to decide whether to make an index, whether a partial index would make sense and how to scope the partial index. Great segue... ### Who creates indexes? Ideally, the **developer**. Don't cheat and only consider datasets that are too small and not relevant to production usage. Consider index creation **when designing the query**. Have realistic amounts of test data at some point between development and production. :pray: Please! 1. Analyze query performance 1. Observe results 1. Improve query (within reason) 1. Build/design index # Day 2: Functions / JSON / Partitioning Watch the day 2 recording until Laurenz' machine says **Aug 31 17:48**. Indexing continued. Having an index on `name` won't help you with a query that's looking for `WHERE upper(name)`. :cry: What do we do? :smile_cat: Create an index on a function! ``` CREATE INDEX ON test (upper(name)); ``` For an index to be useful with a `WHERE` expression the query should look something like: `<indexed expression> <operator> <constant>` The operator must be an operator understood by the optimizer. Use any constant that is constant _during_ the query. To demonstrate how serious this is, consider: ``` EXPLAIN SELECT * FROM test WHERE id + 0 = 42; ``` We have an index on `id` but the database isn't smart enough to know that anything plus `0` is....that thing. With a query on an indexed expression, we see that a sequential scan is no longer used. In our example, the estimates were way off and the optimizer though it would find 20,972 rows when the real answer is `0`. PostreSQL collects table statistics regularly: - how big - value distribution - most common values - histograms to approximate distribution Recalculation of these statistics is triggered by data modification...or when you [tell it to](https://www.postgresql.org/docs/9.1/sql-analyze.html) with `ANALYZE`. ``` CREATE TABLE person (id bigint PRIMARY KEY, name text NOT NULL, bday date NOT NULL); ``` Store dates and times as...dates and times. Don't stores dates as numbers. _something something number of seconds since the epoch_ :laughing: ``` SELECT * FROM person where age(bday) > INTERVAL '80 years'; ``` ``` postgres=# CREATE INDEX ON person (age(bday)); ERROR: functions in index expression must be marked IMMUTABLE Time: 0.980 ms ``` The "volatility" of a function. An immutable function will return the same result always, regardless of things like time of day, phase of moon, upper casing. We got this feedback because `age` is a function of time. - [ ] **READ** [Function Volatility Categories](https://www.postgresql.org/docs/current/xfunc-volatility.html) Stan provides [an example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68784) of GitLab being hit by function volatility: > We got bit by this immutable index problem recently because comparing timestamps without time zone against timestamps WITH time zone doesn't work, and some schemas had a mix of different types: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68784 **12:02** Break! **12:12** Back! ## Timestamps in PostgreSQL ``` <digression> ``` ``` \dTS SELECT current_timestamp; SELECT localtimestamp; ``` Timestamp without time zone: easy straightforward. Often, you want timestamp **with** time zone. CREATE TABLE ts (t timestmap with time zone); If you store a timestampw ith time zone, it's converted to UTC and stored that way. When converted to a string, the conversion happens according to parameters in the database. ``` SHOW timezone; ``` ``` CREATE TABLE ts (t timestamp with time zone); INSERT INTO ts VALUES (current_timestamp); ``` ``` postgres=# TABLE ts; t ------------------------------- 2021-08-31 16:15:10.902239+00 (1 row) Time: 0.372 ms postgres=# INSERT INTO ts VALUES ('2021-08-31 00:00:00+05'); postgres'# TABLE ts; postgres'# SHOW timezone; postgres'# SET timezone = 'Europe/Vienna'; postgres'# TABLE ts; ``` 1. Make sure `timezone` is set properly. 1. Use timestamp without time zone if you are doing time zone management in your application `</digression>` ## Indexes, continued On `btree` indexes and `LIKE` ``` EXPLAIN SELECT * FROM test WHERE name LIKE 'fluff%'; ``` The query above won't use the index. The reason why is collation, which does not compare character-by-character. Why? Collation does not support comparison character-by-character (character-wise). With an **operator class**, we can do this. Specifically we want a test pattern operator. We just create the index a little differently: ``` CREATE INDEX on test (name text_pattern_ops); ``` Take a look at the **Index Cond** and then the **Filter**: ``` # EXPLAIN SELECT * FROM test WHERE name LIKE 'brie%'; QUERY PLAN --------------------------------------------------------------------------- Index Scan using test_name_idx on test (cost=0.43..8.45 rows=1 width=11) Index Cond: ((name ~>=~ 'brie'::text) AND (name ~<~ 'brif'::text)) Filter: (name ~~ 'brie%'::text) (3 rows) ``` The query we just created can also be used for equality comparisons (WHERE name = 'fluffy'). So, we don't need two indexes. `ORDER BY name` is kind of the only use for a straight-up normal index. access methods: there are others: ``` SELECT amname, amhandler FROM pg_am; amname | amhandler --------+---------------------- heap | heap_tableam_handler btree | bthandler hash | hashhandler gist | gisthandler gin | ginhandler spgist | spghandler brin | brinhandler ``` :tada: Indexing is one of the strong points of PostgreSQL. - `hash` - Hash indexes are hash tables persisted on disk. Can only be used for equality searches, which a `btree` could also handle. Hash index can't doa nything that a `btree` can not. Possible speed advantages. Corner case: indexing very large values to save space. - `spgist` and `brin` very special case indices. - `gist` - perfect for weird data types. exotic data types that can't be used with a `btree` because they don't have an order. Consider: geographic data points or range types. - `gin` Most useful for composite data types. Arrays, full text search, JSON - `btree` - data types **must** have a way of being ordered/compared. If not, you can't use `btree`. DEFAULT. Things we won't talk much about: - support functions - `spgist` and `brin` ### Range Types These are intervals. Floating poitn ranges, timestamp ranges: hotel reservations. ``` CREATE TABLE reservations (id bigint PRIMARY KEY, name text NOT NULL, t_from timestamp NOT NULL, t_to timestamp NOT NULL); ``` - **Q**: Is the froom free from **9a** to **10a**? There are a few different ways that the overlap could occur. The query to do to make this happen would be annoying. Let's use a timestamp range instead: ``` CREATE TABLE reservations (id bigint PRIMARY KEY, name text NOT NULL, res tsrange NOT NULL); ``` Look at this query: ``` INSERT INTO reservations VALUES(1, 'room 1', '[2021-08-31 09:00:0,2021-08-31 11:00:00)'); ``` The `[` says that **9a** is _in_ the range and the `)` at the end says that **11a** is _not_ in the range and permits adjacent reservations. ``` postgres=# SELECT * FROM reservations WHERE res && '[2021-08-31 10:00:0,2021-08-31 11:00:00)'; id | name | res ----+--------+----------------------------------------------- 1 | room 1 | ["2021-08-31 09:00:00","2021-08-31 11:00:00") ``` These operators are supported by the gist index: - `&&` overlaps operator - `@>` contains operator The execution plan says a sequential scan will be happening. Lame. Let's make a `gist` index: ``` CREATE INDEX ON reservations USING gist (res); ``` It's still using a **Sequential Scan** so let's do `SET enable_seqscan = off;` to force it to use the `gist` index that we created. a `btree` index is a special kind of `gist` index but don't worry about it. **exclusion constraint** implemented using a `gist` index only: ``` ALTER TABLE reservations ADD EXCLUDE USING gist (res WITH &&); ``` The thing above will **prevent** an overlapping entry! Whoa. Here's how it looks: ``` postgres=# INSERT INTO reservations VALUES(2, 'room 1', '[2021-08-31 09:20:0,2021-08-31 10:45:00)'); ERROR: conflicting key value violates exclusion constraint "reservations_res_excl" DETAIL: Key (res)=(["2021-08-31 09:20:00","2021-08-31 10:45:00")) conflicts with existing key (res)=(["2021-08-31 09:00:00","2021-08-31 11:00:00")). postgres=# INSERT INTO reservations VALUES(2, 'room 1', '[2021-08-31 08:20:0,2021-08-31 08:59:59)'); INSERT 0 1 ``` That's the coolest thing we've seen thus far, IMO. ### Gin Indexes and JSON in a Database There are two data types for JSON: - `json` --> stored as a string, preserves all formatting. use if it's not being touched in the database, only stored and retrieved. - `jsonb` --> to do in-database processing, prefer `jsonb`. Use unless you have a reason not to. Use `::` as the special operator for casting. The right way is: ``` SELECT CAST ('{ "a": 42, "b": [ "x": true, "y": [1, 2, 3] } }' AS json); ``` The shorter and more fun ways: ``` SELECT '{ "a": 42, "b": [ "x": true, "y": [1, 2, 3] } }':: json; ``` ``` SELECT '{ "a": 42, "b": [ "x": true, "y": [1, 2, 3] } }':: jsonb; ``` Inserting JSON into the database ``` INSERT INTO json VALUES (1, '{ "a": 1, "b": { "y": [3, 4, 5] } }'); INSERT INTO json VALUES (2, '{ "a": 1, "b": { "y": [3, 4, 5] } }'); ``` Querying JSON in the database! ``` postgres=# SELECT * FROM json WHERE j ->> 'a' = '42'; id | j ----+--------------------------------------------- 1 | {"a": 42, "b": {"x": true, "y": [1, 2, 3]}} (1 row) postgres=# SELECT * FROM json where j @> '{ "a": 42 }'; id | j ----+--------------------------------------------- 1 | {"a": 42, "b": {"x": true, "y": [1, 2, 3]}} (1 row) postgres=# CREATE INDEX ON json USING gin (j); CREATE INDEX postgres=# EXPLAIN SELECT * FROM json where j @> '{ "a": 42 }'; QUERY PLAN -------------------------------------------------------------------------- Bitmap Heap Scan on json (cost=12.00..16.01 rows=1 width=40) Recheck Cond: (j @> '{"a": 42}'::jsonb) -> Bitmap Index Scan on json_j_idx (cost=0.00..12.00 rows=1 width=0) Index Cond: (j @> '{"a": 42}'::jsonb) (4 rows) ``` So cool! Words on JSON in the database: - Temptation to store everything in JSON is strong: why would I even create columns in a table? Or multiple tables? Just pop everything into JSON. - Queries can become very slow and complicated if you do this. Constraints can't be set on parts of the JSON. That's lame. - Use JSON in the database very sparingly. In 98% of all cases, it's better to avoid it. Things twice before proceeding. - [ ] **READ** http://patshaughnessy.net/2017/12/15/looking-inside-postgres-at-a-gist-index **13:03** Break! **13:17** Back! Sometimes you want to perform a similarity search. Instead of `SELECT * FROM names WHERE name IS 'lorenz';` Let's create an extension with `CREATE EXTENSION pg_trgm;`. The `pg_trgm` gives us the `%` operator that lets us do ``` SET enable_seqscan = on; SELECT * FROM TEST WHERE name % 'lorenz'; ``` The `btree` index does not support the `%` operator. OMG. ``` CREATE INDEX ON test USING gin (name gin_trgm_ops); ``` A trigram index supports queries like `LIKE '%rie%';` USE `ILIKE` instead of `LIKE` for case-insensitivity. Neat-o! - `trgm` - trigram. All that stuff from the cryptanalysis section of **Cryptography** _is_ useful! - [ ] **READ** [English Letter Frequencies](http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/english-letter-frequencies/) on practicalcryptography.com, served via `http`. :lock: ``` # SELECT show_trgm('brie'); show_trgm ----------------------------- {" b"," br",bri,"ie ",rie} (1 row) ``` How cool is that! My name is a little short to make playing with trigrams very fun. (Why trigrams and not quadgrams?) While trigrams are amazing, they are not ideal for full text search. We see it matching for two spaces at the beginning of the sring but only one at the end because matches at the beginning of the string are emphasized. - [ ] **READ** [3 Ways to Detect Slow Queries in PostgreSQL](https://www.cybertec-postgresql.com/en/3-ways-to-detect-slow-queries-in-postgresql/) **End of indexes!** --- ## Transactions, Concurrency, Locking Consider a transaction as a group of grouped statements within a set of parentheses. Let's have some **ACID**. :microscope: - _Atomicity_ --> all or nothing, it all succeeds or it all fails. - _Consistency_ - _Isolation_ Pretend you are the only person working on the database. - _Durability_ Once a transaction has been committed, it sticks. We won't discuss this much. In PostgreSQL, statements can _only_ be run in a transaction. They run in auto commit mode. If you don't specifically start a transaction, PostgreSQL puts it in one for you. This gives us interesting advantages: Either the **whole** statement runs or none of it runs. You update 20 rows, get an error due to network outage on the 21st, sorry! If you want a transaction taht spans multiple statements, you have to make a transaction: - `START TRANSACTION;` - `BEGIN;` Transactions can **not** be nested. You can only have one transaction. ata time. To end the transaction: - `COMMIT;` - `BACK;` (?) The PostgreSQL prompt will show you when youa re in a transaction by updating the prompt: ``` postgres=# BEGIN; BEGIN postgres=*# ``` In the spirit of _all or nothing_: ``` postgres=*# SELECT 1 / 0 ; ERROR: division by zero postgres=!# SELECT 42; ERROR: current transaction is aborted, commands ignored until end of transaction block ``` Once it hits a problem, the transaction is done. You have to start over. You can use `SAVEPOINT` to create a subtransaction so that you can `ROLLBACK` to the save point or the entire transaction. Use `SAVEPOINT` but don't overuse it. Performance and concurrency can suffer. Don't do something like....one `SAVEPOINT` for each statement. Take a look: ``` postgres=# BEGIN; BEGIN postgres=*# SELECT 42; ?column? ---------- 42 (1 row) postgres=*# SAVEPOINT alpha; SAVEPOINT postgres=*# SELECT 1 / 0; ERROR: division by zero postgres=!# SELECT 42; ERROR: current transaction is aborted, commands ignored until end of transaction block postgres=!# ROLLBACK TO SAVEPOINT a; ERROR: savepoint "a" does not exist postgres=!# ROLLBACK TO SAVEPOINT alpha; ROLLBACK postgres=*# SELECT 42; ?column? ---------- 42 (1 row) postgres=*# COMMIT; COMMIT ``` Readers never block writers and writers never block readers. In our money transfer example, the "old" data is shown. The `UPDATE` we are running puts an exclusive lock, preventing any further **writes**. Locks are held until the end of the transaction. Writers can block writers. This is an example of **Isolation** at play. Any subsequent writes that can't proceed due to the lock can run once the locks are removed at the end of the transaction. There are **4** isolation levels in the SQL standard: - `READ UNCOMMITTED` No way to see uncommitted data in PostgreSQL. If you request this, you'll actually get `READ COMMITTED`. - `READ COMMITTED` I never see dirty data from uncommitted transactions. Committed value is always shown. This is the default in PostgreSQL. - `REPEATABLE READ` -- Nothing changes. I see the same values, it's like having a snapshot of the data. - `SERIALIZABLE` Completely absolutely guarantees no anomalies. Rarely used. We won't talk about it. Use [repeatable read](https://www.postgresql.org/docs/13/transaction-iso.html#XACT-REPEATABLE-READ) for complicated reports to make sure you get consistent results, no matter how long it takes the report to be generated. - [ ] **READ** [SUBTRANSACTIONS AND PERFORMANCE IN POSTGRESQL](https://www.cybertec-postgresql.com/en/subtransactions-and-performance-in-postgresql/) **13:54** Break! **14:06** Back! We might have the impression that isolation always works: it doesn't, it's an illusion. There's a good reason for the different isolation levels. The "lower" the isolation level, the fewer guarantees. - **anomaly** weird things that happen when transactions run concurrently and that would never happen if they ran independently. There's a specific kind of anomaly called **lost update anomaly**. Two options for preventing this: - **pessimistic locking** Stay at the lower isolation level and take care of preventing bad things from happening yourself. You do this with **pessimistic locking**. Place a row lock by appending `FOR UPDATE` or (even better) `FOR NO KEY UPDATE`. Better if a conflict is _likely_. Tradeoff: you introduce locks which are bad for concurrency. - **optimistic locking** Pretend everything is OK. If the value is modified between when you read and when you wrote, eh, you'll have to start over. This doesn't degrade concurrency but increases the likelihood of needing to repeat a transaction. You can do either of these but the first is preferred as it will lock _less_. ``` SELECT amount FROM accounts WHERE id = 1 FOR NO KEY UPDATE; SELECT amount FROM accounts WHERE id = 1 FOR UPDATE; ``` There are other ways: ``` SELECT amount FROM accounts WHERE id = 1 FOR NO KEY UPDATE NOWAIT; <-- Return error write away. SELECT amount FROM accounts WHERE id = 1 FOR NO KEY UPDATE SKIP LOCKED; <-- Won't wait but will pretend that locked rows DO NOT EXIST. ``` You might not use `SKIP LOCKED` super often. #### Setting the Isolation Level You can set the [isolation level]() with something like: ``` BEGIN ISOLATION LEVEL REPEATABLE READ; ``` ### The Infamous Dead Lock :skull: :lock: There is a deadlock checker that looks for deadlock cycles. It will cancel one of the transactions. A deadlock is a kind of serialization error. - **Q**: What should I do if I hit a deadlock? - **A**: Repeat the transaction that failed. Two deadlocks per day: NBD. Deal with it in the application :smiley_cat: Five deadlocks per minute: worry! :cry: Completely avoiding deadlocks: not possible. Work to avoid/reduce. The :bow_and_arrow: bow and arrow analogy: - Brie grabs the bow. - Plop grabs the arrow. - Brie and Plop are each waiting for the other to shoot the arrow in order to proceed. Hmm. What if we both followed the rule "grab the bow first"? No more :skull: :lock: . Translated to the database: consider a rule like always update the account with the lower ID first. ### Debug Locks :bug: :lock: Recall that locks are always held until the end of the transaction. You might be running into lots of locks because your transactions are too long. There are two kinds of databases: - Those with short transactions - Those with trouble. > Keep transactions short. > Keep transactions small. Don't modify more in a transaction than is necessary for consistency. Let's learn more: ``` postgres=# SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL; postgres=# SELECT * FROM pg_locks; ``` Look for locks where `granted` is `f`. From that first query, we get process IDs! ``` datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_sta -------+----------+------+------------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+------------------ 13395 | postgres | 3734 | | 10 | postgres | psql | | | -1 | 2021-08-31 17:42:07.755783+00 | 2021-08-31 18:28: 13395 | postgres | 3782 | | 10 | postgres | psql | | | -1 | 2021-08-31 17:48:58.460694+00 | 2021-08-31 18:28: 13395 | postgres | 3932 | | 10 | postgres | psql | | | -1 | 2021-08-31 18:29:19.932291+00 | 13395 | postgres | 3968 | | 10 | postgres | psql | | | -1 | 2021-08-31 18:31:18.042713+00 | 2021-08-31 18:32: (4 rows) ``` Get more info with `SELECT pg_blocking_pids(3968);`. The blocking transaction must be terminated. There are two ways to do this! - `SELECT pg_cancel_backend(PID);` This is when you see `ERROR: canceling statement due to user request.` - `SELECT pg_terminate_backend(PID);` That kills the whole session. `psql` will try (and probably succeed) to reconnect. The notes above are to get you out of a bind. The right thing is to fix the application so it doesn't make you have to do this. ### Transactions You can prevent long-running transactions but be super careful. Observe: ``` postgres=# SET idle_in_transaction_session_timeout = '2s'; SET postgres=# BEGIN; BEGIN postgres=*# SELECT 42; FATAL: terminating connection due to idle-in-transaction timeout server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` Consider a value _like_ 1 minute but IDK, up to you! Use `statement_timeout` to prevent runaway queries. ### MultiVersioning ``` postgres=# SELECT ctid, xmin,xmax, * FROM accounts; ctid | xmin | xmax | id | name | amount --------+------+------+----+------+---------- (0,11) | 561 | 0 | 1 | Plop | 9800.00 (0,12) | 562 | 563 | 2 | Brie | 20200.00 (2 rows) ``` The `ctid` is the current tuple id, the physical storage location in the table for that row: **Plop** is the eleventh entry in the first block. An `UPDATE` doesn't really update, it makes a copy. Observe the chane in `ctid`: ``` postgres=*# SELECT ctid, xmin, xmax, * FROM accounts; ctid | xmin | xmax | id | name | amount --------+------+------+----+------+---------- (0,11) | 561 | 0 | 1 | Plop | 9800.00 (0,14) | 566 | 0 | 2 | Brie | 20300.00 (2 rows) postgres=*# UPDATE accounts SET amount = amount + 100 WHERE id = 2; UPDATE 1 postgres=*# SELECT ctid, xmin, xmax, * FROM accounts; ctid | xmin | xmax | id | name | amount --------+------+------+----+------+---------- (0,11) | 561 | 0 | 1 | Plop | 9800.00 (0,15) | 566 | 0 | 2 | Brie | 20400.00 (2 rows) ``` I missed it a bit but `xmin` and `xmax` give us info about what transaction introduced a value and which invalidated it. There will be multiple versions of a row in a table. Check my screenshot from **14:46**. Deleting a row doesn't remove it, it just changes the `xmin`. These old versions of the row will eventually need to be deleted. We'll talk about garbage collection for tomorrow. Doing it at commit time would be too soon. Recall how **REPEATABLE READ** works. - **Q**: Can one query for the invisible rows? - **A**: No, not really. ``` CREATE EXTENSION pageinspect; ``` A low-level extension that lets you read the raw data. Useful for debugging data corruption: ``` SELECT * FROM heap_page_item_attrs(get_raw_page('accounts', 0), 'accounts'); ``` :warning: Vaccuuming the table cleans up the multi-versioned rows. Each row has an entry in the index. Each _version_ of the row has an entry in the index. You can wind up with duplicate entries in a unique index but only one row will be visible. Indexes don't have the `xmin` and `xmax` information. ### :question: Questions - **Q**: Is it performant to query `pg_stat_activity`? - **A**: Yes. That table is not indexed, the table. isstored in memory. No I/O pain. - **Q**: I have a query taking too long, locking everything up. How do we find the culprit query? - **A**: Looking for a hanging backend or a hanging query. Use the `pg_blocking_pids` function, find who is blocking that backend. - **Q**: What if I want to get to the root of the tree? - **A**: No ready-made answer. You have to find it yourself. It's usually a queue, not a tree. ### Table Locks Let's learn about [table-level locks](https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-TABLES). Take a [look at **Table 13.2. Conflicting Lock Modes**](https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-TABLES). If the cell has an **X**, those locking modes are not compatible. For example `ROW EXCLUSIVE` locks are compatible with one another and that's why we can `INSERT` multiple times in the same table concurrently. Avoid `SELECT` while `DROP` the same table. # Day 3: Memory and cache management / Backup and restore Today, we will discuss: - vaccuuming - logical backup - persistence (esp during crash) The visibility info is in the table, the index does not know about it. Vaccuum: - scan table, identify rows you want to get rid of - scan index, remove corresponding index entries - return to heap and remove offending rows Illustration to remember: ``` postgres=# SELECT ctid,xmin,xmax,* FROM accounts; ctid | xmin | xmax | id | name | amount -------+------+------+----+------+---------- (0,5) | 490 | 0 | 1 | brie | 9800.00 (0,6) | 491 | 0 | 2 | plop | 10200.00 (2 rows) postgres=# BEGIN; BEGIN postgres=*# UPDATE accounts SET amount = amount - 100 WHERE id = 1; UPDATE 1 postgres=*# UPDATE accounts SET amount = amount + 100 WHERE id = 2; UPDATE 1 postgres=*# COMMIT; COMMIT postgres=# SELECT ctid,xmin,xmax,* FROM accounts; ctid | xmin | xmax | id | name | amount -------+------+------+----+------+---------- (0,7) | 492 | 0 | 1 | brie | 9700.00 (0,8) | 492 | 0 | 2 | plop | 10300.00 (2 rows) ``` Run `VACUUM (VERBOSE) accounts;` for more info: ``` postgres=# VACUUM (VERBOSE) accounts; INFO: vacuuming "public.accounts" INFO: "accounts": found 6 removable, 2 nonremovable row versions in 1 out of 1 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 493 There were 4 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: vacuuming "pg_toast.pg_toast_16384" INFO: "pg_toast_16384": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 493 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM ``` No table row can be bigger than 8K because everything is in 8K blocks and a row should be in the same row. OK, I have columns bigger than 8K? Enter :bread: [toast technique](https://www.postgresql.org/docs/current/storage-toast.html). When a table is created with at least one column with a data type of variable width/size, a `toast` table is created to store overlength variables. When a row is added: - if tabl row is less than 2000 bytes, stored as is. - if larger, large values compressed. if compressed row is less than 200, stored compressed. - if compressed row is bigger than 2000 bytes, chunked and stored in toast table. main table contains reference to toast table. The `toast` stuff happens behind the scenes. This is called `detoasting`. postgres=# CREATE TABLE vactest (id integer); CREATE TABLE postgres=# SELECT * FROM generate_series(1, 10); `generate_series()` is a set-returning function, a table function. It behaves like a table: ``` postgres=# SELECT * FROM generate_series(1, 4); generate_series ----------------- 1 2 3 4 (4 rows) ``` So cool! ``` INSERT INTO vactest SELECT * FROM generate_series(1,100000); ``` Modular arithmetic is useful: ``` DELETE FROM vactest WHERE id % 2 = 0; ``` ### Vacuuming Example Do this: ``` postgres=# BEGIN; BEGIN postgres=*# COMMIT; COMMIT postgres=# BEGIN ISOLATION LEVEL REPEATABLE READ; BEGIN postgres=*# SELECT * FROM accounts; id | name | amount ----+------+---------- 1 | brie | 9700.00 2 | plop | 10300.00 (2 rows) ``` Delete 50,000 rows and then observe that vacuum can't get rid of the old row versions. (They might be needed by the other transaction, although the tables are unrelated.) ``` postgres=# DELETE FROM vactest WHERE id % 2 = 0; DELETE 50000 postgres=# VACUUM (VERBOSE) vactest; INFO: vacuuming "public.vactest" INFO: "vactest": found 0 removable, 100000 nonremovable row versions in 443 out of 443 pages DETAIL: 50000 dead row versions cannot be removed yet, oldest xmin: 495 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM ``` Once we commit the transaction, `vacuum` can do its thing. ``` postgres=# VACUUM (VERBOSE) vactest; INFO: vacuuming "public.vactest" INFO: "vactest": removed 50000 row versions in 443 pages INFO: "vactest": found 50000 removable, 50000 nonremovable row versions in 443 out of 443 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 498 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM ``` Long-running transactions in PostgreSQL are doubly bad: - prevent other transactions - prevents vacuum ### Auto Vacuum The `autovacuum launcher` runs in the background. It awakes periodically, checks for whether any tables need to be vacuumed and sends a worker to perform the vacuuming if that's the case. Vacuum is designed to not create problems: no problem with concurrent data modifications. Vacuum won't block this. It's still resource-intensive (lots of reads, possibly some writes, CPU time). Auto vacuum is designed to be gentle. - [ ] **READ** about [Routine Vacuuming](https://www.postgresql.org/docs/current/routine-vacuuming.html) Vacuum removes the data but doesn't shrink: there's just more free space in the 8K block. The empty space is not fragmented. Empty space is always in one spot and ready for reuse with the next `INSERT` or `UPDATE`. - Vacuum isn't for freeing up disk space. #### Configuring Autovacuum ![](https://i.imgur.com/Ms3vW1t.png) Always keep `autovacuum` on. #### Reduce resource usage by autovacuum Modify `autovacuum_max_workers` to address load due to `autovacuum`. Additionally, `autovacuum_vacuum_cost_delay` and `autovacuum_vacuum_cost_limit` also help to make sure `autovacuum` is not too intensive. #### When will autovacuum run? This formula determines whether `autovacuum` runs: `autovacuum_threshold` + `autovacuum_scale_factor` * `number of rows` < `number of dead rows`. By default: if 20% of a table is dead rows, `autovacuum` runs. - [ ] **READ** [Visualizing and Tuning Postgres Autovacuum](https://pganalyze.com/blog/visualizing-and-tuning-postgres-autovacuum) #### More Problems with Autovacuum Imagine you update like nobody's business: you could be creating dead rows more frequently than `autovacuum` can take them out. You'd see `autovacuum` running all the time. (It's going too slowly.) - **Q**: What would you tune to address this? - **A**: Reduce `autovacuum_vacuum_cost_delay` to **0** or change `autovacuum_vacuum_cost_delay`. Each worker can only work on one table at a time. While you can change these settings in the config file to change settings for all tables in all databases: YOU CAN CHANGE THESE SETTINGS ON A PER-TABLE BASIS. :table_tennis_paddle_and_ball: :tada: ``` ALTER TABLE test SET (autovacuum_vacuum_cost_delay = 1); ``` The bit above tells `autovacuum` on just that table to take shorter breaks! ### :bar_chart: AUTO ANALYZE `ANALYZE` is part of `autovacuum` and has similar bits that can be toggled. `ANALYZE` is much less resource-intensive than `autovacuum`. Make sure it's running often enough and you're good. The default is usually good except for things like: - Each row you add has a newer timestamp. The most recent rows will be out-of-date. You have an inaccuracy on teh rows you need the most if you are querying the more recent data. This is when you might want `ANALYZE` to run more often. Recall, you can do this on a per-table basis. Taking a step back: this seems obvious but it's really important to understand your data, your application and how they are used to make the best decisions. ### Index Only Scan Imagine a query like `SELECT id FROM tab WHERE id < 1000`. If we have an index based on the `id`: we don't even need to go to the table and bear the pain of fetching rows from the table. That info is right in the index. ...but... It's not that simple. The index doesn't know about the visibility information of an entry in the index. (The `xmin` and `xmax` stuff.) Because we have to fetch the table row to get the `xmin` and `xmax` info, doing an index only scan is not possible. orhard (we'll see...) To solve this problem, PostgreSQL introduces the [visibility map](https://www.postgresql.org/docs/current/storage-vm.html): 2 bits for each 8K block, one bit is "all visible" when set to **1**, it means that all rows in that 8K block are visible for all transactions. The index only scan finds an index entry, which tells it info about the block. It looks to the visibility map which is in memory and checks the "all visible" block. This is how we increase the efficiency of index only scans. The `vacuum` process creates and maintains the [visibility map](https://www.postgresql.org/docs/current/storage-vm.html). For efficient index only scans, make sure that `vacuum` runs frequently enough. - [ ] **TRY** using `pg_visibility` to get more information https://www.postgresql.org/docs/current/pgvisibility.html ``` postgres=# CREATE TABLE test2 (LIKE accounts) WITH (autovacuum_enabled = off); CREATE TABLE postgres=# INSERT INTO test2 SELECT * FROM accounts; INSERT 0 2 postgres=# ANALYZE test2; ANALYZE postgres=# CREATE INDEX ON test2 (id); CREATE INDEX ``` Vacuum makes the table **emptier** but not **smaller**. ![](https://i.imgur.com/zC6G0N4.png) What if I **want** to shrink the table? Use `VACUUM (FULL) vactest;`. Note that `VACUUM (FULL)` is completely different from `VACUUM`. It makes a copy, removse the old stuff and moves the new table in place of the original. :warning: Running `VACUUM (FULL)` means downtime: it's in an access-exclusive mode. Consider this to be an emergency measure. Don't run it out of `cron`. :wink: Because it makes a copy, don't use `VACUUM (FULL)` when you are on the verge of out of disk space and looking to free some up. - **Q**: What happens if you run out of disk space _during_ a `VACUUM (FULL)`? - **A**: You'll get an `I/O` error, which will terminate the statement and roll it back. The new file is deleted and you'll be back to how it looked beforehand. One exception: if PostgreSQL runs out of space creating a transactio nlog file, it will crash. We'll learn more later. A crash encountered during a `VACUUM (FULL)` will leave you with orphaned files which are hard to identify. Dump and restore becomes your best option. #### :grinning_face_with_one_large_and_one_small_eye: Squeeze Our tables are too big for us to do `VACUUM (FULL)` so we use repack. Laurenz recommends `pg_squeeze`, which is nicer, entirely in the database (no client). `pg_squeeze` - builds compacted version in the background, grabs an access exclusive lock where possible, catches up on updates and shoves the new compacted version into place. Read more about [pg_squeeze](https://github.com/cybertec-postgresql/pg_squeeze). This should be avoided if possible. Don't design a situation where you need to do this. Other things to do instead: - Give `autovacuum` more memory. - Consider `HOT updates` so you don't need to run `vacuum` so frequently. - [ ] **READ** about [TOAST](https://wiki.postgresql.org/wiki/TOAST) on the PostgreSQL wiki **12:04** Break! **12:14** Back! Is my tabel bloated? Do I need to take action? - Is it growing inordinately? - What does `pg_stat_user_tables` say? ``` SELECT * FROM pg_stat_user_tables WHERE relname = 'vactest'; \x ``` The `\x` turns on **Extended display**, which is beautiful: ![](https://i.imgur.com/bNORaX5.png) You might have lots of dead tuples but `vacuum` just ran. Do you want the full truth? :question: Use `pgstattuple`: ``` postgres=# CREATE EXTENSION pgstattuple; CREATE EXTENSION Time: 8.281 ms postgres=# SELECT * FROM pgstattuple('vactest'); -[ RECORD 1 ]------+--------- table_len | 21749760 tuple_count | 150000 tuple_len | 4200000 tuple_percent | 19.31 dead_tuple_count | 0 dead_tuple_len | 0 dead_tuple_percent | 0 free_space | 14474892 free_percent | 66.55 Time: 16.186 ms ``` Run `VACUUM vactest;` and then `SELECT * FROM pgstattuple('vactest');` again. Note that `pgstattuple` is **expensive**. Consider `SELECT * FROM pgstattuple_approx('vactest');` instead. This takes a sample of the table and gives you a guess because `pgstattuple` will be using a sequential scan. (Don't run it every 10 minutes; maybe daily.) There are other tools that aim to estimate bloat. They are less expensive but still check in with `pgstattuple`. ## Logical Backup Our friend `pg_dump` comes in: ``` # pg_dump -d database_name > backup.sql # psql -d copy -f backup.sql ``` This is simple. A few downsides: - Requires a per-database approach - Misses global info (like users) - Configuration files Use `pg_dumpall` to dump the entire cluster (database instance). Intended for use with upgrades. Both `pg_dump` and `pg_dumpall` are client-side tools. Run `pg_dump` somewhere **other** than the database machine. (I mean, you can keep your backups on the server. You do you...) ``` pg_dumpall > hey.sql ``` Use `pg_dumpall -g` to get **just** the global info. This may be a good complement to a single database backup with `pg_dump`. - **Q**: How consistent is a dump? Data could be modified during this time. - **A**: Thankfully, `pg_dump` uses a read-only [**repeatable read**](https://www.postgresql.org/docs/13/transaction-iso.html#XACT-REPEATABLE-READ) transaction. No matter how long it takes, `pg_dump` will always get a consistent snapshot. ### Table-wise Restore I want to restore "less" than the entire database. Dump just a table: `pg_dump -t accounts course` Use `-s` to get the schema. There are lots of options to `pg_dump` worth exploring. Use `pg_dump -F c -f course.dmp course` to tell `pg_dump` to use a custom format. It's a binary format. Use `pg_restore` tow ork with this file. ``` postgres@4b6f50d57295:~$ pg_dump -F c -f howdy.dmp copy postgres@4b6f50d57295:~$ file howdy.dmp howdy.dmp: PostgreSQL custom database dump - v1.14-0 ``` Use something like `pg_restore -d copy howdy.dmp`. Ooooh! You can restore a `.dmp` to a file!!!!! You can generate a `.sql` from a custom dump. This is so cool: ``` pg_restore -f - howdy.dmp | less ``` With the custom format dump, the data are compressed. Maybe always use the custom format and convert to `.sql` if you need to. With a custom format, you can restore only part of what you dmped. ``` pg_restore -f - -t accounts course.dmp ``` To only restore the schema: `pg_restore -f - -s course.dmp` Alternately, you can use the directory format: `pg_dump _F d -f course.dir course` We have looked at a few formats for backups: - `.sql` --> What we are all used to. - custom format --> can't use parallel jobs - directory format --> can use parallel jobs, possibly the coolest format, IMO. - tar/tile format --> no advantages over the others The directory format enables the `-j` flag which lets you have multiple resotre processes. ``` postgres@4b6f50d57295:~$ pg_dump -F d -f backup copy postgres@4b6f50d57295:~$ tree backup/ backup/ |-- 2965.dat.gz |-- 2966.dat.gz |-- 2967.dat.gz |-- 2969.dat.gz |-- 2970.dat.gz `-- toc.dat 0 directories, 6 files ``` So, `pg_dump` is great: simple, good but there's one decisive disadvantage: - It only captures one point in time. (You lose everything between when you took the backup and when the bad thing happened.) ## How PostgreSQL Handles Persistence > This is useful _and_ background so we can understand how alternatives methods of backing up PostgreSQL work. Below the line: persistence world, disk Above the line: memory and processes There are data files on disk that persist the tables. 8K blocks in files contaiining data. Above the line we have a cache (shared buffers in Postgres). Used to cache blocks for all databases, shared by all db processes. We also have a worker backend process working on behalf of a client connection to execute queries. - We want to read. - Block is copied up to shared buffers, read and returned. A contradiction, conflicting interests: - We want to cache writes but we also want **Durability**. A simple (but poor approach) write all involved tables when I `COMMIT;`. Works but poor performance. How do we resolve this? There's a single good architecture that goes by other names. Below the line, we have **Write Ahead Log (WAL)**. This file (on disk) is written. tosequentially, contains a log of all changes that have happened. All DBs have this (also caled transaction log, binary log, redo log). You must always put in the log and then write. This is low-level information (not the SQL statement), write what it takes to modify the files form **before** to **after**. Using WAL 1. backend writes to WAL buffers (in memory) 1. backend is then free to modify the block 1. WAL writer takes info from WAL buffers to Write Ahead Log (This could come above step **2**.) a. Confirm WAL info is persisted, then `COMMIT` can proceed 2. Way after `COMMIT;`, the WAL buffer is written down to disk ( in the data files, not the WAL). The `checkpointer` and `background writer` take care of this. :warning: Writing to the Write Ahead Log does not **necessarily** mean that the info has been written to disk because PostgreSQL uses buffered I/O; the info could be with the kernel. We use `fsync` at `COMMIT;` time to force WAL down to disk. Way after `COMMIT;`, the WAL buffer is written down to disk ( in the data files, not the WAL). The `checkpointer` and `background writer` take care of this. A checkpiont is a point. intime whena ll dirty info in shared buffers is writen down. todisk and cleaned. This is marked in the WAL so PostgreSQL knows anything older than this sequence number has already been writen to disk. You only need to replay everything that came after the latest checkpoint. (This is how crash recovery works.) Most non-WAL activities are written to disk by the `checkpointer`. After checkpointer has completed, buffers are clean. AS time passes, buffers get dirty. Clean buffers are vital for good performance. Dirty buffers mean the backend must write dirty buffers down and then read into. aclean buffer in order to get its query handled. OMG. We introduce the `background writer` for this. `background writer` looks around in dirty buffers, cleans them, and aims. tomake sure we have enough clean buffers _in between_ checkpoitns so no one has to wait for a clean buffer. An overeager `background writer` means mroe writes than are necessary: no thanks! The `background writer` goes front to back through the entire shared buffer, slowly, over time. It shouldn't do any more work than is required to make sure buffer is clean when needed. (The background writer is not overly important. ) - Docs: [Background Writer](https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-BACKGROUND-WRITER) ### Terminology We'll use page, block and buffer interchangeably but we'll always mean those 8K blocks. When on disk, it's a block; when in memory, a page or buffer. **12:59** Break! **13:17** Back! ## Data Directory Backups The name of a table isn't the real name, the object ID is the right identifier. (The table name can be changed). ``` SELECT oid FROM pg_class WHERE relname = 'ts'; ``` Try `SELECT oid, relfilenode FROM pg_class WHERE relname = 'ts';` A full vaccuum will cause a change. ``` postgres=# SELECT oid,datname FROM pg_database; oid | datname -------+----------- 13395 | postgres 1 | template1 13394 | template0 16470 | copy (4 rows) ``` ``` $ pwd && ls /var/lib/postgresql/13/main/base 1 13394 13395 16470 pgsql_tmp ``` The `_vm` files inside **16470** are the visibility maps. :warning: You can't just copy around the contents of that data directory and expect the righ things to happen. The `pg_control` file contains metadata about the database. It's useful for crash recovery. The `pg_controldata` command is used to read `pg_control`. Ack, I don't have `pg_controldata` but this. iswaht would work: ``` pg_controldata -D /var/lib/postgresql/13/main/ -bash: pg_controldata: command not found ``` Let's look. atthe write ahead log in `/var/lib/postgresql/13/main/pg_wal`. It's a bunch of files, not just one. One of those files is active, you can identify it by looking for the most recent timestamp. There are other WALs for future use, these are reserves to improve performance. When the active WAL is full, we switch to the next one and write to it. If a WAL segment that are younger than the latest checkpoint, it is then useless. At the end of a checkpoint, we delete unneeded WAL segments. The database will crash if it hits an I/O error writing the next WAL segment. In production, consider having `pg_wal` on a different filesystem. Whiel database is down, move the data and symlink to the new place. Default WAL size is **16 MB**. If you have lots of data modification, you'll. havelots of WAL switches. Not expensive but you may wish to increase the size of WAL segments. ![](https://i.imgur.com/MLAhL1F.png) Use `pg_waldump` to take a look. So freaking cool! ``` root@4b6f50d57295:~# /usr/lib/postgresql/13/bin/pg_waldump /var/lib/postgresql/13/main/pg_wal/00000001000000000000002C | tail pg_waldump: fatal: error in WAL record at 0/2C774E98: invalid record length at 0/2C774ED0: wanted 24, got 0 rmgr: Heap2 len (rec/tot): 56/ 56, tx: 0, lsn: 0/2C774C40, prev 0/2C774C00, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 44 rmgr: Heap2 len (rec/tot): 56/ 56, tx: 0, lsn: 0/2C774C78, prev 0/2C774C40, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 67 rmgr: Heap2 len (rec/tot): 56/ 56, tx: 0, lsn: 0/2C774CB0, prev 0/2C774C78, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 71 rmgr: Heap2 len (rec/tot): 62/ 62, tx: 0, lsn: 0/2C774CE8, prev 0/2C774CB0, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 76 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 0/2C774D28, prev 0/2C774CE8, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 77 rmgr: Heap2 len (rec/tot): 70/ 70, tx: 0, lsn: 0/2C774D68, prev 0/2C774D28, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 80 rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/2C774DB0, prev 0/2C774D68, desc: RUNNING_XACTS nextXid 590 latestCompletedXid 589 oldestRunningXid 590 rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/2C774DE8, prev 0/2C774DB0, desc: RUNNING_XACTS nextXid 590 latestCompletedXid 589 oldestRunningXid 590 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/2C774E20, prev 0/2C774DE8, desc: CHECKPOINT_ONLINE redo 0/2C774DE8; tli 1; prev tli 1; fpw true; xid 0:590; oid 24576; multi 1; offset 0; oldest xid 478 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 590; online rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/2C774E98, prev 0/2C774E20, desc: RUNNING_XACTS nextXid 590 latestCompletedXid 589 oldestRunningXid 590 ``` An online backup creates a checkpoint from which you recover. WAL segments are useful for recovering from an old checkpoint. You _can_ recover everything but you don't need to. Let's get started. 1. Archive transaction logs. ``` #wal_level = replica # minimal, replica, or logical ``` `minimal` is enough for crash recovery. Do **not** turn off `fsync`. Yes, your database is faster but you are toast (not in the good way) if it crashes. :bread: Let's try something :microscope:: ``` postgres@4b6f50d57295:~$ time psql -d copy -q -f insert.sql real 0m12.576s user 0m0.433s sys 0m0.909s ``` This takes forever because `fsync` runs after each comit. PostgreSQL is running in auto-commit mode so it's writing after **every** commit. `insert.sql` is just 30,001 `INSERT` statements. Put `BEGIN;` at the beginning and `COMMIT;` at. theend and observe the improvement: ``` :~$ time psql -d copy -q -f insert.sql real 0m3.836s user 0m0.478s sys 0m0.872s ``` Much better! :tada: Let's look at `synchronous_commit` to see what impact it has. With it off, not **every** commit syncs to disk. Safe to use because syncs will occur and WAL is still synced/flushed to disk occasionally (just not automatically withe very commit). Let's set it ``` # grep synchronous_commit /etc/postgresql/13/main/postgresql.conf synchronous_commit = off # synchronization level; ``` Let's use `pg_ctl reload` to tell the daemon to check for a new config. OK, I removed the single transaction from `insert.sql` and the performance with `synchronous_commit` is much better: ``` time psql -d copy -q -f insert.sql real 0m4.033s user 0m0.365s sys 0m0.660s ``` However, we don't want to lose data so let's turn `synchronous_commit = on` again. Let's consider `commit_delay` and `commit_siblings` instead. Check for other siblings transactions and sync them together. This means fewer syncs. Commits will take longer (it won't return until the flush. isdone). It's tough to tune `commit_delay` and `commit_siblings` well so it won't be as :rocket: blaxingly fast as with `synchronsou_commit = off` but still better. Don't change `full_page_writes = on`. Partial pages are difficult to recover from. These are called **torn pages**. You may wish to compress the full-page writes: IDK, wanna trade CPU time for I/O? You must make a decision somewhere. ### Checkpoints :checkmark: If more than `max_wal_size` has been written since the last checkpiont, PostgreSQL triggers a checkpoint. Additionally, `checkpoint_timeout`. See `checkpoint_completion_target`. PostgreSQL can spread checkpoint process out. With the default of **0.5**, every checkpoint should be done halfway before the next checkpoint. It's customary to increase this parameter and with 14, it will be increased to 0.9. You'll have checkpoint activity much more often (up until 90% of the way to the next checkpoint). If you have checkpoitns every 10 minutes: - old style it can't go more than 5 minutes - new style, up to 9 minutes I hope that helps! **13:51** Break! **14:03** Back! Great filesystem-level backup: - Stop server - Copy data directory - Start server But I don't want to stop the server! Fair! Let's configure PostgreSQL to Turn `archive_mode = on`, this requires restarting the database. `archive_command` is nice but scary. You just put any command there and go. ``` archive_mode = on # enables archiving; off, on, or always # (change requires restart) #archive_command = '' # command to use to archive a logfile segment # placeholders: %p = path of file to archive # %f = file name only # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' ``` ``` archive_command = 'cp %p /var/lib/postgresql/walarchive/%f' # command to use to archive a logfile segment ``` PostgreSQL only checks the return code. You could add `/bin/true` and PG would be like, yeah, great job, archiving completed. Archival failure does not mean that the segment is deleted. PG will keep trying, back off, try again, it won't just skip and try for the next one. Why? We need an unbroken stream of WAL segments. Activate **Extended display** for a single command with: ``` SELECT * FROM pg_stat_archiver \gx ``` Yep, instead of the `;`. - [ ] **TODO** Take screenshots from when I broek and fixed the WAL archiver stuff. `pg_basebackup -c fast -D backup` Copy files from server to client. Cool but also....OK. Some things will be pissing like `postmaster.pid` (I don't care about the `pid` from the server). The results: ![](https://i.imgur.com/g46lui6.png) In addition, we have some new files like backup/bbackup_manifest. This is used by `pg_verifybackup`. We also have `backup/backup_label`. We **could** just start up from the contents of `pg_basebackup` but.....things will change while `pg_basebackup` is running. Don't do it, please, for you, not for me. :cat: Anything after the `WAL LOCATION` and `CHECKPOINT LOCATION` -- could be consistent, might not be. That's why we then recover things (with WAL) until we get to the point. Thankfully, we can choose a specific point of time and restore to it. So, 3 hours after the `pg_basebackup` ran or.....5 hours after. The `CHECKPOINT LOCATION` tells us where in the `WAL` the checkpoint is located. We recover **from** the `WAL LOCATION`. - **Q**: When is the earliest point that we can safely stop recovering? - **A**: The time the online backup process ended. (You'd recover from the backup and then from the WAL for things that were backed up at 12:04.) You. cancontinue recovering after that point but you don't _have_ to. ^^^ This was tricky. `pg_dump` is granular but `pg_basebackup`: it's all or nothing for backups and restores. Sorry! We can specify `restore_command` to tell PG what command to use to restore. With no **Recovery Target** info, it'll keep recovering until either there is no file or it encounters garbage. Let's specify a **Recovery Target**. Setting it to `immediate` says make sure the backup is consistent. Huh? Set `recovery_target_time = 2020-09-01 14:20:00` to tell it the timestamp to recover up to. Set `recovery_target_action`. If you set it to `pause`, it'll bring the DB up to recovery mode so you can check the recovery looks OK. Otherwise, set it to `promote` to just kick it into production. `recovery.signal` --> is it gone? Everything is OK? The recovery process will rename `backup_label` to `backup_label.old`. Cool! I did the `rm` too soon so....here's how the recovery looked on his machine. ### pg_dump vs pg_basebackup pg_dump is more flexible, has to do a sequential scan, do a bunch of stuff that's slower than pg_basebackup pg_basebackup -- generally faster than `pg_dump`, larger backup [bloat, indexes (in `pg_dump` an index is a `CREATE INDEX` statement but with `pg_basebackup`, is the whole actual index)]. single-threaded, no good wit huge databases (10TB). ### Parallel Universes Restore to a certain point, then new incarnation of database is created **before** the bad thing you want to recover from happened. This is how you get timelines 1 and 2. The timeline is part of the name of the WAL segments. Observe: ![](https://i.imgur.com/851tnka.png) Say TL2 was created **too soon**. TL3 recovers later on: ![](https://i.imgur.com/rEUw48o.png) Each recovery attempts creates a new timeline. There is always only one current timeline. When recovering, always follow the path to a new timeline. With `current`, stay on the current timeline. Or specify a timeline ID (follow each fork to et to this timeline). You shouldn't need to worry too much about this unless you are in a really bad place. Time to call or hire a DBA? :telephone: The `.history` files describe the history of a particular timeline. There was a branch at this point and. wecreated timeline 2: ![/Users/brie/Desktop/Screen Shot 2021-09-01 at 2.47.52 PM.png](https://i.imgur.com/Ce0XYj4.png) There are existing non-core tools for automating elements of this. Laurenz recommends `pgBackRest`. See more at [pgbackrest.org](https://pgbackrest.org/). You have a dedicated `pgBackRest` server, tell `postgresql.conf` a command to talk to that server. It handles retention as well. Add this above where it belongs: > pg_stat_user_tables has statistical data that accumulate over time. It is cheap to query. pgstattuple takes a momemtary snapshot of the table and scans it. That is something different, and it is expensive. But of course nobody keeps you form joining both in a query to present a combined result. - Laurenz Albe TEST YOUR BACKUPS. OMG. - **Q**: A customer experiences corruption when running a `SELECT`. They don't have a backup. What options are available? - **A**: It depends. First, make a faithful copy of the files and experiment on the copy. Find someone. whoknows a lot about PostgreSQL. It won't be fun. Hire an expert or cut your losses. :scissors: We probably won't cover **Replication**. A few words: - Replication is basically just recovery. The differences of note: - It doesn't stop recovering, it just waits for more. - WAL archivse are not restored, they are fetched straight from the mains erver. Say PostgreSQL won't start because you are out of disk space. The correct answer is to increase disk space. Before major version 10, pg_wal used to be pg_xlog (transaction log). sacct used to be called clog for commit log. People would delete those things because people would think they were just unnecessary log files and delete them. YIKES! :scared: # Day 4: Replication / High Availability / "Complex" Architectures When recovering, WAL is retrieved from the archive. You _could_ do replication that way but it would be super annoying. Instead, enter **streaming replication**. Replication is recovery (of a type) that never ends. Streaming replication introduces two proceses: - WAL sender (on the primary) - WAL receiver (on the standby) When soemthing htis WAL buffers, it goes to the WAL sender, is received by the WAL receiver and then sent to WAL on the standby. First to the WAL, and then to the data -- remember. The `startup` process that helps with recovery replays the data into the standby. This is a fundamentally async process. ## WAL sender and WAL receiver The standby acts as a client to the primary. PG has two protocols, both on `5432`, it seems: - QUERY - replication --> optimized for streaming from primary to standby, used by `pg_basebackup` --- The config on the primary and the secondary should be as similar as possible. ``` pg_createcluster standby rm -rf /var/lib/postgresql/13/standby/ pg_basebackup -D 13/standby # Replace with contents of `pg_basebackup`> ``` Modify `postgresql.conf`. `max_wal_senders` --> more than `0`. `wal_level` can not be minimal, must be replica or logical. Configure `primary_conninfo = ''`, this is a connection string. With the streaming replication protocol, you connect to the cluster, no database needed. ``` touch 13/standby/standby.signal ``` Replicate and then recovery mode until I tell you to `promote`. `systemctl start postgresql@13-standby` Observe `walsender` and `walreceiver` processes: ![](https://i.imgur.com/yySIJOK.png) ![](https://i.imgur.com/Sjy0lVQ.png) `sent_lsn` position received by the stand by `write_lsn` psition written by standby `flush_lsn` what's persisted and definitely safe `replay_lsn` what position in the WAL has already been replayed and is visible on the standby server. ### Log Sequence Numbers (LSN) ![](https://i.imgur.com/UAJChXb.png) first 8 hex digits --> denotes the timeline second 8 hex digits --> a number that corresponds to the one before the slash in the `_lsn` first two hex digits after slash in `_lsn` are the **last** two digits of the WAL segment name remaining six digits in `_lsn` are the offset. `SELECT pg_current_wal_lsn();` to get the current LSN. :watch: `psql` has `watch`! ``` \watch 1 SELECT * FROM whatever; ``` Measure replication lag in bytes, not time. Time is unreliable for this purpose. docker run --name postgres_training \ -e POSTGRES_PASSWORD=mysecretpassword \ -d postgres:13 Stop the standby. There are checkpoints (actually called restart points) on a **recovering** database. This makes an interrupted recovery less annoying. Resume recovery from that point, rather than the beginning. This is _clearly_ essential for replication (it super needs to recover from where it left off, not the beginning, where possible). ``` CREATE TABLE accounts (id bigint PRIMARY KEY, name text NOT NULL, amount numeric(15,2) NOT NULL); ``` If you delete a WAL segment needed by the secondary...please don't. Rebuilding the secondary becomes an option. The restore command can not be changed without restarting the server. Use `wal_keep_size` to archive unneeded WAL to ensure it's available for the standby, to make it easier if it falls too far behind. This is usually unnecessary. If standbydowntime is longer than `wal_keep_size` accounts for, you'll still have trouble. Enter the **replication slot**. It marks a position in the WAL stream. The primary doesn't have any memory of the standby. (It's essentially a client.) The primary should not discard any WAL younger than the replication slot. The replication slot is an indicator of ho much WAL the secondary has. ``` SELECT pg_create_physical_replication_slot('slot'); ``` Consider `max_slot_wal_keep_size` --> override the replication slot indicator to avoid disk space pile up. This will tell PostgreSQL to nuke that WAL even if the standby has not caught up. With **synchronous replication**, `COMMIT` does not return until the synchronous standby has received the `WAL`. Doing this across continents: everything will take forever. This only works when the primary and the standby are close together in the network. the other downside of synchronous replication: the availability of the entire system goes down. A failure on the standby means a failur eon the primary. Use **synchronous replication** if you can not afford to lose a single transaction (fail over occurs before sync completes). Most people are using **async**. The `application_name` field can be used to tell the PG server who I (the client) am. This is handy for replicas. Well-behaved applications set this name. This is used as a convention to name the standby server. You can set this in the connection string. There are a range of options for `synchronous_commit` when it's on. `on` -- remote flush, waits for standby to write to WAL file _and_ flush to disk `remote_apply` --> wait until standby has replayed information and that info is now visible. If you query the standby **immediately** after making a change, it may not be available with the default settings for `synchronous_commit`. You want `remote_apply` for that. **12:01** Break! **12:12** Back! Good use cases for streaming replication: - HA/failover -- additional software required to determine that the primary is down. - offloading queries that you might not want to run on production (resource management) - problematic but possible: scaling (for reads, not writes). Application must be able to read and write to one DB and read from the other. You have to account for the time required for the data that was written to be replayed. You could do synchronous commit at `remote_apply` levels. Yeah, it would work, but...enter the **replication conflict**. VACUUM on the primary during a long running query on the second. `max_standby_streaming_delay` governs how long you wait before canceling a query and resuming streaming of WAL. The default value is **30s**. Consider your report that runs on this replicated database. Please please please don't cancel it 30 seconds in. You can make this indefinite with `-1`. Primary goal: replication doesn't lag too much. `hot_standby` Setting this to **off** disallows queries during recovery. The name of the game here is preventing conflicts. `hot_standby_feedback` -- Send oldest query blocking `VACUUM`. No more conflicts due to `VACUUM`. It's not on by default because a very long running query on the replica could cause bloat on the primary server. The primary will know the oldest query. - [ ] **READ** about `hot_standby_feedback` `remote_apply` --> Please don't, says LA. You'll have lots of waiting around all the time. You can have an intentionally delayed standby server. This is helpful to recover from logical problems, assuming you notice the problem within the intentional delay period. Possible appraoch for quick recovery. #### Fun with Functions Every second, tell Postgres to sleep for 10 seconds. ``` \watch 1 SELECT pg_sleep(10); ``` #### Temporarily Suspend Replication Temporarily suspend replication: `SELECT pg_wal_replay_pause();` resume with `SELECT pg_wal_replay_resume();` ## High Availability (continued) and Failover There is no cluster-ware in PostgreSQL but the tools to make it happen are available. Here's a manual failover: - Make the primary go down. - Make sure that the primary is actually down. (Avoid split brain mode. This isn't fun. :brain:) - A few quick steps required: - Standby must be promoted out of recovery mode. Three ways to do this: - `#promote_trigger_file = '' # file name whose presence ends recovery` - `psql` command - `SELECT pg_promote();` - Redirect the clients. Swizzle IP. Swizzle DNS (LOL). Load balancer `HAproxy` or similar. Alternately: this can be done with connection strings. If I can't get to the first, talk to the second. - Now I am back up and running but in a degraded mode, there is no standby any longer. We can work on bringing the primary back unless it's really :skull:. The primary server may be ahead: transactions commited on the primary but not yet gone to the secondary. How do we handle the deviation? We are respecting the timeline of the old standby, which is now the primary. What if we could just roll back the transactiosn that the primary has that are ahead? Well! Enter `pg_rewind`. ``` psql 'host=/var/run/postgresql,/var/run/postgresql port=5432,5433 user=postgres' ``` Connect with a string like the above and observe the output of `\conninfo` to see which you are connected to. That approach is cool but makes it possible to wind up connected. tothe stand by and why? It's read-only. Enter `target_session_attrs`, an addition to the connection string: ``` psql 'host=/var/run/postgresql,/var/run/postgresql port=5432,5433 user=postgres target_session_attrs=read-write' ``` With the above, you'll only connect to a database that is `read-write`. The clients do the handovers. Any client that links with `libpq` can do this. Possible with different syntax with GDBC. A few ways to stop the server: - `pg_ctl stop` there are three shutdown modes: smart, fast immediate. The default is `fast`. - `fast` transactions. areall cancelled, client sessions are interrupted, checkpoint made, goes down. Default. - `smart` no new client connections, existing connections not interrupted. Once last client connection closes, DB shuts down. Not used too often. - `immediate` evil mode. Go down as quickly as you can, no clean up. Crash the server. Crash recovery will be required ons tartup. Bad for production, fun for testing. No `fsync`. ``` pg_ctl stop -m immediate -D /etc/postgresql ``` See [part 2](https://hackmd.io/c2ZPSoj-SkqIMqlt50uiJQ). # Day 5: Execution Plans / Configuration and Resource Management --- # Post-Training These notes will live [here](https://work.carranza.engineer/learning/2021-postgresql-training.html) on `work.carranza.engineer` once the training is complete. ## Clean Up Clean up after yourself! ``` docker stop postgres_training docker container ls -a docker container rm $(docker container ls -a | grep postgres_training | cut -d" " -f1) docker image ls docker image rm <image_id> ``` Get the container ID you want to clean up `docker container ls -a | grep postgres_training | cut -d" " -f1` ### :sunflower: Daily Set Up ``` docker run --name pg_training -td postgres:13 docker exec -it pg_training /bin/bash apt update && apt install lsb-release wget gnupg less man sudo vim postgresql-13 postgresql-13-dbgsym su - postgres pg_createcluster 13 main -- -E UTF-8 --locale=C.UTF-8 -k pg_ctlcluster 13 main start psql ``` # Tools > External tools that might be helpful when poking at Postgres. - [DbSchema](https://dbschema.com/download.html) -- available across all three platforms, looks cool, haven't used it. - [DataFiller](https://www.cri.ensmp.fr/people/coelho/datafiller.html) -- generate random data from database schema. Looks cool, haven't tested. - [sqlsmith](https://github.com/anse1/sqlsmith) -- A random SQL query generator. # Links - [Use the index, Luke!](https://use-the-index-luke.com/) - A Guide to Database Performance for Developers - [PostgreSQL Cheatsheet](https://gist.github.com/Kartones/dd3ff5ec5ea238d4c546#file-postgres-cheatsheet-md) and [another](https://postgrescheatsheet.com/#/tables)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.