changed 4 years ago
Linked with GitHub

PostgreSQL Training: Part 1 // 2021

tags: database,postgresql

See part 2.

Prep

The items in this section are only available to GitLab team members.

General

Let's Get Set Up!

How about a container with Postgres all ready to go?

docker run --name postgres_training \
-e POSTGRES_PASSWORD=mysecretpassword \
-d postgres:13

Use the credentials in pass:

docker run --name postgres_training -e POSTGRES_PASSWORD=$(pass postgres/container/user) -td postgres:13

Cool!

# docker exec -it postgres_training psql -U postgres
psql (11.13 (Debian 11.13-1.pgdg90+1))
Type "help" for help.

postgres=# \c
You are now connected to database "postgres" as user "postgres".

Introducing lsb-core alone will require interactive set up to specify timezone info. Our friend DEBIAN_FRONTEND comes in handy, try this instead:

docker run --name ubuntutty  -e DEBIAN_FRONTEND=noninteractive    -td ubuntu:20.04
docker exec -it ubuntutty /bin/bash
apt-get -y update
apt-get -y install gnupg less lsb-core lsb-release sudo vim wget

I recommend doing the above before class starts to avoid being annoyed.

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

Day 1: Indexing / Transactions and Locking

Sharding will be out of scope for the training. We will talk a bit about partitioning.

Docs, the greatest resource.

The PostgreSQL Wiki > Good for things in the PostgreSQL ecosystem, broader scope than the docs.

About the Project

free, open source software.
License: PostgreSQL is released under the PostgreSQL License, a liberal Open Source license, similar to the BSD or MIT licenses. A great many forks of PostgreSQL:

There are somewhere between 40-500 contributors to PostgreSQL, depending on how you define a contributor.

Versioning

See the versioning policy.

Major versions are released yearly, usually in the fall. We expect 14 in fall 2022. Releases are supported for 5 years.

Supported means you'll get new minor releases (always back fixes, never new behavior).

For minor releases, the community considers not upgrading to be riskier than upgrading.

The goal is for people to be able to upgrade to a new minor release without needing to test their code. They focus on not introducing bugs in minor releases fo facilitate this.

  • Q: How risky to upgrade while skipping a major version?
  • A: Never a problem to skip major versions when upgrading. Upgrade with either pg_dump and pg_restore or in-place upgrade. Both find options. The former is super simple "unparalleled simplicity" and the other is quicker. In-place upgrades are a bit stricter (can't make some modifications). There is no requirement to follow a certain upgrade path (as there is with GitLab). This could change one day. Crystal ball required. In-place upgrades may leave you without the benefit of some index updates that require changes to physical, on-disk representations.

Let's do something!

Install PostgreSQL

There are a few ways to do it (compile from source, repos from distro, straight from PG, Docker, etc). The docs have more information on installing it on Ubuntu. Installing it via apt creates a PostgreSQL cluster. Nice but what if I want to control how it's created?

dpkg -l | grep postgres

Install debugging symbols to make analysis with perf less annoying:

apt install -y postgresql-13 postgresql-13-dbgsym

Let's become the user postgres to proceed from here. (su - postgres)

Let's take a look:

# pwd && ls
/usr/lib/postgresql/13
bin  lib
# bin/postgres  --version
postgres (PostgreSQL) 13.4 (Ubuntu 13.4-1.pgdg20.04+1)

There are more goodies in /usr/share/postgresql/13/ on Ubuntu.

Omnibus: /opt/gitlab/embedded/postgresql/[version]/lib

11:52 Break!
12:03 Back!

Postgres Cluster and Creating a Database

Run initdb to create a database cluster.

Try /usr/lib/postgresql/13/bin/initdb -? | less

Consider adding /usr/lib/postgresql/13/bin to your $PATH.

initdb

It creates a DATADIR. There is no default location but the standard is /var/lib/postgresql/13/main. That's familiar! The install put stuff there but we are nuking it. Only a DATADIR must be passed to initdb.

Specific options to consider:

-E, --encoding=ENCODING set default encoding for new databases
--locale=LOCALE set default locale for new databases
--lc-messages=LOCALE controls the language for error messages, use **English** for better search results
--lc-mone

PostgreSQL encoding options: consider UTF-8. The default is taken from the environment (shell running initdb or whatever), don't leave the default encoding mechanism to fate. ;-)

An option for calling initdb:

initdb -E UTF-8

Set --locale and leave things like lc-collate or lc-monetary alone unless you want to override them. A few of these options can not be changed later, like --lc-collate and --lc-ctype. This controls how databases are created in your cluster moving forward. Once the database has been created the character type and collation can not be changed.

--lc-ctype controls character type, determines what class a character belongs to (punctuation, number, letter, space, etc).
--lc-collate determines the sort order of strings.

NOTE: PostgreSQL uses the collations from C when left to its own devices (no ORDER BY). Upgrading C library could change collation order in PostgreSQL. Not great, could corrupt indices by changing collation. Consider rebuilding all indices.

Try locale -a to see what collation options are available.

# locale -a
C
C.UTF-8
POSIX

Using C collations means that upgrading C libraries is not a concern: the code points won't change.

-k, --data-checksums use data page checksums

In databases, everything is stored in a block. Blocks are the same size. PostgreSQL block size: 8K. With -k, checksums are written/read to disk to alert of possible storage problems. (Something changed between when the block was written and when it was later read.)

PostgreSQL Cluster

What? Why?

The term cluster is somewhat unfortunate. A cluster contains several databases. A better word would be instance.

A cluster contains:

  • multiple databases
  • info outside of a single database like
    • users
    • tablespaces (determine physical placement of the data, irrelevant when using Docker)

We have the same set of users for all databases in a cluster. The permissions can be different for one user across different databases.

Default Port: 5432

When connecting to postgres, you must be connected to a database. Once you are connected to that database, that's the one. You can't query a different database from that connection.

Let's create a cluster!

Forget about initdb and use Ubuntu's thing.

  1. Remove cluster created during install with /usr/bin/pg_dropcluster 13 main
  2. Create cluster with pg_createcluster which accepts initdb options:
pg_createcluster  13 main -- -E UTF-8   --locale=C.UTF-8     -k

Docker: pg_ctlcluster 13 main start
Ubuntu: systemctl start postgresql && systemctl enable postgresql

Now that it's up and running:

  • Config files: /etc/postgresql/13/main/
  • Data directory: /var/libg/postgresql/13/main
  • Processes:
$ ps  x  -u postgres
  PID TTY      STAT   TIME COMMAND
 6214 ?        Ss     0:00 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file
 6216 ?        Ss     0:00 postgres: 13/main: checkpointer
 6217 ?        Ss     0:00 postgres: 13/main: background writer
 6218 ?        Ss     0:00 postgres: 13/main: walwriter
 6219 ?        Ss     0:00 postgres: 13/main: autovacuum launcher
 6220 ?        Ss     0:00 postgres: 13/main: stats collector
 6221 ?        Ss     0:00 postgres: 13/main: logical replication launcher

Postgres uses a multi-process architecture and does not use threading. It's designed to be portable and run on a range of OSs which will implement multithreading differently. That leaves us with a dedicated process on the PostgreSQL server for each database connection.

Connecting to a database in a cluster

There are three methods we will cover:

  • TCP socket
  • Unix socket
  • connection strings

The only client shipped is psql, our friend. See the list of PostgreSQL clients on the wiki.

It's like vi. You might not like it but you might need to use it one day so try to get along. :-)

There are four things we must specify with each connection:

  • host
  • port
  • database name
  • database user
psql -h db.brie.dev -p 5432 -U postgres -d nameofdatabase 

You'll be prompted for a password with the above so we'll have to try another way. Thankfully, PostgreSQL listens on TCP and Unix sockets. Yay!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

psql -h /var/run/postgresql -p 5432 -U postgres -d postgres

The psql client is smart enough to know what to do with the directory with the command above. Now, we have a prompt:

$ psql -h /var/run/postgresql -p 5432 -U postgres -d postgres
psql (13.4 (Ubuntu 13.4-1.pgdg20.04+1))
Type "help" for help.

postgres=#

How do I bail? (It's not :wq!.) Bail with \q or Ctrl + d.

Commands starting with \ are client commands; all other commands are server commands.

Defaults with psql:

  • -h > local Unix socket
  • -p > 5432
  • -U > user calling psql
  • -d > same as database username

So, we can just say psql.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

We can use connection strings:

psql 'host=/var/run/postgresql port=5432 user=postgres dbname=postgres'

Support for this comes from the shared linked libraries (specifically libpq.so.5).

HELP: /?

export PAGER='less -RS'

Consider ls -RS as the pager.

Get a list of databases with \l.

Maybe you don't need an interactive session:

psql -d template1 -l

The above just says connect to the template1 database and list all databases.

Creating a database

You do have to be connected to a database (like postgres) in order to create a database.

Get help for SQL statements with \h! That's kind of awesome

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

\h CREATE DATABASE
postgres=# \h CREATE DATABASE
Command:     CREATE DATABASE
Description: create a new database
Syntax:
CREATE DATABASE name
    [ [ WITH ] [ OWNER [=] user_name ]
           [ TEMPLATE [=] template ]
           [ ENCODING [=] encoding ]
           [ LOCALE [=] locale ]
           [ LC_COLLATE [=] lc_collate ]
           [ LC_CTYPE [=] lc_ctype ]
           [ TABLESPACE [=] tablespace_name ]
           [ ALLOW_CONNECTIONS [=] allowconn ]
           [ CONNECTION LIMIT [=] connlimit ]
           [ IS_TEMPLATE [=] istemplate ] ]

URL: https://www.postgresql.org/docs/13/sql-createdatabase.html

Creating a database in PostgreSQL is essentially copying an existing database (and then doing a little bit of magic). You can use any database as a template. (Permits cloning or duplicating a database.)

You'll get the default locales and collation information when you create a database unless you specify them differently when creating the database.

A database can only be used as a template if no one else is connected to it. (To assist with this, you can toggle ALLOW_CONNECTIONS to prevent anyone from connecting to a database.)

You get tab completion inside the psql prompt. Double tab for all options. Terminate statements with ;. The prompt will change to indicate that the server is still waiting on you. Observe:

postgres=# CREATE DATABASE cutecat
postgres-#
postgres-# ;
CREATE DATABASE
postgres=#

We need a new connection in order to connect to our new database. We can do this from within the same psql session!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Use \c cutecat for this:

postgres=# \c cutecat
You are now connected to database "cutecat" as user "postgres".
cutecat=#

Change the default database info with environment variables like PGHOST, PGPORT, PGDATABASE and PGUSER.

13:02: Break!
13:19: Back!

Case in Commands

We wrote CREATE DATABASE course; to create a database called course. SQL demands that everything is case folded to upper case but PostgreSQL case folders to lower case. We could do cReAtE dAtAbAsE course; and the same thing would happen. Prevent case folding with double quotes. Do CREATE DATABASE "COURSE"; to really create a database called COURSE, otherwise CREATE DATABASE COURSE; would create a database called course.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Exercise caution with using "". (You could use "" to create a database called USER but maybe don't do that because it's a reserved keyword and DROP DATABASE USER; won't do what you would expect. It'll nuke user, not USER.)

See Capitalization/case-folding from Stanford's NLP group or what the W3 has to say about case mapping and case folding.

Template Databases

cutecat=# \c template0
FATAL:  database "template0" is not currently accepting connections
Previous connection kept
cutecat=# \c template1
You are now connected to database "template1" as user "postgres".
  • Q: Why can't I connect to the template0 database?
  • A: It is not allowing connections. Databases with open connections can not be used as templates.
  datname  | datallowconn
-----------+--------------
 postgres  | t
 cutecat   | t
 template1 | t
 template0 | f
 course    | t

We see that template0 is not allowing connections. The docs have more info about the template databases.

Having multiple databases in a single cluster comes with some pros and cons:

  • Databases in a single cluster must be stopped and started together. Yikes! Don't put databases belonging to different services in the same cluster.
  • Databases in a single cluster (obviously) must be using the same version of PostgreSQL.
  • There are some cluster-wide configuration options that must be applied to all databases in a cluster.
  • If there's any doubt, use a different cluster. There's additional overhead but grants flexibility.

Indexing

We need a bunch of rows to make the database sweat.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

​CREATE TABLE test (id integer GENERATED ALWAYS AS IDENTITY, name text);

Describe a table with \d table_name.

​cutecat=# \d test;
​                         Table "public.test"
​Column |  Type   | Collation | Nullable |           Default
--------+---------+-----------+----------+------------------------------
​id     | integer |           | not null | generated always as identity
​name   | text    |           |          |

That query would be akin to the following in MySQL:

​CREATE TABLE test (
​    id INT NOT NULL AUTO_INCREMENT,
​    name CHAR(30) NOT NULL,
​    PRIMARY KEY (id)
);

The GENERATED ALWAYS AS IDENTITY has been part of PostgreSQL since 11. Text columns are limited to a size of 1 GB in PostgreSQL. (What are you doing?!?) use text if you don't have a limit that needs to be enforced. (Don't add a limit from the database side when there is no such limit in the application. )

Let's add a few rows to our table:

cutecat=# INSERT INTO test (name) VALUES ('plop'), ('boo radley');
INSERT 0 2

Duplicate the database:

cutecat=# INSERT INTO test (name) SELECT name FROM test;
INSERT 0 3
cutecat=# SELECT * FROM test;
 id |    name
----+------------
  1 | plop
  2 | boo radley
  3 | fluffy
  4 | plop
  5 | boo radley
  6 | fluffy
(6 rows)

Run the query above until you are happy with how large and slow your database will be.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Measure time from client:

cutecat=# \timing
Timing is on.
cutecat=#

OK, now we have a super slow query:

cutecat=# SELECT * FROM test WHERE id = 42;
 id | name
----+-------
 42 | tiger
(1 row)

Time: 679.481 ms

What happens with a query you provide?

  • It's parsed. The query rewriter updates the * with the names of the columns.
  • Query optimizer: how can I calculate the result in the best way of some of the possible ways? This plans the query.
  • Query executor: runs the executed plan!

Use EXPLAIN to get the execution plan for your query.

# EXPLAIN SELECT * FROM test WHERE id = 42;
                               QUERY PLAN
------------------------------------------------------------------------
 Gather  (cost=1000.00..82407.93 rows=1 width=11)
   Workers Planned: 2
   ->  Parallel Seq Scan on test  (cost=0.00..81407.83 rows=1 width=11)
         Filter: (id = 42)
(4 rows)

Time: 0.734 ms

Let's turn parallel off to make the plan simpler. SET max_parallel_workers_per_gather = 0;. Here's the updated EXPLAIN result with this change to the parallel works:

# EXPLAIN SELECT * FROM test WHERE id = 42;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..137334.80 rows=1 width=11)
   Filter: (id = 42)
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)

Time: 34.111 ms

The value of the cost is not in any unit, it's primarily useful for comparison. You get two numbers. From above:

  • 0.00 start up cost, cost to deliver first row
  • 75100.80 total cost, cost for retrieving all results
  • rows=1 a guess at how many rows will be returned.

We don't care much about the width value.

Analogy: A table is a library. A book is a table row. If you go to the library with no information about how the books are organized, you'll have to do a sequential scan, looking at each book, spine-by-spine. Thankfully, that's not reality! Today, you consult a card catalog. This is akin to a database index.

Tables are also referred to as heaps in PostgreSQL. Tables have no particular order (not ordered by id). An index is ordered. Each entry in an index points to the corresponding row in the table. (OK, think about it that way. It's actually implemented in a tree structure.)

Create an Index

# CREATE INDEX test_id_idx ON test (id);
CREATE INDEX
Time: 7084.656 ms (00:07.085)

Let's look at that SELECT again:

cutecat=# EXPLAIN SELECT * FROM test WHERE id = 42;
                               QUERY PLAN
-------------------------------------------------------------------------
 Index Scan using test_id_idx on test  (cost=0.43..8.45 rows=1 width=11)
   Index Cond: (id = 42)
(2 rows)

Time: 1.027 ms

The start up cost is higher but wow is the total cost lower. We did it.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Q: What happens when the column ID reaches the maximum number allowed by the type integer?
  • A: You have a big problem. You can change the datatype to bigint, bringing you from a four-byte integer to an eight-byte one. Doing this would rewrite the entire table (lots of downtime, resource usage). There are tricks you can do to do this but it's a pain. Don't ever use integer. See the Convert ci_builds.id to bigint - Step 1: Add new columns and sync data issue for the pain GitLab is experiencing with this very topic.

No matter what: stay safe, use bigint.

Laurenz Albe

If you didn't stay safe and you have this problem, here are (from memory) guidance on converting from int to bigint:

  • Add another nullable bigint column
  • Create a TRIGGER that copies the existing primary key into the new column
  • Run a series of updates that populate existing rows in batches
  • add a non-validated check constraint that ensures NOT NULL
  • Validate the constraint
  • CREATE UNIQUE INDEX CONCURRENTLY
  • Convert the index into a constraint
  • Drop the old column
  • Accounting for foreign keys is an exercise left to the reader

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Please just use bigint.

14:00 Break!
14:09 Back!

Using an index, looking at query plans

Using EXPLAIN to understand what PostgreSQL does when we give it queries.

SELECT reltuples, relpages FROM pg_class WHERE relname = 'test';

Tell me how many rows (reltuples) and 8K pages (relpages). These are estimates (they'll be in the right order of magnitude, "close to the truth".)

seq_page_cost and cpu_tuple_cost give us the cost associated with

  • seq_page_cost - Sets the planner's estimate of the cost of a disk page fetch that is part of a series of sequential fetches. The default is 1.0. This value can be overridden for tables and indexes in a particular tablespace by setting the tablespace parameter of the same name (see ALTER TABLESPACE).

  • cpu_tuple_cost cost to process one row. The default is 0.01.

Picking the lowest cost query is easy; generating queries with lowest costs accurately isnot so easy.

Let's look at a more interesting query:

EXPLAIN SELECT * FROM test WHERE id < 42;

When scanning more than one row with an index: PostgreSQL

  • Q: What if the index consists of more than one 8K page?
  • A: More I/O. Size of index does not impact performance as much as number of reads from index.

The real cost in an index scan is not in reading the index. The expensive part is in reading the table row that the index points to. The index can be read from the beginning to the end or from the end to the beginning.

With a query like SELECT * FROM test WHERE id > 42;, the query optimizer could choose to do an index scan but we see it did a sequential scan instead:

cutecat=# EXPLAIN SELECT * FROM test WHERE id > 42;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..137335.20 rows=7669977 width=11)
   Filter: (id > 42)
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)

Time: 1.158 ms

With a query like this, it's back to the index scan:

cutecat=# EXPLAIN SELECT * FROM test WHERE id > 4000000;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Index Scan using test_id_idx on test  (cost=0.43..121539.97 rows=3587802 width=11)
   Index Cond: (id > 4000000)
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)

Time: 1.070 ms

We can be reasonably sure that the optimizer is doing the fastest thing when choosing between index scan and a sequential scan.

The index is useful for doing things like EXPLAIN SELECT max(id) FROM test;:

 Result  (cost=0.46..0.47 rows=1 width=4)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..0.46 rows=1 width=4)
           ->  Index Only Scan Backward using test_id_idx on test  (cost=0.43..218357.71 rows=7670016 width=4)
                 Index Cond: (id IS NOT NULL)
(5 rows)

The index is useful also for sorting:

cutecat=# EXPLAIN SELECT * FROM test ORDER BY id;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Index Scan using test_id_idx on test  (cost=0.43..240645.67 rows=7670016 width=11)
(1 row)

Time: 0.598 ms

I can forbid PostgreSQL from using an index scan with SET enable_indexscan = off;. Use RESET enable_indexscan; to revert that change.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
The index is clearly super helpful butthere are associated downsides to be aware of.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • The size of the index on-disk can be larger than the size of the table. (This doesn't necessarily mean that you did something wrong.) This is redundant data. A small annoyance.
  • A big annoyance: each data modification updates the index. Inserting a row into the table is quick and cheap, inserting a row into the index can be expensive because the sorted order is maintained. A single table could have multiple indexes. Updating the table means updating each of those indices.

Inserting a row into the heap
is cheap.

As a result: it is not always necessarily a good idea to have an index. You need to balance speed when querying vs speed when updating. (If you are querying monthly but inserting dailymaybe don't bother with the index.)

  • Q: How can I tell how much time is spent updating each index?
  • A: No straightforward approach. Observe and measure.

An index you don't use is a bad thing: it degrades performance and provides no benefit.

What if I create an index on the text column?

CREATE INDEX test_name_idx ON test (name);

Will be index even be used?

# EXPLAIN SELECT * FROM test WHERE name = 'plop';
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=24286.17..92968.34 rows=2177773 width=11)
   Recheck Cond: (name = 'plop'::text)
   ->  Bitmap Index Scan on test_name_idx  (cost=0.00..23741.73 rows=2177773 width=0)
         Index Cond: (name = 'plop'::text)
(4 rows)

Time: 1.197 ms

Yes, but it's also doing a new thing called a bitmap heap scan.

Each table block is only fetched once. The entire bitmap might not fit into memory. PostgreSQL will degrade to one bit per block. Less efficient but there's only so much memory so

You can drop an index like you'd drop a database.

For some purposes, having an index is super useful.

Partial Index

Let's use a WHERE to create a partial index.

CREATE INDEX test_name_idx ON test (name) WHERE name NOT IN ('plop');

The index will only be created for rows that satisfy the WHERE condition. A partial index is also referred to as a conditional index.

In addition to having a partial index, you can have a unique index. You can also have a unique partial index.

You might wish to create a partial index that only includes conditions you are frequently querying for. Like:

CREATE INDEX on test (name) WHERE account_status NOT IN ('active');

Completely and quickly made up to demonstrate: querying only for users that are not active.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
It's really important to understand your particular application and dataset to know how to decide whether to make an index, whether a partial index would make sense and how to scope the partial index.

Great segue

Who creates indexes?

Ideally, the developer. Don't cheat and only consider datasets that are too small and not relevant to production usage. Consider index creation when designing the query. Have realistic amounts of test data at some point between development and production.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Please!

  1. Analyze query performance
  2. Observe results
  3. Improve query (within reason)
  4. Build/design index

Day 2: Functions / JSON / Partitioning

Watch the day 2 recording until Laurenz' machine says Aug 31 17:48.

Indexing continued.

Having an index on name won't help you with a query that's looking for WHERE upper(name).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
What do we do?
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Create an index on a function!

CREATE INDEX ON test (upper(name));

For an index to be useful with a WHERE expression the query should look something like:

<indexed expression> <operator> <constant>

The operator must be an operator understood by the optimizer. Use any constant that is constant during the query. To demonstrate how serious this is, consider:

EXPLAIN SELECT * FROM test WHERE id + 0 = 42;

We have an index on id but the database isn't smart enough to know that anything plus 0 isthat thing.

With a query on an indexed expression, we see that a sequential scan is no longer used. In our example, the estimates were way off and the optimizer though it would find 20,972 rows when the real answer is 0.

PostreSQL collects table statistics regularly:

  • how big
  • value distribution
  • most common values
  • histograms to approximate distribution

Recalculation of these statistics is triggered by data modificationor when you tell it to with ANALYZE.

CREATE TABLE person (id bigint PRIMARY KEY, name text NOT NULL, bday date NOT NULL);

Store dates and times asdates and times. Don't stores dates as numbers. something something number of seconds since the epoch

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

SELECT * FROM person where age(bday) > INTERVAL '80 years';
postgres=# CREATE INDEX ON person (age(bday));
ERROR:  functions in index expression must be marked IMMUTABLE
Time: 0.980 ms

The "volatility" of a function.

An immutable function will return the same result always, regardless of things like time of day, phase of moon, upper casing. We got this feedback because age is a function of time.

Stan provides an example of GitLab being hit by function volatility:

We got bit by this immutable index problem recently because comparing timestamps without time zone against timestamps WITH time zone doesn't work, and some schemas had a mix of different types: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68784

12:02 Break!
12:12 Back!

Timestamps in PostgreSQL

<digression>
\dTS
SELECT current_timestamp;
SELECT localtimestamp;

Timestamp without time zone: easy straightforward. Often, you want timestamp with time zone.

CREATE TABLE ts (t timestmap with time zone);

If you store a timestampw ith time zone, it's converted to UTC and stored that way. When converted to a string, the conversion happens according to parameters in the database.

SHOW timezone;
CREATE TABLE ts (t timestamp with time zone);
INSERT INTO ts VALUES (current_timestamp);
postgres=# TABLE ts;
               t
-------------------------------
 2021-08-31 16:15:10.902239+00
(1 row)

Time: 0.372 ms
postgres=# INSERT INTO ts VALUES ('2021-08-31 00:00:00+05');
postgres'# TABLE ts;
postgres'# SHOW timezone;
postgres'# SET timezone = 'Europe/Vienna';
postgres'# TABLE ts;

  1. Make sure timezone is set properly.
  2. Use timestamp without time zone if you are doing time zone management in your application

</digression>

Indexes, continued

On btree indexes and LIKE

EXPLAIN SELECT * FROM test WHERE name LIKE 'fluff%';

The query above won't use the index. The reason why is collation, which does not compare character-by-character.
Why? Collation does not support comparison character-by-character (character-wise).

With an operator class, we can do this. Specifically we want a test pattern operator. We just create the index a little differently:

CREATE INDEX on test (name text_pattern_ops);

Take a look at the Index Cond and then the Filter:

# EXPLAIN SELECT * FROM test WHERE name LIKE 'brie%';
                                QUERY PLAN
---------------------------------------------------------------------------
 Index Scan using test_name_idx on test  (cost=0.43..8.45 rows=1 width=11)
   Index Cond: ((name ~>=~ 'brie'::text) AND (name ~<~ 'brif'::text))
   Filter: (name ~~ 'brie%'::text)
(3 rows)

The query we just created can also be used for equality comparisons (WHERE name = 'fluffy'). So, we don't need two indexes.

ORDER BY name is kind of the only use for a straight-up normal index.

access methods: there are others:

SELECT amname, amhandler FROM pg_am;
 amname |      amhandler
--------+----------------------
 heap   | heap_tableam_handler
 btree  | bthandler
 hash   | hashhandler
 gist   | gisthandler
 gin    | ginhandler
 spgist | spghandler
 brin   | brinhandler

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Indexing is one of the strong points of PostgreSQL.

  • hash - Hash indexes are hash tables persisted on disk. Can only be used for equality searches, which a btree could also handle. Hash index can't doa nything that a btree can not. Possible speed advantages. Corner case: indexing very large values to save space.
  • spgist and brin very special case indices.
  • gist - perfect for weird data types. exotic data types that can't be used with a btree because they don't have an order. Consider: geographic data points or range types.
  • gin Most useful for composite data types. Arrays, full text search, JSON
  • btree - data types must have a way of being ordered/compared. If not, you can't use btree. DEFAULT.

Things we won't talk much about:

  • support functions
  • spgist and brin

Range Types

These are intervals. Floating poitn ranges, timestamp ranges: hotel reservations.

CREATE TABLE reservations (id bigint PRIMARY KEY, name text NOT NULL, t_from timestamp NOT NULL, t_to timestamp NOT NULL);
  • Q: Is the froom free from 9a to 10a?

There are a few different ways that the overlap could occur. The query to do to make this happen would be annoying. Let's use a timestamp range instead:

​CREATE TABLE reservations (id bigint PRIMARY KEY, name text NOT NULL, res tsrange NOT NULL);

Look at this query:

INSERT INTO reservations VALUES(1, 'room 1', '[2021-08-31 09:00:0,2021-08-31 11:00:00)');

The [ says that 9a is in the range and the ) at the end says that 11a is not in the range and permits adjacent reservations.

postgres=# SELECT * FROM reservations WHERE res && '[2021-08-31 10:00:0,2021-08-31 11:00:00)';
 id |  name  |                      res
----+--------+-----------------------------------------------
  1 | room 1 | ["2021-08-31 09:00:00","2021-08-31 11:00:00")

These operators are supported by the gist index:

  • && overlaps operator
  • @> contains operator

The execution plan says a sequential scan will be happening. Lame. Let's make a gist index:

CREATE INDEX ON reservations USING gist (res);

It's still using a Sequential Scan so let's do SET enable_seqscan = off; to force it to use the gist index that we created.

a btree index is a special kind of gist index but don't worry about it.

exclusion constraint implemented using a gist index only:

ALTER TABLE reservations ADD EXCLUDE USING gist (res WITH &&);

The thing above will prevent an overlapping entry! Whoa. Here's how it looks:

postgres=# INSERT INTO reservations VALUES(2, 'room 1', '[2021-08-31 09:20:0,2021-08-31 10:45:00)');
ERROR:  conflicting key value violates exclusion constraint "reservations_res_excl"
DETAIL:  Key (res)=(["2021-08-31 09:20:00","2021-08-31 10:45:00")) conflicts with existing key (res)=(["2021-08-31 09:00:00","2021-08-31 11:00:00")).
postgres=# INSERT INTO reservations VALUES(2, 'room 1', '[2021-08-31 08:20:0,2021-08-31 08:59:59)');
INSERT 0 1

That's the coolest thing we've seen thus far, IMO.

Gin Indexes and JSON in a Database

There are two data types for JSON:

  • json > stored as a string, preserves all formatting. use if it's not being touched in the database, only stored and retrieved.
  • jsonb > to do in-database processing, prefer jsonb. Use unless you have a reason not to.

Use :: as the special operator for casting. The right way is:

SELECT CAST ('{ "a": 42, "b": [ "x": true, "y": [1, 2, 3] } }' AS json);

The shorter and more fun ways:

SELECT '{ "a": 42, "b": [ "x": true, "y": [1, 2, 3] } }':: json;
SELECT '{ "a": 42, "b": [ "x": true, "y": [1, 2, 3] } }':: jsonb;

Inserting JSON into the database

INSERT INTO json VALUES (1, '{ "a": 1, "b": {  "y": [3, 4, 5] } }');
INSERT INTO json VALUES (2, '{ "a": 1, "b": {  "y": [3, 4, 5] } }');

Querying JSON in the database!

postgres=# SELECT * FROM json WHERE j ->> 'a' = '42';
 id |                      j
----+---------------------------------------------
  1 | {"a": 42, "b": {"x": true, "y": [1, 2, 3]}}
(1 row)
postgres=# SELECT * FROM json where j @> '{ "a": 42 }';
 id |                      j
----+---------------------------------------------
  1 | {"a": 42, "b": {"x": true, "y": [1, 2, 3]}}
(1 row)
postgres=# CREATE INDEX ON json USING gin (j);
CREATE INDEX
postgres=# EXPLAIN SELECT * FROM json where j @> '{ "a": 42 }';
                                QUERY PLAN
--------------------------------------------------------------------------
 Bitmap Heap Scan on json  (cost=12.00..16.01 rows=1 width=40)
   Recheck Cond: (j @> '{"a": 42}'::jsonb)
   ->  Bitmap Index Scan on json_j_idx  (cost=0.00..12.00 rows=1 width=0)
         Index Cond: (j @> '{"a": 42}'::jsonb)
(4 rows)

So cool!

Words on JSON in the database:

  • Temptation to store everything in JSON is strong: why would I even create columns in a table? Or multiple tables? Just pop everything into JSON.

  • Queries can become very slow and complicated if you do this. Constraints can't be set on parts of the JSON. That's lame.

  • Use JSON in the database very sparingly. In 98% of all cases, it's better to avoid it. Things twice before proceeding.

  • READ http://patshaughnessy.net/2017/12/15/looking-inside-postgres-at-a-gist-index

13:03 Break!
13:17 Back!

Sometimes you want to perform a similarity search.

Instead of SELECT * FROM names WHERE name IS 'lorenz';
Let's create an extension with CREATE EXTENSION pg_trgm;.
The pg_trgm gives us the % operator that lets us do

SET enable_seqscan = on;
SELECT * FROM TEST WHERE name % 'lorenz';

The btree index does not support the % operator. OMG.

CREATE INDEX ON test USING gin (name gin_trgm_ops); 

A trigram index supports queries like LIKE '%rie%';

USE ILIKE instead of LIKE for case-insensitivity. Neat-o!

  • trgm - trigram. All that stuff from the cryptanalysis section of Cryptography is useful!

  • READ English Letter Frequencies on practicalcryptography.com, served via http.

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

# SELECT show_trgm('brie');
          show_trgm
-----------------------------
 {"  b"," br",bri,"ie ",rie}
(1 row)

How cool is that! My name is a little short to make playing with trigrams very fun. (Why trigrams and not quadgrams?)

While trigrams are amazing, they are not ideal for full text search.

We see it matching for two spaces at the beginning of the sring but only one at the end because matches at the beginning of the string are emphasized.

End of indexes!


Transactions, Concurrency, Locking

Consider a transaction as a group of grouped statements within a set of parentheses.

Let's have some ACID.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Atomicity > all or nothing, it all succeeds or it all fails.
  • Consistency
  • Isolation Pretend you are the only person working on the database.
  • Durability Once a transaction has been committed, it sticks. We won't discuss this much.

In PostgreSQL, statements can only be run in a transaction. They run in auto commit mode.
If you don't specifically start a transaction, PostgreSQL puts it in one for you.

This gives us interesting advantages:

Either the whole statement runs or none of it runs. You update 20 rows, get an error due to network outage on the 21st, sorry!
If you want a transaction taht spans multiple statements, you have to make a transaction:

  • START TRANSACTION;
  • BEGIN;

Transactions can not be nested. You can only have one transaction. ata time. To end the transaction:

  • COMMIT;
  • BACK; (?)

The PostgreSQL prompt will show you when youa re in a transaction by updating the prompt:

postgres=# BEGIN;
BEGIN
postgres=*#

In the spirit of all or nothing:

postgres=*# SELECT 1 / 0 ;
ERROR:  division by zero
postgres=!# SELECT 42;
ERROR:  current transaction is aborted, commands ignored until end of transaction block

Once it hits a problem, the transaction is done. You have to start over.

You can use SAVEPOINT to create a subtransaction so that you can ROLLBACK to the save point or the entire transaction. Use SAVEPOINT but don't overuse it. Performance and concurrency can suffer. Don't do something likeone SAVEPOINT for each statement.

Take a look:

postgres=# BEGIN;
BEGIN
postgres=*# SELECT 42;
 ?column?
----------
       42
(1 row)

postgres=*# SAVEPOINT alpha;
SAVEPOINT
postgres=*# SELECT 1 / 0;
ERROR:  division by zero
postgres=!# SELECT 42;
ERROR:  current transaction is aborted, commands ignored until end of transaction block
postgres=!# ROLLBACK TO SAVEPOINT a;
ERROR:  savepoint "a" does not exist
postgres=!# ROLLBACK TO SAVEPOINT alpha;
ROLLBACK
postgres=*# SELECT 42;
 ?column?
----------
       42
(1 row)

postgres=*# COMMIT;
COMMIT

Readers never block writers and writers never block readers. In our money transfer example, the "old" data is shown. The UPDATE we are running puts an exclusive lock, preventing any further writes. Locks are held until the end of the transaction. Writers can block writers.
This is an example of Isolation at play. Any subsequent writes that can't proceed due to the lock can run once the locks are removed at the end of the transaction.

There are 4 isolation levels in the SQL standard:

  • READ UNCOMMITTED No way to see uncommitted data in PostgreSQL. If you request this, you'll actually get READ COMMITTED.
  • READ COMMITTED I never see dirty data from uncommitted transactions. Committed value is always shown. This is the default in PostgreSQL.
  • REPEATABLE READ Nothing changes. I see the same values, it's like having a snapshot of the data.
  • SERIALIZABLE Completely absolutely guarantees no anomalies. Rarely used. We won't talk about it.

Use repeatable read for complicated reports to make sure you get consistent results, no matter how long it takes the report to be generated.

13:54 Break!
14:06 Back!

We might have the impression that isolation always works: it doesn't, it's an illusion. There's a good reason for the different isolation levels. The "lower" the isolation level, the fewer guarantees.

  • anomaly weird things that happen when transactions run concurrently and that would never happen if they ran independently.

There's a specific kind of anomaly called lost update anomaly. Two options for preventing this:

  • pessimistic locking Stay at the lower isolation level and take care of preventing bad things from happening yourself. You do this with pessimistic locking. Place a row lock by appending FOR UPDATE or (even better) FOR NO KEY UPDATE. Better if a conflict is likely. Tradeoff: you introduce locks which are bad for concurrency.
  • optimistic locking Pretend everything is OK. If the value is modified between when you read and when you wrote, eh, you'll have to start over. This doesn't degrade concurrency but increases the likelihood of needing to repeat a transaction.

You can do either of these but the first is preferred as it will lock less.

SELECT amount FROM accounts WHERE id = 1 FOR NO KEY UPDATE;
SELECT amount FROM accounts WHERE id = 1 FOR UPDATE;

There are other ways:

SELECT amount FROM accounts WHERE id = 1 FOR NO KEY UPDATE NOWAIT; <-- Return error write away.
SELECT amount FROM accounts WHERE id = 1 FOR NO KEY UPDATE SKIP LOCKED; <-- Won't wait but will pretend that locked rows DO NOT EXIST. 

You might not use SKIP LOCKED super often.

Setting the Isolation Level

You can set the isolation level with something like:

BEGIN ISOLATION LEVEL REPEATABLE READ;

The Infamous Dead Lock

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
There is a deadlock checker that looks for deadlock cycles. It will cancel one of the transactions. A deadlock is a kind of serialization error.

  • Q: What should I do if I hit a deadlock?
  • A: Repeat the transaction that failed.

Two deadlocks per day: NBD. Deal with it in the application

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Five deadlocks per minute: worry!
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Completely avoiding deadlocks: not possible. Work to avoid/reduce.

The

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
bow and arrow analogy:

  • Brie grabs the bow.
  • Plop grabs the arrow.
  • Brie and Plop are each waiting for the other to shoot the arrow in order to proceed.

Hmm. What if we both followed the rule "grab the bow first"? No more

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
. Translated to the database: consider a rule like always update the account with the lower ID first.

Debug Locks
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Recall that locks are always held until the end of the transaction. You might be running into lots of locks because your transactions are too long.

There are two kinds of databases:

  • Those with short transactions
  • Those with trouble.

Keep transactions short.
Keep transactions small. Don't modify more in a transaction than is necessary for consistency.

Let's learn more:

postgres=# SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL;
postgres=# SELECT * FROM pg_locks;

Look for locks where granted is f.

From that first query, we get process IDs!

 datid | datname  | pid  | leader_pid | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_sta
-------+----------+------+------------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+------------------
 13395 | postgres | 3734 |            |       10 | postgres | psql             |             |                 |          -1 | 2021-08-31 17:42:07.755783+00 | 2021-08-31 18:28:
 13395 | postgres | 3782 |            |       10 | postgres | psql             |             |                 |          -1 | 2021-08-31 17:48:58.460694+00 | 2021-08-31 18:28:
 13395 | postgres | 3932 |            |       10 | postgres | psql             |             |                 |          -1 | 2021-08-31 18:29:19.932291+00 |
 13395 | postgres | 3968 |            |       10 | postgres | psql             |             |                 |          -1 | 2021-08-31 18:31:18.042713+00 | 2021-08-31 18:32:
(4 rows)

Get more info with SELECT pg_blocking_pids(3968);.

The blocking transaction must be terminated. There are two ways to do this!

  • SELECT pg_cancel_backend(PID); This is when you see ERROR: canceling statement due to user request.
  • SELECT pg_terminate_backend(PID); That kills the whole session. psql will try (and probably succeed) to reconnect.

The notes above are to get you out of a bind. The right thing is to fix the application so it doesn't make you have to do this.

Transactions

You can prevent long-running transactions but be super careful. Observe:

postgres=# SET idle_in_transaction_session_timeout = '2s';
SET
postgres=# BEGIN;
BEGIN
postgres=*# SELECT 42;
FATAL:  terminating connection due to idle-in-transaction timeout
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

Consider a value like 1 minute but IDK, up to you!

Use statement_timeout to prevent runaway queries.

MultiVersioning

postgres=# SELECT ctid, xmin,xmax, * FROM accounts;
  ctid  | xmin | xmax | id | name |  amount
--------+------+------+----+------+----------
 (0,11) |  561 |    0 |  1 | Plop |  9800.00
 (0,12) |  562 |  563 |  2 | Brie | 20200.00
(2 rows)

The ctid is the current tuple id, the physical storage location in the table for that row:

Plop is the eleventh entry in the first block.

An UPDATE doesn't really update, it makes a copy. Observe the chane in ctid:

postgres=*# SELECT ctid, xmin, xmax, * FROM accounts;
  ctid  | xmin | xmax | id | name |  amount
--------+------+------+----+------+----------
 (0,11) |  561 |    0 |  1 | Plop |  9800.00
 (0,14) |  566 |    0 |  2 | Brie | 20300.00
(2 rows)

postgres=*# UPDATE accounts SET amount = amount + 100 WHERE id = 2;
UPDATE 1
postgres=*# SELECT ctid, xmin, xmax, * FROM accounts;
  ctid  | xmin | xmax | id | name |  amount
--------+------+------+----+------+----------
 (0,11) |  561 |    0 |  1 | Plop |  9800.00
 (0,15) |  566 |    0 |  2 | Brie | 20400.00
(2 rows)

I missed it a bit but xmin and xmax give us info about what transaction introduced a value and which invalidated it. There will be multiple versions of a row in a table. Check my screenshot from 14:46. Deleting a row doesn't remove it, it just changes the xmin. These old versions of the row will eventually need to be deleted. We'll talk about garbage collection for tomorrow. Doing it at commit time would be too soon. Recall how REPEATABLE READ works.

  • Q: Can one query for the invisible rows?
  • A: No, not really.
CREATE EXTENSION pageinspect;

A low-level extension that lets you read the raw data. Useful for debugging data corruption:

SELECT * FROM heap_page_item_attrs(get_raw_page('accounts', 0), 'accounts');

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Vaccuuming the table cleans up the multi-versioned rows. Each row has an entry in the index. Each version of the row has an entry in the index. You can wind up with duplicate entries in a unique index but only one row will be visible. Indexes don't have the xmin and xmax information.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Questions

  • Q: Is it performant to query pg_stat_activity?

  • A: Yes. That table is not indexed, the table. isstored in memory. No I/O pain.

  • Q: I have a query taking too long, locking everything up. How do we find the culprit query?

  • A: Looking for a hanging backend or a hanging query. Use the pg_blocking_pids function, find who is blocking that backend.

    • Q: What if I want to get to the root of the tree?
    • A: No ready-made answer. You have to find it yourself. It's usually a queue, not a tree.

Table Locks

Let's learn about table-level locks. Take a look at Table 13.2. Conflicting Lock Modes. If the cell has an X, those locking modes are not compatible. For example ROW EXCLUSIVE locks are compatible with one another and that's why we can INSERT multiple times in the same table concurrently.

Avoid SELECT while DROP the same table.

Day 3: Memory and cache management / Backup and restore

Today, we will discuss:

  • vaccuuming
  • logical backup
  • persistence (esp during crash)

The visibility info is in the table, the index does not know about it.

Vaccuum:

  • scan table, identify rows you want to get rid of
  • scan index, remove corresponding index entries
  • return to heap and remove offending rows

Illustration to remember:

postgres=# SELECT ctid,xmin,xmax,* FROM accounts;
 ctid  | xmin | xmax | id | name |  amount
-------+------+------+----+------+----------
 (0,5) |  490 |    0 |  1 | brie |  9800.00
 (0,6) |  491 |    0 |  2 | plop | 10200.00
(2 rows)

postgres=# BEGIN;
BEGIN
postgres=*#  UPDATE accounts SET amount = amount - 100 WHERE id = 1;
UPDATE 1
postgres=*# UPDATE accounts SET amount = amount + 100 WHERE id = 2;
UPDATE 1
postgres=*# COMMIT;
COMMIT
postgres=# SELECT ctid,xmin,xmax,* FROM accounts;
 ctid  | xmin | xmax | id | name |  amount
-------+------+------+----+------+----------
 (0,7) |  492 |    0 |  1 | brie |  9700.00
 (0,8) |  492 |    0 |  2 | plop | 10300.00
(2 rows)

Run VACUUM (VERBOSE) accounts; for more info:

postgres=# VACUUM (VERBOSE) accounts;
INFO:  vacuuming "public.accounts"
INFO:  "accounts": found 6 removable, 2 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 493
There were 4 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  vacuuming "pg_toast.pg_toast_16384"
INFO:  "pg_toast_16384": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 493
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

No table row can be bigger than 8K because everything is in 8K blocks and a row should be in the same row.

OK, I have columns bigger than 8K? Enter

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
toast technique.

When a table is created with at least one column with a data type of variable width/size, a toast table is created to store overlength variables.

When a row is added:

  • if tabl row is less than 2000 bytes, stored as is.
  • if larger, large values compressed. if compressed row is less than 200, stored compressed.
  • if compressed row is bigger than 2000 bytes, chunked and stored in toast table. main table contains reference to toast table.

The toast stuff happens behind the scenes. This is called detoasting.

postgres=# CREATE TABLE vactest (id integer);
CREATE TABLE
postgres=# SELECT * FROM generate_series(1, 10);

generate_series() is a set-returning function, a table function. It behaves like a table:

postgres=# SELECT * FROM generate_series(1, 4);
 generate_series
-----------------
               1
               2
               3
               4
(4 rows)

So cool!

INSERT INTO vactest SELECT * FROM generate_series(1,100000);

Modular arithmetic is useful:

DELETE FROM vactest WHERE id % 2 = 0;

Vacuuming Example

Do this:

postgres=# BEGIN;
BEGIN
postgres=*# COMMIT;
COMMIT
postgres=# BEGIN ISOLATION LEVEL REPEATABLE READ;
BEGIN
postgres=*# SELECT * FROM accounts;
 id | name |  amount
----+------+----------
  1 | brie |  9700.00
  2 | plop | 10300.00
(2 rows)

Delete 50,000 rows and then observe that vacuum can't get rid of the old row versions. (They might be needed by the other transaction, although the tables are unrelated.)

postgres=# DELETE FROM vactest WHERE id % 2 = 0;
DELETE 50000
postgres=# VACUUM (VERBOSE) vactest;
INFO:  vacuuming "public.vactest"
INFO:  "vactest": found 0 removable, 100000 nonremovable row versions in 443 out of 443 pages
DETAIL:  50000 dead row versions cannot be removed yet, oldest xmin: 495
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Once we commit the transaction, vacuum can do its thing.

postgres=# VACUUM (VERBOSE) vactest;
INFO:  vacuuming "public.vactest"
INFO:  "vactest": removed 50000 row versions in 443 pages
INFO:  "vactest": found 50000 removable, 50000 nonremovable row versions in 443 out of 443 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 498
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Long-running transactions in PostgreSQL are doubly bad:

  • prevent other transactions
  • prevents vacuum

Auto Vacuum

The autovacuum launcher runs in the background. It awakes periodically, checks for whether any tables need to be vacuumed and sends a worker to perform the vacuuming if that's the case. Vacuum is designed to not create problems: no problem with concurrent data modifications. Vacuum won't block this. It's still resource-intensive (lots of reads, possibly some writes, CPU time). Auto vacuum is designed to be gentle.

Vacuum removes the data but doesn't shrink: there's just more free space in the 8K block. The empty space is not fragmented. Empty space is always in one spot and ready for reuse with the next INSERT or UPDATE.

  • Vacuum isn't for freeing up disk space.

Configuring Autovacuum

Always keep autovacuum on.

Reduce resource usage by autovacuum

Modify autovacuum_max_workers to address load due to autovacuum. Additionally, autovacuum_vacuum_cost_delay and autovacuum_vacuum_cost_limit also help to make sure autovacuum is not too intensive.

When will autovacuum run?

This formula determines whether autovacuum runs: autovacuum_threshold + autovacuum_scale_factor * number of rows < number of dead rows.

By default: if 20% of a table is dead rows, autovacuum runs.

More Problems with Autovacuum

Imagine you update like nobody's business: you could be creating dead rows more frequently than autovacuum can take them out. You'd see autovacuum running all the time. (It's going too slowly.)

  • Q: What would you tune to address this?
  • A: Reduce autovacuum_vacuum_cost_delay to 0 or change autovacuum_vacuum_cost_delay.

Each worker can only work on one table at a time. While you can change these settings in the config file to change settings for all tables in all databases: YOU CAN CHANGE THESE SETTINGS ON A PER-TABLE BASIS.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

ALTER TABLE test SET (autovacuum_vacuum_cost_delay = 1);

The bit above tells autovacuum on just that table to take shorter breaks!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
AUTO ANALYZE

ANALYZE is part of autovacuum and has similar bits that can be toggled.

ANALYZE is much less resource-intensive than autovacuum. Make sure it's running often enough and you're good. The default is usually good except for things like:

  • Each row you add has a newer timestamp. The most recent rows will be out-of-date. You have an inaccuracy on teh rows you need the most if you are querying the more recent data. This is when you might want ANALYZE to run more often. Recall, you can do this on a per-table basis.

Taking a step back: this seems obvious but it's really important to understand your data, your application and how they are used to make the best decisions.

Index Only Scan

Imagine a query like SELECT id FROM tab WHERE id < 1000. If we have an index based on the id: we don't even need to go to the table and bear the pain of fetching rows from the table. That info is right in the index.

but

It's not that simple. The index doesn't know about the visibility information of an entry in the index. (The xmin and xmax stuff.)

Because we have to fetch the table row to get the xmin and xmax info, doing an index only scan is not possible. orhard (we'll see)

To solve this problem, PostgreSQL introduces the visibility map:

2 bits for each 8K block, one bit is "all visible" when set to 1, it means that all rows in that 8K block are visible for all transactions.

The index only scan finds an index entry, which tells it info about the block. It looks to the visibility map which is in memory and checks the "all visible" block. This is how we increase the efficiency of index only scans.

The vacuum process creates and maintains the visibility map. For efficient index only scans, make sure that vacuum runs frequently enough.

postgres=# CREATE TABLE test2 (LIKE accounts) WITH  (autovacuum_enabled = off);
CREATE TABLE
postgres=# INSERT INTO test2 SELECT * FROM accounts;
INSERT 0 2
postgres=# ANALYZE test2;
ANALYZE
postgres=# CREATE INDEX ON test2 (id);
CREATE INDEX

Vacuum makes the table emptier but not smaller.

What if I want to shrink the table?

Use VACUUM (FULL) vactest;.

Note that VACUUM (FULL) is completely different from VACUUM. It makes a copy, removse the old stuff and moves the new table in place of the original.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Running VACUUM (FULL) means downtime: it's in an access-exclusive mode. Consider this to be an emergency measure. Don't run it out of cron.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Because it makes a copy, don't use VACUUM (FULL) when you are on the verge of out of disk space and looking to free some up.

  • Q: What happens if you run out of disk space during a VACUUM (FULL)?
  • A: You'll get an I/O error, which will terminate the statement and roll it back. The new file is deleted and you'll be back to how it looked beforehand. One exception: if PostgreSQL runs out of space creating a transactio nlog file, it will crash. We'll learn more later. A crash encountered during a VACUUM (FULL) will leave you with orphaned files which are hard to identify. Dump and restore becomes your best option.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Squeeze

Our tables are too big for us to do VACUUM (FULL) so we use repack. Laurenz recommends pg_squeeze, which is nicer, entirely in the database (no client).

pg_squeeze - builds compacted version in the background, grabs an access exclusive lock where possible, catches up on updates and shoves the new compacted version into place. Read more about pg_squeeze. This should be avoided if possible. Don't design a situation where you need to do this.

Other things to do instead:

  • Give autovacuum more memory.

  • Consider HOT updates so you don't need to run vacuum so frequently.

  • READ about TOAST on the PostgreSQL wiki

12:04 Break!
12:14 Back!

Is my tabel bloated? Do I need to take action?

  • Is it growing inordinately?
  • What does pg_stat_user_tables say?
SELECT * FROM pg_stat_user_tables WHERE relname = 'vactest';
\x

The \x turns on Extended display, which is beautiful:

You might have lots of dead tuples but vacuum just ran.

Do you want the full truth?

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Use pgstattuple:

postgres=# CREATE EXTENSION pgstattuple;
CREATE EXTENSION
Time: 8.281 ms
postgres=# SELECT * FROM pgstattuple('vactest');
-[ RECORD 1 ]------+---------
table_len          | 21749760
tuple_count        | 150000
tuple_len          | 4200000
tuple_percent      | 19.31
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 14474892
free_percent       | 66.55

Time: 16.186 ms

Run VACUUM vactest; and then SELECT * FROM pgstattuple('vactest'); again.

Note that pgstattuple is expensive. Consider SELECT * FROM pgstattuple_approx('vactest'); instead. This takes a sample of the table and gives you a guess because pgstattuple will be using a sequential scan. (Don't run it every 10 minutes; maybe daily.)

There are other tools that aim to estimate bloat. They are less expensive but still check in with pgstattuple.

Logical Backup

Our friend pg_dump comes in:

# pg_dump -d database_name > backup.sql
# psql -d copy -f backup.sql

This is simple. A few downsides:

  • Requires a per-database approach
  • Misses global info (like users)
  • Configuration files

Use pg_dumpall to dump the entire cluster (database instance). Intended for use with upgrades.

Both pg_dump and pg_dumpall are client-side tools. Run pg_dump somewhere other than the database machine. (I mean, you can keep your backups on the server. You do you)

pg_dumpall > hey.sql

Use pg_dumpall -g to get just the global info. This may be a good complement to a single database backup with pg_dump.

  • Q: How consistent is a dump? Data could be modified during this time.
  • A: Thankfully, pg_dump uses a read-only repeatable read transaction. No matter how long it takes, pg_dump will always get a consistent snapshot.

Table-wise Restore

I want to restore "less" than the entire database.

Dump just a table: pg_dump -t accounts course

Use -s to get the schema. There are lots of options to pg_dump worth exploring.

Use pg_dump -F c -f course.dmp course to tell pg_dump to use a custom format. It's a binary format. Use pg_restore tow ork with this file.

postgres@4b6f50d57295:~$ pg_dump -F c -f howdy.dmp copy
postgres@4b6f50d57295:~$ file howdy.dmp
howdy.dmp: PostgreSQL custom database dump - v1.14-0

Use something like pg_restore -d copy howdy.dmp.

Ooooh! You can restore a .dmp to a file!!!

You can generate a .sql from a custom dump. This is so cool:

pg_restore -f - howdy.dmp  | less

With the custom format dump, the data are compressed.

Maybe always use the custom format and convert to .sql if you need to. With a custom format, you can restore only part of what you dmped.

pg_restore -f - -t accounts course.dmp

To only restore the schema: pg_restore -f - -s course.dmp

Alternately, you can use the directory format: pg_dump _F d -f course.dir course

We have looked at a few formats for backups:

  • .sql > What we are all used to.
  • custom format > can't use parallel jobs
  • directory format > can use parallel jobs, possibly the coolest format, IMO.
  • tar/tile format > no advantages over the others

The directory format enables the -j flag which lets you have multiple resotre processes.

postgres@4b6f50d57295:~$ pg_dump -F d -f backup copy
postgres@4b6f50d57295:~$ tree backup/
backup/
|-- 2965.dat.gz
|-- 2966.dat.gz
|-- 2967.dat.gz
|-- 2969.dat.gz
|-- 2970.dat.gz
`-- toc.dat

0 directories, 6 files

So, pg_dump is great: simple, good but there's one decisive disadvantage:

  • It only captures one point in time. (You lose everything between when you took the backup and when the bad thing happened.)

How PostgreSQL Handles Persistence

This is useful and background so we can understand how alternatives methods of backing up PostgreSQL work.

Below the line: persistence world, disk
Above the line: memory and processes

There are data files on disk that persist the tables. 8K blocks in files contaiining data.

Above the line we have a cache (shared buffers in Postgres). Used to cache blocks for all databases, shared by all db processes. We also have a worker backend process working on behalf of a client connection to execute queries.

  • We want to read.
  • Block is copied up to shared buffers, read and returned.

A contradiction, conflicting interests:

  • We want to cache writes but we also want Durability.

A simple (but poor approach) write all involved tables when I COMMIT;. Works but poor performance.

How do we resolve this?

There's a single good architecture that goes by other names.

Below the line, we have Write Ahead Log (WAL). This file (on disk) is written. tosequentially, contains a log of all changes that have happened. All DBs have this (also caled transaction log, binary log, redo log).

You must always put in the log and then write. This is low-level information (not the SQL statement), write what it takes to modify the files form before to after.

Using WAL

  1. backend writes to WAL buffers (in memory)
  2. backend is then free to modify the block
  3. WAL writer takes info from WAL buffers to Write Ahead Log (This could come above step 2.)
    a. Confirm WAL info is persisted, then COMMIT can proceed
  4. Way after COMMIT;, the WAL buffer is written down to disk ( in the data files, not the WAL). The checkpointer and background writer take care of this.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Writing to the Write Ahead Log does not necessarily mean that the info has been written to disk because PostgreSQL uses buffered I/O; the info could be with the kernel. We use fsync at COMMIT; time to force WAL down to disk.

Way after COMMIT;, the WAL buffer is written down to disk ( in the data files, not the WAL). The checkpointer and background writer take care of this.

A checkpiont is a point. intime whena ll dirty info in shared buffers is writen down. todisk and cleaned. This is marked in the WAL so PostgreSQL knows anything older than this sequence number has already been writen to disk. You only need to replay everything that came after the latest checkpoint. (This is how crash recovery works.)

Most non-WAL activities are written to disk by the checkpointer. After checkpointer has completed, buffers are clean. AS time passes, buffers get dirty.

Clean buffers are vital for good performance. Dirty buffers mean the backend must write dirty buffers down and then read into. aclean buffer in order to get its query handled. OMG. We introduce the background writer for this.

background writer looks around in dirty buffers, cleans them, and aims. tomake sure we have enough clean buffers in between checkpoitns so no one has to wait for a clean buffer.

An overeager background writer means mroe writes than are necessary: no thanks!

The background writer goes front to back through the entire shared buffer, slowly, over time. It shouldn't do any more work than is required to make sure buffer is clean when needed. (The background writer is not overly important. )

Terminology

We'll use page, block and buffer interchangeably but we'll always mean those 8K blocks. When on disk, it's a block; when in memory, a page or buffer.

12:59 Break!
13:17 Back!

Data Directory Backups

The name of a table isn't the real name, the object ID is the right identifier. (The table name can be changed).

​SELECT oid FROM pg_class WHERE relname = 'ts';

Try SELECT oid, relfilenode FROM pg_class WHERE relname = 'ts';

A full vaccuum will cause a change.

​postgres=# SELECT oid,datname FROM pg_database;
​ oid  |  datname
-------+-----------
​13395 | postgres
​    1 | template1
​13394 | template0
​16470 | copy
(4 rows)
$ pwd && ls
/var/lib/postgresql/13/main/base
1  13394  13395  16470  pgsql_tmp

The _vm files inside 16470 are the visibility maps.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
You can't just copy around the contents of that data directory and expect the righ things to happen.

The pg_control file contains metadata about the database. It's useful for crash recovery. The pg_controldata command is used to read pg_control.

Ack, I don't have pg_controldata but this. iswaht would work:

pg_controldata -D /var/lib/postgresql/13/main/
-bash: pg_controldata: command not found

Let's look. atthe write ahead log in /var/lib/postgresql/13/main/pg_wal. It's a bunch of files, not just one. One of those files is active, you can identify it by looking for the most recent timestamp.

There are other WALs for future use, these are reserves to improve performance. When the active WAL is full, we switch to the next one and write to it.

If a WAL segment that are younger than the latest checkpoint, it is then useless. At the end of a checkpoint, we delete unneeded WAL segments. The database will crash if it hits an I/O error writing the next WAL segment.

In production, consider having pg_wal on a different filesystem. Whiel database is down, move the data and symlink to the new place.

Default WAL size is 16 MB. If you have lots of data modification, you'll. havelots of WAL switches. Not expensive but you may wish to increase the size of WAL segments.

Use pg_waldump to take a look. So freaking cool!

​root@4b6f50d57295:~# /usr/lib/postgresql/13/bin/pg_waldump /var/lib/postgresql/13/main/pg_wal/00000001000000000000002C  | tail
pg_waldump: fatal: error in WAL record at 0/2C774E98: invalid record length at 0/2C774ED0: wanted 24, got 0
rmgr: Heap2       len (rec/tot):     56/    56, tx:          0, lsn: 0/2C774C40, prev 0/2C774C00, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 44
rmgr: Heap2       len (rec/tot):     56/    56, tx:          0, lsn: 0/2C774C78, prev 0/2C774C40, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 67
rmgr: Heap2       len (rec/tot):     56/    56, tx:          0, lsn: 0/2C774CB0, prev 0/2C774C78, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 71
rmgr: Heap2       len (rec/tot):     62/    62, tx:          0, lsn: 0/2C774CE8, prev 0/2C774CB0, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 76
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 0/2C774D28, prev 0/2C774CE8, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 77
rmgr: Heap2       len (rec/tot):     70/    70, tx:          0, lsn: 0/2C774D68, prev 0/2C774D28, desc: CLEAN remxid 0, blkref #0: rel 1663/16470/1255 blk 80
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/2C774DB0, prev 0/2C774D68, desc: RUNNING_XACTS nextXid 590 latestCompletedXid 589 oldestRunningXid 590
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/2C774DE8, prev 0/2C774DB0, desc: RUNNING_XACTS nextXid 590 latestCompletedXid 589 oldestRunningXid 590
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/2C774E20, prev 0/2C774DE8, desc: CHECKPOINT_ONLINE redo 0/2C774DE8; tli 1; prev tli 1; fpw true; xid 0:590; oid 24576; multi 1; offset 0; oldest xid 478 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 590; online
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/2C774E98, prev 0/2C774E20, desc: RUNNING_XACTS nextXid 590 latestCompletedXid 589 oldestRunningXid 590

An online backup creates a checkpoint from which you recover. WAL segments are useful for recovering from an old checkpoint. You can recover everything but you don't need to.

Let's get started.

  1. Archive transaction logs.
#wal_level = replica                    # minimal, replica, or logical

minimal is enough for crash recovery.

Do not turn off fsync. Yes, your database is faster but you are toast (not in the good way) if it crashes.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Let's try something

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
:

postgres@4b6f50d57295:~$ time psql  -d copy  -q -f insert.sql

real    0m12.576s
user    0m0.433s
sys     0m0.909s

This takes forever because fsync runs after each comit. PostgreSQL is running in auto-commit mode so it's writing after every commit.

insert.sql is just 30,001 INSERT statements. Put BEGIN; at the beginning and COMMIT; at. theend and observe the improvement:

:~$ time psql  -d copy  -q -f insert.sql

real    0m3.836s
user    0m0.478s
sys     0m0.872s

Much better!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Let's look at synchronous_commit to see what impact it has. With it off, not every commit syncs to disk. Safe to use because syncs will occur and WAL is still synced/flushed to disk occasionally (just not automatically withe very commit).

Let's set it

# grep synchronous_commit /etc/postgresql/13/main/postgresql.conf
synchronous_commit = off                # synchronization level;

Let's use pg_ctl reload to tell the daemon to check for a new config.

OK, I removed the single transaction from insert.sql and the performance with synchronous_commit is much better:

time psql  -d copy  -q -f insert.sql

real    0m4.033s
user    0m0.365s
sys     0m0.660s

However, we don't want to lose data so let's turn synchronous_commit = on again. Let's consider commit_delay and commit_siblings instead.

Check for other siblings transactions and sync them together. This means fewer syncs. Commits will take longer (it won't return until the flush. isdone). It's tough to tune commit_delay and commit_siblings well so it won't be as

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
blaxingly fast as with synchronsou_commit = off but still better.

Don't change full_page_writes = on. Partial pages are difficult to recover from. These are called torn pages. You may wish to compress the full-page writes: IDK, wanna trade CPU time for I/O? You must make a decision somewhere.

Checkpoints :checkmark:

If more than max_wal_size has been written since the last checkpiont, PostgreSQL triggers a checkpoint. Additionally, checkpoint_timeout.

See checkpoint_completion_target. PostgreSQL can spread checkpoint process out. With the default of 0.5, every checkpoint should be done halfway before the next checkpoint. It's customary to increase this parameter and with 14, it will be increased to 0.9. You'll have checkpoint activity much more often (up until 90% of the way to the next checkpoint). If you have checkpoitns every 10 minutes:

  • old style it can't go more than 5 minutes
  • new style, up to 9 minutes

I hope that helps!

13:51 Break!
14:03 Back!

Great filesystem-level backup:

  • Stop server
  • Copy data directory
  • Start server

But I don't want to stop the server!

Fair!

Let's configure PostgreSQL to

Turn archive_mode = on, this requires restarting the database.

archive_command is nice but scary. You just put any command there and go.

archive_mode = on               # enables archiving; off, on, or always
                                # (change requires restart)
#archive_command = ''           # command to use to archive a logfile segment
                                # placeholders: %p = path of file to archive
                                #               %f = file name only
                                # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
archive_command = 'cp %p /var/lib/postgresql/walarchive/%f'             # command to use to archive a logfile segment

PostgreSQL only checks the return code. You could add /bin/true and PG would be like, yeah, great job, archiving completed.

Archival failure does not mean that the segment is deleted. PG will keep trying, back off, try again, it won't just skip and try for the next one. Why? We need an unbroken stream of WAL segments.

Activate Extended display for a single command with:

SELECT * FROM pg_stat_archiver  \gx

Yep, instead of the ;.

  • TODO Take screenshots from when I broek and fixed the WAL archiver stuff.

pg_basebackup -c fast -D backup

Copy files from server to client. Cool but alsoOK. Some things will be pissing like postmaster.pid (I don't care about the pid from the server). The results:

In addition, we have some new files like backup/bbackup_manifest. This is used by pg_verifybackup. We also have backup/backup_label.

We could just start up from the contents of pg_basebackup butthings will change while pg_basebackup is running. Don't do it, please, for you, not for me.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Anything after the WAL LOCATION and CHECKPOINT LOCATION could be consistent, might not be. That's why we then recover things (with WAL) until we get to the point. Thankfully, we can choose a specific point of time and restore to it. So, 3 hours after the pg_basebackup ran or5 hours after.

The CHECKPOINT LOCATION tells us where in the WAL the checkpoint is located.

We recover from the WAL LOCATION.

  • Q: When is the earliest point that we can safely stop recovering?
  • A: The time the online backup process ended. (You'd recover from the backup and then from the WAL for things that were backed up at 12:04.) You. cancontinue recovering after that point but you don't have to.

^^^ This was tricky.

pg_dump is granular but pg_basebackup: it's all or nothing for backups and restores. Sorry!

We can specify restore_command to tell PG what command to use to restore.

With no Recovery Target info, it'll keep recovering until either there is no file or it encounters garbage. Let's specify a Recovery Target.

Setting it to immediate says make sure the backup is consistent. Huh?

Set recovery_target_time = 2020-09-01 14:20:00 to tell it the timestamp to recover up to.

Set recovery_target_action. If you set it to pause, it'll bring the DB up to recovery mode so you can check the recovery looks OK. Otherwise, set it to promote to just kick it into production.

recovery.signal > is it gone? Everything is OK?

The recovery process will rename backup_label to backup_label.old.

Cool! I did the rm too soon sohere's how the recovery looked on his machine.

pg_dump vs pg_basebackup

pg_dump is more flexible, has to do a sequential scan, do a bunch of stuff that's slower than pg_basebackup

pg_basebackup generally faster than pg_dump, larger backup [bloat, indexes (in pg_dump an index is a CREATE INDEX statement but with pg_basebackup, is the whole actual index)]. single-threaded, no good wit huge databases (10TB).

Parallel Universes

Restore to a certain point, then new incarnation of database is created before the bad thing you want to recover from happened. This is how you get timelines 1 and 2.

The timeline is part of the name of the WAL segments.

Observe:

Say TL2 was created too soon. TL3 recovers later on:

Each recovery attempts creates a new timeline. There is always only one current timeline. When recovering, always follow the path to a new timeline. With current, stay on the current timeline. Or specify a timeline ID (follow each fork to et to this timeline). You shouldn't need to worry too much about this unless you are in a really bad place. Time to call or hire a DBA?

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The .history files describe the history of a particular timeline.

There was a branch at this point and. wecreated timeline 2:

/Users/brie/Desktop/Screen Shot 2021-09-01 at 2.47.52 PM.png

There are existing non-core tools for automating elements of this. Laurenz recommends pgBackRest. See more at pgbackrest.org. You have a dedicated pgBackRest server, tell postgresql.conf a command to talk to that server. It handles retention as well.

Add this above where it belongs:

pg_stat_user_tables has statistical data that accumulate over time. It is cheap to query. pgstattuple takes a momemtary snapshot of the table and scans it. That is something different, and it is expensive. But of course nobody keeps you form joining both in a query to present a combined result.

  • Laurenz Albe

TEST YOUR BACKUPS. OMG.

  • Q: A customer experiences corruption when running a SELECT. They don't have a backup. What options are available?
  • A: It depends. First, make a faithful copy of the files and experiment on the copy. Find someone. whoknows a lot about PostgreSQL. It won't be fun. Hire an expert or cut your losses.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

We probably won't cover Replication. A few words:

  • Replication is basically just recovery. The differences of note:
    • It doesn't stop recovering, it just waits for more.
    • WAL archivse are not restored, they are fetched straight from the mains erver.

Say PostgreSQL won't start because you are out of disk space. The correct answer is to increase disk space.

Before major version 10, pg_wal used to be pg_xlog (transaction log). sacct used to be called clog for commit log.

People would delete those things because people would think they were just unnecessary log files and delete them. YIKES! :scared:

Day 4: Replication / High Availability / "Complex" Architectures

When recovering, WAL is retrieved from the archive. You could do replication that way but it would be super annoying. Instead, enter streaming replication.

Replication is recovery (of a type) that never ends.

Streaming replication introduces two proceses:

  • WAL sender (on the primary)
  • WAL receiver (on the standby)

When soemthing htis WAL buffers, it goes to the WAL sender, is received by the WAL receiver and then sent to WAL on the standby. First to the WAL, and then to the data remember.

The startup process that helps with recovery replays the data into the standby. This is a fundamentally async process.

WAL sender and WAL receiver

The standby acts as a client to the primary. PG has two protocols, both on 5432, it seems:

  • QUERY
  • replication > optimized for streaming from primary to standby, used by pg_basebackup

The config on the primary and the secondary should be as similar as possible.

pg_createcluster standby
rm -rf /var/lib/postgresql/13/standby/
pg_basebackup -D 13/standby # Replace with contents of `pg_basebackup`>

Modify postgresql.conf.

max_wal_senders > more than 0.

wal_level can not be minimal, must be replica or logical.

Configure primary_conninfo = '', this is a connection string.

With the streaming replication protocol, you connect to the cluster, no database needed.

touch 13/standby/standby.signal

Replicate and then recovery mode until I tell you to promote.

systemctl start postgresql@13-standby

Observe walsender and walreceiver processes:

sent_lsn position received by the stand by
write_lsn psition written by standby
flush_lsn what's persisted and definitely safe
replay_lsn what position in the WAL has already been replayed and is visible on the standby server.

Log Sequence Numbers (LSN)

first 8 hex digits > denotes the timeline
second 8 hex digits > a number that corresponds to the one before the slash in the _lsn

first two hex digits after slash in _lsn are the last two digits of the WAL segment name
remaining six digits in _lsn are the offset.

SELECT pg_current_wal_lsn(); to get the current LSN.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
psql has watch!

\watch 1 SELECT * FROM whatever;

Measure replication lag in bytes, not time. Time is unreliable for this purpose.

docker run name postgres_training
-e POSTGRES_PASSWORD=mysecretpassword
-d postgres:13

Stop the standby.

There are checkpoints (actually called restart points) on a recovering database. This makes an interrupted recovery less annoying. Resume recovery from that point, rather than the beginning. This is clearly essential for replication (it super needs to recover from where it left off, not the beginning, where possible).

CREATE TABLE accounts (id bigint PRIMARY KEY, name text NOT NULL, amount numeric(15,2) NOT NULL);

If you delete a WAL segment needed by the secondaryplease don't.
Rebuilding the secondary becomes an option.

The restore command can not be changed without restarting the server.

Use wal_keep_size to archive unneeded WAL to ensure it's available for the standby, to make it easier if it falls too far behind. This is usually unnecessary. If standbydowntime is longer than wal_keep_size accounts for, you'll still have trouble. Enter the replication slot. It marks a position in the WAL stream. The primary doesn't have any memory of the standby. (It's essentially a client.)

The primary should not discard any WAL younger than the replication slot. The replication slot is an indicator of ho much WAL the secondary has.

SELECT pg_create_physical_replication_slot('slot');

Consider max_slot_wal_keep_size > override the replication slot indicator to avoid disk space pile up. This will tell PostgreSQL to nuke that WAL even if the standby has not caught up.

With synchronous replication, COMMIT does not return until the synchronous standby has received the WAL. Doing this across continents: everything will take forever. This only works when the primary and the standby are close together in the network. the other downside of synchronous replication: the availability of the entire system goes down. A failure on the standby means a failur eon the primary.

Use synchronous replication if you can not afford to lose a single transaction (fail over occurs before sync completes). Most people are using async.

The application_name field can be used to tell the PG server who I (the client) am. This is handy for replicas. Well-behaved applications set this name. This is used as a convention to name the standby server. You can set this in the connection string.

There are a range of options for synchronous_commit when it's on.

on remote flush, waits for standby to write to WAL file and flush to disk
remote_apply > wait until standby has replayed information and that info is now visible.

If you query the standby immediately after making a change, it may not be available with the default settings for synchronous_commit. You want remote_apply for that.

12:01 Break!
12:12 Back!

Good use cases for streaming replication:

  • HA/failover additional software required to determine that the primary is down.
  • offloading queries that you might not want to run on production (resource management)
  • problematic but possible: scaling (for reads, not writes). Application must be able to read and write to one DB and read from the other. You have to account for the time required for the data that was written to be replayed. You could do synchronous commit at remote_apply levels. Yeah, it would work, butenter the replication conflict. VACUUM on the primary during a long running query on the second.

max_standby_streaming_delay governs how long you wait before canceling a query and resuming streaming of WAL. The default value is 30s. Consider your report that runs on this replicated database. Please please please don't cancel it 30 seconds in. You can make this indefinite with -1.

Primary goal: replication doesn't lag too much.

hot_standby Setting this to off disallows queries during recovery.

The name of the game here is preventing conflicts.

hot_standby_feedback Send oldest query blocking VACUUM. No more conflicts due to VACUUM. It's not on by default because a very long running query on the replica could cause bloat on the primary server. The primary will know the oldest query.

  • READ about hot_standby_feedback

remote_apply > Please don't, says LA. You'll have lots of waiting around all the time.

You can have an intentionally delayed standby server. This is helpful to recover from logical problems, assuming you notice the problem within the intentional delay period. Possible appraoch for quick recovery.

Fun with Functions

Every second, tell Postgres to sleep for 10 seconds.

\watch 1 SELECT pg_sleep(10);

Temporarily Suspend Replication

Temporarily suspend replication: SELECT pg_wal_replay_pause();

resume with SELECT pg_wal_replay_resume();

High Availability (continued) and Failover

There is no cluster-ware in PostgreSQL but the tools to make it happen are available. Here's a manual failover:

  • Make the primary go down.
  • Make sure that the primary is actually down. (Avoid split brain mode. This isn't fun.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
    )
  • A few quick steps required:
    • Standby must be promoted out of recovery mode. Three ways to do this:
      • #promote_trigger_file = '' # file name whose presence ends recovery
      • psql command
      • SELECT pg_promote();
    • Redirect the clients. Swizzle IP. Swizzle DNS (LOL). Load balancer HAproxy or similar. Alternately: this can be done with connection strings. If I can't get to the first, talk to the second.
  • Now I am back up and running but in a degraded mode, there is no standby any longer. We can work on bringing the primary back unless it's really
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
    . The primary server may be ahead: transactions commited on the primary but not yet gone to the secondary. How do we handle the deviation? We are respecting the timeline of the old standby, which is now the primary. What if we could just roll back the transactiosn that the primary has that are ahead? Well! Enter pg_rewind.
psql 'host=/var/run/postgresql,/var/run/postgresql port=5432,5433 user=postgres'

Connect with a string like the above and observe the output of \conninfo to see which you are connected to. That approach is cool but makes it possible to wind up connected. tothe stand by and why? It's read-only. Enter target_session_attrs, an addition to the connection string:

psql 'host=/var/run/postgresql,/var/run/postgresql port=5432,5433 user=postgres target_session_attrs=read-write'

With the above, you'll only connect to a database that is read-write. The clients do the handovers. Any client that links with libpq can do this. Possible with different syntax with GDBC.

A few ways to stop the server:

  • pg_ctl stop there are three shutdown modes: smart, fast immediate. The default is fast.

  • fast transactions. areall cancelled, client sessions are interrupted, checkpoint made, goes down. Default.

  • smart no new client connections, existing connections not interrupted. Once last client connection closes, DB shuts down. Not used too often.

  • immediate evil mode. Go down as quickly as you can, no clean up. Crash the server. Crash recovery will be required ons tartup. Bad for production, fun for testing. No fsync.

pg_ctl stop -m immediate -D /etc/postgresql

See part 2.

Day 5: Execution Plans / Configuration and Resource Management


Post-Training

These notes will live here on work.carranza.engineer once the training is complete.

Clean Up

Clean up after yourself!

docker stop postgres_training
docker container ls -a
docker container rm $(docker container ls -a | grep postgres_training | cut -d" " -f1)
docker image ls
docker image rm <image_id>

Get the container ID you want to clean up docker container ls -a | grep postgres_training | cut -d" " -f1

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Daily Set Up

docker run --name pg_training -td postgres:13
docker exec -it pg_training /bin/bash
apt update &&  apt install lsb-release wget gnupg less man sudo vim postgresql-13 postgresql-13-dbgsym
su - postgres
pg_createcluster  13 main -- -E UTF-8   --locale=C.UTF-8     -k
pg_ctlcluster 13 main start
psql

Tools

External tools that might be helpful when poking at Postgres.

  • DbSchema available across all three platforms, looks cool, haven't used it.
  • DataFiller generate random data from database schema. Looks cool, haven't tested.
  • sqlsmith A random SQL query generator.

Links

Select a repo