or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
PostgreSQL Training: Part 1 // 2021
tags:
database
,postgresql
See part 2.
Prep
Links
GitLab-related
General
Let's Get Set Up!
How about a container with Postgres all ready to go?
Use the credentials in
pass
:Cool!
Introducing
lsb-core
alone will require interactive set up to specify timezone info. Our friendDEBIAN_FRONTEND
comes in handy, try this instead:I recommend doing the above before class starts to avoid being annoyed.
Day 1: Indexing / Transactions and Locking
Sharding will be out of scope for the training. We will talk a bit about partitioning.
Docs, the greatest resource.
The PostgreSQL Wiki –> Good for things in the PostgreSQL ecosystem, broader scope than the docs.
About the Project
free, open source software.
License: PostgreSQL is released under the PostgreSQL License, a liberal Open Source license, similar to the BSD or MIT licenses. A great many forks of PostgreSQL:
There are somewhere between 40-500 contributors to PostgreSQL, depending on how you define a contributor.
Versioning
See the versioning policy.
Major versions are released yearly, usually in the fall. We expect 14 in fall 2022. Releases are supported for 5 years.
Supported means you'll get new minor releases (always back fixes, never new behavior).
The goal is for people to be able to upgrade to a new minor release without needing to test their code. They focus on not introducing bugs in minor releases fo facilitate this.
pg_dump
andpg_restore
or in-place upgrade. Both find options. The former is super simple "unparalleled simplicity" and the other is quicker. In-place upgrades are a bit stricter (can't make some modifications). There is no requirement to follow a certain upgrade path (as there is with GitLab). This could change one day. Crystal ball required. In-place upgrades may leave you without the benefit of some index updates that require changes to physical, on-disk representations.Let's do something!
Install PostgreSQL
There are a few ways to do it (compile from source, repos from distro, straight from PG, Docker, etc). The docs have more information on installing it on Ubuntu. Installing it via
apt
creates a PostgreSQL cluster. Nice but what if I want to control how it's created?Install debugging symbols to make analysis with
perf
less annoying:Let's become the user
postgres
to proceed from here. (su - postgres
)Let's take a look:
There are more goodies in
/usr/share/postgresql/13/
on Ubuntu.Omnibus:
/opt/gitlab/embedded/postgresql/[version]/lib
11:52 Break!
12:03 Back!
Postgres Cluster and Creating a Database
Run
initdb
to create a database cluster.Try
/usr/lib/postgresql/13/bin/initdb -? | less
Consider adding
/usr/lib/postgresql/13/bin
to your$PATH
.initdb
It creates a
DATADIR
. There is no default location but the standard is/var/lib/postgresql/13/main
. That's familiar! The install put stuff there but we are nuking it. Only aDATADIR
must be passed toinitdb
.Specific options to consider:
PostgreSQL encoding options: consider
UTF-8
. The default is taken from the environment (shell runninginitdb
or whatever), don't leave the default encoding mechanism to fate.;-)
An option for calling
initdb
:Set
--locale
and leave things likelc-collate
orlc-monetary
alone unless you want to override them. A few of these options can not be changed later, like--lc-collate
and--lc-ctype
. This controls how databases are created in your cluster moving forward. Once the database has been created the character type and collation can not be changed.--lc-ctype
controls character type, determines what class a character belongs to (punctuation, number, letter, space, etc).--lc-collate
determines the sort order of strings.NOTE: PostgreSQL uses the collations from C when left to its own devices (no
ORDER BY
). Upgrading C library could change collation order in PostgreSQL. Not great, could corrupt indices by changing collation. Consider rebuilding all indices.Try
locale -a
to see what collation options are available.Using C collations means that upgrading C libraries is not a concern: the code points won't change.
In databases, everything is stored in a block. Blocks are the same size. PostgreSQL block size: 8K. With
-k
, checksums are written/read to disk to alert of possible storage problems. (Something changed between when the block was written and when it was later read.)PostgreSQL Cluster
What? Why?
The term cluster is somewhat unfortunate. A cluster contains several databases. A better word would be instance.
A cluster contains:
We have the same set of users for all databases in a cluster. The permissions can be different for one user across different databases.
Default Port: 5432
When connecting to
postgres
, you must be connected to a database. Once you are connected to that database, that's the one. You can't query a different database from that connection.Let's create a cluster!
Forget about
initdb
and use Ubuntu's thing./usr/bin/pg_dropcluster 13 main
pg_createcluster
which acceptsinitdb
options:Docker:
pg_ctlcluster 13 main start
Ubuntu:
systemctl start postgresql && systemctl enable postgresql
Now that it's up and running:
/etc/postgresql/13/main/
/var/libg/postgresql/13/main
Postgres uses a multi-process architecture and does not use threading. It's designed to be portable and run on a range of OSs which will implement multithreading differently. That leaves us with a dedicated process on the PostgreSQL server for each database connection.
Connecting to a database in a cluster
There are three methods we will cover:
The only client shipped is
psql
, our friend. See the list of PostgreSQL clients on the wiki.It's like
vi
. You might not like it but you might need to use it one day so try to get along.:-)
There are four things we must specify with each connection:
You'll be prompted for a password with the above so we'll have to try another way. Thankfully, PostgreSQL listens on TCP and Unix sockets. Yay!
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →The
psql
client is smart enough to know what to do with the directory with the command above. Now, we have a prompt:How do I bail? (It's not
:wq!
.) Bail with\q
orCtrl
+d
.Commands starting with
\
are client commands; all other commands are server commands.Defaults with
psql
:-h
–> local Unix socket-p
–> 5432-U
–> user callingpsql
-d
–> same as database usernameSo, we can just say
psql
.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →We can use connection strings:
Support for this comes from the shared linked libraries (specifically
libpq.so.5
).HELP:
/?
Consider
ls -RS
as the pager.Get a list of databases with
\l
.Maybe you don't need an interactive session:
The above just says connect to the
template1
database and list all databases.Creating a database
You do have to be connected to a database (like
postgres
) in order to create a database.Get help for SQL statements with
\h
! That's kind of awesome…- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Creating a database in PostgreSQL is essentially copying an existing database (and then doing a little bit of magic). You can use any database as a template. (Permits cloning or duplicating a database.)
You'll get the default locales and collation information when you create a database unless you specify them differently when creating the database.
A database can only be used as a template if no one else is connected to it. (To assist with this, you can toggle
ALLOW_CONNECTIONS
to prevent anyone from connecting to a database.)You get tab completion inside the
psql
prompt. Double tab for all options. Terminate statements with;
. The prompt will change to indicate that the server is still waiting on you. Observe:We need a new connection in order to connect to our new database. We can do this from within the same
psql
session!- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Use
\c cutecat
for this:Change the default database info with environment variables like
PGHOST
,PGPORT
,PGDATABASE
andPGUSER
.13:02: Break!
13:19: Back!
Case in Commands
We wrote
CREATE DATABASE course;
to create a database calledcourse
. SQL demands that everything is case folded to upper case but PostgreSQL case folders to lower case. We could docReAtE dAtAbAsE course;
and the same thing would happen. Prevent case folding with double quotes. DoCREATE DATABASE "COURSE";
to really create a database calledCOURSE
, otherwiseCREATE DATABASE COURSE;
would create a database calledcourse
.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →""
. (You could use""
to create a database calledUSER
but maybe don't do that because it's a reserved keyword andDROP DATABASE USER;
won't do what you would expect. It'll nukeuser
, notUSER
.)See Capitalization/case-folding from Stanford's NLP group or what the W3 has to say about case mapping and case folding.
Template Databases
template0
database?We see that
template0
is not allowing connections. The docs have more info about the template databases.Having multiple databases in a single cluster comes with some pros and cons:
Indexing
We need a bunch of rows to make the database sweat.
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Describe a table with
\d table_name
.That query would be akin to the following in MySQL:
The
GENERATED ALWAYS AS IDENTITY
has been part of PostgreSQL since 11. Text columns are limited to a size of 1 GB in PostgreSQL. (What are you doing?!?) usetext
if you don't have a limit that needs to be enforced. (Don't add a limit from the database side when there is no such limit in the application. )Let's add a few rows to our table:
Duplicate the database:
Run the query above until you are happy with how large and slow your database will be.
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Measure time from client:
OK, now we have a super slow query:
What happens with a query you provide?
*
with the names of the columns.Use
EXPLAIN
to get the execution plan for your query.Let's turn parallel off to make the plan simpler.
SET max_parallel_workers_per_gather = 0;
. Here's the updatedEXPLAIN
result with this change to the parallel works:The value of the cost is not in any unit, it's primarily useful for comparison. You get two numbers. From above:
0.00
– start up cost, cost to deliver first row75100.80
– total cost, cost for retrieving all resultsrows=1
a guess at how many rows will be returned.We don't care much about the
width
value.Analogy: A table is a library. A book is a table row. If you go to the library with no information about how the books are organized, you'll have to do a sequential scan, looking at each book, spine-by-spine. Thankfully, that's not reality! Today, you consult a card catalog. This is akin to a database index.
EXPLAIN
Tables are also referred to as heaps in PostgreSQL. Tables have no particular order (not ordered by
id
). An index is ordered. Each entry in an index points to the corresponding row in the table. (OK, think about it that way. It's actually implemented in a tree structure.)Create an Index
Let's look at that
SELECT
again:The start up cost is higher but wow is the total cost lower. We did it.
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →integer
?bigint
, bringing you from a four-byte integer to an eight-byte one. Doing this would rewrite the entire table (lots of downtime, resource usage). There are tricks you can do to do this but it's a pain. Don't ever useinteger
. See the Convert ci_builds.id to bigint - Step 1: Add new columns and sync data issue for the pain GitLab is experiencing with this very topic.– Laurenz Albe
If you didn't stay safe and you have this problem, here are (from memory) guidance on converting from
int
tobigint
:bigint
columnTRIGGER
that copies the existing primary key into the new column- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →bigint
.14:00 Break!
14:09 Back!
Using an index, looking at query plans
Using
EXPLAIN
to understand what PostgreSQL does when we give it queries.Tell me how many rows (
reltuples
) and 8K pages (relpages
). These are estimates (they'll be in the right order of magnitude, "close to the truth".)seq_page_cost
andcpu_tuple_cost
give us the cost associated withseq_page_cost
- Sets the planner's estimate of the cost of a disk page fetch that is part of a series of sequential fetches. The default is 1.0. This value can be overridden for tables and indexes in a particular tablespace by setting the tablespace parameter of the same name (see ALTER TABLESPACE).cpu_tuple_cost
cost to process one row. The default is 0.01.Picking the lowest cost query is easy; generating queries with lowest costs accurately is…not so easy.
Let's look at a more interesting query:
When scanning more than one row with an index: PostgreSQL
The real cost in an index scan is not in reading the index. The expensive part is in reading the table row that the index points to. The index can be read from the beginning to the end or from the end to the beginning.
With a query like
SELECT * FROM test WHERE id > 42;
, the query optimizer could choose to do an index scan but we see it did a sequential scan instead:With a query like this, it's back to the index scan:
We can be reasonably sure that the optimizer is doing the fastest thing when choosing between index scan and a sequential scan.
The index is useful for doing things like
EXPLAIN SELECT max(id) FROM test;
:The index is useful also for sorting:
I can forbid PostgreSQL from using an index scan with
SET enable_indexscan = off;
. UseRESET enable_indexscan;
to revert that change.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →As a result: it is not always necessarily a good idea to have an index. You need to balance speed when querying vs speed when updating. (If you are querying monthly but inserting daily…maybe don't bother with the index.)
An index you don't use is a bad thing: it degrades performance and provides no benefit.
What if I create an index on the text column?
Will be index even be used?
Yes, but it's also doing a new thing called a bitmap heap scan.
Each table block is only fetched once. The entire bitmap might not fit into memory. PostgreSQL will degrade to one bit per block. Less efficient but there's only so much memory so…
You can drop an index like you'd drop a database.
For some purposes, having an index is super useful.
Partial Index
Let's use a
WHERE
to create a partial index.The index will only be created for rows that satisfy the
WHERE
condition. A partial index is also referred to as a conditional index.In addition to having a partial index, you can have a unique index. You can also have a unique partial index.
You might wish to create a partial index that only includes conditions you are frequently querying for. Like:
Completely and quickly made up to demonstrate: querying only for users that are not active.
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Great segue…
Who creates indexes?
Ideally, the developer. Don't cheat and only consider datasets that are too small and not relevant to production usage. Consider index creation when designing the query. Have realistic amounts of test data at some point between development and production.
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Day 2: Functions / JSON / Partitioning
Watch the day 2 recording until Laurenz' machine says Aug 31 17:48.
Indexing continued.
Having an index on
name
won't help you with a query that's looking forWHERE upper(name)
.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →For an index to be useful with a
WHERE
expression the query should look something like:<indexed expression> <operator> <constant>
The operator must be an operator understood by the optimizer. Use any constant that is constant during the query. To demonstrate how serious this is, consider:
We have an index on
id
but the database isn't smart enough to know that anything plus0
is…that thing.With a query on an indexed expression, we see that a sequential scan is no longer used. In our example, the estimates were way off and the optimizer though it would find 20,972 rows when the real answer is
0
.PostreSQL collects table statistics regularly:
Recalculation of these statistics is triggered by data modification…or when you tell it to with
ANALYZE
.Store dates and times as…dates and times. Don't stores dates as numbers. something something number of seconds since the epoch
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →The "volatility" of a function.
An immutable function will return the same result always, regardless of things like time of day, phase of moon, upper casing. We got this feedback because
age
is a function of time.Stan provides an example of GitLab being hit by function volatility:
12:02 Break!
12:12 Back!
Timestamps in PostgreSQL
Timestamp without time zone: easy straightforward. Often, you want timestamp with time zone.
CREATE TABLE ts (t timestmap with time zone);
If you store a timestampw ith time zone, it's converted to UTC and stored that way. When converted to a string, the conversion happens according to parameters in the database.
timezone
is set properly.</digression>
Indexes, continued
On
btree
indexes andLIKE
The query above won't use the index. The reason why is collation, which does not compare character-by-character.
Why? Collation does not support comparison character-by-character (character-wise).
With an operator class, we can do this. Specifically we want a test pattern operator. We just create the index a little differently:
Take a look at the Index Cond and then the Filter:
The query we just created can also be used for equality comparisons (WHERE name = 'fluffy'). So, we don't need two indexes.
ORDER BY name
is kind of the only use for a straight-up normal index.access methods: there are others:
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →hash
- Hash indexes are hash tables persisted on disk. Can only be used for equality searches, which abtree
could also handle. Hash index can't doa nything that abtree
can not. Possible speed advantages. Corner case: indexing very large values to save space.spgist
andbrin
very special case indices.gist
- perfect for weird data types. exotic data types that can't be used with abtree
because they don't have an order. Consider: geographic data points or range types.gin
Most useful for composite data types. Arrays, full text search, JSONbtree
- data types must have a way of being ordered/compared. If not, you can't usebtree
. DEFAULT.Things we won't talk much about:
spgist
andbrin
Range Types
These are intervals. Floating poitn ranges, timestamp ranges: hotel reservations.
There are a few different ways that the overlap could occur. The query to do to make this happen would be annoying. Let's use a timestamp range instead:
Look at this query:
The
[
says that 9a is in the range and the)
at the end says that 11a is not in the range and permits adjacent reservations.These operators are supported by the gist index:
&&
overlaps operator@>
contains operatorThe execution plan says a sequential scan will be happening. Lame. Let's make a
gist
index:It's still using a Sequential Scan so let's do
SET enable_seqscan = off;
to force it to use thegist
index that we created.a
btree
index is a special kind ofgist
index but don't worry about it.exclusion constraint implemented using a
gist
index only:The thing above will prevent an overlapping entry! Whoa. Here's how it looks:
That's the coolest thing we've seen thus far, IMO.
Gin Indexes and JSON in a Database
There are two data types for JSON:
json
–> stored as a string, preserves all formatting. use if it's not being touched in the database, only stored and retrieved.jsonb
–> to do in-database processing, preferjsonb
. Use unless you have a reason not to.Use
::
as the special operator for casting. The right way is:The shorter and more fun ways:
Inserting JSON into the database
Querying JSON in the database!
So cool!
Words on JSON in the database:
Temptation to store everything in JSON is strong: why would I even create columns in a table? Or multiple tables? Just pop everything into JSON.
Queries can become very slow and complicated if you do this. Constraints can't be set on parts of the JSON. That's lame.
Use JSON in the database very sparingly. In 98% of all cases, it's better to avoid it. Things twice before proceeding.
READ http://patshaughnessy.net/2017/12/15/looking-inside-postgres-at-a-gist-index
13:03 Break!
13:17 Back!
Sometimes you want to perform a similarity search.
Instead of
SELECT * FROM names WHERE name IS 'lorenz';
Let's create an extension with
CREATE EXTENSION pg_trgm;
.The
pg_trgm
gives us the%
operator that lets us doThe
btree
index does not support the%
operator. OMG.A trigram index supports queries like
LIKE '%rie%';
USE
ILIKE
instead ofLIKE
for case-insensitivity. Neat-o!trgm
- trigram. All that stuff from the cryptanalysis section of Cryptography is useful!READ English Letter Frequencies on practicalcryptography.com, served via
http
.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →How cool is that! My name is a little short to make playing with trigrams very fun. (Why trigrams and not quadgrams?)
While trigrams are amazing, they are not ideal for full text search.
We see it matching for two spaces at the beginning of the sring but only one at the end because matches at the beginning of the string are emphasized.
End of indexes!
Transactions, Concurrency, Locking
Consider a transaction as a group of grouped statements within a set of parentheses.
Let's have some ACID.
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →In PostgreSQL, statements can only be run in a transaction. They run in auto commit mode.
If you don't specifically start a transaction, PostgreSQL puts it in one for you.
This gives us interesting advantages:
Either the whole statement runs or none of it runs. You update 20 rows, get an error due to network outage on the 21st, sorry!
If you want a transaction taht spans multiple statements, you have to make a transaction:
START TRANSACTION;
BEGIN;
Transactions can not be nested. You can only have one transaction. ata time. To end the transaction:
COMMIT;
BACK;
(?)The PostgreSQL prompt will show you when youa re in a transaction by updating the prompt:
In the spirit of all or nothing:
Once it hits a problem, the transaction is done. You have to start over.
You can use
SAVEPOINT
to create a subtransaction so that you canROLLBACK
to the save point or the entire transaction. UseSAVEPOINT
but don't overuse it. Performance and concurrency can suffer. Don't do something like…oneSAVEPOINT
for each statement.Take a look:
Readers never block writers and writers never block readers. In our money transfer example, the "old" data is shown. The
UPDATE
we are running puts an exclusive lock, preventing any further writes. Locks are held until the end of the transaction. Writers can block writers.This is an example of Isolation at play. Any subsequent writes that can't proceed due to the lock can run once the locks are removed at the end of the transaction.
There are 4 isolation levels in the SQL standard:
READ UNCOMMITTED
No way to see uncommitted data in PostgreSQL. If you request this, you'll actually getREAD COMMITTED
.READ COMMITTED
I never see dirty data from uncommitted transactions. Committed value is always shown. This is the default in PostgreSQL.REPEATABLE READ
– Nothing changes. I see the same values, it's like having a snapshot of the data.SERIALIZABLE
Completely absolutely guarantees no anomalies. Rarely used. We won't talk about it.Use repeatable read for complicated reports to make sure you get consistent results, no matter how long it takes the report to be generated.
13:54 Break!
14:06 Back!
We might have the impression that isolation always works: it doesn't, it's an illusion. There's a good reason for the different isolation levels. The "lower" the isolation level, the fewer guarantees.
There's a specific kind of anomaly called lost update anomaly. Two options for preventing this:
FOR UPDATE
or (even better)FOR NO KEY UPDATE
. Better if a conflict is likely. Tradeoff: you introduce locks which are bad for concurrency.You can do either of these but the first is preferred as it will lock less.
There are other ways:
You might not use
SKIP LOCKED
super often.Setting the Isolation Level
You can set the isolation level with something like:
The Infamous Dead Lock
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Two deadlocks per day: NBD. Deal with it in the application
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Five deadlocks per minute: worry!
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Completely avoiding deadlocks: not possible. Work to avoid/reduce.
The
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Hmm. What if we both followed the rule "grab the bow first"? No more
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Debug Locks
Image Not Showing
Possible Reasons
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Recall that locks are always held until the end of the transaction. You might be running into lots of locks because your transactions are too long.
There are two kinds of databases:
Let's learn more:
Look for locks where
granted
isf
.From that first query, we get process IDs!
Get more info with
SELECT pg_blocking_pids(3968);
.The blocking transaction must be terminated. There are two ways to do this!
SELECT pg_cancel_backend(PID);
This is when you seeERROR: canceling statement due to user request.
SELECT pg_terminate_backend(PID);
That kills the whole session.psql
will try (and probably succeed) to reconnect.The notes above are to get you out of a bind. The right thing is to fix the application so it doesn't make you have to do this.
Transactions
You can prevent long-running transactions but be super careful. Observe:
Consider a value like 1 minute but IDK, up to you!
Use
statement_timeout
to prevent runaway queries.MultiVersioning
The
ctid
is the current tuple id, the physical storage location in the table for that row:Plop is the eleventh entry in the first block.
An
UPDATE
doesn't really update, it makes a copy. Observe the chane inctid
:I missed it a bit but
xmin
andxmax
give us info about what transaction introduced a value and which invalidated it. There will be multiple versions of a row in a table. Check my screenshot from 14:46. Deleting a row doesn't remove it, it just changes thexmin
. These old versions of the row will eventually need to be deleted. We'll talk about garbage collection for tomorrow. Doing it at commit time would be too soon. Recall how REPEATABLE READ works.A low-level extension that lets you read the raw data. Useful for debugging data corruption:
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →xmin
andxmax
information.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Q: Is it performant to query
pg_stat_activity
?A: Yes. That table is not indexed, the table. isstored in memory. No I/O pain.
Q: I have a query taking too long, locking everything up. How do we find the culprit query?
A: Looking for a hanging backend or a hanging query. Use the
pg_blocking_pids
function, find who is blocking that backend.Table Locks
Let's learn about table-level locks. Take a look at Table 13.2. Conflicting Lock Modes. If the cell has an X, those locking modes are not compatible. For example
ROW EXCLUSIVE
locks are compatible with one another and that's why we canINSERT
multiple times in the same table concurrently.Avoid
SELECT
whileDROP
the same table.Day 3: Memory and cache management / Backup and restore
Today, we will discuss:
The visibility info is in the table, the index does not know about it.
Vaccuum:
Illustration to remember:
Run
VACUUM (VERBOSE) accounts;
for more info:No table row can be bigger than 8K because everything is in 8K blocks and a row should be in the same row.
OK, I have columns bigger than 8K? Enter
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →When a table is created with at least one column with a data type of variable width/size, a
toast
table is created to store overlength variables.When a row is added:
The
toast
stuff happens behind the scenes. This is calleddetoasting
.postgres=# CREATE TABLE vactest (id integer);
CREATE TABLE
postgres=# SELECT * FROM generate_series(1, 10);
generate_series()
is a set-returning function, a table function. It behaves like a table:So cool!
Modular arithmetic is useful:
Vacuuming Example
Do this:
Delete 50,000 rows and then observe that vacuum can't get rid of the old row versions. (They might be needed by the other transaction, although the tables are unrelated.)
Once we commit the transaction,
vacuum
can do its thing.Long-running transactions in PostgreSQL are doubly bad:
Auto Vacuum
The
autovacuum launcher
runs in the background. It awakes periodically, checks for whether any tables need to be vacuumed and sends a worker to perform the vacuuming if that's the case. Vacuum is designed to not create problems: no problem with concurrent data modifications. Vacuum won't block this. It's still resource-intensive (lots of reads, possibly some writes, CPU time). Auto vacuum is designed to be gentle.Vacuum removes the data but doesn't shrink: there's just more free space in the 8K block. The empty space is not fragmented. Empty space is always in one spot and ready for reuse with the next
INSERT
orUPDATE
.Configuring Autovacuum
Always keep
autovacuum
on.Reduce resource usage by autovacuum
Modify
autovacuum_max_workers
to address load due toautovacuum
. Additionally,autovacuum_vacuum_cost_delay
andautovacuum_vacuum_cost_limit
also help to make sureautovacuum
is not too intensive.When will autovacuum run?
This formula determines whether
autovacuum
runs:autovacuum_threshold
+autovacuum_scale_factor
*number of rows
<number of dead rows
.By default: if 20% of a table is dead rows,
autovacuum
runs.More Problems with Autovacuum
Imagine you update like nobody's business: you could be creating dead rows more frequently than
autovacuum
can take them out. You'd seeautovacuum
running all the time. (It's going too slowly.)autovacuum_vacuum_cost_delay
to 0 or changeautovacuum_vacuum_cost_delay
.Each worker can only work on one table at a time. While you can change these settings in the config file to change settings for all tables in all databases: YOU CAN CHANGE THESE SETTINGS ON A PER-TABLE BASIS.
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →The bit above tells
autovacuum
on just that table to take shorter breaks!- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →ANALYZE
is part ofautovacuum
and has similar bits that can be toggled.ANALYZE
is much less resource-intensive thanautovacuum
. Make sure it's running often enough and you're good. The default is usually good except for things like:ANALYZE
to run more often. Recall, you can do this on a per-table basis.Taking a step back: this seems obvious but it's really important to understand your data, your application and how they are used to make the best decisions.
Index Only Scan
Imagine a query like
SELECT id FROM tab WHERE id < 1000
. If we have an index based on theid
: we don't even need to go to the table and bear the pain of fetching rows from the table. That info is right in the index.…but…
It's not that simple. The index doesn't know about the visibility information of an entry in the index. (The
xmin
andxmax
stuff.)Because we have to fetch the table row to get the
xmin
andxmax
info, doing an index only scan is not possible. orhard (we'll see…)To solve this problem, PostgreSQL introduces the visibility map:
2 bits for each 8K block, one bit is "all visible" when set to 1, it means that all rows in that 8K block are visible for all transactions.
The index only scan finds an index entry, which tells it info about the block. It looks to the visibility map which is in memory and checks the "all visible" block. This is how we increase the efficiency of index only scans.
The
vacuum
process creates and maintains the visibility map. For efficient index only scans, make sure thatvacuum
runs frequently enough.pg_visibility
to get more information https://www.postgresql.org/docs/current/pgvisibility.htmlVacuum makes the table emptier but not smaller.
What if I want to shrink the table?
Use
VACUUM (FULL) vactest;
.Note that
VACUUM (FULL)
is completely different fromVACUUM
. It makes a copy, removse the old stuff and moves the new table in place of the original.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →VACUUM (FULL)
means downtime: it's in an access-exclusive mode. Consider this to be an emergency measure. Don't run it out ofcron
.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Because it makes a copy, don't use
VACUUM (FULL)
when you are on the verge of out of disk space and looking to free some up.VACUUM (FULL)
?I/O
error, which will terminate the statement and roll it back. The new file is deleted and you'll be back to how it looked beforehand. One exception: if PostgreSQL runs out of space creating a transactio nlog file, it will crash. We'll learn more later. A crash encountered during aVACUUM (FULL)
will leave you with orphaned files which are hard to identify. Dump and restore becomes your best option.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Our tables are too big for us to do
VACUUM (FULL)
so we use repack. Laurenz recommendspg_squeeze
, which is nicer, entirely in the database (no client).pg_squeeze
- builds compacted version in the background, grabs an access exclusive lock where possible, catches up on updates and shoves the new compacted version into place. Read more about pg_squeeze. This should be avoided if possible. Don't design a situation where you need to do this.Other things to do instead:
Give
autovacuum
more memory.Consider
HOT updates
so you don't need to runvacuum
so frequently.READ about TOAST on the PostgreSQL wiki
12:04 Break!
12:14 Back!
Is my tabel bloated? Do I need to take action?
pg_stat_user_tables
say?The
\x
turns on Extended display, which is beautiful:You might have lots of dead tuples but
vacuum
just ran.Do you want the full truth?
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Use
pgstattuple
:Run
VACUUM vactest;
and thenSELECT * FROM pgstattuple('vactest');
again.Note that
pgstattuple
is expensive. ConsiderSELECT * FROM pgstattuple_approx('vactest');
instead. This takes a sample of the table and gives you a guess becausepgstattuple
will be using a sequential scan. (Don't run it every 10 minutes; maybe daily.)There are other tools that aim to estimate bloat. They are less expensive but still check in with
pgstattuple
.Logical Backup
Our friend
pg_dump
comes in:This is simple. A few downsides:
Use
pg_dumpall
to dump the entire cluster (database instance). Intended for use with upgrades.Both
pg_dump
andpg_dumpall
are client-side tools. Runpg_dump
somewhere other than the database machine. (I mean, you can keep your backups on the server. You do you…)Use
pg_dumpall -g
to get just the global info. This may be a good complement to a single database backup withpg_dump
.pg_dump
uses a read-only repeatable read transaction. No matter how long it takes,pg_dump
will always get a consistent snapshot.Table-wise Restore
I want to restore "less" than the entire database.
Dump just a table:
pg_dump -t accounts course
Use
-s
to get the schema. There are lots of options topg_dump
worth exploring.Use
pg_dump -F c -f course.dmp course
to tellpg_dump
to use a custom format. It's a binary format. Usepg_restore
tow ork with this file.Use something like
pg_restore -d copy howdy.dmp
.Ooooh! You can restore a
.dmp
to a file!!!You can generate a
.sql
from a custom dump. This is so cool:With the custom format dump, the data are compressed.
Maybe always use the custom format and convert to
.sql
if you need to. With a custom format, you can restore only part of what you dmped.To only restore the schema:
pg_restore -f - -s course.dmp
Alternately, you can use the directory format:
pg_dump _F d -f course.dir course
We have looked at a few formats for backups:
.sql
–> What we are all used to.The directory format enables the
-j
flag which lets you have multiple resotre processes.So,
pg_dump
is great: simple, good but there's one decisive disadvantage:How PostgreSQL Handles Persistence
Below the line: persistence world, disk
Above the line: memory and processes
There are data files on disk that persist the tables. 8K blocks in files contaiining data.
Above the line we have a cache (shared buffers in Postgres). Used to cache blocks for all databases, shared by all db processes. We also have a worker backend process working on behalf of a client connection to execute queries.
A contradiction, conflicting interests:
A simple (but poor approach) write all involved tables when I
COMMIT;
. Works but poor performance.How do we resolve this?
There's a single good architecture that goes by other names.
Below the line, we have Write Ahead Log (WAL). This file (on disk) is written. tosequentially, contains a log of all changes that have happened. All DBs have this (also caled transaction log, binary log, redo log).
You must always put in the log and then write. This is low-level information (not the SQL statement), write what it takes to modify the files form before to after.
Using WAL
a. Confirm WAL info is persisted, then
COMMIT
can proceedCOMMIT;
, the WAL buffer is written down to disk ( in the data files, not the WAL). Thecheckpointer
andbackground writer
take care of this.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →fsync
atCOMMIT;
time to force WAL down to disk.Way after
COMMIT;
, the WAL buffer is written down to disk ( in the data files, not the WAL). Thecheckpointer
andbackground writer
take care of this.A checkpiont is a point. intime whena ll dirty info in shared buffers is writen down. todisk and cleaned. This is marked in the WAL so PostgreSQL knows anything older than this sequence number has already been writen to disk. You only need to replay everything that came after the latest checkpoint. (This is how crash recovery works.)
Most non-WAL activities are written to disk by the
checkpointer
. After checkpointer has completed, buffers are clean. AS time passes, buffers get dirty.Clean buffers are vital for good performance. Dirty buffers mean the backend must write dirty buffers down and then read into. aclean buffer in order to get its query handled. OMG. We introduce the
background writer
for this.background writer
looks around in dirty buffers, cleans them, and aims. tomake sure we have enough clean buffers in between checkpoitns so no one has to wait for a clean buffer.An overeager
background writer
means mroe writes than are necessary: no thanks!The
background writer
goes front to back through the entire shared buffer, slowly, over time. It shouldn't do any more work than is required to make sure buffer is clean when needed. (The background writer is not overly important. )Terminology
We'll use page, block and buffer interchangeably but we'll always mean those 8K blocks. When on disk, it's a block; when in memory, a page or buffer.
12:59 Break!
13:17 Back!
Data Directory Backups
The name of a table isn't the real name, the object ID is the right identifier. (The table name can be changed).
Try
SELECT oid, relfilenode FROM pg_class WHERE relname = 'ts';
A full vaccuum will cause a change.
The
_vm
files inside 16470 are the visibility maps.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →The
pg_control
file contains metadata about the database. It's useful for crash recovery. Thepg_controldata
command is used to readpg_control
.Ack, I don't have
pg_controldata
but this. iswaht would work:Let's look. atthe write ahead log in
/var/lib/postgresql/13/main/pg_wal
. It's a bunch of files, not just one. One of those files is active, you can identify it by looking for the most recent timestamp.There are other WALs for future use, these are reserves to improve performance. When the active WAL is full, we switch to the next one and write to it.
If a WAL segment that are younger than the latest checkpoint, it is then useless. At the end of a checkpoint, we delete unneeded WAL segments. The database will crash if it hits an I/O error writing the next WAL segment.
In production, consider having
pg_wal
on a different filesystem. Whiel database is down, move the data and symlink to the new place.Default WAL size is 16 MB. If you have lots of data modification, you'll. havelots of WAL switches. Not expensive but you may wish to increase the size of WAL segments.
Use
pg_waldump
to take a look. So freaking cool!An online backup creates a checkpoint from which you recover. WAL segments are useful for recovering from an old checkpoint. You can recover everything but you don't need to.
Let's get started.
minimal
is enough for crash recovery.Do not turn off
fsync
. Yes, your database is faster but you are toast (not in the good way) if it crashes.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Let's try something
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →This takes forever because
fsync
runs after each comit. PostgreSQL is running in auto-commit mode so it's writing after every commit.insert.sql
is just 30,001INSERT
statements. PutBEGIN;
at the beginning andCOMMIT;
at. theend and observe the improvement:Much better!
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Let's look at
synchronous_commit
to see what impact it has. With it off, not every commit syncs to disk. Safe to use because syncs will occur and WAL is still synced/flushed to disk occasionally (just not automatically withe very commit).Let's set it
Let's use
pg_ctl reload
to tell the daemon to check for a new config.OK, I removed the single transaction from
insert.sql
and the performance withsynchronous_commit
is much better:However, we don't want to lose data so let's turn
synchronous_commit = on
again. Let's considercommit_delay
andcommit_siblings
instead.Check for other siblings transactions and sync them together. This means fewer syncs. Commits will take longer (it won't return until the flush. isdone). It's tough to tune
commit_delay
andcommit_siblings
well so it won't be as- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →synchronsou_commit = off
but still better.Don't change
full_page_writes = on
. Partial pages are difficult to recover from. These are called torn pages. You may wish to compress the full-page writes: IDK, wanna trade CPU time for I/O? You must make a decision somewhere.Checkpoints :checkmark:
If more than
max_wal_size
has been written since the last checkpiont, PostgreSQL triggers a checkpoint. Additionally,checkpoint_timeout
.See
checkpoint_completion_target
. PostgreSQL can spread checkpoint process out. With the default of 0.5, every checkpoint should be done halfway before the next checkpoint. It's customary to increase this parameter and with 14, it will be increased to 0.9. You'll have checkpoint activity much more often (up until 90% of the way to the next checkpoint). If you have checkpoitns every 10 minutes:I hope that helps!
13:51 Break!
14:03 Back!
Great filesystem-level backup:
But I don't want to stop the server!
Fair!
Let's configure PostgreSQL to
Turn
archive_mode = on
, this requires restarting the database.archive_command
is nice but scary. You just put any command there and go.PostgreSQL only checks the return code. You could add
/bin/true
and PG would be like, yeah, great job, archiving completed.Archival failure does not mean that the segment is deleted. PG will keep trying, back off, try again, it won't just skip and try for the next one. Why? We need an unbroken stream of WAL segments.
Activate Extended display for a single command with:
Yep, instead of the
;
.pg_basebackup -c fast -D backup
Copy files from server to client. Cool but also…OK. Some things will be pissing like
postmaster.pid
(I don't care about thepid
from the server). The results:In addition, we have some new files like backup/bbackup_manifest. This is used by
pg_verifybackup
. We also havebackup/backup_label
.We could just start up from the contents of
pg_basebackup
but…things will change whilepg_basebackup
is running. Don't do it, please, for you, not for me.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Anything after the
WAL LOCATION
andCHECKPOINT LOCATION
– could be consistent, might not be. That's why we then recover things (with WAL) until we get to the point. Thankfully, we can choose a specific point of time and restore to it. So, 3 hours after thepg_basebackup
ran or…5 hours after.The
CHECKPOINT LOCATION
tells us where in theWAL
the checkpoint is located.We recover from the
WAL LOCATION
.^^^ This was tricky.
pg_dump
is granular butpg_basebackup
: it's all or nothing for backups and restores. Sorry!We can specify
restore_command
to tell PG what command to use to restore.With no Recovery Target info, it'll keep recovering until either there is no file or it encounters garbage. Let's specify a Recovery Target.
Setting it to
immediate
says make sure the backup is consistent. Huh?Set
recovery_target_time = 2020-09-01 14:20:00
to tell it the timestamp to recover up to.Set
recovery_target_action
. If you set it topause
, it'll bring the DB up to recovery mode so you can check the recovery looks OK. Otherwise, set it topromote
to just kick it into production.recovery.signal
–> is it gone? Everything is OK?The recovery process will rename
backup_label
tobackup_label.old
.Cool! I did the
rm
too soon so…here's how the recovery looked on his machine.pg_dump vs pg_basebackup
pg_dump is more flexible, has to do a sequential scan, do a bunch of stuff that's slower than pg_basebackup
pg_basebackup – generally faster than
pg_dump
, larger backup [bloat, indexes (inpg_dump
an index is aCREATE INDEX
statement but withpg_basebackup
, is the whole actual index)]. single-threaded, no good wit huge databases (10TB).Parallel Universes
Restore to a certain point, then new incarnation of database is created before the bad thing you want to recover from happened. This is how you get timelines 1 and 2.
The timeline is part of the name of the WAL segments.
Observe:
Say TL2 was created too soon. TL3 recovers later on:
Each recovery attempts creates a new timeline. There is always only one current timeline. When recovering, always follow the path to a new timeline. With
current
, stay on the current timeline. Or specify a timeline ID (follow each fork to et to this timeline). You shouldn't need to worry too much about this unless you are in a really bad place. Time to call or hire a DBA?- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →The
.history
files describe the history of a particular timeline.There was a branch at this point and. wecreated timeline 2:
There are existing non-core tools for automating elements of this. Laurenz recommends
pgBackRest
. See more at pgbackrest.org. You have a dedicatedpgBackRest
server, tellpostgresql.conf
a command to talk to that server. It handles retention as well.Add this above where it belongs:
TEST YOUR BACKUPS. OMG.
SELECT
. They don't have a backup. What options are available?- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →We probably won't cover Replication. A few words:
Say PostgreSQL won't start because you are out of disk space. The correct answer is to increase disk space.
Before major version 10, pg_wal used to be pg_xlog (transaction log). sacct used to be called clog for commit log.
People would delete those things because people would think they were just unnecessary log files and delete them. YIKES! :scared:
Day 4: Replication / High Availability / "Complex" Architectures
When recovering, WAL is retrieved from the archive. You could do replication that way but it would be super annoying. Instead, enter streaming replication.
Replication is recovery (of a type) that never ends.
Streaming replication introduces two proceses:
When soemthing htis WAL buffers, it goes to the WAL sender, is received by the WAL receiver and then sent to WAL on the standby. First to the WAL, and then to the data – remember.
The
startup
process that helps with recovery replays the data into the standby. This is a fundamentally async process.WAL sender and WAL receiver
The standby acts as a client to the primary. PG has two protocols, both on
5432
, it seems:pg_basebackup
The config on the primary and the secondary should be as similar as possible.
Modify
postgresql.conf
.max_wal_senders
–> more than0
.wal_level
can not be minimal, must be replica or logical.Configure
primary_conninfo = ''
, this is a connection string.With the streaming replication protocol, you connect to the cluster, no database needed.
Replicate and then recovery mode until I tell you to
promote
.systemctl start postgresql@13-standby
Observe
walsender
andwalreceiver
processes:sent_lsn
position received by the stand bywrite_lsn
psition written by standbyflush_lsn
what's persisted and definitely safereplay_lsn
what position in the WAL has already been replayed and is visible on the standby server.Log Sequence Numbers (LSN)
first 8 hex digits –> denotes the timeline
second 8 hex digits –> a number that corresponds to the one before the slash in the
_lsn
first two hex digits after slash in
_lsn
are the last two digits of the WAL segment nameremaining six digits in
_lsn
are the offset.SELECT pg_current_wal_lsn();
to get the current LSN.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →psql
haswatch
!Measure replication lag in bytes, not time. Time is unreliable for this purpose.
docker run –name postgres_training
-e POSTGRES_PASSWORD=mysecretpassword
-d postgres:13
Stop the standby.
There are checkpoints (actually called restart points) on a recovering database. This makes an interrupted recovery less annoying. Resume recovery from that point, rather than the beginning. This is clearly essential for replication (it super needs to recover from where it left off, not the beginning, where possible).
If you delete a WAL segment needed by the secondary…please don't.
Rebuilding the secondary becomes an option.
The restore command can not be changed without restarting the server.
Use
wal_keep_size
to archive unneeded WAL to ensure it's available for the standby, to make it easier if it falls too far behind. This is usually unnecessary. If standbydowntime is longer thanwal_keep_size
accounts for, you'll still have trouble. Enter the replication slot. It marks a position in the WAL stream. The primary doesn't have any memory of the standby. (It's essentially a client.)The primary should not discard any WAL younger than the replication slot. The replication slot is an indicator of ho much WAL the secondary has.
Consider
max_slot_wal_keep_size
–> override the replication slot indicator to avoid disk space pile up. This will tell PostgreSQL to nuke that WAL even if the standby has not caught up.With synchronous replication,
COMMIT
does not return until the synchronous standby has received theWAL
. Doing this across continents: everything will take forever. This only works when the primary and the standby are close together in the network. the other downside of synchronous replication: the availability of the entire system goes down. A failure on the standby means a failur eon the primary.Use synchronous replication if you can not afford to lose a single transaction (fail over occurs before sync completes). Most people are using async.
The
application_name
field can be used to tell the PG server who I (the client) am. This is handy for replicas. Well-behaved applications set this name. This is used as a convention to name the standby server. You can set this in the connection string.There are a range of options for
synchronous_commit
when it's on.on
– remote flush, waits for standby to write to WAL file and flush to diskremote_apply
–> wait until standby has replayed information and that info is now visible.If you query the standby immediately after making a change, it may not be available with the default settings for
synchronous_commit
. You wantremote_apply
for that.12:01 Break!
12:12 Back!
Good use cases for streaming replication:
remote_apply
levels. Yeah, it would work, but…enter the replication conflict. VACUUM on the primary during a long running query on the second.max_standby_streaming_delay
governs how long you wait before canceling a query and resuming streaming of WAL. The default value is 30s. Consider your report that runs on this replicated database. Please please please don't cancel it 30 seconds in. You can make this indefinite with-1
.Primary goal: replication doesn't lag too much.
hot_standby
Setting this to off disallows queries during recovery.The name of the game here is preventing conflicts.
hot_standby_feedback
– Send oldest query blockingVACUUM
. No more conflicts due toVACUUM
. It's not on by default because a very long running query on the replica could cause bloat on the primary server. The primary will know the oldest query.hot_standby_feedback
remote_apply
–> Please don't, says LA. You'll have lots of waiting around all the time.You can have an intentionally delayed standby server. This is helpful to recover from logical problems, assuming you notice the problem within the intentional delay period. Possible appraoch for quick recovery.
Fun with Functions
Every second, tell Postgres to sleep for 10 seconds.
Temporarily Suspend Replication
Temporarily suspend replication:
SELECT pg_wal_replay_pause();
resume with
SELECT pg_wal_replay_resume();
High Availability (continued) and Failover
There is no cluster-ware in PostgreSQL but the tools to make it happen are available. Here's a manual failover:
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →#promote_trigger_file = '' # file name whose presence ends recovery
psql
commandSELECT pg_promote();
HAproxy
or similar. Alternately: this can be done with connection strings. If I can't get to the first, talk to the second.- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →pg_rewind
.Connect with a string like the above and observe the output of
\conninfo
to see which you are connected to. That approach is cool but makes it possible to wind up connected. tothe stand by and why? It's read-only. Entertarget_session_attrs
, an addition to the connection string:With the above, you'll only connect to a database that is
read-write
. The clients do the handovers. Any client that links withlibpq
can do this. Possible with different syntax with GDBC.A few ways to stop the server:
pg_ctl stop
there are three shutdown modes: smart, fast immediate. The default isfast
.fast
transactions. areall cancelled, client sessions are interrupted, checkpoint made, goes down. Default.smart
no new client connections, existing connections not interrupted. Once last client connection closes, DB shuts down. Not used too often.immediate
evil mode. Go down as quickly as you can, no clean up. Crash the server. Crash recovery will be required ons tartup. Bad for production, fun for testing. Nofsync
.See part 2.
Day 5: Execution Plans / Configuration and Resource Management
Post-Training
These notes will live here on
work.carranza.engineer
once the training is complete.Clean Up
Clean up after yourself!
Get the container ID you want to clean up
docker container ls -a | grep postgres_training | cut -d" " -f1
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Tools
Links