PostgreSQL Training: Part 2 // 2021

# PostgreSQL Training: Part 2 // 2021 Continued from [part 1](https://hackmd.io/2bHfFb8dScWioiNZ8L3wIw) # Day 4, continued. - [ ] **READ** [pg_rewind](https://www.postgresql.org/docs/current/app-pgrewind.html) from the PostgreSQL docs Run `pg_rewind` locally on the failed primary. Running `pg_rewind`: ``` pg_rewind --target-pgdata=13/main --source-server='port=5433 user=postgres' ``` ![pg-rewind00.png](pg-rewind00.png) Replication slots are only on the primary, they are not replicated. You'll need to add it again when failing over to a new primary! **12:55** Break! **13:10** Back! ## High Availability and Patroni > What we use and they recommend for HA and cluster-ware. - Project: [zalando/patroni](https://github.com/zalando/patroni) - Docs: [Intro](https://patroni.readthedocs.io/en/latest/) Written by Zalando for their own purposes, taken off and now highly recommended. PostgreSQL has a shared mapping architecture, which means there are no shared components. Good for reducing single points of failure. Patroni runs on both the primary and the secondary. Patroni handles starting and stopping PostgreSQL for you. It does the failover and the configs. No more `postgresql.conf` editing. You tell Patroni what to do and Patroni does it for you. Use `etcd` [or Consul (if you're GitLab)]. There is a third box: three instances of `etcd`, implements the **RAFT** protocol which requires an odd number. Patroni uses the key-value store (provided by `etcd` or whatever). See diagram. If it can't talk to the `etcd` running on itself, that's OK, it'll talk to another. ONe Patroni is the **Leader** (primary) and the others are **Followers** (secondaries). They vote regularly. Config is distributed across all instances in the key-value store. `etcd` doesn't need to run on the same machine. the solo `etcd` host can be a little EC2 or whatever. It just needs fast disk so it isn't seen to be nonresponsive. ![patroni-arch.png](patroni-arch.png) - [ ] **READ** [PostgreSQL replication and failover with Omnibus GitLab](https://docs.gitlab.com/ee/administration/postgresql/replication_and_failover.html) ## Security: Users, Permissions and More ### The User View users: ``` postgres=# \du List of roles Role name | Attributes | Member of -----------+------------------------------------------------------------+----------- postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {} ``` Think of it as `root`. There's nothing preventing access to the underlying filesystem. Applications should not **ever** connect as `postgres`. There is no big difference between users and user groups: users can log in to the system and groups do not log in. Regardless, users and groups are caled **roles**. Call `CREATE ROLE` to create a user. ``` postgres=# \h CREATE ROLE Command: CREATE ROLE Description: define a new database role Syntax: CREATE ROLE name [ [ WITH ] option [ ... ] ] where option can be: SUPERUSER | NOSUPERUSER | CREATEDB | NOCREATEDB | CREATEROLE | NOCREATEROLE | INHERIT | NOINHERIT | LOGIN | NOLOGIN | REPLICATION | NOREPLICATION | BYPASSRLS | NOBYPASSRLS | CONNECTION LIMIT connlimit | [ ENCRYPTED ] PASSWORD 'password' | PASSWORD NULL | VALID UNTIL 'timestamp' | IN ROLE role_name [, ...] | IN GROUP role_name [, ...] | ROLE role_name [, ...] | ADMIN role_name [, ...] | USER role_name [, ...] | SYSID uid URL: https://www.postgresql.org/docs/13/sql-createrole.html ``` ``` CREATE ROLE plop LOGIN PASSWORD 'cutecats'; ``` Don't do this: - goes in .psql_history Use `\password plop` after `CREATE ROLE plop LOGIN;` Get **all** roles: ``` postgres=# \duS List of roles Role name | Attributes | Member of ---------------------------+------------------------------------------------------------+-------------------------------------------------------------- pg_execute_server_program | Cannot login | {} pg_monitor | Cannot login | {pg_read_all_settings,pg_read_all_stats,pg_stat_scan_tables} pg_read_all_settings | Cannot login | {} pg_read_all_stats | Cannot login | {} pg_read_server_files | Cannot login | {} pg_signal_backend | Cannot login | {} pg_stat_scan_tables | Cannot login | {} pg_write_server_files | Cannot login | {} plop | | {} postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {} ``` The `Cannot login` objects are **roles**. Grant access to these system roles _instead_ of granting `superuser`. Experiment with `pg_authid` to get more info: ``` postgres=# SELECT rolname, rolpassword FROM pg_authid WHERE rolname LIKE 'plop'; rolname | rolpassword ---------+------------------------------------- plop | md5edd3bc15c6025ae3a4570fdf0061e20b (1 row) ``` MD5. Sad. :cry: :sad: With 14, `scram-sha-256` will be the default. :tada: Every user can see `pg_` table names. People don't like it. There are exceptions like `pg_authid`, which can only be _read_ by the **Superuser. `listen_addresses = 'localhost'` That's the server interface. Accepts a comma-separated list of server interfaces. With the above, it's only listening on the `loopback` (TCP connections are only allowed from `localhost` --> no remote access. Update to `listen_addresses = '*'` to listen on all available interfaces, requires a restart. When a client connects: - Client IP address, desired database user and desried database namea re known to the server. - Server consults HBA config file, which governs how incoming client will be authenticated. Entries are read top-to-bottom. First matching entry is chosen and determiens how client must authenticate. - `local` socket connections - `host` TCP There are dedicated entries for `replication`. The HBA file has these columns: - TYPE - DATABASE - USER - ADDRESS - METHOD This shows us why `psql` doesn't need a password on localhost with sockets but not TCP: ![hba-intro.png](hba-intro.png) _This is the bit of PostgreSQL I never fully grokked._ You can have an authentication method of `reject` to specifically reject someone. Also, no matching entries means that a rejection will occur. - [ ] **READ** [Client Authentication](https://www.postgresql.org/docs/current/client-authentication.html) - `trust` --> come on in! no checking - `peer` --> come on in! only for local connections and only when DBNAME and local username match. (`postgres` can run `psql` no problem but `root` can not.) - [password authenticatoni](https://www.postgresql.org/docs/current/auth-password.html) three variants: - passowrd --> bad, sends password in clear text - the others are `md5` and `scram-sha-256`, challenge/response. client hashes password with known salt to prevent password from being transmitted. Users are shared by all databases but not every user can connect to every database. `pg_hba.conf` permits control of this. Password authentication is second-rate among the other available options. The [available options](https://www.postgresql.org/docs/current/client-authentication.html) permit GSSAPI, PAM, RADIUS, LDAP, all the good stuff! - [ ] **READ** [Configure PostgreSQL block](https://docs.gitlab.com/omnibus/settings/database.html#configure-postgresql-block) Every object in PostgreSQL has an object (the person who created it). The owner can do anything with an object and other users must be granted access. However, only the owner can `ALTER` or `DROP`, regardless of `GRANT`. Permissions are stored on the object. ``` postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+----------+----------+------------+------------+----------------------- postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 | template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres + | | | | | postgres=CTc/postgres ``` PostgreSQL file system permission inspired by Unix file system permissions. To that end, a schema is a directory (a place whre a file or table could go). No nested schemas. Each database object is always in some schema. ``` cutecats=# \dn+ myapp List of schemas Name | Owner | Access privileges | Description -------+----------+-------------------+------------- myapp | postgres | | (1 row) ``` Nothing shown in **Access privileges** means it has the defaults. Sad. Schem privs: - write - usage --> read and execute ``` GRANT USAGE ON SCHEMA myapp TO plop; ``` ``` cutecats=# \dn+ myapp List of schemas Name | Owner | Access privileges | Description -------+----------+----------------------+------------- myapp | postgres | postgres=UC/postgres+| | | plop=U/postgres | (1 row) ``` After the slash is who granted the priv, the bit before the `=` is who got the priviege and the bit between `=` and `/` is the privilege: ``` plop=U/postgres ``` The user `postgres` granted the user `plop` the `U` ____. Now, let's look at the `public` schema: ``` cutecats=# \dn+ public List of schemas Name | Owner | Access privileges | Description --------+----------+----------------------+------------------------ public | postgres | postgres=UC/postgres+| standard public schema | | =UC/postgres | (1 row) ``` The _<null>_ bits are PostgreSQL's way of saying that anybody can do whatever. One of the first things you do after creating a database should be to change the schema. ``` cutecats=# REVOKE CREATE ON SCHEMA public FROM PUBLIC; REVOKE cutecats=# \dn+ public List of schemas Name | Owner | Access privileges | Description --------+----------+----------------------+------------------------ public | postgres | postgres=UC/postgres+| standard public schema | | =U/postgres | (1 row) ``` Do the same thing on `template1` so new DBs won't have the bad lame public schema. ``` \c template1 REVOKE CREATE ON SCHEMA public FROM PUBLIC; ``` Database Cluster --> Databases --> Schema Each database has its own schemas. What MySQL calls a database is a schema elsewhere. WHAT? **13:58** Break! **14:11** Back! There are also system schemas. Viwe with `\dnS`. - `pg_catalog` --> preexisting system types, operators, system functions, catalog tables all live here. - `pg_toast` --> Toast tables for overlength attributes. - `pg_temp_2` and `pg_toast_temp_2` --> implementation details - `information_schema` --> Interesting for the develoer. database-independent standard-confirming way to access database metadata. Use this not that: **THAT** template1=# SELECT * FROM pg_attribute \gx **THIS** template1=# SELECT * FROM information_schema.columns; - **Q**: Why did our tables get created in the `public` schema? - **A**: This is governed by the `search_path` parameter. ``` template1=# SHOW search_path; -[ RECORD 1 ]---------------- search_path | "$user", public Time: 0.283 ms ``` Relative path: SELECT * FROM accounts; Absolute path: SELECT * FROM public.accounts; If you do `SELECT * FROM accounts;`, it will find the first table in a schema on the `search_path` - [ ] **READ** [the docs about schemas](https://www.postgresql.org/docs/current/ddl-schemas.html) Available options when querying, you can: - always use the schema (bad `search_path` never gets in your way) - don't use schema, make sure `search_path` is right _This is answering a lot of questions! The most useful day so far._ ### Permissions on tables ``` GRANT { { SELECT | INSERT | UPDATE | DELETE | TRUNCATE | REFERENCES | TRIGGER } ``` - `REFERENCES` allows people to generate a foriegn key that points to their table - `TRIGGER` don't use, lets someone else create a trigger on your table ### Row-Level Security We have a two-user architecture. A three-user architecture: - Super user - User that owns everything - Application user that can just access (principle of least privilege) This three-user architecture would make migrations more difficult. To achieve row-level security, a policy must be defined. - [ ] **READ** [Row Security Policies](https://www.postgresql.org/docs/current/ddl-rowsecurity.html) ``` ALTER TABLE rls ENABLE ROW LEVEL SECURITY; CREATE POLICY coolstuff ON rls FOR SELECT TO joe USING (id % 2 == 0); ``` The user `joe` will be able to see the results of the query that they apply and a filter based on the contents of `USING()` above. ![row-level.png](row-level.png) ## Bypassing Row Level Security Super user and table owner are not subject to row level security There is a `BYPASSRLS` optoin when creating a role. You can also turn row security off ``` SHOW row_security; ``` ## Upgrade Minor upgrades are easy: always do them. Install new software, restart the database. Happiness! Like, `pg_dump` not required ( do it anyway probably). Due diligence: read the [release notes](https://www.postgresql.org/docs/current/release.html). - [ ] **READ** [Migrate to version 13](https://www.postgresql.org/docs/current/release-13.html#id-1.11.6.9.4) The simplest method: - `pg_dumpall` - `initdb` - restore dump with `psql` This leaves out configuration, remember! You can just copy `pg_hba.conf` around between versions. This is not advised with `postgresql.conf`: - parameters get removed (and new ones are introduced) --> What if DB doesn't start up because parameter is unknown? - you lose info about the default config for that option LA recommends comparing old and new, looking them up and configuring the new. It's an opportunity to review your config. Use the newer `pg_dump` where possible. Another way of upgrading... `pg_upgrade` --> introduced in 8.x - Create 13 cluster of new version - Run `pg_upgrade` which will extract metadata from old database (only catalog tables), then copies data files as-is. Faster only works if same storage format things that influence how data is stored must also be the same (-k during `initdb`) --> You would _have_ to use `pg_dump` and restore in that case. This approach can still be painfully slow. Enter `pg_upgrade -?`. With `-k, --link` passed to `pg_upgrade`, it hard links instead of copying files to the new cluster. Up to you whether you want to do this but with this approach, you don't have a known old good db to go back. Old server **can not** be started. Fall back from upgrade with `pg_upgrade --link` requires recovering from a rsetore or failing over to a replica or something. ``` -k, --link link instead of copying files to new cluster ``` ![pg_upgrade.png](pg_upgrade.png) The statistics are not moved over. Instead, you get a `analyze_new_cluster.sh` script to help guide the process for you. So, `pg_upgrade` is super fast (neer more than 2-3 minutes, he says but huh???) anyway, say you don't even want those 2-3 minutes: logical replicatoin an extension that ships logical informationto the standby, reverse enginers WAL to figure out the right thing to do usually for replicating part of a database useful if physical differences are OK streaming replication across PostgreSQL versions This options reduces downtime as much as possible but at the cost of simplicity. --- - **Q**: Why is `smart` rarely used? - **A**: Because you normally have a connection pool, so you have long database sessions (which is a good idea; more tomorrow). --- # Day 5: Partitioning, Configuration, Performance Tuning, Query Tuning ## Partitioning ### Views > So we can learn about partitions. ``` CREATE DATABASE cat; \c cat CREATE TABLE person (id bigint, name text, bday date); INSERT INTO person VALUES(1, 'oscar', '1996-11-23'); INSERT INTO person VALUES(2, 'brie', '2000-11-23'); SELECT id, name FROM person WHERE bday > '2000-01-01'; id | name ----+------ 2 | brie (1 row) ``` Say we use that query frequently... We create a view. A view is a `SELECT` with a name. It looks and acts like a table but isn't one. ``` cat=# CREATE VIEW youngsters AS SELECT id, name FROM person WHERE bday > '2000-01-01'; CREATE VIEW cat=# \d youngsters View "public.youngsters" Column | Type | Collation | Nullable | Default --------+--------+-----------+----------+--------- id | bigint | | | name | text | | | cat=# SELECT * FROM youngsters; id | name ----+------ 2 | brie (1 row) ``` Explore `EXPLAIN` using the view and the underlying query: ``` cat=# EXPLAIN SELECT * FROM youngsters; QUERY PLAN ---------------------------------------------------------- Seq Scan on person (cost=0.00..24.12 rows=377 width=40) Filter: (bday > '2000-01-01'::date) (2 rows) cat=# EXPLAIN SELECT id, name FROM person WHERE bday > '2000-01-01'; QUERY PLAN ---------------------------------------------------------- Seq Scan on person (cost=0.00..24.12 rows=377 width=40) Filter: (bday > '2000-01-01'::date) (2 rows) ``` Useful for: - shortcut/nicknames --> don't keep typing the same thing - extraction layer --> Maybe your application just deals with the `VIEW`. This makes application updates much less annoying. Even if you change the tables around, just update the view and the application gets the same results. As with everything, use in moderation. Views on views on views on views. Please don't. :turtle: :turtle: :turtle: ### Partition Table Partitions as kinds of views. The partitioning key determines in which row/partition something belongs. - [ ] **READ** [Table Partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html) On `INSERT`, unique partition for each row. `SELECT` collects info from all partitions as if you had done `UNION ALL`. :arrow_right: `INSERT` :arrow_left: `SELECT` ![](https://i.imgur.com/xemSD71.png) There are three partitioning strategies: - range partitioning - list partitioning - hash partitioning You can do **range partitioning** -- everything between x and y goes here. No overlaps, positions must be unique. List partitionign is also possible (list of values specified, instead of range). **Hash partitioning** -- use partitioning key, depending on data type, the appropriate hashing function is called, result divided by # partitions and remainder determins resulting partition. Mostly useless. Good for randomly splitting a big table into many, I/O balancing advantage but beyond that -- eh, why?) #### Range Partitioning ``` CREATE TABLE data (id bigint GENERATED ALWAYS AS IDENTITY NOT NULL, created timestamp NOT NULL, catload text) PARTITION BY RANGE (created); ; ``` The `(created)` value specifies the partitioning key. ``` cat=# INSERT INTO data (created, catload) VALUES (current_timestamp, 'Cats are so extremely cute, OMG.'); ERROR: no partition of relation "data" found for row DETAIL: Partition key of the failing row contains (created) = (2021-09-03 15:23:32.195025). cat=# ``` Automatic partition creation == NO. Not currently available in PostgreSQL. Your application or `cron` or something has to create the partitions. We create it with `CREATE TABLE`: ``` CREATE TABLE data_2021 PARTITION OF data FOR VALUES FROM ('2021-01-01 00:00:00') TO ('2022-01-01 00:00:00'); CREATE TABLE data_2020 PARTITION OF data FOR VALUES FROM ('2020-01-01 00:00:00') TO ('2021-01-01 00:00:00'); CREATE TABLE data_2019 PARTITION OF data FOR VALUES FROM ('2019-01-01 00:00:00') TO ('2020-01-01 00:00:00'); CREATE TABLE data_2018 PARTITION OF data FOR VALUES FROM ('2018-01-01 00:00:00') TO ('2019-01-01 00:00:00'); ``` The upper boundary is always excluded to avoid overlaps on New Year's Day at midnight with the statements above. If you try, you'll get an error: ``` cat=# CREATE TABLE data_dogs PARTITION OF data FOR VALUES FROM ('2018-01-01 00:00:00') TO ('2019-01-01 00:00:00'); ERROR: partition "data_dogs" would overlap partition "data_2018" ``` ``` cat=# INSERT INTO data (created, catload) VALUES (current_timestamp, 'something'); INSERT 0 1 cat=# TABLE data_2021; id | created | catload ----+----------------------------+----------- 2 | 2021-09-03 15:28:12.284289 | something (1 row) cat=# INSERT INTO data_2018 (created, catload) VALUES (current_timestamp, 'something'); ERROR: null value in column "id" of relation "data_2018" violates not-null constraint DETAIL: Failing row contains (null, 2021-09-03 15:28:28.176442, something). ``` In use: ``` cat=# EXPLAIN SELECT * FROM data; QUERY PLAN --------------------------------------------------------------------------- Append (cost=0.00..104.20 rows=4280 width=48) -> Seq Scan on data_2018 data_1 (cost=0.00..20.70 rows=1070 width=48) -> Seq Scan on data_2019 data_2 (cost=0.00..20.70 rows=1070 width=48) -> Seq Scan on data_2020 data_3 (cost=0.00..20.70 rows=1070 width=48) -> Seq Scan on data_2021 data_4 (cost=0.00..20.70 rows=1070 width=48) (5 rows) ``` Updates that move data between partitions regularly probably mean you should have chosen a better partitioning strategy. ``` cat=# UPDATE data SET created = created - INTERVAL '1 year' WHERE id = '2' ; UPDATE 1 cat=# TABLE data_2021; id | created | catload ----+---------+--------- (0 rows) cat=# TABLE data_2020; id | created | catload ----+----------------------------+----------- 2 | 2020-09-03 15:28:12.284289 | something (1 row) ``` Let's add 1,000,000+ rows that will be in different partitions: ``` INSERT INTO data (created, catload) SELECT t, 'random value' FROM generate_series(TIMESTAMP '2018-12-29 00:00:00', TIMESTAMP '2021-01-06 00:00:00', INTERVAL '1 minute') AS gs(t); INSERT 0 1064161 ``` Partition pruning in action: ``` cat=# EXPLAIN SELECT * FROM data WHERE id = 42; QUERY PLAN ------------------------------------------------------------------------------------------ Gather (cost=1000.00..27785.85 rows=4 width=29) Workers Planned: 2 -> Parallel Append (cost=0.00..26785.45 rows=4 width=29) -> Parallel Seq Scan on data_2020 data_3 (cost=0.00..13241.01 rows=1 width=29) Filter: (id = 42) -> Parallel Seq Scan on data_2019 data_2 (cost=0.00..13205.00 rows=1 width=29) Filter: (id = 42) -> Parallel Seq Scan on data_2021 data_4 (cost=0.00..211.90 rows=1 width=29) Filter: (id = 42) -> Parallel Seq Scan on data_2018 data_1 (cost=0.00..127.53 rows=1 width=29) Filter: (id = 42) (11 rows) cat=# EXPLAIN SELECT * FROM data WHERE created = '2020-07-12 12:00:00'; QUERY PLAN ---------------------------------------------------------------------------------- Gather (cost=1000.00..14241.21 rows=2 width=29) Workers Planned: 2 -> Parallel Seq Scan on data_2020 data (cost=0.00..13241.01 rows=1 width=29) Filter: (created = '2020-07-12 12:00:00'::timestamp without time zone) (4 rows) ``` Using `created` lets the partition work. ``` BEGIN; DROP TABLE data_2018; COMMIT; ``` You can **detach** the partition with `ALTER TABLE data DETACH PARTITION data_2018;`. The table persists but is no longer apart of the partition. The reverse operation is possible by attaching a partition: ``` ALTER TABLE data ATTACH PARTITION data_2018 FOR VALUES FROM ('2018-01-01') TO ('2019-01-01'); ``` Or: ``` CREATE TABLE data_2022 (LIKE data_2021); <POPULATE></POPULATE> ALTER TABLE data ATTACH PARTITION dat-2022 FOR VALUES FROM ('2022-01-01') TO ('2023-01-01'); ``` ### Default Partition `CREATE TABLE data_def PARTITION OF data DEFAULT;` If it doesn't belong elsewhere, it goes there. ``` INSERT INTO data (created, catload) VALUES ('2039-02-14 04:19:00', 'SO very cute'); ``` Yikes! Exercise caution when using a default partition. In the example above, you won't be able to create the `data_2039` partition. this might be fine depending on _how_ you are defining your partitions: ``` cat=# CREATE TABLE data_2023 PARTITION OF data FOR VALUES FROM ('2039-01-01 00:00:00') TO ('2040-01-01 00:00:00'); ERROR: updated partition constraint for default partition "data_def" would be violated by some row ``` That's because of the row we `INSERT`ed above. With a fixed set of partitions, a default partition is OK, like users by alphabetical order or similar. Let's create an index... Creating an index on `data` creates a partitioned index, observe: :eyeglasses: ``` cat=# \d data Partitioned table "public.data" Column | Type | Collation | Nullable | Default ---------+-----------------------------+-----------+----------+------------------------------ id | bigint | | not null | generated always as identity created | timestamp without time zone | | not null | catload | text | | | Partition key: RANGE (created) Indexes: "data_created_idx" btree (created) Number of partitions: 5 (Use \d+ to list them.) cat=# \d data_2018 Table "public.data_2018" Column | Type | Collation | Nullable | Default ---------+-----------------------------+-----------+----------+--------- id | bigint | | not null | created | timestamp without time zone | | not null | catload | text | | | Partition of: data FOR VALUES FROM ('2018-01-01 00:00:00') TO ('2019-01-01 00:00:00') Indexes: "data_2018_created_idx" btree (created) ``` ``` cat=# DROP INDEX data_2018_created_idx; ERROR: cannot drop index data_2018_created_idx because index data_created_idx requires it HINT: You can drop index data_created_idx instead. ``` PostgreSQL prevents you from doing silly things: ``` cat=# ALTER TABLE data_2019 ADD newcol inet; ERROR: cannot add column to a partition ``` The optimizer creates an execution plan individually for each partition. Partitioning creates more work for the query optimizer. Let's _try_ to create a primary key: ``` cat=# ALTER TABLE data ADD PRIMARY KEY (id); ERROR: unique constraint on partitioned table must include all partitioning columns DETAIL: PRIMARY KEY constraint on table "data" lacks column "created" which is part of the partition key. ``` :game_die: No dice because no guarantee of uniqueness _across_ partitions. We require inherent uniqueness that can be imbued by the partitioning key. - Uniqueness within partition guaranteed by the index. - Uniqueness across partitions added by the partitioning key. ``` cat=# ALTER TABLE data ADD PRIMARY KEY (id, created); ALTER TABLE ``` The above works but is unfortunate. A global index that does the desired thing _could_ happen but would deprive us of the performance benefits of partitioning. Not possible today with PostgreSQL. Subpartitioning is possible and could theoretically be arbitrarily deep. Beyond two layers: > WHY:question: - **Q**: What's a good approach for partitioning a live table after the fact? - **A**: Partition it in advance, think ahead. Fallacy: I don't need to worry about that now, I'll deal with it when I have lots of data. - See [Initiate the web_hook_logs partitioning](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/55938) where we rolled our own with `TRIGGER`. - With PostgreSQL 13+, set up logical replication somewhere with a partition table, replicate over. **11:55** Break! **12:04** Back! ### Performance, Partitioning and (Query) Planning With a query like `EXPLAIN SELECT catload, count(*) FROM data GROUP BY catload;`: ``` SET enable_partitionwise_aggregate = on; ``` Two table spartitioned with the same partitiong key and along the same boundaries, use `enable_partitionwise_join`. The `enable_partitionwise_aggregate` and `enable_partitionwise_join` options are **off** by default: PostgreSQL tries to reduce costs. This only makes sense if your setup would benefit by it. ## When would I partition a table? When shouldn't I partition a table? Situations when you might want to use partitioning: - Getting rid of data more performant: Makes retention policies with `DROP TABLE;` nice and easy. See notes above on `DROP TABLE data_2023;` instead of a long-running, I/O intensive `DELETE`, requires a long `VACUUM;`, introduces bloat. - Backloading: load partition, create table with `COPY FREEZE`, then create indexes and then create partitions. Recall `DETACH` and `ATTACH` from above. - Performance: mixed bag. :handbag: Imagine a query that requires a sequential scan, has a `WHERE`, the sequential scan can run on just those partitions rather than everywhere. :warning: Narrow use case. Could be more expensive. The moral of many of these stories is: IDK, depends on your data, your use case, your tolerance level for different behaviors. Average query may become slower but some queries will be much faster. - `VACUUM` performance: only one running `VACUUM` per-table, could cause performance problems. Parition permits simultaneous `VACUUM` along the partitioned data (multiple allowed). An example: ``` cat=# EXPLAIN SELECT * FROM data WHERE id = 42; QUERY PLAN ---------------------------------------------------------------------------------------------- Append (cost=0.29..41.66 rows=5 width=33) -> Index Scan using data_2018_pkey on data_2018 data_1 (cost=0.29..8.30 rows=1 width=29) Index Cond: (id = 42) -> Index Scan using data_2019_pkey on data_2019 data_2 (cost=0.43..8.45 rows=1 width=29) Index Cond: (id = 42) -> Index Scan using data_2020_pkey on data_2020 data_3 (cost=0.43..8.45 rows=1 width=29) Index Cond: (id = 42) -> Index Scan using data_2021_pkey on data_2021 data_4 (cost=0.29..8.30 rows=1 width=29) Index Cond: (id = 42) -> Index Scan using data_def_pkey on data_def data_5 (cost=0.12..8.14 rows=1 width=48) Index Cond: (id = 42) (11 rows) ``` With partitioning, you have to do way more scans than you otherwise would so what even is the point. Big tables and slow queries --> the answer is not **necessarily** partition! Partitioning is not considered to be a performance improvement. Makes the most sense in a data warehouse or similar. :factory: - **Myth**: Partitioning will make all queries faster. - The average query will probably be slower. Some queries will be faster. - **Myth**: If table is very large, you must partition. - Have a better reason: `auto vacuum`, specific queries that require sequential scan. Partitioning was introduced in major version 10. Use at least **12** to take advantage of benefits. Essentially feature-complete with **11** and performance improvements began in **12**. ### Sharding Split data into parts that go in different databases on different hosts to permit scaling, load distribution. Sensible with data that lends itself to being split in this way. Retain some data on _all_ shards to make sharding less painful. Shard: - in the application - consider PG fork - [PL/Proxy](https://plproxy.github.io/) - PostgreSQL 14+ -> built-in options - Foreign Tables (looks and feels like a regular table but acts as client to another database, presents received data as if local) - Use [foreign table](https://www.postgresql.org/docs/current/sql-createforeigntable.html) as partition -> kinda sharding. With 14+, parallel append makes this approach more reasonable (before, each 'shard' is queried sequentially, lame). Not every data set can be sharded successfully. `/` :bookmark_tabs: ## Configuration **GUC** - Grand Unified Configuration If you add trash to `postgresql.conf` and tell it to reload, it'll still be running with the old config. If you then do `systemctl restart` or hard reboot -- you'll really be sad as it has the old config on disk and is aiming to use that. There is no `--validate-config` flag: you just `reload` and look for errors. :sign_of_the_horns: The **GUC** (Grand Unified Configuration): ```mermaid graph TD defaults --> postgresql.conf; postgresql.conf --> C[ALTER SYSTEM]; C-->D[start server, override params, see Patroni] D-->E[ALTER DATABASE, per-database settings] E-->F[ALTER ROLE] F-->G[ALTER ROLE IN DATABASE] G-->H[SET] H-->I[SET LOCAL] ``` There are so many options: you could have fun. Say one DBA sets a change in `postgresql.conf` and another with `ALTER SYSTEM`. The person changing `postgresql.conf` will be sad. :crying_cat_face: Talk to your fellow DBAs. ``` SELECT * FROM pg_settings ; ``` Explore the output, a few notes: - Contexts: `postmaster` must be changed at startup - use the `source` field to figure out _where_ a specific setting has been set! - Use `pending_restart` for settings that require a restart to be changed. So cool! ``` postgres=# SELECT name,setting,category,context FROM pg_settings WHERE name = 'search_path' ; name | setting | category | context -------------+-----------------+-------------------------------------------------+--------- search_path | "$user", public | Client Connection Defaults / Statement Behavior | user (1 row) ``` You might want a different schema search path for different databases: ``` ALTER DATBASE course SET search_path = myapp, public; ``` That change would apply to _new_ connections unless you restart. The things lower down will override the things that are higher up. Design principle: do configuration changes from the client, as you may not have shell access. That's why there are so many of these little things. Super cool. postgresql.auto.conf is modified **automatically** when using `ALTER SYSTEM`: ``` ALTER SYSTEM SET enable_partitionwise_aggregate = on; ``` You still need to `reload` for changes set via `SQL`. Thankfully, you can reload via SQL: ``` SELECT pg_reload_conf(); ``` Not _all_ settings can be changed via SQL, like....`SET port = 5555;`. 1. Why? 1. No. 1. Change in `postgresql.conf` and reload ``` SHOW server_version; ``` Use `\drds` to list the per-database role settings stored by this table. https://www.postgresql.org/message-id/flat/20091008163401.23D8C753FB7%40cvs.postgresql.org Use `SET LOCAL` to make a change for a single transaction. Cool! (Increase memory, timeouts, whatever.) - **Challenge**: Improve performance by changing a single value from the defaults! - **Response**: Increase `shared_buffers` beyond the default. The guideline for determining a reasonable value: 1/4 available RAM, no more than **8GB**. Why so little? More than that _could_ be beneficial but try it for **your** workload. (Bigger isn't always better, consider a representative load test.) PostgreSQL uses buffered I/O: reads may not go to disk so you may still benefit from stuff being cached into memory even if it's not in PG's shared buffers (thanks, kernel! :popcorn: ). - Avoid caching the same things twice: lower than half of the memory or more than half of the memory but not exactly half. :one: :heavy_division_sign: :two: ### Look at the Buffer Cache ``` \d pg_buffercache SELECT usagecount, count(*) FROM pg_buffercache GROUP BY usagecount ORDER BY usagecount; ``` So, `usagecount` is a number between `0` and `5`. Each usage increases the count by 1, up to 5. You can _only_ evict a buffer with `usagecount` of `0`. This happens when someone passes by looking for an empty buffer. There's a background job that goes through and counts down usage counts. Backend looking for empty buffer will count it down. Usage increases the `usagecount`, someone skipping past it decreases the `usagecount`. Clock sweep algorithm - [ ] **READ** [The LRU/Clock sweep cache algorithm](https://madusudanan.com/blog/understanding-postgres-caching-in-depth/#LRU) ### :cooking: Cache Warm-Up Without this, it takes awhile for things to get warmed up. If you know what things should be cached for good performance, you can prewarm to load things into cache: ``` SELECT pg_prewarm('test_upper_idx'); SELECT pg_prewarm('test_id_idx'); ``` Data that has been prewarmed is still subject to the `usagecount` stuff. You can do some pre-warming at start-up with a parameter called `shared_preload_libraries`: ``` shared_preload_libraries = 'pg_prewarm' ``` Restart PostgreSQL and you'll see a custom background worker in `ps`: ``` postgres: 13/main: autoprewarm master ``` It regularly checks what's in shared buffers, persists this in a file on disk in the data directory called....`something`. --- > It's not as bad as Apache configs, which are science. :joy: **13:01** Break! **13:17** Back! ## Tuning Parameters ### Parameter: work_mem `work_mem` private memory for backend processes, the calculate queries on behalf of the client. ``` postgres=# EXPLAIN SELECT * FROM pg_stats; QUERY PLAN ---------------------------------------------------------------------------------------------------- Nested Loop Left Join (cost=126.24..153.50 rows=5 width=401) -> Hash Join (cost=126.11..152.20 rows=5 width=475) Hash Cond: ((s.starelid = c.oid) AND (s.staattnum = a.attnum)) -> Seq Scan on pg_statistic s (cost=0.00..22.02 rows=402 width=349) -> Hash (cost=111.80..111.80 rows=954 width=142) -> Hash Join (cost=22.65..111.80 rows=954 width=142) Hash Cond: (a.attrelid = c.oid) Join Filter: has_column_privilege(c.oid, a.attnum, 'select'::text) -> Seq Scan on pg_attribute a (cost=0.00..81.61 rows=2861 width=70) Filter: (NOT attisdropped) -> Hash (cost=17.82..17.82 rows=386 width=72) -> Seq Scan on pg_class c (cost=0.00..17.82 rows=386 width=72) Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid))) -> Index Scan using pg_namespace_oid_index on pg_namespace n (cost=0.13..0.18 rows=1 width=68) Index Cond: (oid = c.relnamespace) (15 rows) ``` There are a great many **nodes** or **steps** in the execution plan. each node or step could use the amount of memory in `work_mem`. Most nodes won't use a lot of memory (sequential and index scans). The three types of nodes that consume memory: - Hash - Bitmap - Sort --> with insufficient memory you'll have to sort in batches. Avoid. The above makes it tricky to calcualte what `work_mem` should be. LA recommends this formula as a _guideline_: - `shared_buffers` + `work_mem` * `max_connections` (less than or equal to ) RAM Consider `max_connections` as an upper limit, you might have far fewer in reality. Example: ``` postgres=# CREATE DATABASE cats; CREATE DATABASE postgres=# \c cats You are now connected to database "cats" as user "postgres". cats=# CREATE TABLE work (id integer GENERATED ALWAYS AS IDENTITY, name text); CREATE TABLE cats=# INSERT INTO work (name) SELECT 'brie and plop' FROM generate_series(1, 1000000); INSERT 0 1000000 cats=# INSERT INTO work (name) SELECT 'plop so cute' FROM generate_series(1, 1000000); INSERT 0 1000000 cats=# ANALYZE work; ANALYZE cats=# EXPLAIN SELECT name, count(*) FROM work GROUP BY name; QUERY PLAN ----------------------------------------------------------------------------------------------- Finalize GroupAggregate (cost=26239.05..26239.56 rows=2 width=21) Group Key: name -> Gather Merge (cost=26239.05..26239.52 rows=4 width=21) Workers Planned: 2 -> Sort (cost=25239.03..25239.03 rows=2 width=21) Sort Key: name -> Partial HashAggregate (cost=25239.00..25239.02 rows=2 width=21) Group Key: name -> Parallel Seq Scan on work (cost=0.00..21072.33 rows=833333 width=13) (9 rows) cats=# SET max_parallel_workers_per_gather = 0; SET cats=# EXPLAIN SELECT name, count(*) FROM work GROUP BY name; QUERY PLAN --------------------------------------------------------------------- HashAggregate (cost=42739.00..42739.02 rows=2 width=21) Group Key: name -> Seq Scan on work (cost=0.00..32739.00 rows=2000000 width=13) (3 rows) ``` Just the statements: ``` CREATE DATABASE cats; \c cats ``` The `HashAggregate` takes care of the `GROUP BY`. If we `SELECT` id instead: ``` cats=# EXPLAIN SELECT id, count(*) FROM work GROUP BY id; QUERY PLAN --------------------------------------------------------------------------------- HashAggregate (cost=145239.00..180864.00 rows=2000000 width=12) Group Key: id Planned Partitions: 64 -> Seq Scan on work (cost=0.00..32739.00 rows=2000000 width=4) JIT: Functions: 3 Options: Inlining false, Optimization false, Expressions true, Deforming true (7 rows) ``` This will require temporary partitions sent to disk because `work_mem` is too small. :crying_cat_face: Let's fix that: ``` cats=# SET work_mem = '1GB'; SET cats=# EXPLAIN SELECT id, count(*) FROM work GROUP BY id; QUERY PLAN -------------------------------------------------------------------- HashAggregate (cost=42739.00..62739.00 rows=2000000 width=12) Group Key: id -> Seq Scan on work (cost=0.00..32739.00 rows=2000000 width=4) (3 rows) ``` The minimum possible `work_mem` is **64kB**. Let's set `work_mem` to that minimum and see how the query plan changes: ``` cats=# EXPLAIN SELECT id, count(*) FROM work GROUP BY id; QUERY PLAN --------------------------------------------------------------------------------- GroupAggregate (cost=351436.69..386436.69 rows=2000000 width=12) Group Key: id -> Sort (cost=351436.69..356436.69 rows=2000000 width=4) Sort Key: id -> Seq Scan on work (cost=0.00..32739.00 rows=2000000 width=4) JIT: Functions: 6 Options: Inlining false, Optimization false, Expressions true, Deforming true (8 rows) ``` We see that `GroupAggregate` is used instead. The `cost` values start to be useful: we can see the impact of a lower `work_mem`. :ram: #### Choosing work_mem Watch how many temp files are being written for a database called `cats`: ``` cats=# \watch 4 SELECT temp_files,temp_bytes FROM pg_stat_database WHERE datname = 'cats' \gx Fri Sep 3 17:31:55 2021 (every 4s) temp_files | temp_bytes ------------+------------ 2 | 28000000 (1 row) ``` - :musical_note: Don't give more memory than you can afford. #### About work_mem From [the docs](https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY): > work_mem (integer) > > Sets the base maximum amount of memory to be used by a query operation (such as a sort or hash table) before writing to temporary disk files. If this value is specified without units, it is taken as kilobytes. The default value is four megabytes (4MB). Note that for a complex query, several sort or hash operations might be running in parallel; each operation will generally be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries. > > Hash-based operations are generally more sensitive to memory availability than equivalent sort-based operations. The memory available for hash tables is computed by multiplying work_mem by hash_mem_multiplier. This makes it possible for hash-based operations to use an amount of memory that exceeds the usual work_mem base amount. ## Parameter: max_connections We care about the connections that are actually consuming system resources. A sufficiently high `max_connections` value can (and will) degrade performance. Also, consider `work_mem` when setting `max_connectiosn`, see the guideline above. Use connection pools to manage the number of active connections. - [ ] **READ** [Client-side connection-pool](https://docs.gitlab.com/ee/development/database/client_side_connection_pool.html) from GitLab development docs ## Parameter: maintenance_work_mem, autovacuum_work_mem and logical_decoding_work_mem The friends of `work_mem`. - `maintenance_work_mem` -- private memory for creation of indexes and clusters, full vaccuum. You probably don't do this _too_ often so feel free to be generous. - `autovacuum_work_mem` will take the value of `maintenance_work_mem`, limits number of rows that can be processed at a time. Be generous, raise to speed `vacuum`. ## Parameter: `temp_file_limit` Let's look at temp files. 1. Do `SELECT * FROM generate_series(1, 10000000000000000000000);` 1. Elsewhere: `while true ; do date ; ls -ahls ; sleep 1 ; done` 1. :skull: Kill the `SELECT` from above unless you want to live on the wild side. Runaway queries could exhaust your disk space. ## Query Tuning Parameters **Planner Cost Constraints** Tune to match hardware. - `random_page_cost` -- most important, tuned for spinning disks by default. - `seq_page_cost` cost to read one 8K page during sequential scan. Match `random_page_cost` and `seq_page_cost` with SSDs. `effective_cache_size` -- impact is moderate, easy to set tells the optimizer how much memory is available to cache data. PG doesn't otherwise know how much RAM your machine has, lol. Index scans become cheaper with an increased `effective_cache_size`. :warning: Avoid the temptation to tune a bunch of paramters all at once just because you have a theory or read somewhere on the Internet that a parameter is super awesome. - Change one parameter at a time. (Duh.) - Observe the results. - Proceed. ## Query and Analyzing Tuning: Find Problematic Queries :hospital: Essentially med school for database doctors. You can learn so much and then you see a bunch of queries (patients) and can apply suggestions from your experience. ## Parallel Queries The `Gather` in an execution plan denotes parallelization. _Screenshot around 13:52 local time_ ![](https://i.imgur.com/HjWtyqX.png) Everything under the **Gather** node is parallelized. ### Configuring Parallelization `max_parallel_workers` = limit of parallel workers across _entire_ cluster. Take # cores into consideration. More parallel workers than cores --> No. Use some number less than the number of cores. ``` #max_parallel_workers = 8 # maximum number of max_worker_processes that # can be used in parallel operations ``` Tune for conflicting ends: throughout vs response time - Parallel queries is good for response time, a loss for throughput. Don't use parallel queries if you wish to optimize for throughput. - Data warehouse, many long queries: set `max_parallel_workers` and `max_parallel_workers_per_gather` a bit higher. You could find a difference between **Workers Planned** and **Workers Launched** depending on what's going on when the query actually runs. The query won't wait for the number of **Workers** it plans to be available to be launched to run the query. - [ ] **READ** [Parallel queries in PostgreSQL](https://www.percona.com/blog/2019/02/21/parallel-queries-in-postgresql/) Hmm: observe that parallelization is not used here, observe the `rows` value to understand why: ![](https://i.imgur.com/BD71vLv.png) _screenshot around 13:59 local time_ `parallel_setup_cost` --> cost for setting up parallelization `parallel_tuple_cost` estimate for transferring rows betwene parallel processes. `min_parallel_table_scan_size` if table exceeds this size, a worker will be considered. if table exceeds double the size, another worker is consdiered. logarithmically increased. ``` EXPLAIN SELECT count(*) FROM test; SET max_parallel_workers_per_gather = 8 ; EXPLAIN SELECT count(*) FROM test; ``` :moneybag: Never mind the cost, make it fast: ``` ALTER TABLE test SET (parallel_workers = 6); ``` **14:04** Break! **14:12** Back! `log_min_duration_statement` - log queries that take longer than a certain amount of time. - **Q**: Which queries are causing load and hurting the most? - **A**: Short ultra-frequent queries? Won't find with `log_in_duration_statement`. Instead, we want New `shared_preload_libraries`: ``` shared_preload_libraries = 'pg_prewarm,pg_stat_statements' ``` LA recommends always having this in production. Creates extra shared memory segment with statistics about executed queries. Not a record of all queries, lumps together. queries who only differ by constants == identical. There's a built in benchmarking tool: `pgbench`. ``` $ pgbench -c 2 -T 30 cats starting vacuum...end. transaction type: <builtin: TPC-B (sort of)> scaling factor: 10 query mode: simple number of clients: 2 number of threads: 1 duration: 30 s number of transactions actually processed: 29922 latency average = 2.005 ms tps = 997.304049 (including connections establishing) tps = 997.414817 (excluding connections establishing) ``` A high standard deviation: query is sometimes slow and sometimes fast. ``` CREATE EXTENSION pg_stat_statements; \d pg_stat_statements ``` ``` SELECT calls, total_exec_time, query FROM pg_stat_statements ORDER BY total_exec_tiem DESC LIMIT 10; ``` Here's how that looks after running `pgbench`: ``` postgres=# SELECT calls, total_exec_time, query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10; calls | total_exec_time | query -------+--------------------+------------------------------------------------------------------------------------------------------ 29019 | 1472.7642000000155 | UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2 29019 | 1397.9302000000264 | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2 29019 | 1044.9512000000009 | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2 1 | 918.3737 | INSERT INTO large SELECT i, i / $1 + $2 FROM generate_series($3, $4) AS i 29019 | 648.4597000000005 | SELECT abalance FROM pgbench_accounts WHERE aid = $1 29019 | 557.028700000003 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP) 29019 | 181.55990000000943 | BEGIN 29019 | 176.7948000000034 | END 1 | 57.3338 | CREATE EXTENSION pg_stat_statements 1 | 10.5965 | vacuum pgbench_branches (10 rows) ``` ``` cats=# CREATE TABLE large (id integer, x integer); CREATE TABLE cats=# INSERT INTO large SELECT i, i / 10000 + 1 FROM generate_series(1, 1000000) AS i; INSERT 0 1000000 ``` This will make blocks of rows with the same value for `x`. ``` CREATE TABLE small (x integer); INSERT INTO small VALUES (1); \timing ``` Observe that `SELECT * FROM large JOIN USING (x);` takes a while (**Time: 233.530 ms**). ``` EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x); ``` So much info: ``` # EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Hash Join (cost=30832.00..198533.19 rows=12750000 width=8) (actual time=10265.965..10330.206 rows=9999 loops=1) Hash Cond: (small.x = large.x) Buffers: shared hit=4426, temp read=2845 written=4218 -> Seq Scan on small (cost=0.00..35.50 rows=2550 width=4) (actual time=0.009..0.020 rows=1 loops=1) Buffers: shared hit=1 -> Hash (cost=14425.00..14425.00 rows=1000000 width=8) (actual time=10217.431..10217.450 rows=1000000 loops=1) Buckets: 131072 (originally 131072) Batches: 32 (originally 16) Memory Usage: 3073kB Buffers: shared hit=4425, temp written=3276 -> Seq Scan on large (cost=0.00..14425.00 rows=1000000 width=8) (actual time=4.940..5004.856 rows=1000000 loops=1) Buffers: shared hit=4425 Planning Time: 0.189 ms JIT: Functions: 10 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 0.847 ms, Inlining 0.000 ms, Optimization 0.298 ms, Emission 4.485 ms, Total 5.630 ms Execution Time: 10380.233 ms (16 rows) Time: 10380.995 ms (00:10.381) ``` Differences between estimate and actual are close, you are probably good. - [ ] **READ** [pgbench](https://www.postgresql.org/docs/current/pgbench.html) ## `JOIN` in PostgreSQL > To understand misestimates. There are three ways that a `JOIN` could be calculated: 1. nested loop `JOIN` (For each row in the first table, look for a match in the second. The obvious and somewhat naive approach.) Efficient with some setups: small table used in the outer loop. An index on the inner table will help. 1. ____ 1. ____ #### :three: Types of `JOIN` Strategies | | Nested Loop `JOIN` | Hash `JOIN`| Merge `JOIN`| |----|----|----|--| | How it works | Useful when one table is smaller. | sequential scan of inner table, hash key is the `JOIN` condition, sequentail scan of outer tabe and probes. For each result in the outer table, we probe the hash instead of the entire table. Startup cost of `hash` repeated with later performance. | Sort both tables on `JOIN` condition and then merge them. | | good? | "outer table" small | Bigger tables. If hash fits into `work_mem`. If hash from inner table doesn't fit in `work_mem`, less good. | Large tables when hash `JOIN` is no fun (too big for `work_mem`) | | index? | On `JOIN` condition, on the inner table | Using an index is not helpful -- probably. | on `JOIN` condition on both tables. | A poor estimate could lead to PostgreSQL choosing the wrong kind of `JOIN`. FUN! - [ ] :bookmark: https://explain.depesz.com/ - [ ] :bookmark: https://explain.dalibo.com The results from [explain.dalib.com](https://explain.dalibo.com/plan#) are pretty: ![](https://i.imgur.com/nyqPoQD.png) ``` EXPLAIN ( ANALYZE, BUFFERS ) SELECT * FROM large JOIN small USING (x); ``` ![](https://i.imgur.com/TAgypPx.png) ![](https://i.imgur.com/QXme5sG.png) **Problem**: The estimate is **WAY** higher than the atual -- why? Statistics are off. **Fix**: `ANALYZE small;` It's true! ``` # ANALYZE small; ANALYZE Time: 1.875 ms cats=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- ------- Gather (cost=1001.02..12196.86 rows=10000 width=8) (actual time=0.379..3302.646 rows=9999 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=4623 read=1 -> Hash Join (cost=1.02..10196.86 rows=4167 width=8) (actual time=0.327..3271.277 rows=3333 loops=3) Hash Cond: (large.x = small.x) Buffers: shared hit=4623 read=1 -> Parallel Seq Scan on large (cost=0.00..8591.67 rows=416667 width=8) (actual time=0.046..1624.942 rows=333333 lo ops=3) Buffers: shared hit=4425 -> Hash (cost=1.01..1.01 rows=1 width=4) (actual time=0.055..0.069 rows=1 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=3 -> Seq Scan on small (cost=0.00..1.01 rows=1 width=4) (actual time=0.017..0.028 rows=1 loops=3) Buffers: shared hit=3 Planning: Buffers: shared hit=5 Planning Time: 0.208 ms Execution Time: 3358.110 ms (18 rows) Time: 3358.938 ms (00:03.359) ``` - **WHY?** - The `small` table is tiny and auto analyze never got ran because we didn't get to the threshold. That totally makes sense. ``` # CREATE INDEX ON large(x); CREATE INDEX Time: 327.184 ms cats=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- ------ Nested Loop (cost=0.42..360.44 rows=10000 width=8) (actual time=0.065..167.526 rows=9999 loops=1) Buffers: shared hit=46 read=11 -> Seq Scan on small (cost=0.00..1.01 rows=1 width=4) (actual time=0.006..0.018 rows=1 loops=1) Buffers: shared hit=1 -> Index Scan using large_x_idx on large (cost=0.42..259.43 rows=10000 width=8) (actual time=0.037..58.590 rows=9999 loo ps=1) Index Cond: (x = small.x) Buffers: shared hit=45 read=11 Planning: Buffers: shared hit=15 read=1 Planning Time: 0.306 ms Execution Time: 220.724 ms (11 rows) Time: 221.788 ms cats=# ``` - **Q**: Why isn't it doing a Nested Loop JOIN when we know that would be better? - **A**: No index! Scroll :up: to see how we reduced the time of the query with a few steps. ### More Query Tuning If you have a sequential scan with a filter and the majority of the rows are filtered out: use an index. ![](https://i.imgur.com/AQthne7.png) # Agenda _Mindmap with markmap goes here._ --- # Setup ``` docker run --name postgres_training5 \n -e POSTGRES_PASSWORD=mysecretpassword -d postgres:13 \n && docker exec -it postgres_training5 /bin/bash ``` # :bookmark: Links and the Like - [awesome-postgres](https://github.com/dhamaniasad/awesome-postgres) - [pg_flame](https://github.com/mgartner/pg_flame) - A flamegraph generator for Postgres `EXPLAIN ANALYZE` output.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.