Use the index aka PostgreSQL indexing

# Use the index aka PostgreSQL indexing ## Introduction - SQL is the most sucessfull [4GL language](https://en.wikipedia.org/wiki/Fourth-generation_programming_language) - SoC and abstraction are very good in SQL but reach their limits when it comes to performance - We need indexing - Developer task, indexing depends on quries. ### What is an index? - "What makes the query run faster" <img style="max-height: 50px; display: inline-block; padding-bottom: 15px; padding-left: 10px" src="https://freepngimg.com/save/98529-meme-photos-you-dont-say/506x368" /> - Distinct structure that duplicates data stored in tables (disk space) - We need an ordered structure with a quick delete / insert operations, thats why the most database indexes use balanced trees ## PostgreSQL index types ### B-Tree ```pgsql CREATE INDEX "name" ON "table" ("column"); ``` - Balanced Search Tree - [simulator](https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html) - The Leaf Nodes of the index tree are sorted values of the index containing reference to the original row on disk (`ROW ID`, in postgre [`ctid`](https://www.postgresql.org/docs/current/ddl-system-columns.html)) - They are stored in small blocks / pages (the smallest unit of DB) <img style="display: block; padding: 20px 0;" src="https://use-the-index-luke.com/static/fig01_01_index_leaf_nodes.en.5nWCgZxM.png" /> - The first values of each page are taken to the level up and point to each page. This is repeated until we have one root page. - Balanced tree because the depth is equal on every position <img style="display: block; padding: 20px 0;" src="https://use-the-index-luke.com/static/fig01_02_tree_structure.en.SIIhx7If.png" /> - The search is done by comparison from root to the leaft (**index traversal**) - Postgre 13 can [deduplicate](https://www.postgresql.org/docs/13/btree-implementation.html) leafs to save space which can lead to more effective queries. ([space test comparison](https://mydbops.wordpress.com/2020/09/25/deduplication-of-b-tree-indexes-in-postgresql-13/)) <img style="display: block; padding: 20px 0;" src="https://use-the-index-luke.com/static/fig01_03_tree_traversal.en.bhRJyIWe.png" /> #### Fill factor - Fill factor define how full the pages in leaft nodes will be after creation of the index. - Percentage value (default 90) - When the pages are full they are split. (increases memory, requires time) - For static table use 100 - For dynamic and highly updated table use lower amount ```pgsql CREATE INDEX "name" ON "table" ("column") WITH (fillfactor = 100); ``` #### When to use B-Tree - Almost everytime :) - Good for comparisons `<`, `>`, `<=`, `>=`, `=`, `BETWEEN`, `IN` - Can help for obtaining sorted results (not every time) - Can help with `IS NULL` - Can be used for `LIKE` queries that are trying to match prefix), but requires specific sorting inside of pages. Use [operator `text_pattern_ops`](https://www.postgresql.org/docs/current/indexes-opclass.html) ```pgsql CREATE INDEX test_index ON "table" ("column" text_pattern_ops); ``` ```pgsql SELECT * FROM "table" WHERE "column" LIKE 'Prefix%' ``` ### Hash index ```pgsql CREATE INDEX "name" ON "table" USING hash ("column"); ``` - Can be used only for `=` - Should have quicker build and lookup than B-Tree indexes - **Needs to be recreated after crash** of the database ### GIN index - Generalized inverted index ```pgsql CREATE INDEX "name" ON "table" USING gin ("column"); ``` - Used to index multidimensional values - array - jsonb - hStore (XML) - [range types](https://www.postgresql.org/docs/9.3/rangetypes.html) - "NoSQL index for SQL database" - [Implemented as nested B-Trees](https://www.postgresql.org/docs/10/gin-implementation.html) pointing to multiple places in original table - In general for [array operations](https://www.postgresql.org/docs/9.1/functions-array.html): - `=` - `<@` - contains - `@>` - is contained by - `&&` - overlap - Created for extensibility and plugins. - One of main reasons was also [full text search](https://gitlab.ack.ee/Backend/artdeal-userapi/-/merge_requests/147/diffs?commit_id=fa9c68d5d35d1a9ddfd604fd7aa1d45466c57a76) - [Tricks and tips section](https://www.postgresql.org/docs/9.1/gin-tips.html) ### GIST index ```pgsql CREATE INDEX "name" ON "table" USING gist ("column"); ``` - Generalized search tree (internally again B-Tree with large improvements) - [Great article about implementation](https://medium.com/postgres-professional/indexes-in-postgresql-5-gist-86e19781b5db) - Used to index geometric data (points), intervals and full text - Works for geometric [operators](https://www.postgresql.org/docs/9.1/functions-geometry.html) - `@>` - contains (circle point) - `<@` - contained in or on (point in circle) - `&&` - intersection - Great example of usage: K-Nearest neighbours ```sql SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10; ``` - [GIN vs GIST for full text in docs](https://www.postgresql.org/docs/9.1/textsearch-indexes.html) + [multicolumn performance differences](https://www.postgresql.org/docs/9.1/indexes-multicolumn.html) # Execution plans - Basics ## How to search an index? - **Sequential scan** - No usage of index - **Index Only Scan** - Goes over the index and returns data directly stored in index (B-Tree only) - **Index Scan** - Searches the index, then goes to the referenced place on disk and fetches data there - **Bitmap scan** - Goes over the index, creates a bitmap for blocks and reads them sequentionally ## Join operations ### Nested loops - Two nested loops - Good for small sets ```typescript const movies = knex.table('movies') for(const m of movies) { const movieActors = knex.table('actors').where('movie', '=', m.id) // use index for (const a of movieActors) { // Process actors... } } ``` ### Hash Join - Do not need indexes on joined columns - Loads rows from one table to hash table - Needs more space = optimal only if the rows fits in the memory - Selecting only some columns can help! ### Sort-Merge Join - Both sides are sorted - Usefull for outer joins and already ordered colections: <img src="https://use-the-index-luke.com/static/sort-merge.2hg7gOBL.gif"> # How to avoid bad indexing ## Slow indices - What if we have many non-unique values? - Many leaft nodes in B-Tree are same - Index traversal + sequential scan of leaf nodes + referencing to original values in heap - It is not optimal to use index when large dataset is returned or we are searching for very non unique value. ## Primary index - No problem for primary keys - Index scan with one `ROW ID` acess (unique) ## Multicolumns and partial indexing - When using multicolumns, only the first column can be used in query to use this multicolumn index. - Always think about the order of columns, the first one should be the most selective - BUT should be not. Index should be used effectively - By as many queries as possible > 💡 **Example** - Multicolumn index vs single column > - [no index](https://explain.dalibo.com/plan/J3k), [double index](https://explain.dalibo.com/plan/fh4), [year started mmulticolumn](https://explain.dalibo.com/plan/nqL), [month started multicolumn](https://explain.dalibo.com/plan/sDe) - Partial indexing is usefull when we are always using same condition - e.g. `state = "UNPROCESSED"` - Partial index are small in size and more effective when used properly than multi column indexing > 💡 **Example** - Partial indexing for borrowings > - [no index](https://explain.dalibo.com/plan/JDKE), [multicolumn](https://explain.dalibo.com/plan/yg2), [partial](https://explain.dalibo.com/plan/Ioj) > - ~12 million rows, multicolumn(~80MB) vs partial (~20MB) ## Index functions - If using functions in query (e.g. `UPPER`/`LOWER`), you should use them also in index (decreases number of non-unique values), BUT only deterministic functions = e.g. not functions containing `NOW()` or RNG functions ```pgsql CREATE INDEX title_lower ON books (LOWER(title)) ``` - Use math trasnformation: ```pgsql -- Query SELECT a, b FROM table_name WHERE 3 * a - b = -5 -- Transform query into equation SELECT a, b FROM table_name WHERE 3 * a + 5 = b -- Now we can create an index CREATE INDEX math ON table_name (3 * a - b) ``` ### ## Binding parameters - Always rather use it (knex does it by default) - It does protect againts SQL injections but Postgre can also use same execution plans to run with multiple queries. - When binding is not used execution plan is always recalculated. - Sometimes it is better (exec plan can deside based on value), but in general better approach is to always use the binding - Always ask only for what is needed right now, not create logicall hacks to avoid complications in code. e.g. ```pgql SELECT first_name, last_name, subsidiary_id, employee_id FROM employees WHERE ( subsidiary_id = :sub_id OR :sub_id IS NULL ) AND ( employee_id = :emp_id OR :emp_id IS NULL ) AND ( UPPER(last_name) = :name OR :name IS NULL ) ``` ## Ranges and inequality - The biggest performance deal - Leaft Node Traversal - Once we meet the condition we need to go over all leaf nodes ```pgsql WHERE date_of_birth >= ? AND date_of_birth <= ? AND subsidiary_id = ? ``` <img src="https://use-the-index-luke.com/static/fig02_02_range_scan.en.nRlIqWfx.png" /> - **Combining indices rule** - always use the most selective one on the first place ### Like filters - Often transform to more inequality searches in index - Based on prefixes - longer prefix = more selective for indices - => Avoid leading wildcards <img src="https://use-the-index-luke.com/static/fig02_05_like.en.9him98+S.png" /> ## Merging indexes - **Rule of thumb**: It is always better to use one index for more columns that single index per each column - Sometimes there is no good solution e.g. index for `name` and `date_of_birth` ## Index changes and statistics - Be carefull when changing the indices - It can delete / change the database statistics, so the planner can't run the query the best way - Use the query execution plan (e.g. tool [dalibo](https://explain.dalibo.com/)) - Run PostgreSQL [`ANALYZE`](https://www.postgresql.org/docs/9.3/sql-analyze.html) to recalculate database statistics ## Ordering - Index can help with ordering, but it can be expensive to do lots of table fetches (after index traversal) - planner avoids it in that case - Index can be read by both sides, but you can specify the order ```pgsql CREATE INDEX sales_dt_pr ON sales (sale_date ASC, product_id DESC) ``` - Clustering index - indirect measure of the probability that two succeeding index entries refer to the same table block - You can even use `NULLS FIRST` - An indexed order by execution not only saves the sorting effort, however; it is also able to return the first results without processing all input data - Can help also with `GROUP BY` - it is using `ORDER` > 💡 **Example** - Order index > - [index](https://explain.dalibo.com/plan/tiD), [asc & desc index](https://explain.dalibo.com/plan/UuV) ## Pagination - Vždycky `LIMIT`! - Do not use `OFFSET` if you dont need it - databse need to count all values before the offset - use the where clause. - Specific page issue - Backwards pagging <img src="https://use-the-index-luke.com/static/fig07_04_paging_scalability.en.vlMZSR3z.png" /> ## Null in indexes ### Oracle indexes curiosity - Oracle by default doesn't index null values, so every index is technically partial index ### Postgres takes a different approach - Dumb when you try and make IS NULL or IS NOT NULL index, at least until v11 it wasn't able to flip the condition for the opposite - If we're making two indexes, one for null and one for not null, we're not much better off than just going with default B-Tree (https://stackoverflow.com/a/31966621/16494105) - This is in term of data how postgres index looks like ![](https://i.imgur.com/GcJxUfl.png) (source: https://patshaughnessy.net/2014/11/11/discovering-the-computer-science-behind-postgres-indexes) - Index tuple data is composed from two parts - either a link to another node or database record and 16bit information about the tuple and what it links to - If the data contains null (indicated in the info), next up is bitmap for Null values, this way postgres saves value on the index while indexing all values https://github.com/postgres/postgres/blob/master/src/include/access/itup.h https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/nbtree.c