# Use the index aka PostgreSQL indexing
## Introduction
- SQL is the most sucessfull [4GL language](https://en.wikipedia.org/wiki/Fourth-generation_programming_language)
- SoC and abstraction are very good in SQL but reach their limits when it comes to performance
- We need indexing - Developer task, indexing depends on quries.
### What is an index?
- "What makes the query run faster" <img style="max-height: 50px; display: inline-block; padding-bottom: 15px; padding-left: 10px" src="https://freepngimg.com/save/98529-meme-photos-you-dont-say/506x368" />
- Distinct structure that duplicates data stored in tables (disk space)
- We need an ordered structure with a quick delete / insert operations, thats why the most database indexes use balanced trees
## PostgreSQL index types
### B-Tree
```pgsql
CREATE INDEX "name" ON "table" ("column");
```
- Balanced Search Tree - [simulator](https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html)
- The Leaf Nodes of the index tree are sorted values of the index containing reference to the original row on disk (`ROW ID`, in postgre [`ctid`](https://www.postgresql.org/docs/current/ddl-system-columns.html))
- They are stored in small blocks / pages (the smallest unit of DB)
<img style="display: block; padding: 20px 0;" src="https://use-the-index-luke.com/static/fig01_01_index_leaf_nodes.en.5nWCgZxM.png" />
- The first values of each page are taken to the level up and point to each page. This is repeated until we have one root page.
- Balanced tree because the depth is equal on every position
<img style="display: block; padding: 20px 0;" src="https://use-the-index-luke.com/static/fig01_02_tree_structure.en.SIIhx7If.png" />
- The search is done by comparison from root to the leaft (**index traversal**)
- Postgre 13 can [deduplicate](https://www.postgresql.org/docs/13/btree-implementation.html) leafs to save space which can lead to more effective queries. ([space test comparison](https://mydbops.wordpress.com/2020/09/25/deduplication-of-b-tree-indexes-in-postgresql-13/))
<img style="display: block; padding: 20px 0;" src="https://use-the-index-luke.com/static/fig01_03_tree_traversal.en.bhRJyIWe.png" />
#### Fill factor
- Fill factor define how full the pages in leaft nodes will be after creation of the index.
- Percentage value (default 90)
- When the pages are full they are split. (increases memory, requires time)
- For static table use 100
- For dynamic and highly updated table use lower amount
```pgsql
CREATE INDEX "name" ON "table" ("column") WITH (fillfactor = 100);
```
#### When to use B-Tree
- Almost everytime :)
- Good for comparisons `<`, `>`, `<=`, `>=`, `=`, `BETWEEN`, `IN`
- Can help for obtaining sorted results (not every time)
- Can help with `IS NULL`
- Can be used for `LIKE` queries that are trying to match prefix), but requires specific sorting inside of pages. Use [operator `text_pattern_ops`](https://www.postgresql.org/docs/current/indexes-opclass.html)
```pgsql
CREATE INDEX test_index ON "table" ("column" text_pattern_ops);
```
```pgsql
SELECT * FROM "table" WHERE "column" LIKE 'Prefix%'
```
### Hash index
```pgsql
CREATE INDEX "name" ON "table" USING hash ("column");
```
- Can be used only for `=`
- Should have quicker build and lookup than B-Tree indexes
- **Needs to be recreated after crash** of the database
### GIN index
- Generalized inverted index
```pgsql
CREATE INDEX "name" ON "table" USING gin ("column");
```
- Used to index multidimensional values
- array
- jsonb
- hStore (XML)
- [range types](https://www.postgresql.org/docs/9.3/rangetypes.html)
- "NoSQL index for SQL database"
- [Implemented as nested B-Trees](https://www.postgresql.org/docs/10/gin-implementation.html) pointing to multiple places in original table
- In general for [array operations](https://www.postgresql.org/docs/9.1/functions-array.html):
- `=`
- `<@` - contains
- `@>` - is contained by
- `&&` - overlap
- Created for extensibility and plugins.
- One of main reasons was also [full text search](https://gitlab.ack.ee/Backend/artdeal-userapi/-/merge_requests/147/diffs?commit_id=fa9c68d5d35d1a9ddfd604fd7aa1d45466c57a76)
- [Tricks and tips section](https://www.postgresql.org/docs/9.1/gin-tips.html)
### GIST index
```pgsql
CREATE INDEX "name" ON "table" USING gist ("column");
```
- Generalized search tree (internally again B-Tree with large improvements)
- [Great article about implementation](https://medium.com/postgres-professional/indexes-in-postgresql-5-gist-86e19781b5db)
- Used to index geometric data (points), intervals and full text
- Works for geometric [operators](https://www.postgresql.org/docs/9.1/functions-geometry.html)
- `@>` - contains (circle point)
- `<@` - contained in or on (point in circle)
- `&&` - intersection
- Great example of usage: K-Nearest neighbours
```sql
SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;
```
- [GIN vs GIST for full text in docs](https://www.postgresql.org/docs/9.1/textsearch-indexes.html) + [multicolumn performance differences](https://www.postgresql.org/docs/9.1/indexes-multicolumn.html)
# Execution plans - Basics
## How to search an index?
- **Sequential scan** - No usage of index
- **Index Only Scan** - Goes over the index and returns data directly stored in index (B-Tree only)
- **Index Scan** - Searches the index, then goes to the referenced place on disk and fetches data there
- **Bitmap scan** - Goes over the index, creates a bitmap for blocks and reads them sequentionally
## Join operations
### Nested loops
- Two nested loops
- Good for small sets
```typescript
const movies = knex.table('movies')
for(const m of movies) {
const movieActors = knex.table('actors').where('movie', '=', m.id) // use index
for (const a of movieActors) {
// Process actors...
}
}
```
### Hash Join
- Do not need indexes on joined columns
- Loads rows from one table to hash table
- Needs more space = optimal only if the rows fits in the memory - Selecting only some columns can help!
### Sort-Merge Join
- Both sides are sorted
- Usefull for outer joins and already ordered colections:
<img src="https://use-the-index-luke.com/static/sort-merge.2hg7gOBL.gif">
# How to avoid bad indexing
## Slow indices
- What if we have many non-unique values?
- Many leaft nodes in B-Tree are same
- Index traversal + sequential scan of leaf nodes + referencing to original values in heap
- It is not optimal to use index when large dataset is returned or we are searching for very non unique value.
## Primary index
- No problem for primary keys - Index scan with one `ROW ID` acess (unique)
## Multicolumns and partial indexing
- When using multicolumns, only the first column can be used in query to use this multicolumn index.
- Always think about the order of columns, the first one should be the most selective
- BUT should be not. Index should be used effectively - By as many queries as possible
> 💡 **Example** - Multicolumn index vs single column
> - [no index](https://explain.dalibo.com/plan/J3k), [double index](https://explain.dalibo.com/plan/fh4), [year started mmulticolumn](https://explain.dalibo.com/plan/nqL), [month started multicolumn](https://explain.dalibo.com/plan/sDe)
- Partial indexing is usefull when we are always using same condition - e.g. `state = "UNPROCESSED"`
- Partial index are small in size and more effective when used properly than multi column indexing
> 💡 **Example** - Partial indexing for borrowings
> - [no index](https://explain.dalibo.com/plan/JDKE), [multicolumn](https://explain.dalibo.com/plan/yg2), [partial](https://explain.dalibo.com/plan/Ioj)
> - ~12 million rows, multicolumn(~80MB) vs partial (~20MB)
## Index functions
- If using functions in query (e.g. `UPPER`/`LOWER`), you should use them also in index (decreases number of non-unique values), BUT only deterministic functions = e.g. not functions containing `NOW()` or RNG functions
```pgsql
CREATE INDEX title_lower ON books (LOWER(title))
```
- Use math trasnformation:
```pgsql
-- Query
SELECT a, b
FROM table_name
WHERE 3 * a - b = -5
-- Transform query into equation
SELECT a, b
FROM table_name
WHERE 3 * a + 5 = b
-- Now we can create an index
CREATE INDEX math ON table_name (3 * a - b)
```
###
## Binding parameters
- Always rather use it (knex does it by default) - It does protect againts SQL injections but Postgre can also use same execution plans to run with multiple queries.
- When binding is not used execution plan is always recalculated.
- Sometimes it is better (exec plan can deside based on value), but in general better approach is to always use the binding
- Always ask only for what is needed right now, not create logicall hacks to avoid complications in code.
e.g.
```pgql
SELECT first_name, last_name, subsidiary_id, employee_id
FROM employees
WHERE ( subsidiary_id = :sub_id OR :sub_id IS NULL )
AND ( employee_id = :emp_id OR :emp_id IS NULL )
AND ( UPPER(last_name) = :name OR :name IS NULL )
```
## Ranges and inequality
- The biggest performance deal - Leaft Node Traversal
- Once we meet the condition we need to go over all leaf nodes
```pgsql
WHERE date_of_birth >= ?
AND date_of_birth <= ?
AND subsidiary_id = ?
```
<img src="https://use-the-index-luke.com/static/fig02_02_range_scan.en.nRlIqWfx.png" />
- **Combining indices rule** - always use the most selective one on the first place
### Like filters
- Often transform to more inequality searches in index
- Based on prefixes - longer prefix = more selective for indices
- => Avoid leading wildcards
<img src="https://use-the-index-luke.com/static/fig02_05_like.en.9him98+S.png" />
## Merging indexes
- **Rule of thumb**: It is always better to use one index for more columns that single index per each column
- Sometimes there is no good solution e.g. index for `name` and `date_of_birth`
## Index changes and statistics
- Be carefull when changing the indices - It can delete / change the database statistics, so the planner can't run the query the best way - Use the query execution plan (e.g. tool [dalibo](https://explain.dalibo.com/))
- Run PostgreSQL [`ANALYZE`](https://www.postgresql.org/docs/9.3/sql-analyze.html) to recalculate database statistics
## Ordering
- Index can help with ordering, but it can be expensive to do lots of table fetches (after index traversal) - planner avoids it in that case
- Index can be read by both sides, but you can specify the order
```pgsql
CREATE INDEX sales_dt_pr
ON sales (sale_date ASC, product_id DESC)
```
- Clustering index - indirect measure of the probability that two succeeding index entries refer to the same table block
- You can even use `NULLS FIRST`
- An indexed order by execution not only saves the sorting effort, however; it is also able to return the first results without processing all input data
- Can help also with `GROUP BY` - it is using `ORDER`
> 💡 **Example** - Order index
> - [index](https://explain.dalibo.com/plan/tiD), [asc & desc index](https://explain.dalibo.com/plan/UuV)
## Pagination
- Vždycky `LIMIT`!
- Do not use `OFFSET` if you dont need it - databse need to count all values before the offset - use the where clause.
- Specific page issue
- Backwards pagging
<img src="https://use-the-index-luke.com/static/fig07_04_paging_scalability.en.vlMZSR3z.png" />
## Null in indexes
### Oracle indexes curiosity
- Oracle by default doesn't index null values, so every index is technically partial index
### Postgres takes a different approach
- Dumb when you try and make IS NULL or IS NOT NULL index, at least until v11 it wasn't able to flip the condition for the opposite
- If we're making two indexes, one for null and one for not null, we're not much better off than just going with default B-Tree (https://stackoverflow.com/a/31966621/16494105)
- This is in term of data how postgres index looks like  (source: https://patshaughnessy.net/2014/11/11/discovering-the-computer-science-behind-postgres-indexes)
- Index tuple data is composed from two parts - either a link to another node or database record and 16bit information about the tuple and what it links to
- If the data contains null (indicated in the info), next up is bitmap for Null values, this way postgres saves value on the index while indexing all values
https://github.com/postgres/postgres/blob/master/src/include/access/itup.h
https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/nbtree.c
{"metaMigratedAt":"2023-06-16T20:47:15.469Z","metaMigratedFrom":"Content","title":"Use the index aka PostgreSQL indexing","breaks":true,"contributors":"[{\"id\":\"1af662b8-c090-44ba-a627-279f50d30f12\",\"add\":1263,\"del\":100},{\"id\":\"21000801-897e-483f-9f51-87af6405d79f\",\"add\":6107,\"del\":2320},{\"id\":\"a8215e64-4805-4d65-ba0a-fd85dc19c2fa\",\"add\":9529,\"del\":1904}]"}