# Database Indexes
tl;dr Binary Search Lookups on not-the-primary-key for tables that improve lookup performance. Probably more complex than you think at high scales.
## What's an Index?
Remember how Binary Search runs in O(log(N))? Basically, this allows us to find elements fast from a sorted array.
Database entries are stored in the order they are inserted (except in key-value stores, like Redis, which are basically big hash maps), and so when a query is run, the database server must scan the entire table to find the right rows.
If you run a query frequently on a large table, this can really slow down the database. An index solves this by sacrificing space (often in memory) for time.
For example, if you often look up users by their username, having a list of usernames sorted alphabetically would be helpful in locating the exact row needed. That index would essentially be a sorted list of usernames, with their corresponding row number. This way you don't need to start with usernames that begin with `A` when the username you want starts with a `V`.
The Primary Key is automatically indexed in a database, and serves to tell the database server exactly where to look on the disk for the data it needs, which is much faster than reading through the enire file line by line and doing string comparison on each row.
## At High Scales
Indexes start to become a problem. If all the data won't fit on the disk, your indexes are probably large, and won't fit in memory. Indexes _will still work_ if they won't fit in memory, but if a table is frequently accessed, that one index will tend to crowd out your memory and not allow other indexes to be used, which creates latency across the whole system.
## Types of Index
### Single-field Index
If your table is really only searched along one field (like a datestamp), creating an index on that field is probably the best approach. These indexes are small (usually).
### Compound Indexes
A multi-field index that will sort along the first field, and then further sort along a second field, and a third, etc. So, Customers by First and Last name, further indexed by date would be a common query, because the customer in front of you is likely to be within the most recently active customer. This query is probably run by all customer service reps whenever they look up a customer account, so it's gotta be fast. A compound index would help speed up this query.
### The ESR Rule for Compound Indexes
```
|LASTNAME-FIRSTNAME-LASTACCESSED-WITHIN90DAYS|
|STEINER-LAURA-01262023-TRUE
```
For compound indexes, this rule of thumb is helpful in deciding the order of fields in the index:
1. First, add those fields against which **Equality queries** are run.
2. The next fields to be indexed should reflect the **Sort order of the query**.
3. The last fields represent the **Range of data to be accessed**- like "users that have logged in within the last 90 days".
## Resources & References
* [Here's where I learned my lesson on Database Indexes with MongoDB](https://medium.com/p/cb2addb59fe7)
* [Performance Best Practices](https://www.mongodb.com/blog/post/performance-best-practices-indexing)
* [Ensure Indexes Fit In RAM](https://www.mongodb.com/docs/manual/tutorial/ensure-indexes-fit-ram/)
* [MongoDB Indexes](https://www.mongodb.com/docs/manual/indexes/)
* [Large Mongo Indexes are a Problem](https://stackoverflow.com/questions/10617283/why-are-my-mongodb-indexes-so-large)
* [Database Indexing 101](https://www.freecodecamp.org/news/database-indexing-at-a-glance-bb50809d48bd/)
* [Indexing Very Large Tables](https://towardsdatascience.com/indexing-very-large-tables-569811421ee0)
* [B Tree on wikipedia](https://en.wikipedia.org/wiki/B-tree#:~:text=In%20computer%20science%2C%20a%20B,with%20more%20than%20two%20children.)
* [Partial Indexes](https://en.wikipedia.org/wiki/Partial_index)
## Discussion Group
Today we're going to talk about Compound Indexes in MongoDB in a discussion group.
The following is a data model for a media company that has hundreds of websites and millions of viewers to track their media sites, writers, articles, subscribers for each site, and views for each article per subscriber. They have millions of subscribers.
There are several tables we'll want to add compound indexes to later, but for now let's just focus on defining the tables:
websites
---
id int
name text
writers
---
id int
name text
users
---
id int
name text
articles
---
id int
website_id int
writer_id int
title text
url text
body text
index: writer_id, website_id
subscription
---
id int
user_id int
website_id int
subscription_type int
index: website_id
views
---
id int
subscription_id int
article_id int
viewed_at datetime
index: subscription_id
articles.writer_id, views.article_id
writer_id from articles - equality
views.viewed_at - sort
article_id from views
writer_name from writers
## Expensive queries using this dataset
Let's look at some expensive queries:
1. Finding all articles written by a particular writer
2. Finding all articles viewed by a particular subscriber
3. Finding all subscribers for a particular website
4. Writers with the most viewers
5. Writers with the most viewers in the last X
For each of these queries, we can create a compound index to improve performance.
Here's another data model likely to generate issues with mongodb indexes at really high scales:
a streaming media website. Let's examine the parts of their data model most likely to cause problems
in a large distributed system:
movies
---
id int
name text
users
---
id int
name text
views
---
id int
movie_id int
user_id int
index: views.user_id,views.movie_id
like_dislike
---
id int # like ID
movie_id # ID associated with the movie, tv episode, or music album track
user_id int * index this
created_at datetime
last_updated datetime
is_active | boolean # allows user to deselect response
like_type_id | foreign key
like_dislike.user_id,like_type_id="2"
like_type
---
id
name | like or dislike - we made a table in case we want to expand to "meh"
media_type
----
id int
media str | type of media, typically a movie or tv episode. ex: MOVIE
## Expensive queries using this dataset
Let's look at some expensive queries:
1. Finding all users who have liked a particular show
2. Finding all movies watched by a particular user
```
Spoilers Below!!!
For each of these queries, we can create a compound index to improve performance.
#### media sites answers
For query 1, we can create a compound index on the writers table with the writer_id and website_id fields.
This will allow us to quickly find all articles written by a particular writer for a particular website.
For query 2, we can create a compound index on the views table with the subscriber_id and article_id fields.
This will allow us to quickly find all articles viewed by a particular subscriber.
For query 3, we can create a compound index on the subscribers table with the website_id and name fields.
This will allow us to quickly find all subscribers for a particular website.
#### streaming media answers
For query 1, we can create a compound index on the likes table with the show_id and user_id fields.
This will allow us to quickly find all users who have liked a particular show.
For query 2, we can create a compound index on the shows table with the user_id and movie_id fields.
This will allow us to quickly find all movies watched by a particular user.
For query 3, we can create a compound index on the likes table with the track_id and user_id fields.
This will allow us to quickly find all users who have liked a particular track.