Database Indexes

# Database Indexes tl;dr Binary Search Lookups on not-the-primary-key for tables that improve lookup performance. Probably more complex than you think at high scales. ## What's an Index? Remember how Binary Search runs in O(log(N))? Basically, this allows us to find elements fast from a sorted array. Database entries are stored in the order they are inserted (except in key-value stores, like Redis, which are basically big hash maps), and so when a query is run, the database server must scan the entire table to find the right rows. If you run a query frequently on a large table, this can really slow down the database. An index solves this by sacrificing space (often in memory) for time. For example, if you often look up users by their username, having a list of usernames sorted alphabetically would be helpful in locating the exact row needed. That index would essentially be a sorted list of usernames, with their corresponding row number. This way you don't need to start with usernames that begin with `A` when the username you want starts with a `V`. The Primary Key is automatically indexed in a database, and serves to tell the database server exactly where to look on the disk for the data it needs, which is much faster than reading through the enire file line by line and doing string comparison on each row. ## At High Scales Indexes start to become a problem. If all the data won't fit on the disk, your indexes are probably large, and won't fit in memory. Indexes _will still work_ if they won't fit in memory, but if a table is frequently accessed, that one index will tend to crowd out your memory and not allow other indexes to be used, which creates latency across the whole system. ## Types of Index ### Single-field Index If your table is really only searched along one field (like a datestamp), creating an index on that field is probably the best approach. These indexes are small (usually). ### Compound Indexes A multi-field index that will sort along the first field, and then further sort along a second field, and a third, etc. So, Customers by First and Last name, further indexed by date would be a common query, because the customer in front of you is likely to be within the most recently active customer. This query is probably run by all customer service reps whenever they look up a customer account, so it's gotta be fast. A compound index would help speed up this query. ### The ESR Rule for Compound Indexes ``` |LASTNAME-FIRSTNAME-LASTACCESSED-WITHIN90DAYS| |STEINER-LAURA-01262023-TRUE ``` For compound indexes, this rule of thumb is helpful in deciding the order of fields in the index: 1. First, add those fields against which **Equality queries** are run. 2. The next fields to be indexed should reflect the **Sort order of the query**. 3. The last fields represent the **Range of data to be accessed**- like "users that have logged in within the last 90 days". ## Resources & References * [Here's where I learned my lesson on Database Indexes with MongoDB](https://medium.com/p/cb2addb59fe7) * [Performance Best Practices](https://www.mongodb.com/blog/post/performance-best-practices-indexing) * [Ensure Indexes Fit In RAM](https://www.mongodb.com/docs/manual/tutorial/ensure-indexes-fit-ram/) * [MongoDB Indexes](https://www.mongodb.com/docs/manual/indexes/) * [Large Mongo Indexes are a Problem](https://stackoverflow.com/questions/10617283/why-are-my-mongodb-indexes-so-large) * [Database Indexing 101](https://www.freecodecamp.org/news/database-indexing-at-a-glance-bb50809d48bd/) * [Indexing Very Large Tables](https://towardsdatascience.com/indexing-very-large-tables-569811421ee0) * [B Tree on wikipedia](https://en.wikipedia.org/wiki/B-tree#:~:text=In%20computer%20science%2C%20a%20B,with%20more%20than%20two%20children.) * [Partial Indexes](https://en.wikipedia.org/wiki/Partial_index) ## Discussion Group Today we're going to talk about Compound Indexes in MongoDB in a discussion group. The following is a data model for a media company that has hundreds of websites and millions of viewers to track their media sites, writers, articles, subscribers for each site, and views for each article per subscriber. They have millions of subscribers. There are several tables we'll want to add compound indexes to later, but for now let's just focus on defining the tables: websites --- id int name text writers --- id int name text users --- id int name text articles --- id int website_id int writer_id int title text url text body text index: writer_id, website_id subscription --- id int user_id int website_id int subscription_type int index: website_id views --- id int subscription_id int article_id int viewed_at datetime index: subscription_id articles.writer_id, views.article_id writer_id from articles - equality views.viewed_at - sort article_id from views writer_name from writers ## Expensive queries using this dataset Let's look at some expensive queries: 1. Finding all articles written by a particular writer 2. Finding all articles viewed by a particular subscriber 3. Finding all subscribers for a particular website 4. Writers with the most viewers 5. Writers with the most viewers in the last X For each of these queries, we can create a compound index to improve performance. Here's another data model likely to generate issues with mongodb indexes at really high scales: a streaming media website. Let's examine the parts of their data model most likely to cause problems in a large distributed system: movies --- id int name text users --- id int name text views --- id int movie_id int user_id int index: views.user_id,views.movie_id like_dislike --- id int # like ID movie_id # ID associated with the movie, tv episode, or music album track user_id int * index this created_at datetime last_updated datetime is_active | boolean # allows user to deselect response like_type_id | foreign key like_dislike.user_id,like_type_id="2" like_type --- id name | like or dislike - we made a table in case we want to expand to "meh" media_type ---- id int media str | type of media, typically a movie or tv episode. ex: MOVIE ## Expensive queries using this dataset Let's look at some expensive queries: 1. Finding all users who have liked a particular show 2. Finding all movies watched by a particular user ``` Spoilers Below!!! For each of these queries, we can create a compound index to improve performance. #### media sites answers For query 1, we can create a compound index on the writers table with the writer_id and website_id fields. This will allow us to quickly find all articles written by a particular writer for a particular website. For query 2, we can create a compound index on the views table with the subscriber_id and article_id fields. This will allow us to quickly find all articles viewed by a particular subscriber. For query 3, we can create a compound index on the subscribers table with the website_id and name fields. This will allow us to quickly find all subscribers for a particular website. #### streaming media answers For query 1, we can create a compound index on the likes table with the show_id and user_id fields. This will allow us to quickly find all users who have liked a particular show. For query 2, we can create a compound index on the shows table with the user_id and movie_id fields. This will allow us to quickly find all movies watched by a particular user. For query 3, we can create a compound index on the likes table with the track_id and user_id fields. This will allow us to quickly find all users who have liked a particular track.