# Mongo Index Optimization Summary Created: June 6, 2022 11:16 AM ## Cleanup Script - The aggregation step was taking ~15 minutes due primarily to the sorting step. - Since we match by status first; it made more sense to use an index that started with a sorted status. - What was done: Restructured the aggregation sorting so that we sort by status first. Also created a new index that starts with a sorted status - Result: The query was significantly faster (took seconds rather than around 15 minutes) ## Running Explain on all queries - I started by trying to make all the `match` steps in aggregations and `filters` in finds to matching/filtering by `status` first instead of `project` (since there are less permutations of status than project). I had a query that looks like this: ``` col.find({'a':'value', 'b': 'value', 'c': 'value'}) ``` but it seems mongo is using an index that has a key pattern of: ``` {b: 1, a: 1, c: 1} ``` even though I have an index that I thought would be more efficient of: ``` {a: 1, b: 1, c: 1} ``` given that there are less possible values of ‘a’ than ‘b’; I thought that it would be faster. However, while this was helpful for queries involving a sort step; it was actually worse for just trying to find documents since having more values of ‘b’ allows mongo to cut out a significantly more number of documents right off the bat. - I used `.explain()` to investigate each of the queries inside `build_index.py` to see if they were all using an index scan (they were) and also monitored the speed (all were under 5 mins based on triplefive when not iterating). - all the aggregation pipelines and queries used in this file were matching an existing index already (no column scans). - since there is no sort involved, the new index that was added from the cleanup script was not a better index, mongo continued using the existing index (as indicated by the plan). - To map out which parts of the building index process was actually the slowest I created a simple timer to log the elapsed time of how long it took queries to complete. - The slowest steps are - `es_dynamic_mapping` (46.7 seconds on average over 10 runs) - `_get_latest_times_for_all_feeds` (37 seconds on average over 10 runs) - `_get_data` - This was the single slowest step as each batch required a new round trip to the sever. (Each server trip took on average 26.7 seconds) - Strangely the batch seemed to be in 1000s regardless of the specified batch size. Even though the logs show that the batch size was the specified size; the operation at which is server trip happened was every 1000 documents. ![](https://i.imgur.com/QoFR7IF.png) ## Investigating Usage Stats - The speed of queries depends on if we can fit our databases working set in memory. For us that means our working set needs to be able to fit inside the memory allocated to the WiredTiger Cache. - run `db.serverStatus().wiredTiger.cache` to investigate - Of our 16 GiB of memory allocated to the mongo pod, mongo automatically allocated almost 7.5 GiB to this cache. It is recommended to let mongo do this automatically. - Here are the stats: ```jsx TLDR; { "maximum bytes configured": 7481589760, //~7.5GiB "bytes currently in the cache": 5975961357, // we've used 80% of the cache "tracked dirty bytes in the cache": 133137, // less than 5% of cache } //see bottom of documentation for full output ``` - The our usage really only becomes an issue when we run out of space in this cache and evictions start happening; removing the least recently used pages. - WiredTiger Cache starts evictions when we are around 80% usage; which means we are likely experiencing evictions. - So I did more digging to see what our working set size really is. Given that working set is comprised of the most accessed documents and indices; - I total documents that we do not delete via the `clean_final_cache` script; this would be our working set of documents - `final_data_en_us` — 4,246,091 documents * 20.19kB / documents = ~85GiB - `final_data` — 1,812,038 documents * 20.1kB / document = ~37GiB - Total size of the indices in these two collections = ~1.1GiB - We certainly do not have enough memory to fit our working set. ## Other Appraoches - I tried using `keyhole`; a third party mongodb analysis tool. It did not work and I could not find any documentation on the errors I was facing. - The full output of the stats for our WiredTiger Cache: ```jsx { "application threads page read from disk to cache count": 725128232, "application threads page read from disk to cache time (usecs)": 276149765871, "application threads page write from cache to disk count": 11620288, "application threads page write from cache to disk time (usecs)": 652256828, "bytes allocated for updates": 154624016, "bytes belonging to page images in the cache": 5740776239, "bytes belonging to the history store table in the cache": 486, "bytes currently in the cache": 5975961357, "bytes dirty in the cache cumulative": Long("4288828859575"), "bytes not belonging to page images in the cache": 235185117, "bytes read into cache": Long("83449149626567"), "bytes written from cache": Long("3444779020619"), "cache overflow score": 0, "checkpoint blocked page eviction": 182279, "checkpoint of history store file blocked non-history store page eviction": 0, "eviction calls to get a page": 789912009, "eviction calls to get a page found queue empty": 31378434, "eviction calls to get a page found queue empty after locking": 7256717, "eviction currently operating in aggressive mode": 0, "eviction empty score": 0, "eviction gave up due to detecting an out of order on disk value behind the last update on the chain": 0, "eviction gave up due to detecting an out of order tombstone ahead of the selected on disk update": 0, "eviction gave up due to detecting an out of order tombstone ahead of the selected on disk update after validating the update chain": 0, "eviction gave up due to detecting out of order timestamps on the update chain after the selected on disk update": 0, "eviction passes of a file": 1652005717, "eviction server candidate queue empty when topping up": 6622753, "eviction server candidate queue not empty when topping up": 1243851, "eviction server evicting pages": 0, "eviction server slept, because we did not make progress with eviction": 37252247, "eviction server unable to reach eviction goal": 0, "eviction server waiting for a leaf page": 6709781, "eviction state": 64, "eviction walk most recent sleeps for checkpoint handle gathering": 337429, "eviction walk target pages histogram - 0-9": 1627929856, "eviction walk target pages histogram - 10-31": 14934843, "eviction walk target pages histogram - 128 and higher": 0, "eviction walk target pages histogram - 32-63": 2263571, "eviction walk target pages histogram - 64-128": 6877447, "eviction walk target pages reduced due to history store cache pressure": 0, "eviction walk target strategy both clean and dirty pages": 94202264, "eviction walk target strategy only clean pages": 1045725003, "eviction walk target strategy only dirty pages": 512078450, "eviction walks abandoned": 1748084, "eviction walks gave up because they restarted their walk twice": 1635495899, "eviction walks gave up because they saw too many pages and found no candidates": 3054377, "eviction walks gave up because they saw too many pages and found too few candidates": 118747, "eviction walks reached end of tree": 3274718890, "eviction walks restarted": 0, "eviction walks started from root of tree": 1639794224, "eviction walks started from saved location in tree": 12211493, "eviction worker thread active": 4, "eviction worker thread created": 0, "eviction worker thread evicting pages": 751171173, "eviction worker thread removed": 0, "eviction worker thread stable number": 0, "files with active eviction walks": 0, "files with new eviction walks started": 1639222991, "force re-tuning of eviction workers once in a while": 0, "forced eviction - history store pages failed to evict while session has history store cursor open": 56, "forced eviction - history store pages selected while session has history store cursor open": 983, "forced eviction - history store pages successfully evicted while session has history store cursor open": 0, "forced eviction - pages evicted that were clean count": 2914123, "forced eviction - pages evicted that were clean time (usecs)": 7737568, "forced eviction - pages evicted that were dirty count": 44341, "forced eviction - pages evicted that were dirty time (usecs)": 744861903, "forced eviction - pages selected because of a large number of updates to a single item": 0, "forced eviction - pages selected because of too many deleted items count": 4398374, "forced eviction - pages selected count": 3230405, "forced eviction - pages selected unable to be evicted count": 49359, "forced eviction - pages selected unable to be evicted time": 88876, "hazard pointer blocked page eviction": 5412330, "hazard pointer check calls": 754409696, "hazard pointer check entries walked": 6790982355, "hazard pointer maximum array length": 2, "history store score": 0, "history store table insert calls": 529768, "history store table insert calls that returned restart": 0, "history store table max on-disk size": 0, "history store table on-disk size": 12288, "history store table out-of-order resolved updates that lose their durable timestamp": 0, "history store table out-of-order updates that were fixed up by reinserting with the fixed timestamp": 0, "history store table reads": 0, "history store table reads missed": 0, "history store table reads requiring squashed modifies": 0, "history store table truncation by rollback to stable to remove an unstable update": 0, "history store table truncation by rollback to stable to remove an update": 0, "history store table truncation to remove an update": 0, "history store table truncation to remove range of updates due to key being removed from the data page during reconciliation": 621333719, "history store table truncation to remove range of updates due to out-of-order timestamp update on data page": 0, "history store table writes requiring squashed modifies": 2, "in-memory page passed criteria to be split": 452349, "in-memory page splits": 222898, "internal pages evicted": 3412498, "internal pages queued for eviction": 3619100, "internal pages seen by eviction walk": 435623120, "internal pages seen by eviction walk that are already queued": 1914989, "internal pages split during eviction": 455, "leaf pages split during eviction": 533567, "maximum bytes configured": 7481589760, "maximum page size at eviction": 360, "modified pages evicted": 29074206, "modified pages evicted by application threads": 842, "operations timed out waiting for space in cache": 0, "overflow pages read into cache": 0, "page split during eviction deepened the tree": 122, "page written requiring history store records": 32450, "pages currently held in the cache": 62286, "pages evicted by application threads": 8116, "pages evicted in parallel with checkpoint": 11212431, "pages queued for eviction": 771455762, "pages queued for eviction post lru sorting": 772685146, "pages queued for urgent eviction": 19515881, "pages queued for urgent eviction during walk": 2251894, "pages queued for urgent eviction from history store due to high dirty content": 0, "pages read into cache": 725151680, "pages read into cache after truncate": 71477, "pages read into cache after truncate in prepare state": 0, "pages requested from the cache": 14882139441, "pages seen by eviction walk": 4128748781, "pages seen by eviction walk that are already queued": 99389964, "pages selected for eviction unable to be evicted": 5559574, "pages selected for eviction unable to be evicted as the parent page has overflow items": 0, "pages selected for eviction unable to be evicted because of active children on an internal page": 109884, "pages selected for eviction unable to be evicted because of failure in reconciliation": 240, "pages selected for eviction unable to be evicted because of race between checkpoint and out of order timestamps handling": 0, "pages walked for eviction": 225576173533, "pages written from cache": 33579608, "pages written requiring in-memory restoration": 24290334, "percentage overhead": 8, "the number of times full update inserted to history store": 529264, "the number of times reverse modify inserted to history store": 504, "tracked bytes belonging to internal pages in the cache": 49304667, "tracked bytes belonging to leaf pages in the cache": 5926656690, "tracked dirty bytes in the cache": 133137, "tracked dirty pages in the cache": 4, "unmodified pages evicted": 719553082 } ``` # Let's Try Postgres 🐘 Since the data held within mongo is quite tabular, we hypothesized that using postgres to query the data would be faster. The data from the `mall` database (three collections - `final_data`(3m docs), `final_data_en_us`(7m docs), `final_data_fr_fr`(0.5M docs)) was migrated to a postgres database to test this. To start I added some relevant indices (the same ones as the ones used by mongo). According to [this](https://www.postgresql.org/message-id/12553.1135634231@sss.pgh.pa.us), our optimal scan should be what is known as the bitmap index scan due to the size of our queries. * note: the postgres instance has the same resources and the mongo (32Gi Ram) ## TLDR A metric that I will be referencing is the *average execution time per document*. I run the query for roughly 300K documents and find how long it took for the database to return each document. A lower number is better. This value will be given in miliseconds. Baseline Queries - Mongo: `0.51ms` - Postgres: - Trial 2 (client-side, batch): `0.97ms` - Trial 3 (server-side, batch): `0.68ms` - Trial 4 (server-side, default): `0.44ms` - Trial 5 (server-side, iterator=750): `0.44ms` Querying by JSONB - Mongo: `0.525ms` - Postgres: - Trial 6 (server-size, iterator=750, query by jsonb): `0.44ms` - Trial 7 (server-size, iterator=750, query by jsonb, parse rows to dict): `0.46ms` Finding Distinct Values (Trial 8) - Mongo: `163.8 seconds` - Postgres: `1.427 seconds` Checking For Usable Final Document (Trial 9) - Mongo: `0.011 seconds` - Postgres: `0.0024 seconds` Overall postgres is still slighlty faster than mongo. To make sense of these metrics, postgres will be about one minute faster every one million documents returned. ## Trial No. 1 ``` q = ''' SELECT * FROM final_data_en_us WHERE feed='macys' AND project='triplefive' AND status='completed' ''' cur.execute(query) records = cur.fetchall() ... ``` Queries where I tried to select everything never even returned. It seems that pyscopg's execute function actually attempts to execute this query in full before returning at all. This is different from how the mongo python client works. Also, running an explain on this query showed that it was using a bitmap heap scan (an optimized version of a sequential scan that ANDs and ORs the relevant indexes together to only visit the disk pages it needs to) which means that the largest issue is the query size. ## Baseline Mongo Measurements Using this basic query I found how much time it would take mongo to collect 300K documents. ``` filter = { "feed": "macys", "project" : "triplefive", "status" : "completed", } cursor = col2.find(filter).limit(300000) count = 0 page = 0 for doc in cursor: count += 1 print("TOTAL TIME:") print('count: ', count) print(time.time() - START) ``` This yields a time/document of **0.51ms** for mongo. ## Trial No. 2 ``` amount = 1000 page = 0 # loop while there are still documents... q = """ SELECT * FROM final_data_en_us WHERE feed='macys' AND project='triplefive' AND status='completed' ORDER BY id LIMIT {amount} OFFSET {page * amount}; """ cur.execute(query) records = cur.fetchall() page += 1 ... ``` Using pagination to break up the select statement into smaller portions allowed for results to actually return. ``` amount = 1000 page = 0 start = time.time() total_doc = 0 while(True): query = f""" SELECT * FROM final_data_en_us WHERE feed='macys' AND project='triplefive' AND status='completed' ORDER BY id LIMIT {amount} OFFSET {page * amount}; """ cur.execute(query) records = cur.fetchall() page += 1 total_doc += len(records) if total_doc > 300000: break print('total time:', time.time() - start) print('total docs:', total_docs) ``` Dividing the total time by the total docs (roughly 300K docs) the average time taken to collect each document was **0.97ms**. Changing the `amount` did not significanlty affect this (a greater amount meant overall less queries but each query took longer to execute). So comparing the two values; mongo takes half as much time! It seems that postgres actually doesn't perform as well. I looked into the `EXPLAIN` for why the postgres query was executing so poorly and it turned out that by applying the `ORDER BY id` , `LIMIT` and `OFFEST`, postgres had essentially reverted to using a sequential scan (it was no longer making use of our indices). ## Trial No. 3 After reading more into docs, one possible improvment was to use server-side cursors which works more like how cursors in mongo work (where query results are streamed to the client instead). Modifying the code slightly to take advantage of a name cursor we have ``` query = f""" SELECT * FROM final_data_en_us WHERE feed='macys' AND project='triplefive' AND status='completed'; """ cur.execute(query) # negligible time now since this query isn't actually being performed here start = time.time() amount = 1000 total = 0 while True: rows = cur.fetchmany(amount) total += len(rows) if total > 300000: break print('total time: ', time.time() - start) print('total docs: ', total) ``` The time/document metric is now **0.68ms**. While this is a good improvement over the client side default cursor it is still slower than mongo. Out of curiosity I tried to see if the batch size made a difference. It does not seem like the batch size matters too much. ![](https://i.imgur.com/iZDvPHV.png) ## Trial No. 4 Still using the server-side cursor I found another example online that was a lot simpler, instead of managing the batch sizes we just let the cursor handle things. ``` query = f""" SELECT * FROM final_data_en_us WHERE feed='macys' AND project='triplefive' AND status='completed'; """ cur.execute(query) count = 0 start = time.time() for row in cur: count += 1 print('total time: ', time.time() - start) print('total docs: ', count) ``` Using the defaults, we seem to get the best performance! This time/document is **0.44ms** which is indeed slightly better than mongo! ## Trial No. 5 After further investigation, attempt 4 was actually using an defualt `iterator` size of 2000 documents at a time. This is different than manually setting the batch size. **Using fetchmany does not actually affect how many documents psycopg returns per network call**. This means that attempt number 3's batches did not work as intended. So with this knowledge the query becomes ``` query = f""" SELECT * FROM final_data_en_us WHERE feed='macys' AND project='triplefive' AND status='completed'; """ for size in [10, 20, 100, 250, 500, 750, 1000]: with conn: with conn.cursor(name=f"server_side_name_{size}") as cur: cur.itersize = size cur.execute(query) count = 0 start = time.time() for row in cur: count += 1 print('size: ', size) print('total time: ', time.time() - start) print('total docs: ', count) ``` The resulting plot: ![](https://i.imgur.com/dXagJAJ.png) With an iterator size of 750, we have the fastest time/document of `0.4367ms` ## Trial No 6 I also wanted to see how postgres performs when doing jsonb queries ``` query = f""" SELECT * FROM final_data_en_us WHERE feed='macys' AND project='triplefive' AND status='completed' AND meta->>'scrape_version' = '2022_01_16_16_31_09_gen45'; """ with conn: with conn.cursor(name=f"server_side_name") as cur: cur.itersize = 750 cur.execute(query) count = 0 start = time.time() for row in cur: count += 1 print('total time: ', time.time() - start) print('total docs: ', count) ``` Using this query; the time/doc is **0.4424ms** Compared to the equivant query in mongo; the time/doc is **0.525ms** ``` filter = { "feed": "macys", "project" : "triplefive", "status" : "completed", "meta.scrape_version": "2022_01_16_16_31_09_gen45", } cursor = col2.find(filter).limit(300000) count = 0 page = 0 for doc in cursor: count += 1 print("TOTAL TIME:") print('count: ', count) print(time.time() - START) ``` ## Trial No 7 Mongo returns the documents in the form of a dict direclty. However, Psycopg does not. Instead it returns each row as a tuple in the order of the columns of the table. Our app uses the returned documents as dictionaries and not tuples so these tuples will need to be parsed into the dicts our app expects. Luckily this can be done easily by setting the cursor_factory parameter. ```conn.cursor(name=f"server_side_name", cursor_factory=psycopg2.extras.DictCursor)``` Performing Trial No 6 with this cursor; the time/doc increased to **0.46ms** (up from 0.44ms). Postges is still faster! ## Trial No 8 To check which database is able to return distinct values faster the following queries were used: Postgres: ```SELECT DISTINCT project FROM final_data_en_us where "STORE"='FOOT_LOCKER';``` Mongo: ```col2.distinct("project", {"STORE": "FOOT_LOCKER"})``` Postgres returned the 4 distinct projects in **1.427 seconds** while mongo took an astoundingly long **163.78 seconds**. Postgres performed significantly better! ## Trial No 9 Checking the performance of the query in `is_final_catalog_cached` to see if there is a final cache availible. Postgres: ```SELECT * FROM final_data_en_us where feed='apple' AND locale='en-US' AND status='completed' AND date='20220706' LIMIT 1``` Mongo: ```col2.find_one({'feed': 'apple', 'locale': 'en-US', 'date': '20220706', 'status': 'completed'})``` Postgres took just **0.00243 seconds** to complete this query while mongo took **0.011 seconds**.