Follow-up of tasks. Solr Cloud tuning

# Follow-up of tasks. Solr Cloud tuning **Open questions / issues** - [x] Did you have the chance to capture the necessary metrics on Datadog to monitor the number of shards in degraded / down state? I'd like to create monitors on them asap WIP: exploring the addition of new metrics fetched from JSON api - [x] Related to previous point, it is a fact that at least a couple of times the cluster crashed somehow after you ran a very high load test against it. Last friday you provided us with the procedure to follow when it happens (check the metrics and rolling restart) but I am bit worried about the lack of "self-healing" behaviour of solr cloud. Why is it necessary to restart the nodes? Should the cluster self heal in these scenarios? - [ ] We've got another application (our Marketplace) that we are already testing on solrcloud, e.g., test collections already indexed on SolrCloud. Our next move is to index a copy of its prod collection which is way bigger than the one we are using in our load testing (libraryE0596). I'll sync with you when we launch this indexation process so you can monitor it, there will be a lot of write ops and I do not know if the cluster is optimized / ready for this. In summary: we need to work on sizing this collection as well (num of replicas and shards, maybe adjust caches, heap...). Following is some stats about this collection on prod: Current size: 112 GB Num docs: 4261700 Max Doc: 6532336 Deleted docs: 2270636 Segmented count: 41 Instance's specs: 16 GB, 4 CPU Heap: 12 GB # Remarkable work done * **Aug, 27th:** First week (W34) recommendations applied to cluster. Softcommit setting is left untouched due to business case's requirements. {%pdf https://public99.s3.eu-west-1.amazonaws.com/hackmd-uploads/Week+34+-+Odilo+-+Amrit+Sarkar.pdf%} * **Oct, 13th:** Sync-up call. Fixed load testing pipeline. * **Oct 14th:** Load testing report for 1 shard, 2 replicas on libraryE0596 {%pdf https://public99.s3.eu-west-1.amazonaws.com/hackmd-uploads/Odilo+Solrcloud+Test+Runs+1S2R+-+Draft+2.pdf%} * **Oct, 15th:** Reindexed collection libraryE0596 using 2 shards, 2 replicas. * **Oct, 15th:** Load testing data amended to not use queries with rows=2xxxxxxx. Won't happen on prod as there queries were already fixed by dev team on dev branch. * **Oct, 18th:** Cluster down after load test * **Oct, 19th:** New DD agent config uploaded, cluster nodes upgraded to t3.xlarge. * **Oct, 19th:** Created widget on DD dashboard to monitor used cache heap per core. * **Oct, 20th:** New tests set. Ran multiple tests today with better servers and got the QPS until 120 smoothly, 99th% being 1.9sec which is decent as comparison to previous test runs. Still greatly concerned with thread management but now we are in a position to do capacity sizing. I will bring all the tables on single spreadsheet. Let me know if we can speak on Friday and conclude our findings. I can see in the APM metrics we are getting a peak of 180 QPS, we need to be at a position to handle highest QPS + 30% buffer, that would be 225 QPS smooth sailing with latency noted 300ms. I will work on this on more detail tomorrow and prepare estimation chart. {%pdf https://public99.s3.eu-west-1.amazonaws.com/hackmd-uploads/Odilo+Solrcloud+Test+Runs+2S2R+T4+Machine.pdf%} * **Oct, 20th:** Estimated that sample collection traffic represents about 75% of the overall solr QPS. * **Oct, 21st:** Cluster down due to shards in unrecoverable / down state. Applied settings found on comment in [https://issues.apache.org/jira/browse/SOLR-15139](https://issues.apache.org/jira/browse/SOLR-15139) * **Oct, 21st:** New load testing report with heap incremented to 6GB. {%pdf https://public99.s3.eu-west-1.amazonaws.com/hackmd-uploads/Odilo+Solrcloud+Test+Runs+2S2R+6GB+Heap+T4+Machine.pdf%} **Summary**: If we still have a bit of time, I would like to test 1 shard 3 replicas. Let me know if that is something we can entertain. The index size is 512MB for libraryE0596 and I don't think we need shards for this collection. Single shard multiple replicas is the way to go. * **Oct, 26th:** New load tests report with 1S3R. Excellent latency figures and 0% error rate for up to 240 QPS during 10 min. {%pdf https://public99.s3.eu-west-1.amazonaws.com/hackmd-uploads/Odilo+Solrcloud+Test+Runs+1S3R+6GB+Heap+T4+Machine.pdf%} # Cheatsheet / Knowledge base * **Cluster crushes after high load test** ``` Here's the rule of thumb if solr gets flaky. 1. Check number of threads in the DD dashboard. If it ranges around 10-20K that means high QPS traffic was applied and didn't finish well. 2. Check GC widget in the DD dashboard, if too many spikes are there. 9:53 In both the cases, do rolling restarts of the nodes .. with replication .. there will be no downtime. ``` * **Cluster scale-out (a new node is added in the future). Should we change the number of replicas to match the nodes count? Do you know any existing tool to automate this process across all the working collections?** ``` There are no open-source tools available*to automatically scale out a cluster on a logical basis i.e. adding or moving replicas to each shard of collection or any other business logic. The reason being it's too big of a responsibility to scale-out collections when live requests are coming in for Automation. When replicas are added, some processing is required at the Zookeeper ensemble and corresponding leader replicas to register new replicas. In most cases, collections with smaller index sizes (comparatively less number of documents than rest) don't need more replicas to serve SLAs defined; the problematic requests (long-running / tail queries) arise from bigger collections (collections with higher document count/index size). It is not necessary to have 1 replica of a combination at each node, few replicas need not be hosted on every node. To identify problematic / non-SLA sufficing collections, monitoring dashboards are utilised. Based on the metrics, collections and corresponding ADDREPLICA commands are run. Common scenarios of doing replica movement, in this case, addition: 1. Solr Query Response Time is exceeding SLAs for few collections, mark those collections and move/add replicas to new nodes. In most cases, the movement of replicas (adding new replicas on new nodes, and deleting existing ones) is enough. Too many replicas (in Solr standalone terms - cores) can cause a slowing down of query processing. 2. Multiple GCs occurring in few (not all) nodes. This behaviour denotes heavily readable replicas hosted on them and some should be moved/added to a new node. 3. CPU contention. When nodes (AWS servers) are overwhelmed with incoming requests rate, a few of the replicas need to be moved to new nodes. On similar lines, equal distribution of replicas wrt request rate is preferred. If any other bad / deteriorating case comes, it needs to be handled on a point to point basis and identifying the culprit replicas or nodes. ``` * **Cluster scale in (a node is eliminated from the cluster). How should we proceed? Is there any tool to automate the process of moving the containing shards /replicas over to another node? Is this a real/common scenario that might happen in the future?** ``` Just like above, there are no open-source tools available*to automatically scale in a cluster on a logical basis i.e. adding or moving replicas to each shard of collection or any other business logic. The whole purpose of adopting Solrcloud is to handle situations when one or more node goes down, the live traffic should be still be served with responses. Nodes can go down due to various reasons i.e. OOM, CPU contention, disk space exhaustion etc and proper monitoring and alerts can drive us not to have such situations. All the reasons for node going down can be avoided at all times in a due timely manner. Unforced human errors are not considered. ``` * **What would be the key indicators/metrics that lead us to start using sharding for a specific collection (providing we previously chose to create only one shard)? I mean, when the collection grows, we have to have an alarm warning us that that collection should increase its number of shards and in that case, how are we supposed to do this "shard splitting"? Is there any way to automate it so we have zero downtime?** ``` There is no thumb rule as to when shard should split but we can create our own indicators. As the data grows on a particular collection (a particular client) and average response time has increased by a limit say 30%, the possible reason could increasing data issues and shardscan be split*. It is not necessary that by increasing shards, performance will be improved and hence a load test at a lower environment with identical data is a must. Splitting shards are the last resort to improve performance, and other factors like cache, heap, RAM etc if possible are tuned first. We will very carefully decide on the number of shards on each collection based on the number of documents in collections in comparison to one another at the time of load test and initial setup. We are starting with one shard each collection to create a baseline, and quickly split larger collections into 2 shards and run the test again. If response time improves, we again split by 3 shards and test again, .. and so on. If performance worsens, we select the best logical combination and finalise for production. ``` * **Backup and restore useful links to doc** ``` https://solr.apache.org/guide/8_8/making-and-restoring-backups.html#restore-api https://solr.apache.org/guide/8_8/collection-management.html#backup https://solr.apache.org/guide/8_8/collection-management.html#restore ``` * **How did you do the math for stating that each core uses 64mb of cache? Is this an optimal value for our collections? Should we tweak any cache setting?** ``` It is an assumption each core will take 64mb, since there is a provision to upper cap the limit of cache consumption. maxRamMB is a parameter allocated to each core at solrconfig.xml we can regulate. Yes, we need to tweak the cache setting, its not set for 64 mb right now. ``` * **Threads management in Single vs Multiple sharded Collection at Queries** ``` Single sharded collection have to spin X number of threads for as many concurrent requests received. 80QPS for a single shard, two replicas would need 40 threads (httpExecutors) per node to be spun every second whether the previous threads are finished or not. This at high-pressure constant traffic can get to 10-20 times, hence we see 800-1000 threads being active at one point in time. Thanks to Java 9 improved concurrent models, 90% of threads are in TIMED_WAITING state waiting patiently to be executed without disrupting background Lucene-based and Solr-based activities. It is perfectly normal to spin multiple threads as long as they are not in a BLOCKED / WAITING state, indicating the java process is not overwhelmed. Multiple sharded collections had to spin multiple more threads for shard management. 80QPS for a double shard needs to spin 2 queries per shard, plus httpShardExecutor thread which does the job of aggregating results from both shards indications 3x times threads to be spun as a comparison to a single sharded collection. Here server configuration (number of cores) play a major factor. Divide-&-Conquer algorithm (parallel execution) ends up using more threads given servers have enough cores to handle the load. In our case, 2 core servers are not enough to handle 3K-4K threads per node and the nodes gloriously crashed at 80QPS itself. If enough cores were available, parallel execution will improve response time otherwise high traffic can crash java processes. ``` * **OneMinuteRate metric** ``` So here's the thing. QPS is not equivalent to /select requests to Solr. Reason being the following: 1 query made to Solr is then distributed to N number of internal queries for each shard. For 2 shards, 2 more local /select queries are fired.. in total 1 + 2 -> 3 queries are listed at APM. OneMinuteRate is the true source of QPS KPI for any Solr Cloud collection. ``` * **Cluster self-healing and Java threads** ``` At the heart of Solrcloud, it is java processes running the search engine. Upto a certain point a java process can handle management of threads, GCs. Other factor comes out to be OOTB Solr resiliency, now again retries due to network packet drops, mis-communication with ZK all are handled. Where Solr as a singleton service without orchestration tool like Kubernetes fail is when java process is stuck, if more than 20K live threads are spawned, that amount of threads are not going down anytime soon unless the traffic ( live requests) drops by 50%. That's how Java Concurrent Framework is written by Oracle. We build a solution at LW previously, where Solr processes are forcefully restarted if it is not available for a certain interval i.e. liveliness probe, since there is no other way to write ExecutorThreadPool in Java 11 and beyond. Kubernetes solved it out of the box for us in the later stages. What we can do is perform best possible capacity planning, and restart Solr processes when situation like Friday happens. And in the future, if possible move to Dockers and Kuberentes and get rid of manual restarts altogether. ```