Prod API Server Down Incident Timestamp

# Prod API Server Down Incident Timestamp **15 Aug (Tue)** - Changes made: Change logging file name to minute - Production issue start arise in daily basis (Refer Hercules’s list “4.Past Week Production Issues”) ![](https://hackmd.io/_uploads/rk9MY2NT3.png) - Temporarily resolved by either restart API service or self recovery **22 Aug (Tue)** - Changes made: Change logging file name to hour - Changes made 2: Increase NGINX server [open files] lmit setting to 90k (Before changes made,field was left empty, as believe openfile default was 1024 as per finding), with intention to increase the threshold - Reference: https://phoenixnap.com/kb/ulimit-linux-command - Previous [opening files] is empty **23 Aug** 6:30a.m UE9 issue reported - Found same cause trigger API server dead -> Error message shown too many open files, server hit limit of 90k and trigger server stucked then down in the end. - Redeploy API to trigger restart API service (open file limit will be refreshed) [Gary] ![](https://hackmd.io/_uploads/HJpm3nVan.jpg) 2:31p.m ASAA issue reported - Restart API service [Sebastian] 5:00p.m - Change logging file to daily after discussion [Sebastian] 7:44p.m ASAA issue reported - Restart API service [Simon Lim] 8:20p.m EV29 issue reported - Unable to transfer (Seems to be not related, suspect vendor unable to return balance) 9:23p.m - close filebeat for CM2, ASAA, 88Pro only [Sebastian] and docker restart CM2 and 88Pro 9:27p.m - Disable remaining brand’s filebeat (UE9, UDO, EZ12, PH9, EV29, SAM) 10:03p.m ASAA issue reported - Restart API service [Simon Lim] **24 Aug** 12:16a.m ASAA - Restart API service [Sebastian] - Preparing deployment “Revert old logging style”, gitlab maintenance cause delay 12:39a.m CM2 - Restart API service [Sebastian] - GitLab still in maintenance 12:56a.m UDP - Restart API service [Sebastian] - GitLab still in maintenance 1:48a.m - deployed revert old logging style 9:11a.m CM2 - Restart API service [Avishai] 11:34a.m 88Pro - Restart API service [Avishai] **12:00p.m - 1:00p.m** - Finding: Before implementation of increase NGINX open file to 90k, chances of API server down was lower and most time only happens on ASAA (before 15/8) - **Test to revert NGINX [opening file] to empty field (default) in UE9, after changes made checked open file limit updated to 1,048,576 (1024x1024), and this explained why above issues happened.** - Apply revert NGINX [opening file] to empt field (default) to all brands once verify new open file limit is in safe tier in UE9. - POC to add Body.Close() for io.ReadCloser, with purpose to CLOSE those open connection in open file (As at 1pm, it doesn't reduce, POC ongoing still) - Finding: Checked GA, no ununsual spike on traffic on 23 Aug **6:17pm:** - Open file usage as at 24/8, 6:17pm ASAA 240,803 CM2 182,103 Udompet 111,791 UE9 70,606 88Pro 67,554 EV29 52,865 EZ 15,567 Samruai 3,460 PH9 278 **6:28pm:** - Implement daily 5am auto restart docker container for API server to ALL Brand (Purpose: to refresh open file limit) - Continue monitor to ensure open file limit refresh every 5am and open file limit not hit - Monitor up to 25/8 10am, will enable filebeat if above issue not trigger again - Continue POC to reduce and kill open connection that consume open file in NGINX