Try   HackMD

My first postmortem - ALX

Issue summary

The issue arised due to the server having default timeout making it inaccessible when a client tries to communicate with it.

Timeline:

  • 2023–07–04, 6:00 AM EAT: Project release
  • 2023–07–04, 9:00 AM EAT: Begin project.
  • 2023–07–04, 9:20 AM EAT: Everything working fine. Goes to a 30 mins break
  • 2023–07–04, 9:50 AM EAT: I try to reach my server using curl, i received status unreachable. Ping returns destination unknow.
  • 2023–07–04, 10:00 AM EAT: Logged in the ubuntu server and went over to check nginx status logs only to note that the web server was down.
    1612762327197

ROOT CAUSE AND RESOLUTION

After analysing the error nginx error logs,I realised that the server had slept due to 20 mins of inactivity. This made the error when a client tried to communicate with it. I had to create a puppet script that would start a cron job to allow ensure that the web service is active every 15 mins to avoid such a down time.

Corrective and preventative measures:

At the time of this downtime, only one web server was serving the site constituting a single point of failure (SPOF). Hence, a good preventive measure and recommendation will be to use a load balancer to distribute the traffic on multiple servers to prevent a total downtime when one server is down.

Thank you for reading my article.