Try   HackMD

My first postmortem - ALX

Issue summary

The issue arised due to the server having default timeout making it inaccessible when a client tries to communicate with it.

Timeline:

  • 2023–07–04, 6:00 AM EAT: Project release
  • 2023–07–04, 9:00 AM EAT: Begin project.
  • 2023–07–04, 9:20 AM EAT: Everything working fine. Goes to a 30 mins break
  • 2023–07–04, 9:50 AM EAT: I try to reach my server using curl, i received status unreachable. Ping returns destination unknow.
  • 2023–07–04, 10:00 AM EAT: Logged in the ubuntu server and went over to check nginx status logs only to note that the web server was down.
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

ROOT CAUSE AND RESOLUTION

After analysing the error nginx error logs,I realised that the server had slept due to 20 mins of inactivity. This made the error when a client tried to communicate with it. I had to create a puppet script that would start a cron job to allow ensure that the web service is active every 15 mins to avoid such a down time.

Corrective and preventative measures:

At the time of this downtime, only one web server was serving the site constituting a single point of failure (SPOF). Hence, a good preventive measure and recommendation will be to use a load balancer to distribute the traffic on multiple servers to prevent a total downtime when one server is down.

Thank you for reading my article.