How Grafana Cloud and Prometheus Can Help on Resource Monitoring

--- title: 'How Grafana Cloud and Prometheus Can Help on Resource Monitoring' disqus: hackmd tags: Monitoring,Grafana,Prometheus --- # How Grafana Cloud and Prometheus Can Help on Resource Monitoring ![grafana-logo](https://hackmd.io/_uploads/HyX0Z9SXR.png) Ensuring the reliability of your application on production is not an easy task. There are many variables and metrics to be considered and measured. One might think to create his/her own scripts, cronjobs, or any automation process to help them monitor the resources being used, but it might not be comprehensive and full of bugs. To resolve this issue we can leverage the existing and one of the most popular "**FREE**" monitoring tools, Grafana and Prometheus. Prometheus is the actual "agent" that does the realtime data acquisition for certain metrics, and those data will be processed, analyzed, and served on a dashboard platform that is provided by Grafana. Pretty neat isn't it? In this article We won't discuss on how to set Grafana and Promethus up, rather We'll discuss metrics and insight that these two tools can provide to help you plan and analyze your system resource towards system reliability. ## 💻CPU and System Usage ![image](https://hackmd.io/_uploads/S1j7B5SQA.png) The most important aspect that defines your system or application performance is undisputably the computation device, namely CPU. Fortunately, Grafana and Prometheus has provide CPU monitoring preset right out of the box. The "CPU and System Usage" panel, provides you with 3 important metrics, they are 1. CPU Usage 2. Load Average 3. Time (Synchronization) Drift The CPU Usage and Load Average is pretty self-explanatory and corelates each other, They describe percentage of CPU being used by all the running process along with the number of (waiting) process that needs the CPU cycle. But, the 3rd metric is actually crucial on certain environment. Time Drift is paramount in almost any distributed system or any infrastructure configuration that needs milisecond or even nanosecond precision. Time drift tells you the error in time calculation of your system relative to the realworld time clock. Imagine when you have complex process that chain invocation of other functions based on message ordering, but your system calculate the time incorrectly (error) on certain message, thus breaking your system behaviour, yikes. Surprisingly The "CPU and System Usage" panel has already provide you with the *Time Synchronization Drift* metric, giving you the actual error and maximum error of the time drift. Using this information, one can enhance their system time calculation methods, or adjust the running process based on custom time calculation methods. ## 🧮Memory ![image](https://hackmd.io/_uploads/r1UPF5SXC.png) The 2nd important aspect is memory, the place where all running processes reside. Grafana and Prometheus also provide you with numerous metric for memory, but in this article i will only address the most common and crucial metric, they are 1. Memory Usage 2. Memory Page Fault Memory usage is also straightforward, it tells you the size of memory being used by all the running processes. Usage that is bigger than 85% tells you that the current system is just barely enough to run all the process. One can also use this information to investigate for memory leak if the memory usage percentage is always increasing over the time. Memory Page Fault is a little bit complex. Whenever your process (and CPU) need to access data from storage, It actually search the virtual address of the data on an abstract table, called page. If It couldn't find the said virtual address, memory page fault happens. Whenever you see the number of memory page fault increasing over the time or even stagnant with pretty big numbers, that means your process (CPU)--most of the times--couldn't find the needed data on the memory page, giving overhead cost to resolve the memory page. This could indicate that your system cache implementation isn't optimized or your process isn't optimized which needs to be fixed. ## 📡Network The next most important aspect is Network. Network aspect has so many metrics which this article won't and can address in details. Grafana and Promethus provides you with network metrics divided into 3 subcategories, they are 1. Network (overview) 2. Network Socket 3. Network netstat All of which actually corelates and explain the same things but on different representation. ### Network Overview ![image](https://hackmd.io/_uploads/SyO5a9S7C.png) In this category you can monitor all of the connected network interface status. It tells you the carrier up/down status, the speed and the transmitted/received datas. It also tells you the network traffic for all related interfaces, thus giving you general insight on the reliability and connectivity of your network. ### Network Socket ![image](https://hackmd.io/_uploads/H1xaksrXA.png) The Network Socket gives you more detailed information on socket level. You can see the number of sockets in use for both TCP and UDP, as well as the socket memory usage. Too many open sockets with less to none activity might indicate that there are too many idle connection or not-closed-correcly connection which means, the implementation of socket need to be enhanced. The socket memory metric also might gives you information to help you maintain automatically size of the socket when needed. All of which can be achieved through advanced socket programming ### Network netstat ![image](https://hackmd.io/_uploads/S1lu-irm0.png) The Network netstat provides you with data on the transport layer. It shows both TCP and UDP segment transmission as well as the error rate. One can use this information to diagnose or troubleshoot the network that the system use. ## 🎞Filesystem and Disk ![image](https://hackmd.io/_uploads/BksbmsH7C.png) Filesystem and Disk (io) might seems not too important for small-scale app or simple processes, but that's not the case otherwise. Luckily, Grafana and Prometheus also provides you with this information. The Filesystem information might be boring, but the Disk information is very useful. A high Disk I/O significantly impacts your system's performance as it could slow down the whole CPU process. A High Disk I/O means that your process is using too much read/write data and needs optimization. The other alternative is,-- If your processes are meant to use such high Read/Write Disk-- to change the system resource to meet your requirements. ## ⚙Logs ![image](https://hackmd.io/_uploads/ryQmNsHXC.png) Logs is Important to help you traceback issues and monitor incidents for security purpose. By Implementing logging, you can traceback unexpected runtime errors in depth, gain insight whenever there's suspicious process/access from/to your system, e.g ssh login, system start/stop and sensitive data access. ## 📚Conclusion Monitoring is Important to ensure your System Reliability. Luckily, You can use Free and exisiting feature from Grafana and Promethus that provides you with comprehensive metrics and information to help you achieving system realiability