Prometheus is a tool that is used for Systems and Service Monitoring. The piece of code programmers write is to be run on actual hardware which means we are utilizing a physical resource that costs us money to operate on. So we better utilize this resource in the most efficient manner. This is where systems monitoring comes in picture. Running complex applications on actual servers is complicated and things can go haywire for several reasons. Some potential problems which can occur are:
Disk Full -> No new data can be stored
Software Bug -> Request Errors
High Temperature -> Hardware Failure
Network Outage -> Services cannot communicate
Low Memory Utilization -> money wasted
These problems occur more often than you think, so it is necessary to monitor your systems and services to keep check of their health.
Monitoring means to get information from your system, to get insights and act on those insights. There are different ways in which you could do the monitoring.
Check Based Monitoring -> Run scripts periodically to check the health of servers. Very Static. Very Local Context based on individual machines.
Logs/ Events -> Record full details about each event. It can be structured and unstructured. Further analysis required. (Loki, InfluxDB). Lack of interservice correlation.
Metrics/ Time Series -> Numeric values, sampled over time. (OpenTSDB, Prometheus), good for aggregate health monitoring. Need logs for detailed analysis though.
For inter-service correlation, we use Tracing, track single requests through the entire stack. (Jaeger).
Prometheus is metrics-based monitoring & alerting stack made for dynamic cloud environments. It doesn’t do logging or tracing.
Architecture of Prometheus:
There are things called targets from which Prometheus server pulls time-series metrics at regular intervals and store it in a local disk.
These targets can be of two types, one in which you have control over the source code like your web app or your API which is what we will be doing, here you can use a Prometheus client library to execute an endpoint from which Prometheus can gather this data.
Another target is something that you don’t necessarily have control upon like a Linux VM or a SQL DB instance. In this case, you use something called an exporter which sits on top of this system and sends metrics to Prometheus.
Prometheus server is then configured to pull or scrape data from these targets and stores these time series metrics on a local disk.
Having given you enough explanation about what and how of Prometheus, let us create our service and monitor it using Prometheus. You can check out the Github Repository which contains all the code.
You can now enter a PromQL expression and query the metrics. You can find a guide about how to query using PromQL here. You can do a lot of useful stuff here alone, but we will go ahead and connect Prometheus with Grafana and make beautiful graphs that will give us useful insights about our service as well.
Grafana is open-source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics no matter where they are stored. In plain English, it provides you with tools to turn your time-series database (TSDB) data into beautiful graphs and visualizations.
You can install Grafana by following the instructions mentioned here. Once installed you can run a Grafana server by executing
This will start a Grafana server on http://localhost:3000. The default username and password both are admin. Before you start creating beautiful graphs, you first need to Import Prometheus as a data source. You can find to option to Add Datasource on the home screen itself. Select Prometheus as a data source and it will direct you to a Settings page, where you need to fill in the appropriate details as shown below.
Now, click on Save & Test. You should be seeing a notification that says Data source is working.
After adding Prometheus as a data source, we will make a new dashboard, to do that, press the + icon in the left pane and click on the dashboard. You will be greeted by the below screen.
Click Add Query, you will be taken to the New Dashboard Screen, where we can make those beautiful graphs using PromQL queries as discussed above.
The query which we will be performing is the rate at which we get requested on a particular endpoint and what response is being returned. The [15s] means that we are getting rate as measured over the last 15 seconds.
Similar to the above rate query for 200 response, we have done the same for 403 and 404 as well.
You can do the same now for the other endpoint, below shown is a sample for the other endpoint.
To generate traffic on our custom server, I am using a tool called Vegeta. It is a tool used to simulate traffic on a given endpoint. You can install the tool from its Github Repository. To use Vegeta to simulate traffic, write the command.
These commands will hit the particular endpoint on the server for 600s and you can watch the same on the Grafana Dashboard you just created.
You may think what is the need for checking what status codes which are being returned on a particular endpoint, so let’s take an example of a Payments Gateway service, it will be important to keep a track of successful and unsuccessful responses which are being returned when a user is paying using the service, a sudden spike in unsuccessful responses should mean that something is wrong with the server and you should immediately rectify whatever the problem.
The final dashboard will look something like below.
I hope you were able to understand and follow along on making a Monitoring Dashboard with Prometheus + Grafana. Thanks for reading. If you have any questions, feel free to leave a response.