While Prometheus is an excellent and capable monitoring system, one aspect I find very frustrating is its resource consumption. If this frustrates you as much as it does me, let’s break down the causes of this issue and see how to address it.
But first, let’s have a short overview of Prometheus, discuss its design, how its components consume resources, and look for the possible culprit that wastes them.
Prometheus architecture and resource consumption
In 2012, SoundCloud, inspired by Borgmon—a monitoring system for Borg used in Google—decided to create something similar but Open Source. That’s how Prometheus was born. Its first public version was released in 2015. Prometheus is written in Go and is a single binary file. Several interrelated processes run inside it, and you can easily differentiate them according to their objectives.
Let’s see what Prometheus has under the hood:

- First of all, the monitoring system needs to know where it will scrape the metrics from. Service discovery takes care of this. It interacts with external APIs, retrieves metadata from them, and builds a list of data sources called targets.
- Next comes the data retrieval process, scraping. It collects data from targets using the pull model and the HTTP protocol. Scraping obtains data from exporters that can extract metrics and convert them into a Prometheus-compatible format. Another source is the applications themselves. You can use standard client libraries to integrate metrics right into them.
- Once the metrics are collected, they need to be saved somewhere. Prometheus uses TSDB (Time Series Database) as a storage.
- Then, the user interface lets you read the stored metrics. Prometheus has its own UI, though it is not functional enough in many cases. Thus, a third-party one (Grafana) is usually used instead.
- Prometheus has two types of rules: recording and alerting.
- Recording rules allow you to execute a query and save its result as a new metric.
- Alerting rules check if certain conditions are met and, as the name implies, send notifications to external systems such as Alertmanager.
While we have looked at the Prometheus architecture and the external elements it interacts with, finding out which component consumes the most resources is pretty simple. All you actually need is this minified diagram:

Since Prometheus is written in Go, you can use the built-in pprof tool to find out how much memory each component consumes. I collected statistics on about a dozen or so installations and got the following approximate memory usage distribution:

This memory consumption distribution shows us that:
- Service discovery is a low-overhead operation that consumes virtually no resources.
- Rules are usually run hourly, so they don’t require a lot of data and resources.
- Surprisingly, UI requests only consume about 10% of the resources, too.
- The 30% is reasonable in the case of scraping because it receives data as plain text and converts it to internal structures. String operations in Go are notoriously computationally costly.
- Storage (TSDB) definitely comes out on top.
This is undoubtedly what requires our further investigation.
Looking for a culprit
#1: TSDB performance peculiarities
It’s worth mentioning that in a previous article, I covered the concept of TSDBs with a special focus on them in Prometheus. Below are the key findings from that piece.
Suppose you have a bunch of thermometers that are data sources. Every 30 or 60 seconds, their values are collected and saved along with their timestamps. Since there are a lot of thermometers, you have to identify them somehow. You can do this by using so-called label sets, which are key-value pairs. Below is an example of a label set:
{
name: cpu_usage,
node: curiosity,
core: 0
}
In Prometheus, as in most monitoring systems, label sets are usually shown in curly brackets. Moreover, since these are long strings, you don’t deal with them directly — each label set is assigned an ID. Then the label set values are stored together with their ID (aka time series or series):

Let’s say you have a million data sources in Prometheus, a hard drive and some memory range.
Note: TSDB has an active block. In the case of Prometheus, it usually spans two hours. Prometheus stores it in memory and flushes it to disk after some time.
In this case, here’s how the data will be stored:
- Scrape the value from a source and save the label set’s mapping to ID in memory.
- Save the same mapping in the log. In case of an incident (server reboot, memory shortage), the log allows you to restore all data to memory.
- Write the value and the time it was collected in the log.
- Write the value and the time it was collected in memory.
- Repeat operations 1-4 for the subsequent sources.
In the end, you’ll have some data accumulated. Periodically, this data is dumped to disk as a separate block that includes both the metric values and the full range of label sets. Essentially, a block is a mini-database that contains all the information necessary to determine what value a particular metric had in the time period that this block spans:

Now, the data in memory can be roughly divided into two pieces: the first one contains the data, and the second one works with the label sets. Let’s take a look at the first one and try to figure out what may cause excessive memory consumption.
#2: Processing data
Let’s go back to our thermometer example. We retrieve values from it and save timestamps. The data in memory is encoded and compressed:
- The values are encoded using the GORILLA algorithm. It enables time series to be stored in memory in the most efficient way. There is an important thing about this algorithm: to read any point, the entire amount of data must first be decoded, i.e., in our case, the entire two-hour block.
- Timestamps are encoded using the Delta-Delta algorithm. Supposing the values are retrieved at regular intervals, e.g., every 30 or 60 seconds, only one bit is needed to store one timestamp.

The sequence of timestamps and values is called a chunk. The default chunk in Prometheus covers two hours’ worth of data. After two hours, the chunk is flushed to disk, and a new data collection begins.
If there are a million sources, a million chunks will be created (one source = one chunk). In two hours, you’ll end up with a million files on disk. That’s not too much, but in 24 hours, their number would rise to 12 million. With a large number of data sources (e.g., millions) and a long retention period (e.g., a month instead of a day), the data files would be scattered across the disk. This would result in significantly slower reads, as the system would have to access many scattered files for each request.
To get around this bottleneck, in Prometheus, the chunks are written to disk sequentially in a single file. This reduces the number of files, keeps data consistent, and makes reading operations fast and convenient. In addition, all the offsets in the file are known, which means that you can easily find the right chunk without reading the whole file. The maximum file size is 128 MB.
The recording algorithm is as follows:
- Initially, the first chunks from all the sources are written to the file.
- Then, if there is space left, the second chunks are written, and so on.
- When space in the file runs out, another one is opened, and writing continues.

The main drawback is that the chunk size is limited only by the time interval and not by the number of data points. For example, with a standard data collection interval of 60 seconds, a two-hour chunk would contain 120 data points. However, if you reduce the data collection interval to 15 seconds, the number of data points in the same two-hour segment would increase to 480.
This Prometheus feature raises two challenges:
- You cannot predict the amount of memory required to store a chunk.
- Since reading any data point requires decoding the entire chunk, the overall memory consumption of the system increases.
Prometheus addresses the issue of growing data volumes by limiting the size of each chunk to 120 data points. When a chunk reaches this limit, it is automatically saved to disk.
At the same time, it is essential to maintain the speed of data processing so that data can be quickly read from it. To do so, Prometheus leverages the memory-mapping technology. Swap memory works in much the same way:
- Data structures are created in RAM.
- When such a structure is accessed, the operating system quickly reads the data from the disk and places it into memory.
- From the application perspective, it looks like the data has always been in memory.
This approach significantly reduces RAM usage, since the data is physically stored on the disk and only read into memory when needed.
After three hours (one full two-hour block and half of the next block), the system groups the chunks belonging to the same time block. With a scrape interval of 15 seconds, each block would contain four chunks, each with 120 data points. Then, an index gets added to these chunks, thus creating a valid metrics database. Then, the process repeats.

Let us recap what we have covered so far.
- All data from each source is written to individual chunks. The size of the chunk is 120 points. The chunks are sequentially written to a single file until a maximum size of 128 MB is reached. When the file reaches this limit, the system creates a new file and continues writing to it. Every three hours, the system “cuts off” the accumulated chunks and merges them into a separate data block.
- For efficient memory utilisation, all data is stored in compressed and encoded form. Prometheus leverages two encoding methods: GORILLA encoding for metric values and Delta-Delta encoding for timestamps. The distinctive feature of GORILLA encoding is that to read any single value, you have to decode the whole chunk. Delta-Delta encoding, on the other hand, is very efficient at compressing timestamps — only one bit is used to store each timestamp.
It looks like the data storage and processing system is quite efficient. Perhaps the problem of excessive memory consumption is not related to the data but to the label sets storage system? Let’s look at it in more detail.
#3: Working with label sets
Below is an example of a label set:
{_name_="http_requests_total", job="ingress", method="GET"}
In this example, the label set consists of three components: name
, job
, and method
.
All label sets are stored in a special data structure called an Index, which works similarly to indexes in relational databases. In Prometheus, however, you don’t usually search for a full label set, but for one or more labels. This way, you can find all the time series containing these labels.
Note: I’ll present a simplified perspective on how Index works (the actual implementation in the source code is more complex).
The Index structure in Prometheus uses a dedicated memory area containing several components.
The first one is Symbols. In this component, all keys and label values are stored as separate strings, and each string is assigned a unique identifier (ID).

This approach ensures efficient memory utilization because the same keys and values are often repeated in label sets.
The next entity, Postings, lets you quickly search for key-value pairs. For each such pair, the system stores a list of time series containing this pair. However, it uses not the string values themselves, but their identifiers (IDs) and the corresponding series arrays:

For example, to find all series that have the job
label with the ingress
value, you must first convert the key-value pair using symbols. In our example, these are symbols 3 and 4. Next, you have to find a pair of 3,4 and read the series array.
The Series entity associates label sets with their data. It contains series IDs and an array of IDs describing the complete label set. The system converts the IDs back to text values to extract the original label set. Series also stores references to the data as offsets in the chunk files, allowing you to access the chunks themselves. Since there are usually multiple chunks for each series, the structure stores multiple such references:

Let’s now look at how these entities will change if we add a new series that differs from the original one only by the value of the method label (POST
instead of GET
):
{_name_="http_requests_total", job="ingress", method="POST"}
Let’s break down the steps of how a series is added to the Index.
Step 1. Add a POST
entry to Symbols:

Step 2. Add to Postings a new key-value pair (method="POST"
) and the series in which it occurs. Update records for the existing key-value pairs, since the pairs _name_="http_requests_total"
and job="ingress"
are already there:

Step 3. Add a new entry to Series, since this is a new series and it has its own references to chunks and its own label set:

As you can see, Symbols, Postings and Series enable storing label sets efficiently. They help you get rid of duplicating strings, use numeric IDs and chunk offsets, and, at the same time, maintain the search speed. Therefore, label sets processing is unlikely to be the cause of high resource consumption by the system.
However, two other characteristics usually go unnoticed but can greatly affect system performance—the number of unique label sets (cardinality) and the rate at which new time series emerge (churn). Let’s examine them.
#4: Cardinality
Cardinality is the number of unique label sets for a metric.
Let’s explore it with a basic example. Suppose you have the http_requests_total
metric with two labels: instance
(has three different values) and job
(has one value).

To calculate cardinality, you need to multiply the number of values of each label: 1 (metric name) × 3 (instance values) × 1 (job values) = 3. This means that there are three different label combinations, so the system will create three chunks to store the data.
If you throw another label (method
) with five possible values to the mix, the cardinality would increase to 1 × 3 × 1 × 1 × 5 = 15.

The number of chunks would grow as well. Still, the difference between 3 and 15 is not that large. However, the situation can change drastically when new labels are added.
For example, if you add an endpoint label with, say, 1000 different values, the cardinality (and the number of chunks) would increase to 1 × 3 × 1 × 5 × 1000 = 15,000:

Apparently, storing that much data in chunks requires a large amount of memory. It also takes much more resources to insert data into those chunks and to read them. For example, if you want to count how many total requests were received, you’ll need to decode all 15,000 chunks.
A cardinality of 15,000 means that there are as many new series and label sets in the Index. In other words, you must store 15,000 new label sets in Symbols and Postings, with these structures growing in both number of records and size as a result. Series arrays would also grow as key-value pairs are repeated. On top of that, it takes more resources to update the Index as well as read and resolve the request.
Finally, it looks like we have found one of the culprits for high resource consumption in Prometheus. However, let’s consider one more suspect: churn.
#5: Churn
Churn is a measure of the rate at which time series are added to or removed from a monitoring system. This parameter helps you figure out how rapidly the metrics’ composition in the system is changing.
To better understand this term, here’s a real-life illustration of churn. Imagine you produce cups and pack them into batches of 100 cups each. You can put 100 cups into a box, but they all should have the same design. Having a high churn means you produce 100 different designs for your cups, and each will require a separate box. In this case, you’ll end up with 100 boxes and hiring a truck to deliver all of them. If your churn was 0, you can pack all your cups into a single box and even use public transport for delivery.
Let’s come back to Prometheus. Consider an example with four thermometers. Only one of them (the fourth) sends readings regularly; the rest work erratically. When 120 measurements are accumulated, Prometheus creates a chunk and saves it to disk. Then, a new cycle of data collection commences.

After three hours, the data is written to the disk. However, during those three hours, Prometheus continues to store information from the other three sources that have stopped sending data. Since chunks from erratic sources contain few values, data compression works less efficiently compared to full chunks. This results in increased consumption of system resources.
Symbols, Postings, and Series are also increasing in size, consuming more resources for Index updates and reading.

Prometheus has to store in the Index information about all time series that have ever existed in the system, as this historical data may be required for analysis. This means that the Index has to keep metadata about inactive series, too.
While Prometheus’s chunk and index handling mechanisms are quite efficient, high cardinality and churn are still the main reasons for excessive resource consumption. That said, churn has yet another annoying feature: it doesn’t keep track of which metrics come up and which ones go away, so the initial 1000 unique metrics in a block can turn into 3000 in a few hours. This can be hard to detect.
#6 (Bonus): remote_write
The well-known remote_write mechanism in Prometheus allows you to export collected metrics to external storage systems.
This component can also affect the system’s resource consumption, but its effect depends on the configuration. A single remote_write endpoint has minimal impact on system resource consumption. However, if there are multiple endpoints, the situation can deteriorate quickly, especially if those external storage systems become unavailable. In this case, each configured remote_write can consume between 500 and 700 megabytes of RAM, with the exact value depending on the total number of metrics in the cluster.
While remote_write does affect resource consumption, its impact is usually much smaller than that caused by high cardinality or churn. This is why remote_write is seen as a minor factor in the overall Prometheus performance analysis.
Optimizing Prometheus: Analyzing resource consumption
Now that we’ve identified the main causes of elevated resource consumption in Prometheus, let’s take a look at the tools that help you discover abusive metrics.
Dealing with churn
First of all, let’s focus on churn, as it is quite difficult to track and its impact is not always obvious.
Prometheus features a ./tsdb
tool. If you run it with the analyze
command, it will analyze the completed data blocks and help you find the key-value pairs that cause the most churn:
./tsdb analyze
Label pairs most involved in churning:
17 job=node
11 __name__=node_systemd_unit_state
6 instance=foo:9100
4 instance=bar:9100
3 instance=baz:9100
Starting with version 2.14, a custom metric has been introduced — using the query below, you can find the series that have been added and removed most often in the last hour:
topk(10, sum without(instance)(sum_over_time(scrape_series_added[1h])))
Dealing with cardinality
The selection of tools for analyzing the cardinality is wider. In addition to ./tsdb analyze
, you can use Prometheus Status, a page in Prometheus that shows the labels with the highest cardinality. Another utility worth noting is mimirtool
. Let’s look at it in more detail.
Mimirtool gathers information from two sources:
- It extracts from Grafana a list of metrics that are used in the dashboards.
- It retrieves from Prometheus the full list of metrics, specifically those metrics that are used in Rules.
Once the data is collected, mimirtool analyzes it and determines which metrics are not being used anywhere. While finding unused metrics does not mean that they should be removed without any consideration, it is still a good reason to assess whether they are really in demand for the monitoring system.

Optimizing Prometheus: Getting rid of unnecessary stuff
Let’s look at a real-life example of optimizing a monitoring system. There was a cluster where Prometheus was scraping about 10 million metrics, consuming a whopping 64 gigabytes of RAM:

Once the optimization was performed and redundant metrics and label sets were removed, the system performance improved significantly: the number of metrics shrank >11 times to 877К while memory consumption was down to less than 5 gigabytes (which is an impressive 92% optimization!).

There are a couple of proven methods to optimize Prometheus memory consumption and performance:
- Avoid adding unnecessary labels in the first place. Modern exporters and instrumenting libraries feature flexible settings that allow you to control this process.
- Delete unused labels or metrics. Prometheus features a special relabeling mechanism (
relabel_configs
) that allows you to remove unnecessary elements or convert them to a more efficient format:
- source_labels: [requestID]
action: labeldrop
...
- source_labels: [__name__]
regexp: not_important_metric
action: drop
- Set limits on the number of samples to collect from the source. This will prevent Prometheus from “getting hot” if some application suddenly tries to send too many metrics.
- Set limits on the number of sources. This will help in case someone decides to scale their Pod running 50,000 metrics to 1000 Pods, for example.
Effective use of Prometheus after optimizations
You may now be wondering how to effectively investigate incidents after all the unnecessary stuff has been removed.
Suppose there is an HTTP 500 error. Before optimization, its associated metric used to look like this:
{__name__="http_requests_total", code="500", user_agent="Mozilla/5.0", uri="/service/action/42", method="POST", request_id="XXXX-XX-XX", backend="backend-42", user_id="17"}
This metric provided comprehensive information such as server-side (backend) identifier, request identifier, request destination (endpoint), and even user identifier.
And here’s what that same metric looks like after the “general cleanup”:
{__name__="http_requests_total", code="500", method="POST"}
The main idea behind monitoring is to notify you when something has gone wrong, so that you can quickly figure things out and correct the error. At first glance, it seems that the “cleaned up” metric above contains fewer details, but this is not the case, as you can get detailed information for analyzing incidents from system logs and traces. Logs usually contain all necessary extra information, including the user_agent
, the backend
used, as well as other vital details.
The exemplars mechanism in Prometheus allows you to link metrics to the corresponding logs and traces. The typical Prometheus metric looks as follows:
http_requests_total{method="POST", code="500"} 3
And here’s what that same metric looks like with exemplars:
http_requests_total{method="GET", code="500"} 3 # {trace_id="foo123"}
Essentially, exemplars are about being able to add metadata to metrics. Trace_id
is most commonly used as such metadata, allowing for more detailed analysis by associating metrics with corresponding traces.
In Prometheus, exemplars are stored separately from the main metrics, thus consuming additional system resources. So, when dealing with exemplars, you should follow the same optimization principles as when dealing with conventional metrics, i.e., you should carefully evaluate the need to add each exemplar on a case-by-case basis.
For example, if the system logged several dozen 500 errors for a single endpoint over a 30-second period, it is likely that all of these errors were caused by the same thing. In that case, you only need to add an exemplar to a single metric (or, if you are super meticulous, to every tenth or hundredth metric). Thus, you have to think ahead about adding exemplars; otherwise, you risk going back to where it all started: resource scarcity.
Real-world experience shows the effectiveness of this approach: the team behind the example above confirmed that removing redundant metrics and label sets did not reduce the efficiency of the incident monitoring and handling. They continue to use the same monitoring dashboards and alerts as before, and the exemplars mechanism they have been using for a long time now remains just as beneficial, providing all the detail they need when investigating incidents.
Takeaways
Prometheus is an advanced monitoring system with many tools and features. Its known drawback is excessive resource consumption, but as we have found, this is often due to a lack of understanding of the tool and thus its misuse.
The solution lies in proper management of metrics and labels: you can delete unnecessary ones and set limits on the number of samples from sources or the number of sources themselves. You can use both Prometheus built-in tools (./tsdb analyze
and Prometheus Status) and a third-party mimirtool to scan for unused metrics.
At the same time, deleting redundancies should not affect the effectiveness of monitoring and incident handling. So, you must carefully analyze the used and even unused metrics before deleting them.
Comments