This article discusses generating Prometheus metrics using logs as a data source. This approach can help if existing exporters do not provide the metrics you need and, for whatever reason, you are reluctant to develop your own exporter.
One of our customers was interested in obtaining timely information on failed queries from an application to a PostgreSQL v14 database. At the same time, we had to implement this monitoring without making any changes to the application code. Upon looking through the existing PostgreSQL exporters, we failed to find any suitable metrics capable of sending error reports, so we decided to create one ourselves.
Preparing logs for further use
From a technical perspective, the task boiled down to parsing the log file, extracting a metric from the data, exporting it to Prometheus, and setting up alerts. Unfortunately, earlier PostgreSQL versions (up to 15) do not support the popular JSON structured log format out-of-the-box. Installing third-party modules in the production environment was not an option, nor was storing log data in the CSV format. However, we were able to use a configuration file to format the output as needed. Our customer used the following log format:
# postgresql.conf
log_line_prefix = '%m %p %u@%d from %h [vxid:%v txid:%x] [%i] '
Where:
%m
is a timestamp, including milliseconds;%p
– Process ID;%u
– username;%d
– database name.
For more information about what the other variables are used for and what other information can be written to the log, see the PostgreSQL documentation. The result was something like this:
2022-05-12 07:33:54.285 UTC 2672031 @ from [vxid: txid:0] [] LOG: checkpoint complete: wrote 64 buffers (0.0%); 0 WAL file(s) added, 0 removed, 10 recycled; write=6.266
s, sync=0.004 s, total=6.285 s; sync files=10, longest=0.003 s, average=0.001 s; distance=163840 kB, estimate=163844 kB
Searching for the best solution
We tried fluentd, Promtail, and Vector to retrieve the metrics from the logs and send them to Prometheus.
Fluentd has an extensive plugin system to suit all use cases. For example, fluent-plugin-prometheus converts data into metrics’ format and delivers them to Prometheus. However, since the release of v2.0.0 early last year, the project has evolved at a snail’s pace. For that reason, we decided to use something else this time — even though we really like fluentd and use it a lot.
The first tool we tried was Promtail. It has the ability to parse text logs (including multi-line logs), process them based on the task (including extracting metrics), and filter the content based on specified parameters.
For testing, we created a configuration file to count the log lines. Here is an example of the log processing stage:
pipeline_stages:
- multiline:
firstline: '^\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\]'
- regex:
expression: '^(?P\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\]) (?P(?s:.*))$'
- metrics:
log_lines_total:
type: Counter
description: "total number of log lines"
prefix: pg_custom_
max_idle_duration: 24h
config:
match_all: true
action: inc
log_bytes_total:
type: Counter
description: "total bytes of log lines"
prefix: pg_custom_
max_idle_duration: 24h
config:
match_all: true
count_entry_bytes: true
action: add
The configuration worked fine; however, we missed one crucial point. Promtail requires the address of Loki (the log aggregation tool developed by Grafana) as a mandatory configuration parameter. It simply doesn’t start without this parameter set. We felt it would be impractical to install Loki as well.
NB. If you still want to use Promtail, you can replace the Loki address with that of any web server (e.g., nginx) capable of serving a 200 response code on any request. However, we do not recommend using this workaround in a production environment.
Finally, it was the Vector’s turn. Spoiler: it worked out well for us.
Vector: parsing logs and exporting them to Prometheus
Vector has to be installed on the hosts where the log files to be parsed are located in order to export the logs to Prometheus. After the installation is complete, configure it accordingly. You can use, e.g., Ansible to do just that:
# playbook-vector.yaml
---
- name: Setup vector
hosts:
- pg
become: yes
vars:
arch: amd64
version: 0.18.1
vector_template: files/40-vector.toml
vector_config_file: /etc/vector/vector.toml
tasks:
- name: Setup install vector
become: yes
apt:
deb: "https://packages.timber.io/vector/{{ version }}/vector-{{ version }}-{{ arch }}.deb"
install_recommends: yes
notify:
- restart vector
- name: Copy config
copy:
src: "{{ vector_template }}"
dest: "{{ vector_config_file }}"
mode: 0644
owner: vector
group: vector
notify: restart vector
- name: Start Vector
service:
state: started
enabled: yes
name: vector
handlers:
- name: restart vector
service:
state: restarted
daemon_reload: yes
name: vector
The Vector configuration is stored in a TOML file. Specify the location of the log file and its type in this file:
# vector.toml
[sources.postgres_logs.multiline]
start_pattern = '^\d{4}-[0-1]\d-[0-3]\d \d+:\d+:\d+\.\d+ [A-Z]{3}'
mode = "halt_before"
condition_pattern = '^\d{4}-[0-1]\d-[0-3]\d \d+:\d+:\d+\.\d+ [A-Z]{3}'
timeout_ms = 1000
Note that halt_before
mode means that Vector will treat all the lines that follow the condition_pattern
(and do not begin with the latter) as a single message.
You can use other multiline.mode
values as well. For example, the half_with
mode includes all consecutive lines up to and including the first line matching the condition_pattern
in the message.
The message is then parsed using the Vector Remap Language (VRL):
# vector.toml
[transforms.postgres_remap]
type = "remap"
inputs = [ "postgres_logs" ]
source = """
. |= parse_regex!(.message, r'^(?P\\d{4}-[0-1]\\d-[0-3]\\d \\d+:\\d+:\\d+\\.\\d+ [A-Z]{3}) (?P\\d+) (?P(\\[\\w+\\]@\\w+|@|\\w+@\\w+)) from (?P(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}|\\[\\w+\\]|\\s*)) (?P\\[\\w+:.+:\\d+\\]) (?P(\\[\\]|\\[\\w.+\\])) (?P.*[A-Z]): (?P.*)$')
del(.timestamp)
message_parts, err = split(.message, ", ", limit: 2)
structured = parse_key_value(message_parts[1], key_value_delimiter: ":", field_delimiter: ",") ?? {}
message = message_parts[0]
. = merge(., structured)
del(."please try the setup again")
del(.message)"""
Here, we:
- specified the log source;
- set a regular expression to parse the log message with;
- deleted the unnecessary fields;
- split the message using the “,” delimiter;
- saved the result to the
map
array, processed it, and got JSON output so we could continue manipulating its fields.
Now let’s filter out the error messages:
# vector.toml
[transforms.postgres_filter]
type = "filter"
inputs = [ "postgres_remap" ]
condition = '.level == "ERROR" || .level == "FATAL"'
This configuration will include ERROR and FATAL messages in the metric.
Next, create a metric based on the filtered log messages. Set the metric type and the field to use, name it appropriately, and attach additional labels.
# vector.toml
[transforms.postgres_metric]
type = "log_to_metric"
inputs = [ "postgres_filter" ]
[[transforms.postgres_metric.metrics]]
type = "counter"
field = "level"
name = "error_total"
namespace = "pg_log"
[transforms.postgres_metric.metrics.tags]
level = "{{level}}"
host = "{{host}}"
The last step is to publish an exporter. Prometheus will use it to scrape metrics.
[sinks.postgres_export_metric]
type = "prometheus_exporter"
inputs = [ "postgres_metric" ]
address = "0.0.0.0:9598"
default_namespace = "pg_log"
Setting up alerts based on retrieved metrics
For Prometheus to be able to scrape metrics from the new exporter, we now have to set up regular targets:
scrape_configs:
- job_name: custom-pg-log-exporter
static_configs:
- targets: ['10.10.10.2:9598', '10.10.10.3:9598', '10.10.10.4:9598']
The next step is to create a metric-based rule, which Alertmanager will handle according to its routing settings:
- alert: PgErrorCountChangeWarning
expr: |
increase(pg_log_error_total{level="ERROR"}[30m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: The amount of errors in pg host {{$labels.host}} log has changed to {{$value}}
description: |
There are errors in the PostgreSQL logs on the {{$labels.host}} server.
- alert: PgErrorCountChangeCritical
expr: |
increase(pg_log_error_total{level="FATAL"}[30m]) > 0
for: 10m
labels:
severity: critical
annotations:
summary: The amount of fatal errors in pg host {{$labels.host}} log has changed to {{$value}}
description: |
There are fatal errors in the PostgreSQL logs on the {{$labels.host}} server.
Here, the increment of the pg_log_error_total
vector time series for 30 minutes is calculated. Its value being greater than zero means that the counter has changed, resulting in an alert being sent to the user. The user can then examine the PostgreSQL logs to find the cause of the problem.
Conclusion
Thus, we used a simple error counter to illustrate setting up metrics collection based on the log file. If the existing exporters do not provide the metrics you desire and there is no way to change the application code, this can serve as a relatively simple solution.
Our choice was influenced by Vector’s easy configuration and extensive capabilities, rendering it ideal for various tasks. In addition, its high performance means that the available resources are used efficiently. The vector top
CLI tool helps debug log pipelines and view Vector’s metrics. It displays the information in a nice TUI interface.
Vector is not limited to basic counter functionality – it can also parse logs looking for slow database queries. You can then process the results using VRL, aggregate them, convert them into a metric, and display, say, the top 10 queries in Grafana. It can also prove useful for security auditing by processing connection logs and creating metrics based on the data obtained. In other words, there are many potential uses for Vector – it all depends on what your needs are.
Comments