Posts by tags
  • Popular
  • Kubernetes 73
  • tools 25
  • databases 24
  • migrations 13
  • observability 12
  • A-Z
  • AIOps 1
  • ARM 1
  • AWS 3
  • benchmarking 2
  • best practices 7
  • business 4
  • caching 3
  • Calico 1
  • Cassandra 2
  • Ceph 5
  • cert-manager 1
  • CI/CD 9
  • CLI 4
  • ClickHouse 3
  • CNI 2
  • CockroachDB 1
  • comparison 9
  • databases 24
  • eBPF 2
  • Elasticsearch 5
  • etcd 4
  • failures 11
  • FinOps 1
  • Fluentd 1
  • GitLab 4
  • Helm 5
  • hyperconvergence 1
  • Ingress 3
  • Kafka 2
  • Keycloak 1
  • KeyDB 3
  • Kubernetes 73
  • Kubernetes operators 11
  • Linux 4
  • logging 5
  • Logstash 1
  • market 5
  • memcached 1
  • migrations 13
  • MongoDB 2
  • MySQL 2
  • networking 7
  • nginx 1
  • observability 12
  • Palark 7
  • PHP 1
  • PostgreSQL 10
  • Prometheus 4
  • Python 4
  • RabbitMQ 1
  • Redis 4
  • Rook 3
  • security 7
  • serverless 2
  • software development 2
  • SSL 1
  • storage 10
  • success stories 2
  • Terraform 3
  • tools 25
  • troubleshooting 8
  • Vault 1
  • Vector 2
  • virtualization 1
  • VPN 1
  • werf 3
  • YAML 2
  • ZooKeeper 1

Exporting PostgreSQL logs into Prometheus metrics using Vector

This article discusses generating Prometheus metrics using logs as a data source. This approach can help if existing exporters do not provide the metrics you need and, for whatever reason, you are reluctant to develop your own exporter.

One of our customers was interested in obtaining timely information on failed queries from an application to a PostgreSQL v14 database. At the same time, we had to implement this monitoring without making any changes to the application code. Upon looking through the existing PostgreSQL exporters, we failed to find any suitable metrics capable of sending error reports, so we decided to create one ourselves.

Preparing logs for further use

From a technical perspective, the task boiled down to parsing the log file, extracting a metric from the data, exporting it to Prometheus, and setting up alerts. Unfortunately, earlier PostgreSQL versions (up to 15) do not support the popular JSON structured log format out-of-the-box. Installing third-party modules in the production environment was not an option, nor was storing log data in the CSV format. However, we were able to use a configuration file to format the output as needed. Our customer used the following log format:

# postgresql.conf
log_line_prefix = '%m  %p %u@%d from %h [vxid:%v txid:%x] [%i] '

Where:

  • %m is a timestamp, including milliseconds;
  • %p – Process ID;
  • %u – username;
  • %d – database name.

For more information about what the other variables are used for and what other information can be written to the log, see the PostgreSQL documentation. The result was something like this:

2022-05-12 07:33:54.285 UTC  2672031 @ from  [vxid: txid:0] [] LOG:  checkpoint complete: wrote 64 buffers (0.0%); 0 WAL file(s) added, 0 removed, 10 recycled; write=6.266
s, sync=0.004 s, total=6.285 s; sync files=10, longest=0.003 s, average=0.001 s; distance=163840 kB, estimate=163844 kB

Searching for the best solution

We tried fluentd, Promtail, and Vector to retrieve the metrics from the logs and send them to Prometheus.

Fluentd has an extensive plugin system to suit all use cases. For example, fluent-plugin-prometheus converts data into metrics’ format and delivers them to Prometheus. However, since the release of v2.0.0 early last year, the project has evolved at a snail’s pace. For that reason, we decided to use something else this time — even though we really like fluentd and use it a lot.

The first tool we tried was Promtail. It has the ability to parse text logs (including multi-line logs), process them based on the task (including extracting metrics), and filter the content based on specified parameters.

For testing, we created a configuration file to count the log lines. Here is an example of the log processing stage:

pipeline_stages:
  - multiline:
      firstline: '^\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\]'
  - regex:
      expression: '^(?P\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\]) (?P(?s:.*))$'
  - metrics:
     log_lines_total:
       type: Counter
       description: "total number of log lines"
       prefix: pg_custom_
       max_idle_duration: 24h
       config:
         match_all: true
         action: inc
     log_bytes_total:
       type: Counter
       description: "total bytes of log lines"
       prefix: pg_custom_
       max_idle_duration: 24h
       config:
         match_all: true
         count_entry_bytes: true
         action: add

The configuration worked fine; however, we missed one crucial point. Promtail requires the address of Loki (the log aggregation tool developed by Grafana) as a mandatory configuration parameter. It simply doesn’t start without this parameter set. We felt it would be impractical to install Loki as well.

NB. If you still want to use Promtail, you can replace the Loki address with that of any web server (e.g., nginx) capable of serving a 200 response code on any request. However, we do not recommend using this workaround in a production environment.

Finally, it was the Vector’s turn. Spoiler: it worked out well for us.

Vector: parsing logs and exporting them to Prometheus

Vector has to be installed on the hosts where the log files to be parsed are located in order to export the logs to Prometheus. After the installation is complete, configure it accordingly. You can use, e.g., Ansible to do just that:

# playbook-vector.yaml
---
- name: Setup vector
 hosts:
   - pg
 become: yes
 vars:
   arch: amd64
   version: 0.18.1
   vector_template: files/40-vector.toml
   vector_config_file: /etc/vector/vector.toml
 tasks:
   - name: Setup install vector
     become: yes
     apt:
       deb: "https://packages.timber.io/vector/{{ version }}/vector-{{ version }}-{{ arch }}.deb"
       install_recommends: yes
     notify:
       - restart vector

   - name: Copy config
     copy:
       src: "{{ vector_template }}"
       dest: "{{ vector_config_file }}"
       mode: 0644
       owner: vector
       group: vector
     notify: restart vector

   - name: Start Vector
     service:
       state: started
       enabled: yes
       name: vector

 handlers:
   - name: restart vector
     service:
       state: restarted
       daemon_reload: yes
       name: vector

The Vector configuration is stored in a TOML file. Specify the location of the log file and its type in this file:

# vector.toml
[sources.postgres_logs.multiline]
start_pattern = '^\d{4}-[0-1]\d-[0-3]\d \d+:\d+:\d+\.\d+ [A-Z]{3}'
mode = "halt_before"
condition_pattern = '^\d{4}-[0-1]\d-[0-3]\d \d+:\d+:\d+\.\d+ [A-Z]{3}'
timeout_ms = 1000

Note that halt_before mode means that Vector will treat all the lines that follow the condition_pattern (and do not begin with the latter) as a single message.

You can use other multiline.mode values as well. For example, the half_with mode includes all consecutive lines up to and including the first line matching the condition_pattern in the message.

The message is then parsed using the Vector Remap Language (VRL):

# vector.toml
[transforms.postgres_remap]
type   = "remap"
inputs = [ "postgres_logs" ]
source = """
. |= parse_regex!(.message, r'^(?P\\d{4}-[0-1]\\d-[0-3]\\d \\d+:\\d+:\\d+\\.\\d+ [A-Z]{3})  (?P\\d+) (?P(\\[\\w+\\]@\\w+|@|\\w+@\\w+)) from (?P(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}|\\[\\w+\\]|\\s*)) (?P\\[\\w+:.+:\\d+\\]) (?P(\\[\\]|\\[\\w.+\\])) (?P.*[A-Z]):  (?P.*)$')

del(.timestamp)

message_parts, err = split(.message, ", ", limit: 2)
structured = parse_key_value(message_parts[1], key_value_delimiter: ":", field_delimiter: ",") ?? {}
message = message_parts[0]
. = merge(., structured)
del(."please try the setup again")
del(.message)"""

Here, we:

  • specified the log source;
  • set a regular expression to parse the log message with;
  • deleted the unnecessary fields;
  • split the message using the “,” delimiter;
  • saved the result to the map array, processed it, and got JSON output so we could continue manipulating its fields.

Now let’s filter out the error messages:

# vector.toml
[transforms.postgres_filter]
type      = "filter"
inputs    = [ "postgres_remap" ]
condition = '.level == "ERROR" || .level == "FATAL"'

This configuration will include ERROR and FATAL messages in the metric.

Next, create a metric based on the filtered log messages. Set the metric type and the field to use, name it appropriately, and attach additional labels.

# vector.toml
[transforms.postgres_metric]
type   = "log_to_metric"
inputs = [ "postgres_filter" ]

  [[transforms.postgres_metric.metrics]]
  type      = "counter"
  field     = "level"
  name      = "error_total"
  namespace = "pg_log"

    [transforms.postgres_metric.metrics.tags]
    level = "{{level}}"
    host  = "{{host}}"

The last step is to publish an exporter. Prometheus will use it to scrape metrics.

[sinks.postgres_export_metric]
type              = "prometheus_exporter"
inputs            = [ "postgres_metric" ]
address           = "0.0.0.0:9598"
default_namespace = "pg_log"

Setting up alerts based on retrieved metrics

For Prometheus to be able to scrape metrics from the new exporter, we now have to set up regular targets:

scrape_configs:
  - job_name: custom-pg-log-exporter
    static_configs:
      - targets: ['10.10.10.2:9598', '10.10.10.3:9598', '10.10.10.4:9598']

The next step is to create a metric-based rule, which Alertmanager will handle according to its routing settings:

- alert: PgErrorCountChangeWarning
    expr: |
        increase(pg_log_error_total{level="ERROR"}[30m]) > 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: The amount of errors in pg host {{$labels.host}} log has changed to {{$value}}
      description: |
       There are errors in the PostgreSQL logs on the {{$labels.host}} server.
- alert: PgErrorCountChangeCritical
    expr: |
        increase(pg_log_error_total{level="FATAL"}[30m]) > 0
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: The amount of fatal errors in pg host {{$labels.host}} log has changed to {{$value}}
      description: |
        There are fatal errors in the PostgreSQL logs on the {{$labels.host}} server.

Here, the increment of the pg_log_error_total vector time series for 30 minutes is calculated. Its value being greater than zero means that the counter has changed, resulting in an alert being sent to the user. The user can then examine the PostgreSQL logs to find the cause of the problem.

Conclusion

Thus, we used a simple error counter to illustrate setting up metrics collection based on the log file. If the existing exporters do not provide the metrics you desire and there is no way to change the application code, this can serve as a relatively simple solution.

Our choice was influenced by Vector’s easy configuration and extensive capabilities, rendering it ideal for various tasks. In addition, its high performance means that the available resources are used efficiently. The vector top CLI tool helps debug log pipelines and view Vector’s metrics. It displays the information in a nice TUI interface.

Vector is not limited to basic counter functionality – it can also parse logs looking for slow database queries. You can then process the results using VRL, aggregate them, convert them into a metric, and display, say, the top 10 queries in Grafana. It can also prove useful for security auditing by processing connection logs and creating metrics based on the data obtained. In other words, there are many potential uses for Vector – it all depends on what your needs are.

Comments

Your email address will not be published. Required fields are marked *