Recent troubleshooting cases from our SREs, part 2

It’s been a while since the previous article in which we shared several captivating stories about our real-life experience in operating Kubernetes clusters as well as applications/services running in them. Here is another episode in adventures of our engineers sailing in turbulent waters of a large-scale (and not so large) web applications.

Case #1. Kafka and Docker variables in Kubernetes

The container-based systems gradually become an industry standard, and many tools have already taken root in this environment. However, you can still find some ridiculous bugs related to integrating software into Docker/Kubernetes.

One day, we were going to deploy the standard bundle of ZooKeeper + Kafka for Sentry, with a possible addition of other components in some future. At first sight, the process looks easy: take one hard-to-read Kafka’s Helm chart as a basis (you can find it in this well-known GitHub repo), chip off all the unnecessary stuff, add some tweaks for your needs, et voila!

However, we got the following error while trying to run it:

===> Configuring ...
port is deprecated. Please use KAFKA_ADVERTISED_LISTENERS instead.

That’s strange. Okay, let’s add this:

env:
  - name: KAFKA_ADVERTISED_LISTENERS
    value: {{ printf "%s:%s" .Chart.Name  .Values.port._default }}

Still the same problem… Perhaps, it makes sense to find out what script produces the above message and why? The next question then is: “What image do we use, and what’s inside?”

This chart is based on a quite popular confluentinc/cp-kafka image that boasts a multipurpose entrypoint (as is the case with many other Confluent Community images). If you analyze what scripts are run and in what order, you will find the one that hides the source of our problems:

if [[ -n "${KAFKA_PORT-}" ]]
then
  echo "port is deprecated. Please use KAFKA_ADVERTISED_LISTENERS instead."
  exit 1
fi

It’s a no-brainer, right? But wait, we did not declare that variable (or anything like it)!

However, if you look at the actual environment variables declared inside the container, you will find such a variable… as well as some similar others:

root@kafka-0:/# env |grep KAFKA_PORT
KAFKA_PORT_9092_TCP_PORT=9092
KAFKA_PORT_9092_TCP_PROTO=tcp
KAFKA_PORT=tcp://192.168.252.192:9092
KAFKA_PORT_9092_TCP=tcp://192.168.252.192:9092
KAFKA_PORT_9092_TCP_ADDR=192.168.252.192

The thing is that kubelet adds a set of variables when starting a pod. Their names are composed using service names/hosts/ports. This way K8s can find an address of the service it needs — it is briefly described in the Kubernetes documentation.

The obvious way to solve the problem is simply to rename the service in Kubernetes to something more relevant (and less similar to the conventional kafka).

Let’s run it:

$ kubectl -n sentry logs sentry-kafka-0
===> ENV Variables ...
...
SENTRY_KAFKA_PORT=tcp://192.168.224.150:9092
...
===> Configuring ...
===> Running preflight checks ...
===> Check if /var/lib/kafka/data is writable ...

Nice one!

Case #2. How a mere 100 bytes have led to a disaster

We are fond of ClickHouse and, among other things, actively use it in many projects as the primary DBMS for collecting statistics. We also keep K8s clusters’ logs in there since this is very convenient and efficient. So the fact that in real life, the ClickHouse-related incidents may occur periodically does not surprise us at all. We’ve seen many of them… but still, this case is special.

For many years, we have been maintaining a pretty large ClickHouse cluster consisting of 5 shards, 10 bare-metal nodes with 45 TB of data on each. The cluster contains a large number of replicated tables, while ZooKeeper running on 3 VM nodes is responsible for the replication.

One night, one of the hypervisors, where ZooKeeper VMs were running, froze. We decided to restart the hypervisor, after which the ZK cluster lost its quorum, stopped working and serving ClickHouse needs. Our first thought was: “Well, something corrupted,” so we decided to restore metadata from a snapshot.

Dictum factum… but to no avail. The problem persisted when we were trying to run ZooKeeper and restore data:

2020-07-03 18:19:50,681 [myid:2] - WARN  [QuorumPeer[myid=2](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):Follower@129] - Exception when following the leader
java.net.SocketTimeoutException: Read timed out
....

We even tried to leave the snapshot aside and create tables right in the ClickHouse — but no luck again:

CREATE TABLE database.storage1 (`date` Date, `created` DateTime, `value` Int32) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/storage1', '{replica}', date, value, 8192);
....
Received exception from server (version 19.14.3):
Code: 999. DB::Exception: Received from 127.0.0.1:9000. Coordination::Exception. Coordination::Exception: All connection tries failed while connecting to ZooKeeper. Addresses: 192.168.0.60:2181, 192.168.0.61:2181, 192.168.0.62:2181
....
Code: 209, e.displayText() = DB::NetException: Timeout exceeded while reading from socket (192.168.0.62:2181): while receiving handshake from ZooKeeper (version 19.14.3.3 (official build)), 192.168.0.62:2181
(Connection loss).

That was strange since we did not change any settings, either on servers or in the CH/ZK configuration. However, all these problems suggest there is something wrong with the network, and they are clearly related to ZK.

I will not dive into excessive details of all our actions taken to find out the source of the problems: it is sufficient to say that we were quite stressed. The culprit was… an MTU — yeah, you got that right: an MTU!

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:e5:6d:fc brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.60/24 brd 192.168.0.255 scope global ens3
       valid_lft forever preferred_lft forever

The thing is the network on ZooKeeper VMs was configured a while ago — in 2018, right after the release of Ubuntu 18.04 — using a netplan configuration utility (at the time, it was rather new and poorly understood). While the netplan sets the MTU to 1500 bytes by default, the default MTU value in Hetzner vSwitch is just 1400 bytes.

The thing is, we have not used packets of this size until that moment, so the problem has never manifested itself. That night, these additional 100 bytes turned out to be fatal. We reduced the MTU, after which the cluster started and kept on working normally. So we could carry on with cleaning the mess created by the incident…

Bonus!

When restoring metadata in ZooKeeper, we tried, among other things, to recreate tables manually and discovered an interesting peculiarity.

The following format of creating a table (when the partition key and primary key are specified directly in the ENGINE section) is deprecated starting from ClickHouse version 20:

CREATE TABLE database.storage1foo (`date` Date, `created` DateTime, `value` Int32)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/storage1foo', '{replica}', date, value, 8192);

Here is the correct way to do it in the latest versions of ClickHouse:

CREATE TABLE database.storage1bar
(
 `date` Date,
 `created` DateTime,
 `value` Int32
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/storage1bar', '{replica}')
PARTITION BY date
ORDER BY value
SETTINGS index_granularity = 8192

Take a closer look: both statements do exactly the same and create identical tables with the same parameters. But the metadata generated by these statements are different:

[zk: localhost:2181(CONNECTED) 5] get /clickhouse/tables/1/storage1foo/metadata
metadata format version: 1
date column: date
sampling expression:
index granularity: 8192
mode: 0
sign column:
primary key: value
granularity bytes: 10485760[zk: localhost:2181(CONNECTED) 6] get /clickhouse/tables/1/storage1bar/metadata
metadata format version: 1
date column:
sampling expression:
index granularity: 8192
mode: 0
sign column:
primary key: value
data format version: 1
partition key: date
granularity bytes: 10485760

As you can see, there is an obvious difference in the date column and partition key fields, although the input data appear to be the same. Therefore, CH metadata and tables created by queries of different formats are incompatible.

Case #3. The story of one overheated… K8s

One day, we received a message about the malfunction of one of the services. When examining service pods, we noticed that pods periodically restarted and did not operate as expected.

The service was running on several Kubernetes nodes with enormous CPU load. Load upsurges occurred not simultaneously on all nodes, but quite close in time. The situation was critical — at some point, the kubelet stopped responding at all.

At the same time, cluster resources consumed only a small part of the computing power of the nodes: the bulk of the CPU load was attributed to the kidle_inj (idle) process.

$ dmesg -T | grep intel_powercl
[Sat Jun 13 06:29:00 2020] intel_powerclamp: Start idle injection to reduce power
[Sat Jun 13 06:29:02 2020] intel_powerclamp: Stop forced idle injection
[Sat Jun 13 06:29:05 2020] intel_powerclamp: Start idle injection to reduce power
[Sat Jun 13 06:29:07 2020] intel_powerclamp: Stop forced idle injection
[Sat Jun 13 06:29:08 2020] intel_powerclamp: Start idle injection to reduce power

This cluster runs on bare-metal nodes. Which begs the question: what’s with their temperature?

$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +92.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:        +90.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:        +91.0°C  (high = +80.0°C, crit = +100.0°C)
Core 2:        +92.0°C  (high = +80.0°C, crit = +100.0°C)
Core 3:        +91.0°C  (high = +80.0°C, crit = +100.0°C)

90 degrees Celsius! (= 194 °F)

It was a perfect time to drop a message to the technical support of the datacenter, asking them to check the machines. Meanwhile, we had to find another place where we could route the traffic.

The technical support said they needed to turn off machines for 30 minutes to carry out maintenance. To make it possible, we drained the first problem node. Soon, we received the following update from the data center technicians:

Dear Client,

unfortunately the CPU fan was faulty, we’ve replaced this and started the server again.

And this totally proves our point! The further course of action is clear: relocate nodes to other machines one by one and coordinate efforts with the data center staff who replaces the coolers. Needless to say, the situation returned to normal right away.

Case #4. Ode to pg_repack, a PostgreSQL extension

Usually, we do not provide extensive DBA services. Still, we can analyze a database, create a couple of indexes, optimize some requests. These things often happen by coincidence: our engineers notice some strange behavior, proceed to explore it and discover something (you can always find something if you look closely!).

One day, our engineer-on-duty noticed an increase in the number of long transactions in the PostgreSQL database. To be more precise, it was detected by our monitoring system: the planned task of clearing outdated data from the database started to take up more and more time and exceeded the threshold set for long transactions.

Despite our initial efforts to solve the problem, it became clear that a deeper analysis of the situation is required.

That is why we evaluated the fragmentation of tables using the pgstattuple module:

SELECT schemaname,
       tablename,
       dead_tuple_count(stats) AS dead_tuples,
       pg_size_pretty(dead_tuple_len(stats)) AS dead_space,
       pg_size_pretty(free_space(stats)) AS free_space,
       last_autovacuum,
       last_autoanalyze
FROM
  (SELECT pgt.schemaname AS schemaname,
          pgt.tablename AS tablename,
          pgstattuple(pgt.schemaname || '.' || pgt.tablename) 
          AS stats,
          uts.last_autovacuum AS last_autovacuum,
          uts.last_autoanalyze AS last_autoanalyze
   FROM pg_tables AS pgt
   LEFT JOIN pg_stat_user_tables 
        AS uts 
        ON pgt.schemaname = uts.schemaname
   AND pgt.tablename = uts.relname
   WHERE pgt.schemaname NOT IN ('repack','pg_catalog')) AS r
ORDER BY dead_tuples DESC;

… as well as the fragmentation of indexes:

SELECT schemaname,
       indexname,
       tablename,
       dead_tuple_count(stats) AS dead_tuples,
       pg_size_pretty(dead_tuple_len(stats)) AS dead_space,
       pg_size_pretty(free_space(stats)) AS free_space
FROM
  (SELECT pgt.schemaname AS schemaname,
          pgt.indexname AS indexname,
          pgt.tablename AS tablename,
          pgstattuple(pgt.schemaname || '.' || pgt.indexname) AS stats
   FROM pg_indexes AS pgt
   WHERE pgt.schemaname NOT IN ('repack',
                                'pg_catalog')
     AND pgt.indexname NOT IN ('some',
                               'important',
                               'triggers')) AS r
ORDER BY dead_tuples DESC;

This study revealed a large number of fragmented tables and indexes. Therefore, we decided to run a pg_repack to remove all bloat from tables/indexes and restore physical order of indexes.

It took us a whole night and the following morning to repack a 620 GB database. This operation freed up about 150 GB of disk space, optimized the database, and somewhat increased its performance.

Our next step will be to migrate from PostgreSQL v11 to v12 since this version has enhancements to reduce data fragmentation (so the problem becomes less of an issue).

The takeaway is clear: keep an eye on your databases, optimize them wisely, and do not turn them into the trash can filled with useless indexes.

Conclusion

Even well-known and well-documented tools may surprise you (both pleasantly and not), which makes our eventful work even more fascinating. Have you had similar cases? Please, feel free to share them in the comments! As for us, we still have a lot of breathtaking stories, so stay tuned!

Comments 1

Your email address will not be published. Required fields are marked *

  1. Dimitrios Nomikos

    The first case just ate 3 hours of my time and would eat at least 5 more if you hadn’t written this article. <3