Recent troubleshooting cases from our SREs, part 4

The next set of troubleshooting cases from our SREs includes renewing Let’s Encrypt root certificates for legacy CentOS, handling an error in DNS records with Ingress, dealing with a tricky sharding problem in Elasticsearch, and more.

Caution! None of the solutions provided below should be considered universal. Apply them with caution. The outcome depends on the characteristics of the particular project, so we recommended exploring all the possible alternatives.

Story #1: Package troubles in CentOS 6

Support for CentOS 6 expired in 2020. Everyone knows that the OS needs to be updated regularly, but sometimes it is simply not feasible. One of the clusters we maintained had several servers running CentOS 6.5, and we needed to install several packages on them quickly. Typically, it comes down to running yum install <foobar>, but in our case, several obstacles prevented us from taking the easy path.

First of all, the /etc/yum/yum.repos.d/CentOS-Base.repo file contains a list of CentOS repository mirrors with URLs. Since the system was no longer supported, this file had lost its relevance (assuming that it retained its original contents). Consequently, it was no longer possible to download anything from these mirrors.

So we decided to try switching to the main repository, but to no avail. The CentOS version control plan involved moving distributions of older versions to a separate repository, As a result, they were no longer available in the main repository.

The next obvious step was to disable Centos-Base.repo (you can delete this file or remove it from /etc/yum/yum.repos.d; another option is to set enabled=0 for each repository it contains). Then we went on to enable all the repositories in the Centos-Vault.repo file: enabled=1.

After that, it was time to run yum install (no luck again). While the configuration had HTTP specified as the protocol to retrieve files from the repository, redirected from HTTP to HTTPS. In 2011, when CentOS 6 was released, HTTPS was not considered mandatory, but nowadays, redirecting to HTTPS is the rule rather than the exception. Those are the security requirements. As a result, our obsolete CentOS simply couldn’t connect to the server the new certificate was protecting:

curl -L
curl: (35) SSL connect error

The cause of the error was clear. Further analysis revealed that it was all about the NSS (Network Security Services), a set of libraries for developing security-enabled client and server applications. Applications built with NSS support SSLv3, TLS, and other security standards.

To fix the error, we needed to update the NSS. But not so fast: we still couldn’t retrieve the new packages since they are only accessible over HTTPS. However, in our case, HTTPS only partially worked: some encryption algorithms were missing, which led to some issues.

Fortunately, there are several workarounds:

  • You can download packages to the local computer over HTTPS and then upload them over SCP/Rsync to the remote machine. This is probably the best course of action.
  • Use either of the two official mirrors that have not yet been upgraded from HTTP to HTTPS: or mirror.nsc.liu (the third mirror,, listed as an official at, is not suitable as it upgrades connections to HTTPS).

Here is a quick “recipe” with a list of packages (and their dependencies) that need to be updated to use HTTPS:

mkdir update && cd update
curl -O
curl -O
curl -O
curl -O
curl -O
curl -O
curl -O

sudo rpm -Uvh *.rpm

And we’re done! Let’s try to install the required package once again.

No luck: the SSL error is still plaguing us. At that point, you start to think that it would be easier to just install a new version of the OS… (But that’s actually not the case, so we continue.)

The latter problem is due to an invalid root certificate since the domain is secured by a Let’s Encrypt (LE)* certificate. Since September 30, 2021, older operating systems no longer trust LE-signed certificates.

At the time of publishing this article, the point about LE and ca-bundle patching has lost relevance, since is now secured by a certificate signed by Amazon Root CA 1. But you still have to update the NSS.

curl: (60) Peer certificate cannot be authenticated with known CA certificates
More details here:

To fix this problem for CentOS 6: a) update the list of root certificates, and b) change the validity period of the certificate in the local secure storage.

The changed certificate will still be treated as valid by OpenSSL. This is because CentOS 6 uses OpenSSL v1.0.1e, and the latter does not check the signature of certificates in the local secure storage.

curl -O
sudo rpm -Uvh ca-certificates-2020.2.41-65.1.el6_10.noarch.rpm

sudo sed -i "s/xMDkzMDE0MDExNVow/0MDkzMDE4MTQwM1ow/g" /etc/ssl/certs/ca-bundle.crt
sudo update-ca-trust

Access to the official CentOS repositories for yum has been restored, and HTTPS is working. You can now add a repository with the latest CentOS version, 6.10 (if necessary).

cat </etc/yum.repos.d/CentOS-6.10-Vault.repo
name=CentOS-6.10 - Base
name=CentOS-6.10 - Updates
name=CentOS-6.10 - Extras

Congratulations! yum install <foobar> now works exactly as expected.

Story #2: Sudden DNS modification and Ingress

It was a quiet and peaceful Thursday night. Suddenly, one of the clients messaged us that he had made “a little” mistake in the DNS settings. As a result, a domain that isn’t supposed to do so now points to the Kubernetes cluster. Unfortunately, there was no possible way to quickly change the DNS record due to bureaucratic and other delays. In addition, updating the DNS cache itself takes some time.

As a result, as of Thursday evening, we had:

  • the real-host DNS record pointing to the cluster, with
  • a DNS update time of about 3 days.

At first glance, the solution is pretty obvious: set up nginx-proxy in the cluster and proxy requests to the right service. But at the same time, why use nginx if we already have nginx ingress running in the cluster? Besides, a Kubernetes external service can be used to proxy external requests.

As a result, we added the following resources to the cluster:

apiVersion: v1
kind: Service
  name: real-external
  - port: 443
    protocol: TCP
    targetPort: 443
  type: ClusterIP

apiVersion: v1
kind: Endpoints
  name: real-external
- addresses:
  - ip: 
  - port: 443
    protocol: TCP
apiVersion: extensions/v1beta1
kind: Ingress
  annotations: nginx HTTPS
  name: real-external
  - host: real-host
      - backend:
          servicePort: 443
        path: /
  - hosts:
    - real-host
    secretName: real-tls

Note that you will also need a real-tls SSL certificate (you can issue it with cert-manager).

This approach is quite common and acceptable. It is helpful when someone has made a mistake or if you need to make resources accessible outside the cluster (never mind the question of why this might be necessary).

Case #3: Shards charade

One day, there was a need to split a large Elasticsearch index into shards. That begs the question – how many shards do we need to use?

The Elasticsearch documentation stipulates two conditions on the new number of shards (number_of_shards):

  • the new number must be a multiple of the current number of shards;
  • the internal number of sub-shards (set during the index creation) — number_of_routing_shards — must be a multiple of the new number of shards (we use the default value in our example).

Unfortunately, number_of_routing_shards is not passed to /index/_settings. If you do not know the parameters with which the index was created, the task of determining the number of sub-shards seems not to be feasible.

But there is a workaround: you can call the split API and specify the number of shards that you know the number_of_routing_shards cannot be divided by (e.g., a large prime number). The number_of_routing_shards parameter will be returned along with the error.

POST index/_split/target-index
  "settings": {
    "index.number_of_shards": this_number
  "error": {
        "reason": "the number of routing shards [421] must be a multiple of the target shards [15]"


…or not? Somehow the index gets divided into 421 shards, which spread across the nodes, and the cluster is “blushing” for our mistakes. But what has happened?

It turns out that Elasticsearch can divide the single-shard index into any number of new shards. You can roll back the changes by making the index read-only and shrinking the index API (in our case, we just deleted all 421 shards, since the data they stored was not essential).

Case #4: Restoring a table from a backup is easy, right?

Picture this: it’s Wednesday morning, the middle of the workweek. One customer asks for some help: “Do you guys have backups of the ‘Somename’ database (all names are fictitious)? We need Table 1 as we accidentally wiped out all the data from it older than 1/1/2022. The data is not critical: it would be great if you could get it today.”

Normally, we use BorgBackup for backups (including DB backups). For example, here is how we back up PostgreSQL data to BorgBackup:

pg_basebackup --checkpoint=fast --format=tar --label=backup --wal-method=fetch --pgdata=-

If we need to retrieve some data from a backup (e.g., a single table), we usually:

  • pull the necessary pgdata dump;
  • extract its contents to a temporary directory;
  • run Postgres in a Docker container mounting the directory containing the extracted pgdata as the container’s volume.

You can do that as follows:

borg extract --stdout /backup/REPONAME::ARCHIVE_NAME | tar -xC /backup/PG-DATA
docker run -d -it -e POSTGRES_HOST_AUTH_METHOD=trust -v /backup/PG-DATA:/var/lib/postgresql/data postgres:11

Next, connect to Postgres and retrieve the necessary data: create a dump of part of the data, part of the tables, etc. (rolling back the entire database is rarely required).

In our story, we followed a similar pattern. However, at the time, our plan failed: the engineer restored the backup, ran Docker, had a cup of coffee, got connected to the database, and … could not find the necessary data in the backup! What was going on? The data should’ve been in place because the client had precisely specified the date from which the data was saved to the lost backup.

But this is not yet another story about validating backups and repeating the GitHub challenge. The backup was fine (at least at the time it was extracted). So what was going on?

The answer is simple: a DB replica was used for the backup, so there was a recovery.conf* file in the backup. Once started, the Postgres container running in Docker decided to catch up with the master**. This was not part of the plan: we wanted to get the state of the target table at the time of the backup, not its current state.

* Starting with PostgreSQL 12, connection parameters are stored in the file, while the standby.signal file is responsible for switching to recovery mode (these two files replaced recovery.conf). A similar situation may occur if a standby.signal file is present in the data directory during the PostgreSQL startup.

** This scenario is possible when the backup server has access to the master database: e.g., they are in the same L2 network, and access to pg_hba.conf is granted to the whole subnet.

The correct solution was to delete the recovery.conf file (standby.signal or recovery.signal) before starting the Docker container. That’s what we did on the second attempt.

In the end, the data was restored, and no records were damaged.

We also amended the internal instructions for deploying temporary backups and informed our fellow engineers. Hopefully, this has helped someone significantly reduce their downtime.

The bottom line: you have to check the configuration even while running applications in Docker.

A note about MySQL

A similar problem can occur with MySQL when using Percona tools XtraBackup and innobackupex.

During a dump, a complete copy of the database is created (including replication settings). Restoring this dump may result in the DBMS connecting to the main cluster and initiating replication. However, for this to happen, two conditions must be met:

  1. The user under which the replication is running is configured to accept connections from any IP (e.g., ‘replication'%'*‘).
  2. The database is no older than the available binlogs.

Story #5: Where is the Pod?

One day, one of the stage clusters we maintain lost the ability to create new Pods in its namespace:

kubectl scale deploy productpage-v2 -replicas=2
Kubectl get po | grep productpage-v2
```only one old Pod is running```

Typically, when a Pod is created, but no containers are running inside it, this means that the Pod is in one of the following states:

  • Pending (there is no suitable node);
  • CrashLoopBackOff (the Pod is waiting to run after several crashes have occurred);
  • ContainerCreating (containers are being created).

However, in our case, the Pod object was missing!

Kubectl describe deploy productpage-v2
Normal  ScalingReplicaSet  17s   deployment-controller  Scaled up replica set productpage-v2-65fff6fcc9 to 2

Kubectl describe rs productpage-v2-65fff6fcc9
Warning  FailedCreate  16s (x13 over 36s)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "": Post "https://istiod-v1x10x1.d8-istio.svc:443/inject?timeout=30s": no service port 443 found for service "istiod-v1x10x1"

It turned out that there was some webhook preventing Pods from starting. What in the world does that mean?

The Kubernetes documentation describes two types of admission webhooks: validating and mutating. Mutating webhooks can modify objects sent to the API server while validating webhooks can reject requests to enforce custom policies.

Here’s how it works: when creating an object, the API server sends its description to the address specified in the hook and gets an answer on the fate of that object, i.e. allow/deny/modify.

Here is a typical example of a hook that will be called every time any object is created or updated:

- clientConfig:
  	name: cert-manager-webhook       <- Certmanager is listening here.
  	namespace: d8-cert-manager
  	path: /mutate
  	port: 443
  failurePolicy: Fail
  - apiGroups:
	- v1
	- '*/*'
	scope: '*'

In our case, the webhook was the culprit. Istio uses it to add sidecar Envoy containers to all running Pods.

Here’s what the process usually looks like:

  • Istio creates MutatingWebhookConfiguration.
  • When creating a Pod, the API server requests the istiod-v1x10x1 service:
          name: istiod-v1x10x1
          namespace: d8-istio
          path: /inject
          port: 443
  • Istio responds with changes that need to be made to the Pod; the Pod is then created.

In our case, Istio was deleted, but only partially: the istiod-v1x10x1 service no longer worked, while the webhook was still there. Meanwhile, the webhook’s failurePolicy was set to Fail. That is what prevented Pods from being created.

You can create Pods by setting failurePolicy: Ignore, but we believe it is better to discover the problem and fix it rather than devise a workaround.

To be continued…

Here go our previous troubleshooting stories:

  • Part 3: Linux server migration; getting to know the ClickHouse Kubernetes operator; accelerating the data recovery in a broken PostgreSQL replica; a CockroachDB upgrade that went wrong;
  • Part 2: Kafka and Docker variables in Kubernetes; 100 bytes that led ClickHouse to a disaster; one overheated K8s; ode to PostgreSQL’s pg_repack;
  • Part 1: Golang and HTTP/2 issue;  no Sentry for old Symfony; RabbitMQ and third-party proxy; the power of a GIN index in PgSQL; Caching S3 with NGINX; analyzing Google User Content during DDoS.


Your email address will not be published. Required fields are marked *