Once upon a time, there was a PostgreSQL cluster running on a virtual machine managed by Patroni. Its primary node did not need all the available resources (CPU and memory), so we decided to reduce them. Making this change implied a role switchover between primary and secondary servers. However, we ran into an issue: only the primary remained operational, while all the replicas stopped working…
Fortunately, we were able to fix it and also discover a few interesting things along the way. This article shares how you can ensure a smooth switchover by leveraging synchronous replication in PostgreSQL/Patroni.
A bit of backstory
Our PostgreSQL cluster consisted of a primary node with 32 CPU cores and 64 GB of RAM supplemented by two inferior replicas. The replicas were configured for asynchronous replication, as the cluster was geographically distributed and waiting for data to synchronize with replicas located elsewhere was impractical for our workflow.
At some point, we realized that having such a powerful primary node was overkill. In order to optimize costs, we decided to downsize it. Our action plan was as follows:
- Wait until a period of low cluster load ensues.
- Perform the switchover procedure by reassigning the primary role to the first replica.
- Reconfigure the virtual machine on which the primary node was running, thereby reducing the allocated resources.
- Perform the reverse switchover procedure, thus returning the primary role to the original, now reconfigured node.
In doing so, we hoped to optimize the cluster’s resources while having only a minimal impact on the cluster’s operations.
Performing a manual switchover for PostgreSQL
Once a period of low cluster load had been reached, I proceeded to perform the switchover procedure. Since our replicas were configured for asynchronous replication, there existed the risk of losing some of the data that had not yet had time to synchronize with the replicas. To keep this risk to a minimum for PgSQL, I typically:
- Run the
CHECKPOINT
command. This command flushes all modified buffers to disk and creates a checkpoint indicating the consistent state of the cluster. - Run the
SELECT pg_switch_wal()
command. What it does is it closes the current WAL (Write-Ahead Log).
I assumed that running CHECKPOINT
on replicas would cause them to commit all received changes to disk, thereby preparing them for role switching. However, my assumption turned out to be wrong.
To get a sense of the current situation, I used a special query that shows the difference in states between the primary node and replicas at different stages of WAL processing. It helps you find out how far the replicas are lagging behind the primary during the data replication process:
SELECT
pg_wal_lsn_diff( pg_current_wal_insert_lsn(),pg_current_wal_flush_lsn()) non_flushed,
s.application_name,
pg_wal_lsn_diff( pg_current_wal_insert_lsn(),s.sent_lsn) as sent_lag,
pg_wal_lsn_diff( pg_current_wal_insert_lsn(),s.write_lsn) as write_lag,
pg_wal_lsn_diff( pg_current_wal_insert_lsn(),s.flush_lsn) as flush_lag,
pg_wal_lsn_diff( pg_current_wal_insert_lsn(),s.replay_lsn) as replay_lag
FROM pg_stat_replication s;
non_flushed | application_name | sent_lag | write_lag | flush_lag | replay_lag
-------------+----------------------+----------+-----------+-----------+------------
0 | postgres-1 | 0 | 0 | 0 | 0
0 | postgres-2 | 0 | 0 | 0 | 0
It looked like the primary database server and the replicas were identical.
However, in doing a patroni switchover
, I was caught off guard by something:
+ Cluster: main ----+--------------+---------+---------+----+-----------+------------------+
| Member | Host | Role | State | TL | Lag in MB | Tags |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-0 | 10.1.4.20 | Replica | running | 8 | unknown | clonefrom: true |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-1 | 10.2.4.3 | Leader | running | 9 | | clonefrom: true |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-2 | 10.3.4.7 | Replica | running | 8 | unknown | clonefrom: true |
+-------------------+--------------+---------+---------+----+-----------+------------------+
Once the switchover procedure was complete, I discovered that none of the replicas in the cluster were active anymore. So what happened?
It turned out that Patroni stopped the current primary node and turned over its role to the postgres-1 replica. However, in between these actions, some changes were made to the primary node as well. These changes were applied to the postgres-2 replica but failed to reach postgres-1. As a result, the primary role was assigned to the replica with the least current data state.
PostgreSQL has a rule that the primary server should always be the one with the most up-to-date data state. If there are more transactions on the replica than on the primary, it cannot join the cluster. In that case, you can either roll back the replica state to the primary state using the pg_rewind
tool or completely resynchronize the data from the primary to the replica.
Solving the [unexpected] replication issue
My previous experience with this cluster showed that, in most cases, it took longer to invoke pg_rewind
than fully resynchronize replicas. However, this is not a catch-all solution. For example, if the database is huge, a complete resynchronization may take an extremely long time. In cases like these, rolling back the state using pg_rewind
is a more sensible alternative.
Before proceeding to synchronize the databases, I had to check what data failed to be transferred when the roles were switched. If there was some critical information, it needed to be saved and later restored.
So I went through the cluster logs, and here’s what I found:
postgres-0$ /usr/lib/postgresql/15/bin/pg_waldump /var/lib/postgresql/15/main/pg_wal/0000000800002265000000CE
postgres-2$ /usr/lib/postgresql/15/bin/pg_waldump /var/lib/postgresql/15/main/pg_wal/0000000800002265000000CE
CHECKPOINT_SHUTDOWN
While I was going through the logs, I found that information about shutting down the primary node failed to reach postgres-1. This clarified the issue I was experiencing and gave me an idea of what to do next.
Note: This experience involved using Patroni v3.0.1, which is quite old today. According to the project’s changelog, the issue I encountered that day was resolved in v3.2.1. Another lesson learned is to consider newer versions of the software you rely on. It’s not necessarily about updating them, but at least going through the changelog.
I started the reinitialization process for postgres-2 using the patronictl reinit postgres-2
command without running any additional checks. It took about 30 minutes before the replica reported that the reinitialization process was complete. To make sure everything was okay, I checked the cluster status and saw the following:
+ Cluster: main ----+--------------+---------+---------+----+-----------+------------------+
| Member | Host | Role | State | TL | Lag in MB | Tags |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-0 | 10.1.4.20 | Replica | running | 8 | unknown | clonefrom: true |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-1 | 10.2.4.3 | Leader | running | 9 | | clonefrom: true |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-2 | 10.3.4.7 | Replica | running | 8 | unknown | clonefrom: true |
+-------------------+--------------+---------+---------+----+-----------+------------------+
After half an hour, the cluster was still in a semi-functional state. Upon examining the Patroni logs, I noticed that the synchronization was performed with postgres-0, the former primary.
I re-ran the re-initialization process and got the same log message: the replica was trying to retrieve data from the previous timeline. This unusual behavior warranted further investigation.
The culprit is found! (And new adventures ahead)
It’s always a good idea to refer to the documentation if you don’t know what to do. That’s exactly what I did. It turned out that the database server is only cloned from the primary if there are no other replicas running with the clonefrom
tag. This makes sense, seeing as in this case, the primary server is not burdened by any backup procedures. However, Patroni does not take into account how relevant the replica is during cloning.
Since I was planning to reconfigure the former primary anyway, I deleted postgres-2 and ran the following command:
patronictl reinit main postgres-0
As expected, it initiated data synchronization with postgres-1. To speed things up, I immediately started initializing the second replica:
patronictl reinit main postgres-2
Out of the blue, postgres-0, which had already downloaded several gigabytes of data, started the transfer process all over again. I stopped the process and restarted it with patronictl reinit main postgres-0
. However, this caused the second replica synchronization to fail. Log analysis revealed that Patroni created a snapshot with the same label for dumps. This caused confusion, leading to process restarts.
I had no choice but to proceed sequentially, so I started the replication process again.
By this point, the workday was in full swing, but the replicas were still missing. So all queries were routed to the primary node, causing it to slow down considerably. This led to a build-up of queries that failed to be filled in time. To make matters worse, pg_basebackup, the tool used by Patroni for replication, was connecting with privileges identical to those of regular queries and, therefore, was awaiting its turn to do something.
Since the applications were connecting via HAProxy, I reduced the number of permitted connections in its settings. This allowed pg_basebackup to run and commence the replication process but caused applications to slow down. So I ended up having to reset the number of connections in HAProxy back to the original value.
It felt like it was time to take a short break, but there was a serious concern: the entire system was running extremely slow. Simply put, the only server running was supposed to handle three times as many queries as it usually does. However, the actual performance, for some reason, turned out to be not just three times worse, but an order of magnitude worse. It was time to conduct a rigorous investigation into the issue!
In analyzing the primary database log, I discovered a large number of messages about COMMIT
requests that took a long time to complete. What exactly was causing the delays — the queries themselves due to a high load or slow transactions?
After looking into it, it turned out that queries were running fast enough, but it was transaction committing that was sluggish. I wondered, “Did pg_basebackup replication affect performance that much?”
While doing the analysis, I noticed a message that the transaction had been canceled due to a long commit on a replica. This was particularly surprising given that our configuration used asynchronous replication. So I decided to check the primary server’s settings and came up with the following:
synchronous_commit=on
synchronous_standby_names='*'
Nice! To my surprise, synchronous replication had been activated on the new primary node for some reason. It remains a mystery where these settings came from. Fortunately, there was a silver lining as well: this incident prompted us to introduce new metrics and alerts into our monitoring system to track whatever changes were made to the PostgreSQL configuration.
I purged the list of synchronous replicas, which resulted in a significant increase in application speeds. However, while the overall condition had improved, the issue still persisted.
I was about to perform a reverse primary switch, and I was extremely reluctant to repeat my previous mistakes. I wondered if synchronous replication, which had caused so many problems for applications before, could help me carry out the switchover procedure.
After carefully analyzing the situation, I devised the following action plan:
- Wait until the data replication process to the postgres-0 instance is complete.
- Replicate data to the postgres-2 instance to ensure that an active replica is available to handle application queries.
- On the primary node, set
synchronous_standby_names='postgres-0'
, thereby enabling synchronous replication with the selected replica. - Perform a switchover.
My assumption is that in this scenario, the primary will complete the active transaction and only make the checkpoint after receiving confirmation from the candidate to switch to (i.e. postgres-0).
Wrapping things up
The postgres-0 replica successfully started and synchronized with the primary, catching up with its state.
The next step was to enable synchronous replication:
patronictl edit-config
synchronous_standby_names='postgres-0'
However, when I checked the synchronous replication status in the database, I found no changes.
Patroni log review revealed an error in the command. You might be aware that PostgreSQL does not support the use of hyphens (-
) in object names. Column, table, database, and server names containing hyphens must be enclosed in quotation marks (e.g., "postgres-0"
). Unfortunately, Patroni does not honor this feature.
Okay. I set all replicas as synchronous (synchronous_standby_names='*'
) and initiated the switchover. To my pure joy, the whole process launched without any glitches, completely in line with the original plan.
Summary
If communications between nodes in your Patroni cluster are not fast enough, you may encounter similar challenges when running switchover or failover operations. It is difficult to do anything about failover, but you can make the switchover process safer by following these guidelines:
- Switch the replicas to synchronous mode; disable the automatic
pg_rewind
. - Check that the changes have been applied (
journalctl -u patroni
). - Ensure that all replicas are fully synchronized with the primary server and do not lag behind it.
- Perform a switchover.
Using newer versions of Patroni will also be beneficial. Particularly, the project introduced various switchover-related fixes and improvements over the last 1.5 years. Good luck!
Comments