Skip to main content

How Bitlox Engineers Avoid the 3 Deadliest PostgreSQL Replication Mistakes

Replication in PostgreSQL can feel like a solved problem — until your standby falls behind by gigabytes, a WAL segment gets deleted prematurely, or a failover leaves you with split-brain data. Bitlox engineers have debugged these exact scenarios across dozens of deployments, and the same three mistakes keep surfacing. This guide names them, explains why they're deadly, and shows you how to build replication that doesn't betray you at 3 AM. Who Needs This and What Goes Wrong Without It If you run PostgreSQL in production with more than one server, you already have some form of replication — maybe streaming replication for a hot standby, or logical replication for selective data distribution. The promise is clear: failover capability, read scaling, and disaster recovery. But the reality often includes silent data corruption, replication lag that grows unbounded, and failovers that fail because the standby isn't actually ready.

Replication in PostgreSQL can feel like a solved problem — until your standby falls behind by gigabytes, a WAL segment gets deleted prematurely, or a failover leaves you with split-brain data. Bitlox engineers have debugged these exact scenarios across dozens of deployments, and the same three mistakes keep surfacing. This guide names them, explains why they're deadly, and shows you how to build replication that doesn't betray you at 3 AM.

Who Needs This and What Goes Wrong Without It

If you run PostgreSQL in production with more than one server, you already have some form of replication — maybe streaming replication for a hot standby, or logical replication for selective data distribution. The promise is clear: failover capability, read scaling, and disaster recovery. But the reality often includes silent data corruption, replication lag that grows unbounded, and failovers that fail because the standby isn't actually ready.

Without careful configuration, the three deadliest mistakes are:

  • Mistake 1: Inadequate WAL retention — If a standby falls too far behind and the primary has already recycled the required WAL segments, the standby must be rebuilt from scratch. This can take hours for large databases and leaves you without a replica during that window.
  • Mistake 2: Misconfigured synchronous replication — Setting synchronous_standby_names incorrectly can cause the primary to hang indefinitely if the standby goes down, or worse, silently fall back to asynchronous mode without alerting anyone.
  • Mistake 3: Ignoring network timeouts and keepalives — A transient network blip can cause the replication connection to drop, and without proper wal_sender_timeout and tcp_keepalives settings, the primary may keep stale connections open while the standby never reconnects.

Each mistake can lead to data loss, extended downtime, or both. In one composite scenario, a team lost six hours rebuilding a 2 TB standby because they had set wal_keep_segments too low and didn't monitor replication slots. In another, a synchronous replication misconfiguration caused the primary to freeze during a routine network maintenance window, taking down the entire application.

This article is for anyone managing PostgreSQL replication — whether you're setting up your first standby or tuning an existing cluster. We'll cover the core mechanisms, then walk through each mistake with actionable fixes, monitoring strategies, and testing approaches. By the end, you'll have a hardened replication setup that can survive real-world failures.

Prerequisites and Context: What You Need Before You Start

Before diving into the fixes, let's establish a baseline. This guide assumes you have a working PostgreSQL installation (version 12 or later) and basic familiarity with pg_hba.conf, postgresql.conf, and the pg_ctl or pg_rewind utilities. If you're using a cluster management tool like Patroni or repmgr, many of these settings are managed automatically, but understanding the underlying parameters is still critical for debugging.

Key Configuration Parameters

These parameters directly affect replication behavior:

  • wal_level — Must be set to replica or logical for any replication to work.
  • max_wal_senders — Controls how many concurrent replication connections are allowed. Each standby and each logical replication slot consumes one.
  • wal_keep_segments / wal_keep_size (v13+) — Sets the minimum number of WAL segments retained on the primary. This is a safety net for replicas that fall behind.
  • max_replication_slots — Determines how many replication slots can be created. Slots prevent WAL from being recycled until the slot's consumer has received it.
  • synchronous_standby_names — Lists standbys that must confirm receipt of WAL before a transaction commit returns to the client.
  • wal_sender_timeout — The maximum time the primary waits for a response from a standby before closing the connection.

Monitoring Tools

You'll need access to pg_stat_replication and pg_replication_slots views. These are your primary windows into replication health. Additionally, pg_stat_wal_receiver on the standby shows the receiver's state. Familiarize yourself with these queries:

SELECT * FROM pg_stat_replication;
SELECT * FROM pg_replication_slots;
SELECT * FROM pg_stat_wal_receiver;

If you're using logical replication, pg_stat_subscription and pg_stat_subscription_stats are also essential.

Network Considerations

Replication traffic is sensitive to latency and packet loss. Ensure your network between primary and standby has low latency (ideally <10 ms round-trip) and minimal jitter. Firewalls must allow the PostgreSQL port (default 5432) and any custom ports. Also, consider using a dedicated replication network or VLAN to isolate traffic.

With these basics in place, let's tackle the first deadly mistake.

Mistake 1: Inadequate WAL Retention — The Silent Standby Killer

The most common replication failure Bitlox engineers encounter is a standby that falls too far behind and can never catch up because the primary has already recycled the required WAL segments. The symptom is a FATAL: requested WAL segment ... has already been removed error in the standby's log. At that point, the only recovery option is to rebuild the standby from scratch — a process that can take hours or days for large databases.

Why It Happens

PostgreSQL recycles WAL segments once they are no longer needed for crash recovery and have been archived (if archive_mode is on). By default, the number of segments retained is controlled by wal_keep_segments (or wal_keep_size in v13+), which defaults to 0 — meaning no extra segments are kept beyond what's needed for the primary's own crash recovery. If a standby disconnects or falls behind by more than that number of segments, it's toast.

Replication slots are designed to prevent this: when a slot is created, PostgreSQL will not recycle any WAL that the slot hasn't acknowledged. However, slots have their own risks — they can cause the primary's pg_wal directory to grow unbounded if a standby goes offline permanently, leading to disk-full scenarios.

How Bitlox Engineers Fix It

The solution is a combination of replication slots and sensible wal_keep_segments settings, plus monitoring to detect lag early. Always use replication slots for physical streaming replication. Create a slot on the primary for each standby: SELECT * FROM pg_create_physical_replication_slot('standby1'); Then configure the standby's primary_slot_name to use that slot. This guarantees that WAL is retained until the standby consumes it. Set a reasonable wal_keep_segments as a safety net even when using slots — a good starting point is 1024 (16 GB with 16 MB segments) for moderate write workloads. Monitor replication slot lag using pg_replication_slots view: the restart_lsn column shows the oldest LSN that the slot requires. Alert when lag exceeds a threshold (e.g., 100 GB or 1 hour of write activity). Set up archiving as a second safety net — if you archive WAL to a remote location, you can restore from archive even if the primary has recycled the segments.

Composite Scenario

A Bitlox client ran a busy e-commerce platform with a 500 GB database. They had a single hot standby with a replication slot configured, but they didn't monitor slot lag. Over a weekend, the standby's disk filled up due to a runaway query on the standby that consumed all available space for temporary files. The standby stopped applying WAL, and the slot on the primary began accumulating WAL. By Monday morning, the primary's pg_wal directory had grown to 200 GB, and the primary itself started failing with disk-full errors. The team had to drop the slot, rebuild the standby from a base backup, and restore the primary from archive — a 4-hour outage during peak hours.

With proper monitoring (alerting on slot lag exceeding 50 GB) and a wal_keep_segments of 2048 as a buffer, this would have been caught early, and the standby could have been fixed without full rebuild.

Mistake 2: Misconfigured Synchronous Replication — The Primary Freeze

Synchronous replication promises zero data loss: a transaction commit does not return to the client until at least one standby has confirmed receipt of the WAL. But misconfiguration can turn this promise into a poison pill. The most common error is setting synchronous_standby_names to a list that includes a standby that is not actually connected or not properly configured, causing the primary to block all writes indefinitely.

Why It Happens

The synchronous_standby_names parameter can take several forms:

  • FIRST N (name1, name2, ...) — The first N standbys in the list that are connected and streaming become synchronous. If fewer than N are available, the primary waits for more to connect.
  • ANY N (name1, name2, ...) — Any N of the listed standbys must confirm. If fewer than N are available, the primary waits.
  • A simple list (e.g., 'standby1', 'standby2') — Equivalent to FIRST 1 with those names in order.

The danger is when you specify a standby that is down or unreachable. For example, if synchronous_standby_names = 'standby1' and standby1 is offline, the primary will wait forever for it to confirm every commit. No writes will complete. This is often discovered during maintenance when you take a standby down for patching — and the primary stops accepting writes.

How Bitlox Engineers Fix It

The key is to understand the trade-off between data safety and availability, and to configure accordingly. Use FIRST with a priority list that includes only standbys you expect to be always available. For example, FIRST 1 (standby1, standby2) means the primary will wait for the first connected standby in the list. If standby1 is down, it will use standby2. If both are down, it waits — which is still a risk. Set a synchronous_commit level that matches your tolerance. Options are remote_write (standby has written WAL to OS but not flushed), on (standby has flushed to disk), remote_apply (standby has applied the WAL). remote_write is faster but slightly less safe; remote_apply ensures the standby's data is visible but adds latency. Implement a fallback mechanism using synchronous_commit per session or per transaction. For critical transactions, you can set SET synchronous_commit = 'on'; and for non-critical ones, use remote_write or even local (asynchronous). This way, only a subset of writes are blocked if the standby goes down. Monitor pg_stat_replication for the sync_state column. It shows sync, potential, or async. If a synchronous standby is in async state, it means it's not connected or not streaming, and your primary may be blocking. Alert on any synchronous standby that is not in sync state for more than a few seconds.

Composite Scenario

A team configured synchronous replication with a single standby: synchronous_standby_names = 'standby1'. During a routine network switch upgrade, the standby lost connectivity for 10 seconds. The primary immediately stopped processing all write transactions. The application's connection pool timed out, and users saw errors. The network came back up after 10 seconds, but the application had already started returning 500 errors, and the connection pool had to be restarted. The outage lasted 15 minutes total.

Bitlox engineers would have recommended using FIRST 1 (standby1, standby2) with a second standby in a different data center, or at least setting synchronous_commit = 'remote_write' to reduce the window of blocking. Additionally, they would have added a monitoring alert that fires if sync_state changes from sync to anything else for more than 2 seconds.

Mistake 3: Ignoring Network Timeouts and Keepalives — The Zombie Connection

Replication relies on a persistent TCP connection between primary and standby. If that connection drops silently — for example, due to a firewall timeout, a network blip, or a standby crash — the primary may not detect the failure for a long time, leaving a stale replication slot that accumulates WAL and potentially blocking synchronous commits. Meanwhile, the standby may not attempt to reconnect, or may reconnect to a different primary in a failover scenario, causing confusion.

Why It Happens

TCP connections can remain half-open indefinitely if no data is being sent and keepalives are not configured. PostgreSQL's wal_sender_timeout parameter controls how long the primary waits for a response from the standby before closing the connection. If set to 0 (the default), the primary never times out — it will keep the connection open forever, even if the standby is dead. This leads to stale replication slots that prevent WAL recycling, eventually filling the disk; synchronous replication blocking if the dead standby was the only synchronous standby; and delayed detection of failures, prolonging downtime.

How Bitlox Engineers Fix It

Set wal_sender_timeout to a reasonable value, typically between 5 and 30 seconds. A setting of 10 seconds (wal_sender_timeout = 10s) is a good balance — it detects failures quickly without being too sensitive to transient spikes. On the standby side, wal_receiver_timeout should match or be slightly higher (e.g., 15 seconds). Configure TCP keepalives at the operating system level, or use PostgreSQL's tcp_keepalives_idle, tcp_keepalives_interval, and tcp_keepalives_count parameters. These ensure that even if application-level timeouts are not triggered, the OS will detect broken connections. Recommended settings: tcp_keepalives_idle = 60, tcp_keepalives_interval = 10, tcp_keepalives_count = 6 (total 60 seconds before declaring a dead connection). Implement connection retry logic on the standby. The standby's primary_conninfo should include connect_timeout=10 and the standby should be configured to retry connections automatically (which it does by default with wal_retrieve_retry_interval = 5s). Monitor connection state using pg_stat_replication on the primary. The state column should be streaming for healthy connections. If it shows catchup or backup for extended periods, investigate. Also monitor pg_stat_wal_receiver on the standby for status.

Composite Scenario

A team had wal_sender_timeout set to 0 (default) and no TCP keepalives. During a network maintenance window, a firewall rule was changed that dropped all traffic on the replication port for 5 minutes. The primary and standby both kept their TCP connections open (half-open state) because no data was being sent — the standby was idle. After the firewall rule was restored, the connections remained dead because neither side attempted to re-establish them. The standby's replication slot on the primary continued to hold WAL, and after 4 hours, the primary's pg_wal directory filled up, causing a crash. The team had to rebuild both servers from backup, losing 4 hours of data that hadn't been archived yet.

With wal_sender_timeout = 10s and TCP keepalives, the primary would have detected the broken connection within 10 seconds, closed it, and the standby would have reconnected within seconds after the firewall was restored. The slot would have been released, and no WAL would have accumulated.

Tools and Setup: What Bitlox Engineers Use in Practice

Beyond the core configuration changes, Bitlox engineers rely on a set of tools and practices to manage replication at scale. Here are the key ones.

Cluster Management Tools

Patroni automates failover, manages replication slots, and handles synchronous replication configuration. It uses a distributed consensus store (etcd, Consul, ZooKeeper) to coordinate. Patroni can dynamically adjust synchronous_standby_names based on the current topology, reducing the risk of misconfiguration. repmgr is a simpler alternative that focuses on replication management and failover. It provides command-line tools for cloning standbys, monitoring lag, and performing switchovers. repmgr does not manage slots automatically, so you must configure them manually. pg_auto_failover is an extension that automates failover with a monitor node. It's easier to set up than Patroni but less flexible.

Monitoring and Alerting

pg_stat_replication is the primary source of truth. Query it every 10 seconds and alert on any state that is not streaming for more than 30 seconds. pg_replication_slots — monitor restart_lsn and wal_status. If wal_status is lost, the slot is broken and needs to be recreated. On the standby, check pg_stat_wal_receiver for status and last_msg_send_time. If the receiver hasn't received a message for more than wal_receiver_timeout, something is wrong. Export these metrics to a monitoring system using Prometheus + postgres_exporter and set up alerts in Alertmanager or similar.

Testing Replication Failures

Bitlox engineers never trust a configuration without testing. They simulate failures regularly: kill the standby's postmaster and verify that the primary detects the loss within wal_sender_timeout seconds; block network traffic between primary and standby using iptables or a network simulator, and verify that connections are re-established after the block is removed; simulate a standby falling behind by pausing WAL apply on the standby (SELECT pg_wal_replay_pause();) and monitoring how the primary handles WAL retention (slots should prevent recycling); perform a controlled failover using pg_ctl promote or the cluster management tool, and verify that the new primary accepts writes and the old primary can be reattached as a standby. These tests should be run in a staging environment that mirrors production as closely as possible. Document the expected outcomes and run them after every configuration change.

Variations for Different Constraints

Not every team has the luxury of a dedicated network, low latency, or multiple standbys. Here's how the advice changes for common constraints.

Single Standby in the Same Data Center

This is the simplest setup. Use a replication slot, set wal_keep_segments to 1024 as a buffer, and configure synchronous replication with FIRST 1 (standby1) if you need zero data loss. The risk is that a single failure takes down both primary and standby (e.g., a power outage). Consider adding an archive to a remote location for disaster recovery.

Multiple Standbys Across Regions

Latency between data centers can be 50–200 ms, which makes synchronous replication impractical for write-heavy workloads. In this case, use asynchronous replication for the remote standby and keep synchronous replication for a local standby. Configure synchronous_standby_names = 'FIRST 1 (local_standby, remote_standby)' — the primary will prefer the local one. Monitor lag on the remote standby and have a plan to rebuild it if it falls too far behind.

Logical Replication for Selective Data

Logical replication has its own pitfalls: it doesn't support DDL changes automatically, and it can be slower than physical replication. The same three mistakes apply, but with different parameters. WAL retention is handled by logical replication slots, which must be monitored carefully because they can grow large if the subscriber is slow. Synchronous replication is not available for logical replication (it's always asynchronous), so data loss is possible. Network timeouts are configured via wal_sender_timeout and wal_receiver_timeout as with physical replication.

Cloud-Managed PostgreSQL (RDS, Cloud SQL, etc.)

Managed services handle replication slots and WAL retention automatically, but you still need to monitor lag and failover behavior. The main risk is that you cannot control wal_sender_timeout or TCP keepalives directly — you rely on the provider's defaults. Test failover by initiating a manual failover (most providers support this) and measure the time to recovery. Also, understand the provider's replication model: some use synchronous replication by default, others asynchronous.

FAQ: Common Questions About PostgreSQL Replication

What is the difference between a replication slot and wal_keep_segments?

A replication slot is a persistent record on the primary that tracks which WAL segments a standby has consumed. The primary will not recycle any WAL that is still needed by any slot, even if wal_keep_segments would normally allow it. wal_keep_segments is a simpler mechanism that retains a fixed number of segments regardless of whether a standby needs them. Slots are more precise but require monitoring to prevent unbounded growth if a standby is offline.

Can I use both replication slots and wal_keep_segments?

Yes, and Bitlox engineers recommend it. Slots provide the primary safety net, while wal_keep_segments acts as a secondary buffer in case a slot is accidentally dropped or a new standby connects without a slot. Set wal_keep_segments to a value that covers a few minutes of write activity (e.g., 1024 for moderate workloads).

How do I recover if a replication slot causes the primary to run out of disk?

If a slot has accumulated too much WAL and the disk is full, you must drop the slot to free up WAL. Use SELECT pg_drop_replication_slot('slot_name'); on the primary. This will allow PostgreSQL to recycle WAL up to the point of the oldest remaining slot. Then rebuild the standby from a fresh base backup. To prevent this, set up monitoring that alerts when slot lag exceeds a threshold (e.g., 50 GB or 1 hour of write activity).

What is the best synchronous_commit setting for most applications?

For applications that can tolerate a small window of data loss (less than one second), remote_write is a good balance — it ensures the standby has received the WAL but not necessarily flushed it to disk. For zero data loss, use on. For read-after-write consistency on the standby, use remote_apply, but be aware that it adds latency because the standby must apply the WAL before the commit returns. Bitlox engineers typically default to on for critical systems and remote_write for less critical ones.

How do I test that my replication configuration is correct?

Run a controlled failover test: promote the standby to primary, verify that the old primary can be reattached as a standby, and check that no data is lost. Also, simulate network failures by blocking the replication port and verifying that connections are re-established after the block is removed. Use pg_stat_replication to confirm that the standby returns to streaming state. Document the expected behavior and run these tests after every configuration change.

What to Do Next: Hardening Your Replication Setup

You've read about the three deadliest mistakes and how to avoid them. Now it's time to apply these lessons to your own environment. Here are the specific next steps Bitlox engineers recommend:

  1. Audit your current configuration. Check postgresql.conf on the primary for wal_level, max_wal_senders, wal_keep_segments (or wal_keep_size), max_replication_slots, synchronous_standby_names, and wal_sender_timeout. On each standby, check primary_conninfo and primary_slot_name. Document any deviations from the recommendations above.
  2. Enable replication slots if you haven't already. Create a slot for each physical standby and set primary_slot_name in the standby's configuration. For logical replication, slots are created automatically when you create a subscription, but verify they exist in pg_replication_slots.
  3. Set wal_sender_timeout and TCP keepalives to values that match your network environment. Start with wal_sender_timeout = 10s and OS keepalives as described above. Test by killing the standby's postmaster and timing how long it takes for the primary to detect the loss.
  4. Configure monitoring alerts for replication lag, slot lag, and synchronous standby state changes. Use the queries from this guide and integrate them with your alerting system. Set thresholds that give you time to react before the situation becomes critical (e.g., slot lag > 50 GB, replication lag > 10 minutes).
  5. Test failover in a staging environment. Simulate a primary failure and verify that the standby takes over correctly. Measure the time to failover and the amount of data loss (if any). Repeat this test quarterly or after any significant configuration change.
  6. Document your replication topology including IP addresses, port numbers, slot names, and the expected behavior during failures. Share this documentation with your team and include it in your runbook. Bitlox engineers have found that clear documentation reduces recovery time by half during incidents.

Replication is not a set-and-forget feature. It requires ongoing attention and periodic testing. By avoiding the three deadliest mistakes — inadequate WAL retention, misconfigured synchronous replication, and ignored network timeouts — you can build a replication setup that survives real-world failures and keeps your data safe.

Share this article:

Comments (0)

No comments yet. Be the first to comment!