Primary-Backup Replication

What is Database Replication?

Replication is the process of copying data from a source database server to one or more destination servers. It's a fundamental technique for achieving:

High Availability (HA): If the primary server fails, a replica can take over, minimizing downtime.
Read Scalability: Read queries can be distributed across replicas, reducing load on the primary server.
Disaster Recovery & Backups: Replicas can be located geographically distant or used for backups without impacting the primary.

Primary-Backup (Master-Slave) Model

This is a very common replication topology. One server is designated as the Primary (or Master), and one or more servers act as Backups (or Slaves/Replicas).

The Primary node handles all write operations (INSERT, UPDATE, DELETE).
The Primary logs these changes and sends them to the Backup nodes.
Backup nodes receive the changes from the Primary and apply them to their own copy of the data.
Backup nodes can often serve read-only queries.

Write Flow

Client sends a write request to the Primary.
Primary processes the write, applies it locally (e.g., to its transaction log and data files).
Primary sends the update (or log record) to all connected Backup nodes.
Backup nodes receive the update and apply it locally.
(Confirmation back to the client depends on Sync/Async mode - see below)

Read Flow

Clients can typically read from the Primary for the most up-to-date data.
Clients can *optionally* read from Backup nodes. This scales read capacity but might return slightly stale data (depending on replication lag, especially in async mode).

Synchronous vs. Asynchronous Replication

Synchronous (Sync): The Primary waits for at least one (or sometimes all) Backup nodes to acknowledge that they have received and applied (or at least durably stored) the update before confirming the write success back to the client.
- (+) Stronger consistency, less chance of data loss on failover.
- (-) Higher write latency, Primary can be blocked if a Backup is slow or unavailable.
Asynchronous (Async): The Primary sends updates to Backups but confirms write success to the client immediately without waiting for acknowledgment from Backups.
- (+) Lower write latency, Primary performance not directly impacted by slow Backups.
- (-) Potential for data loss if the Primary fails before updates reach the Backups. Reads from Backups might be significantly behind (replication lag).

Failover

If the Primary node fails:

The failure needs to be detected (often via health checks or lack of heartbeats - simplified here).
A Backup node needs to be chosen as the new Primary. This might involve manual intervention or an automated process (e.g., using consensus or choosing the most up-to-date replica).
The chosen Backup is promoted to become the new Primary.
Other remaining Backups need to be reconfigured to replicate from the *new* Primary.
Clients need to be redirected to connect to the new Primary for writes.