Write Ahead Log (WAL)
Write Ahead Log is a standard method for ensuring data
integrity, it is automatically enabled by default.
The WALs are the REDO logs in PostgreSQL. But, what are the REDO
logs?
REDO logs contain all changes that were made in the
database and they are used by replication, recovery, online backupand point in time recovery (PITR). Any changes that have not been applied to the data
pages can be redone from the REDO logs.
Using WAL results in a significantly reduced number of disk
writes, because only the log file needs to be flushed to disk to guarantee that
a transaction is committed, rather than every data file changed by the
transaction.
A WAL record will specify, bit by bit, the changes made to the
data. Each WAL record will be appended into a WAL file. The insert position is
a Log Sequence Number (LSN) that is a byte offset into the logs, increasing
with each new record.
The WALs are stored in pg_xlog (or pg_wal in PostgreSQL 10)
directory, under the data directory. These files have a default size of 16MB
(the size can be changed by altering the --with-wal-segsize configure option
when building the server). They have a unique incremental name, in the
following format: "00000001 00000000 00000000".
The number of WAL files contained in pg_xlog (or pg_wal) will
depend on the value assigned to the parameter checkpoint_segments (or
min_wal_size and max_wal_size, depending on the version) in the postgresql.conf
configuration file.
One parameter that we need to setup when configuring all
our PostgreSQL installations is the wal_level. It determines how
much information is written to the WAL .The default value is minimal, which
writes only the information needed to recover from a crash or immediate
shutdown. Archive adds logging required for WAL archiving; hot_standby further
adds information required to run read-only queries on a standby server; and,
finally logical adds information necessary to support logical decoding. This
parameter requires a restart, so, it can be hard to change on running prod
databases if we have forgotten that.
For further information, you can check the official
documentation here or here. Now that we’ve covered the WAL, let's review the
replication history…
History of replication in
PostgreSQL
The first replication method (warm standby) that PostgreSQL
implemented (version 8.2 , back in 2006) was based on the log shipping method.
This means that the WAL records are directly moved from
one database server to another to be applied. We can say that is a
continuous PITR.
PostgreSQL implements file-based log shipping by transferring
WAL records one file (WAL segment) at a time.
This replication implementation has the downside that if there
is a major failure on the primary servers, transactions not yet shipped will be
lost. So there is a window for data loss (you can tune this by using the
archive_timeout parameter, which can be set to as low as a few seconds, but
such a low setting will substantially increase the bandwidth required for file
shipping).
We can represent this method with the picture below:
PostgreSQL file-based log shipping
So, on version 9.0 (back in 2010), streaming replication was
introduced.
This feature allowed us to stay more up-to-date than is possible
with file-based log shipping, by transferring WAL records (a WAL file is
composed of WAL records) on the fly (record based log shipping), between a
master server and one or several slave servers, without waiting for the WAL
file to be filled.
In practice, a process called WAL receiver, running on the slave
server, will connect to the master server using a TCP/IP connection. In the
master server another process exists, named WAL sender, and is in charge of
sending the WAL registries to the slave server as they happen.
Streaming replication can be represented as following:
PostgreSQL Streaming replication
By looking at the above diagram we can think, what happens when
the communication between the WAL sender and the WAL receiver fails?
When configuring streaming replication, we have the option to
enable WAL archiving.
This step is actually not mandatory, but is extremely important
for robust replication setup, as it is necessary to avoid the main server to
recycle old WAL files that have not yet being applied to the slave. If this
occurs we will need to recreate the replica from scratch.
When configuring replication with continuous archiving
(as explained here), we are starting from a backup and, to reach the on
sync state with the master, we need to apply all the changes hosted in the WAL
that happened after the backup. During this process, the standby will first
restore all the WAL available in the archive location (done by calling
restore_command). The restore_command will fail when we reach the last archived
WAL record, so after that, the standby is going to look on the pg_wal (pg_xlog)
directory to see if the change exists there (this is actually made to avoid
data loss when the master servers crashes and some changes that have already
been moved into the replica and applied there have not been yet archived).
If that fails, and the requested record does not exist there,
then it will start communicating with the master through streaming replication.
Whenever streaming replication fails, it will go back to step 1
and restore the records from archive again. This loop of retries from the
archive, pg_wal, and via streaming replication goes on until the server is
stopped or failover is triggered by a trigger file.
This will be a diagram of such configuration:
PostgreSQL streaming replication with continuous archiving
Streaming replication is asynchronous by default, so at some
given moment we can have some transactions that can be committed in the master
and not yet replicated into the standby server. This implies some potential
data loss.
However this delay between the commit and impact of the changes
in the replica is supposed to be really small (some milliseconds), assuming of
course that the replica server is powerful enough to keep up with the load.
For the cases when even the risk of a small data loss is not
tolerable, version 9.1 introduced the synchronous replication feature.
In synchronous replication each commit of a write transaction
will wait until confirmation is received that the commit has been written to
the write-ahead log on disk of both the primary and standby server.
This method minimizes the possibility of data loss, as for that
to happen we will need for both, the master and the standby to fail at the same
time.
The obvious downside of this configuration is that the response
time for each write transaction increases, as we need to wait until all parties
have responded. So the time for a commit is, at minimum, the round trip between
the master and the replica. Readonly transactions will not be affected by that.
To setup synchronous replication we need for each of the
stand-by servers to specify an application_name in the primary_conninfo of the
recovery.conf file: primary_conninfo = '...aplication_name=slaveX' .
We also need to specify the list of the stand-by servers that
are going to take part in the synchronous replication :
synchronous_standby_name = 'slaveX,slaveY'.
We can setup one or several synchronous servers, and this
parameter also specifies which method (FIRST and ANY) to choose synchronous
standbys from the listed ones. For more information on how to setup this
replication mode please refer here. It is also possible to set up synchronous replication
when deploying via ClusterControl, from version 1.6.1 (which is released at the time of
writing).
After we have configured our replication, and it is up and
running, we will need to have some monitoring over it.
Monitoring PostgreSQL Replication
The pg_stat_replication view on the master server has a lot of
relevant information:
1
2
3
4
5
6
7
8
|
postgres=# SELECT * FROM pg_stat_replication;
pid |
usesysid | usename | application_name | client_addr |
client_hostname | client_port
|
backend_start | backend_xmin
| state | sent_lsn | write_lsn |
flush_lsn
| replay_lsn | write_lag | flush_lag | replay_lag | sync_priority |
sync_state
-----+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-
----------+------------+-----------+-----------+------------+---------------+------------
994
| 16467 | repl |
walreceiver | 192.168.100.42
|
| 37646 | 2018-05-24 21:27:57.256242-03
|
| streaming | 0/50002C8 | 0/50002C8 |
0/50002C8
| 0/50002C8
|
|
|
| 0 |
async
(1
row)
|
Let's see this in detail:
pid:
Process id of walsender process
usesysid:
OID of user which is used for Streaming replication.
usename:
Name of user which is used for Streaming replication
application_name: Application name connected to master
client_addr: Address of standby/streaming replication
client_hostname: Hostname of standby.
client_port: TCP port number on which standby communicating with WAL sender
backend_start: Start time when SR connected to Master.
state:
Current WAL sender state i.e streaming
sent_lsn:
Last transaction location sent to standby.
write_lsn: Last transaction written on disk at standby
flush_lsn: Last transaction flush on disk at standby.
replay_lsn: Last transaction flush on disk at standby.
sync_priority: Priority of standby server being chosen as synchronous standby
sync_state: Sync State of standby (is it async or synchronous).
We can also see the WAL sender/receiver processes running on the
servers.
Sender (Primary Node):
1
2
3
4
5
6
7
8
9
10
|
[root@postgres1 ~]# ps
aux |grep postgres
postgres
833 0.0 1.6 392032 16532
? Ss 21:25
0:00 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/
postgres
847 0.0 0.1 244844 1900
? Ss 21:25
0:00 postgres: logger process
postgres
850 0.0 0.3 392032 3696
? Ss 21:25
0:00 postgres: checkpointer process
postgres
851 0.0 0.3 392032 3180
? Ss 21:25
0:00 postgres: writer process
postgres
852 0.0 0.6 392032 6340
? Ss 21:25
0:00 postgres: wal writer process
postgres
853 0.0 0.3 392440 3052
? Ss 21:25
0:00 postgres: autovacuum launcher process
postgres
854 0.0 0.2 247096 2172
? Ss 21:25
0:00 postgres: stats collector process
postgres
855 0.0 0.2 392324 2504
? Ss 21:25
0:00 postgres: bgworker: logical replication launcher
postgres
994 0.0 0.3 392440 3528
? Ss 21:27
0:00 postgres: wal sender process repl 192.168.100.42(37646) streaming
0/50002C8
|
Receiver (Standby Node):
1
2
3
4
5
6
7
8
|
[root@postgres2 ~]# ps
aux |grep postgres
postgres
833 0.0 1.6 392032 16436
? Ss 21:27
0:00 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/
postgres
848 0.0 0.1 244844 1908
? Ss 21:27
0:00 postgres: logger process
postgres
849 0.0 0.2 392128 2580
? Ss 21:27
0:00 postgres: startup process recovering
000000010000000000000005
postgres
851 0.0 0.3 392032 3472
? Ss 21:27
0:00 postgres: checkpointer process
postgres
852 0.0 0.3 392032 3216
? Ss 21:27
0:00 postgres: writer process
postgres
853 0.0 0.1 246964 1812
? Ss 21:27
0:00 postgres: stats collector process
postgres
854 0.0 0.3 398860 3840
? Ss 21:27
0:05 postgres: wal receiver process streaming 0/50002C8
|
One way of checking how up to date is our replication is by
checking the amount of WAL records generated in the primary, but not yet
applied in the standby.
Master:
1
2
3
4
5
|
postgres=# SELECT pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
0/50002C8
(1
row)
|
Note: This function is for PostgreSQL 10. For previous versions,
you need to use: SELECT pg_current_xlog_location();
Slave:
1
2
3
4
5
|
postgres=# SELECT pg_last_wal_receive_lsn();
pg_last_wal_receive_lsn
-------------------------
0/50002C8
(1
row)
|
1
2
3
4
5
|
postgres=# SELECT pg_last_wal_replay_lsn();
pg_last_wal_replay_lsn
------------------------
0/50002C8
(1
row)
|
Note: These functions are for PostgreSQL 10. For previous
versions, you need to use: SELECT pg_last_xlog_receive_location(); and SELECT
pg_last_xlog_replay_location();
We can use the following query to get the lag in seconds.
PostgreSQL 10:
1
2
3
4
|
SELECT CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
THEN 0
ELSE EXTRACT
(EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;
|
Previous Versions:
1
2
3
4
|
SELECT CASE
WHEN pg_last_xlog_receive_location() =
pg_last_xlog_replay_location()
THEN 0
ELSE EXTRACT
(EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;
|
Output:
1
2
3
4
5
6
7
8
|
postgres=# SELECT CASE WHEN pg_last_wal_receive_lsn()
= pg_last_wal_replay_lsn()
postgres-# THEN 0
postgres-# ELSE EXTRACT (EPOCH FROM now()
- pg_last_xact_replay_timestamp())
postgres-# END AS log_delay;
log_delay
-----------
0
(1
row)
|
Deploying PostgreSQL replication setups
It also allows us to monitor the replication lag, as well as
other key metrics.
As streaming replication is based on shipping the WAL records
and them being applied to the standby server, it is basically saying what bytes
to add or change in what file. As a result, the standby server is actually a
bit by bit copy of the master.
We have here some well known limitations:
- We cannot replicate into a different version or architecture.
- We cannot change anything on the standby server.
- We do not have much granularity on what we can replicate.
So, for overcoming these limitations, PostgreSQL 10 has added
support for logical replication.
Logical Replication
Logical replication will also use the information in the WAL
file, but it will decode it into logical changes. Instead of knowing which byte
has changed, we will know exactly what data has been inserted in which table.
It is based in a publish and subscribe model with one or more
subscribers, subscribing to one or more publications on a publisher node that
looks like this:
PostgreSQL Logical Replication
With this replication option there many cases that now become
possible, like replicating only some of the tables or consolidating multiple
databases into a single one.
What new features will come? We will need to stay tuned and
check, but we hope that master-master built-in replication is not far away.