Setup MariaDB Replication With MariaBackup

环境介绍

Server OS: CentOS 8.2 x64

MariaDB: 10.5.x （主、从均运行在 Docker / Podman 容器里）

操作目标：配置一台全新的从库

假设主库IP是 dbserver1

从库IP是 dbserver2

使用 MariaBackup 做全量或增量同步，更加方便。

安装MariaBackup

首先我们需要在 主 / 从 服务器上分别安装 MariaBackup.

这里比较简单，参考官方文档就好了。

# CentOS 8 / Stream
sudo dnf install MariaDB-backup

# Debian / Ubuntu
sudo apt-get install mariadb-backup

准备全量同步备份文件

这里假设，master 机已经开启 binlog 和 master 模式。我们需要给整个库做一个备份。

由于我们是针对整个 master 服务器备份，因此这里没有指定 --databases=数据名 参数。

mariabackup --verbose --rsync --backup \
   --target-dir=/home/user001/backup/mysql/ \
   --host=dbserver1 --port=3307 --protocol=tcp \
   --user=root --password=root密码在这

这里一定会出错。因为我们的 MariaDB 跑在容器里，host 机并不存在 /var/lib/mysql 目录。

解决办法是：

Create a symlink on your host so that Mariabackup thinks/var/lib/mysql exists. Then delete the symlink once you're done.

ln -s /home/user007/data/mysql /var/lib/mysql

然后再运行命令，就OK了。如果备份成功，最后一部分的日志输出大概如下：

[00] 2021-03-04 05:54:58 Writing xtrabackup_binlog_info
[00] 2021-03-04 05:54:58         ...done
[00] 2021-03-04 05:54:58 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
[00] 2021-03-04 05:54:58 mariabackup: The latest check point (for incremental): '2611384153422'
mariabackup: Stopping log copying thread.[00] 2021-03-04 05:54:58 >> log scanned up to (2611437102980)

[00] 2021-03-04 05:54:58 >> log scanned up to (2611437103220)
[00] 2021-03-04 05:54:58 Executing BACKUP STAGE END
[00] 2021-03-04 05:54:58 All tables unlocked
[00] 2021-03-04 05:54:58 Backup created in directory '/home/user007/backup/mysql/'
[00] 2021-03-04 05:54:58 MySQL binlog position: filename 'master16-bin.003097', position '15736398', GTID of the last change '0-101-9088409195'
[00] 2021-03-04 05:54:58 Writing backup-my.cnf
[00] 2021-03-04 05:54:58         ...done
[00] 2021-03-04 05:54:58 Writing xtrabackup_info
[00] 2021-03-04 05:54:58         ...done
[00] 2021-03-04 05:54:58 Redo log (from LSN 2611376635865 to 2611437103220) was copied.
[00] 2021-03-04 05:54:59 completed OK!

这样还没完，因为现在copy出来的备份文件，不能直接用于 MariaDB 恢复备份。

原因在这里有解释：

Prepares an existing backup to restore to the MariaDB Server.
Files that Mariabackup generates during --backup operations in the target directory are not ready for use on the Server. Before you can restore the data to MariaDB, you first need to prepare the backup.
In the case of full backups, the files are not point in time consistent, since they were taken at different times. If you try to restore the database without first preparing the data, InnoDB rejects the new data as corrupt. Running Mariabackup with the --prepare command readies the data so you can restore it to MariaDB Server. When working with incremental backups, you need to use the --prepare command and the --incremental-dir option to update the base backup with the deltas from an incremental backup.

因此，对于增量备份的 prepare 操作，其实还需要加 --incremental-dir 选项。

mariabackup --prepare --verbose --rsync --target-dir=/home/user007/backup/mysql/
......
2021-03-04  6:02:53 0 [Note] InnoDB: apply 2611435179079: [page id: space=129, page number=2125]
2021-03-04  6:02:53 0 [Note] InnoDB: apply 2611436322402: [page id: space=129, page number=2177]
2021-03-04  6:02:53 0 [Note] InnoDB: apply 2611436322402: [page id: space=129, page number=2177]
2021-03-04  6:02:53 0 [Note] InnoDB: apply 2611436391525: [page id: space=129, page number=2177]
2021-03-04  6:02:53 0 [Note] InnoDB: apply 2611436391525: [page id: space=129, page number=2177]
2021-03-04  6:02:53 0 [Note] InnoDB: apply 2611435179079: [page id: space=129, page number=2125]
2021-03-04  6:02:53 0 [Note] InnoDB: apply 2611435368189: [page id: space=129, page number=2125]
2021-03-04  6:02:53 0 [Note] InnoDB: apply 2611435368189: [page id: space=129, page number=2125]
[00] 2021-03-04 06:02:53 Last binlog file , position 0
[00] 2021-03-04 06:02:53 completed OK!

复制备份文件到从库所在主机

# 首先 ssh 到从库机器, 准备好存放备份的目录
mkdir -p /home/user007/backup/mysql

# 其实我这里的机器并不是22端口
# 没错，当然是走内网地址
rsync -e 'ssh -p 2222' -avP /home/user007/backup/mysql root@dbserver2:/home/user007/backup/

准备好从库同步需要的账号

CREATE USER 'replication_user'@'dbserver2' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.*  TO 'replication_user'@'dbserver2';

在新的从库上恢复备份

首先 systemctl stop ctr-site007-mariadb 停止掉 MariaDB 服务器, 准备恢复备份。

为什么不是 docker stop 或 podman stop ? 因为这个容器是用 podman generate 生成了 systemd unit 启动服务的，如果从 podman 停止， systemd 又会自动重启容器。所以，下面看到 systemctl 操作 ctr 开头的 service 时，请自行替换成你自己的命令。

恢复其实就是 copy 准备好的备份文件到 MariaDB datadir (也就是/var/lib/mysql）。

由于从为这边我们也是使用容器跑的 MariaDB, 因此同样要给做软链：

ln -s /home/user007/data/mysql /var/lib/mysql

开始恢复备份文件：

mariabackup --copy-back --verbose --rsync \
   --target-dir=/home/user007/backup/mysql/
   
   
# 相关日志输出
[01] 2021-03-04 06:27:14 Copying ./xtrabackup_info to /var/lib/mysql/xtrabackup_info
[01] 2021-03-04 06:27:14         ...done
[00] 2021-03-04 06:27:14 completed OK!

# 检查下恢复的文件
[root@replica mysql]# ls /var/lib/mysql/
aria_log.00000001  aria_log_control  ib_buffer_pool  ibdata1  mysql  performance_schema  site007_prod  xtrabackup_info

注意，执行这个操作的前提是 /var/lib/mysql 目录是空的。也就是如果是我们新跑的服务器，里面没数据的，我们也要清空这个目录，不然 MariaBackup 不会执行恢复备份的操作（会有错误： Original data directory /var/lib/mysql is not empty!）。

根据情况的不同，可能还要恢复一下文件权限，比如 chown -R mysql:mysql /var/lib/mysql/

查看一下 xtrabackup_binlog_info 这个文件的内容，这里第三个值 0-101-9090487761 就是 GTID， copy出来先找个地方暂存一下，等会要用：

[root@replica mysql]# cat xtrabackup_binlog_info
master16-bin.003097     775477086       0-101-9090487761

开启从库同步

启动从服务器： systemctl start ctr-site007-mariadb

登录从库mysql -u root -h dbserver2 -P 3307 -p：

STOP SLAVE;
RESET SLAVE;

SET GLOBAL gtid_slave_pos = "0-101-9090487761";


CHANGE MASTER TO
  MASTER_HOST='dbserver1',
  MASTER_USER='replication_user',
  MASTER_PASSWORD='主库同步密码',
  MASTER_PORT=3307,
  MASTER_CONNECT_RETRY=10,
  MASTER_USE_GTID=slave_pos;
  
START SLAVE;

查看一下同步状态是否正常：

比较需要关注的几个选项：

Slave_IO_State: Queueing master event to the relay log

Master_Log_File: master16-bin.003097 Read_Master_Log_Pos: 842451034 Relay_Log_File: master94-relay-bin.000002 Relay_Log_Pos: 649307

Slave_IO_Running: Yes Slave_SQL_Running: Yes

Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-101-9090669366

Seconds_Behind_Master: 301

Seconds_Behind_Master 是最重要的一个参数。300多秒的延时对于一般的同步来说算是问题比较大了，但是这里由于我们是刚刚启动复制，才开始第一次同步，因此这个情况是正常的。过一会这个值就会正常了(正常来说应该是接近0)。

MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                Slave_IO_State: Queueing master event to the relay log
                   Master_Host: dbserver1
                   Master_User: replication_user
                   Master_Port: 3307
                 Connect_Retry: 10
               Master_Log_File: master16-bin.003097
           Read_Master_Log_Pos: 842451034
                Relay_Log_File: master94-relay-bin.000002
                 Relay_Log_Pos: 649307
         Relay_Master_Log_File: master16-bin.003097
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB: site007_prod
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 776125657
               Relay_Log_Space: 66974795
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 301
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 101
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-101-9090669366
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: init for update
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 1765
1 row in set (0.000 sec)

Misc

ERROR 1198

ERROR 1198 (HY000): This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first

这个错误一般出现在 CHANGE MASTER TO ... 配置slave时。原因是前面的 slave 配置还在，STOP SLAVE 然后 RESET SLAVE 再执行 CHANGE MASTER TO ... 就好了。

ERROR 1932

#1932 - Table 'site007_prod.users' doesn't exist in engine
Copy the ib_logfileXX and ibdata file from old mysql/data folder to the new mysql data folder and it will fix the issue

这个主要是没有使用 MariaBackup 恢复备份，而是采用 rsync 手动 copy 文件到 MairaDB datadir 下的。

并且，少复制了 ib_logfileXX 和 ibdata 等文件。所以，还是用 MariaBackup 恢复的好，除非你非常了解 MariaBackup 的工作原理。

如何安全地清理 binlog

这里主要是讲主库的binlog (从库一般不再作为别人的主，我们不开）

命令格式：

PURGE { BINARY | MASTER } LOGS { TO 'log_name' | BEFORE datetime_expr }

The PURGE BINARY LOGS statement deletes all the binary log files listed in the log index file prior to the specified log file name or date. BINARY and MASTER are synonyms.
Deleted log files also are removed from the list recorded in the index file, so that the given log file becomes the first in the list. The datetime expression is in the format 'YYYY-MM-DD hh:mm:ss'.

注意这里是 prior to (在…之前), 首先我们要检查从库最慢的那个节点（假设有多个从库的话）的同步情况，比如Slave status中的 Master_Log_File = master16-bin.003103

Relay_Log_File master94-relay-bin.000015

Relay_Master_Log_File master16-bin.003103

表示当前 slave 正在 replicate master 那边的 master16-bin.003103，因此， master 那边可以清空在 master16-bin.003103 之前的所有 binlog 文件：

PURGE BINARY LOGS TO 'master16-bin.003103'

如何使用 podman 生成 systemd unit 文件

cd /etc/systemd/system/

podman generate systemd --new -t 10 \
--container-prefix ctr -f -n site007-mariadb

# 启用并启动服务
systemctl enable --now ctr-site007-mariadb