MySQL · 2019-03-08 0

MySQL GTID 日常维护案例两则

前言

在测试环境开启GTID运行一年多之后,我们准备近期上线生产。为了保证GTID顺利的上线,在测试环境模拟各种故障场景,观察GTID 的表现

案例采用sandbox测试,GTID的一些关键测试参数如下

master-info-repository=table
relay-log-info-repository=table
gtid_mode=ON
log-slave-updates
enforce-gtid-consistency
relay_log_recovery=on

案例一

其中Master的UUID是00021800-1111-1111-1111-111111111111,Slave的UUID是 00021802-3333-3333-3333-333333333333,我们简称 Master的UUID为1111,Slave的UUID为3333。

时间线 Master Slave 备注
t1 CREATE TABLE t1 生成的GTID为1111:1856
t2 STOP SLAVE SQL_THREAD 停止从库的SQL THREAD 线程
t3 RESET MASTER; 清空从库的gtid_executed和binlog信息
t4 SET GLOBAL GTID_PURGED=’1111:1-1857′ 手动让从库认为1111:1857已经执行
t5 DROP TABLE t1 主库删除t1,GTID为1111:1857
t6 CREATE TABLE t1 主库创建t1,GTID为1111:1858
t7 START SLAVE SQL_THREAD 启动从库SQL线程
t8 SHOW SLAVE STATUS\G Error ‘Table ‘t1′ already exists’
t9 STOP SLAVE SQL_THREAD;
t10 RESET MASTER
t11 SET GLOBAL GTID_PURGED=’1111:1-1856′ 让从库重新去执行1111:1857的事务
t12 START SLAVE SQL_THREAD; Error ‘Table ‘t1′ already exists’ on query

当我们不小心在从库让 SQL THREAD 跳过一个GTID事务的时候,想让从库重新执行这个跳过的GTID,却始终报错,我们把从库的线程的关键信息提供一下

              Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 2768
               Relay_Log_File: mysql-relay.000006
                Relay_Log_Pos: 594
        Relay_Master_Log_File: mysql-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 2606       
               Last_SQL_Errno: 1050
               Last_SQL_Error: Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'create table t1(id int)'
  Replicate_Ignore_Server_Ids:
           Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1857-1858
            Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1856
                Auto_Position: 1      

刚刚我们执行的操作只是关闭了SQL THREAD,那么主库执行的DROP TABLE操作(被从库第一次手动跳过的)以及CREATE 操作都应该在relay log里面,我们解析一下,结果删去了一些不必要的输出。

$ mysqlbinlog mysql-relay.000006
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 10:56:36 server id 102  end_log_pos 123 CRC32 0x73fa3393   Start: binlog v 4, server v 5.7.25-28-log created 190308 10:56:36
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 10:56:36 server id 102  end_log_pos 154 CRC32 0x4c964c64   Previous-GTIDs
# [empty]
# at 154
#700101  8:00:00 server id 1  end_log_pos 0 CRC32 0x6677862d   Rotate to mysql-bin.000009  pos: 4
# at 201
#190308  9:30:10 server id 1  end_log_pos 123 CRC32 0x536ddf69     Start: binlog v 4, server v 5.7.25-28-log created 190308  9:30:10

# at 320
#190308 10:56:36 server id 0  end_log_pos 367 CRC32 0x72b7d6b4     Rotate to mysql-bin.000009  pos: 194
# at 367
#190308 10:56:36 server id 0  end_log_pos 414 CRC32 0x73628a00     Rotate to mysql-bin.000009  pos: 2426
# at 414
#190308 10:57:44 server id 1  end_log_pos 2491 CRC32 0x18c5896c    GTID    last_committed=13   sequence_number=14  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1857'/*!*/;
# at 479
#190308 10:57:44 server id 1  end_log_pos 2606 CRC32 0x7c575de0    Query   thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
DROP TABLE `t1` /* generated by server */  删除表的操作
/*!*/;
# at 594
#190308 10:57:48 server id 1  end_log_pos 2671 CRC32 0x7138d3a8    GTID    last_committed=14   sequence_number=15  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1858'/*!*/;
# at 659
#190308 10:57:48 server id 1  end_log_pos 2768 CRC32 0x3a8eeb36    Query   thread_id=4 exec_time=0 error_code=0
SET TIMESTAMP=1552013868/*!*/;
create table t1(id int)                创建表的操作
/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

我们可以看到 GTID 1111:1857 和 GTID 1111:1858 都在relay log里面,但是SQL thread 始终没有执行,我们看看为什么

Relay_Log_Pos = 594 表示的是SQL thread 执行的位点在relay log 的位置

我们看看594 在relay log 是什么event


# at 594 #190308 10:57:48 server id 1 end_log_pos 2671 CRC32 0x7138d3a8 GTID last_committed=14 sequence_number=15 rbr_only=no SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1858'/*!*/; # at 659 #190308 10:57:48 server id 1 end_log_pos 2768 CRC32 0x3a8eeb36 Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1552013868/*!*/; create table t1(id int)

但是DROP TABLE 的位点是在 414 的位点,而SQL thread 位点却在594,所以始终没办法执行到 414 位点的DROP TABLE 操作,无法修复,那是为啥呢?

因为位点为414 的DROP TABLE 操作的GTID 是 1111:1857 ,被我们人为的认为从库已经执行。当从库线程发现GTID 为1111:1857已经执行过,则跳过该事务,继续往下执行回放。当执行到位点594时,即GTID为1111:1858。这时候,当我们在t11时刻,设置gtid_purged=1111:1-1856’,让从库重新执行GTID 为1111:1157 的事务时,SQL thread 还是从relay log 的594位点开始执行,无法恢复。那怎么修复呢?

解决方案

1.让SQL thread 从414 位点开始执行

STOP SLAVE;
CHANGE MASTER TO RELAY_LOG_FILE='mysql-relay.000006' ,RELAY_LOG_POS=414 ,MASTER_AUTO_POSITION=0;
CHANGE MASTER TO MASTER_AUTO_POSITION=1;
START SLAVE;
SHOW SLAVE STATUS\G

我们再观察下slave的状态

              Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 2768
               Relay_Log_File: mysql-relay.000002
                Relay_Log_Pos: 756
        Relay_Master_Log_File: mysql-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

          Exec_Master_Log_Pos: 2768

           Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1857-1858
            Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1858

我们看到1111:1157 已经被加入Executed_Gtid_Set。

2.重新启动MySQL Slave

因为relay_log_recovery为on的状态,老的relay log 会被删除,重新根据gtid_executed 去主库拉取binlog,这时候就会顺利往下执行了。

案例二

时间线 Master Slave 备注
t1 CREATE TABLE t1 生成的GTID为1111:1858
t2 STOP SLAVE 停止从库的线程
t3 RESET MASTER; 清空从库的gtid_executed和binlog信息
t4 SET GLOBAL GTID_PURGED=’1111:1-1859′ 手动让从库认为1111:1859已经执行
t5 DROP TABLE t1 主库删除t1,GTID为1111:1859
t6 CREATE TABLE t1 主库创建t1,GTID为1111:1860
t7 START SLAVE 启动从库线程
t8 SHOW SLAVE STATUS\G Error ‘Table ‘t1′ already exists’
t9 STOP SLAVE;
t10 RESET MASTER
t11 SET GLOBAL GTID_PURGED=’1111:1-1858′ 让从库重新去执行1111:1859的事务
t12 START SLAVE; Error ‘Table ‘t1′ already exists’ on query

这次和案例一不一样的地方是本次停止的是整个从库线程,包括IO线程和SQL线程。在启动前IO线程是没有拿到DROP的GTID为1111:1159的事务的,启动SLAVE线程后MySQL会怎么处理呢?

我们查一下官方文档,发现了以下这句话

In the initial handshake, the slave sends a GTID set containing the transactions that it has already received, committed, or both. This GTID set is equal to the union of the set of GTIDs in the gtid_executed system variable (@@GLOBAL.gtid_executed), and the set of GTIDs recorded in the Performance Schema replication_connection_status table as received transactions (the result of the statement SELECT RECEIVED_TRANSACTION_SET FROM PERFORMANCE_SCHEMA.replication_connection_status).

https://dev.mysql.com/doc/refman/5.7/en/replication-gtids-auto-positioning.html

也就是说,当从库连上主库的时候,从库会发送一个GTID集合,可能是已经执行的GTID集合,也可能是已经收到的GTID集合,或者是两者。这个GTID是已经执行的GTID集合和已经收到的GTID集合的并集。

当t7 时刻启动slave的时候,已经执行的GTID集合为1111:1-1859,已经收到的GTID集合为1111:1-1858,发送给主库的GTID集合为1111:1-1859。主库需要发送给从库的GTID为1111:1860,也就是CREATE TABLE语句,我们去relay log里面验证下

当前slave 的状态

              Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 3110
               Relay_Log_File: mysql-relay.000005
                Relay_Log_Pos: 414
        Relay_Master_Log_File: mysql-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 2948
               Last_SQL_Errno: 1050
               Last_SQL_Error: Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'create table t1(id int)'
           Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1859-1860
            Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1858

我们看下 mysql-relay.000005的内容

$mysqlbinlog mysql-relay.000005
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 12:15:33 server id 102  end_log_pos 123 CRC32 0x1f2e3e22   Start: binlog v 4, server v 5.7.25-28-log created 190308 12:15:33
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 12:15:33 server id 102  end_log_pos 154 CRC32 0x5422dcce   Previous-GTIDs
# [empty]
# at 154
#700101  8:00:00 server id 1  end_log_pos 0 CRC32 0x6677862d   Rotate to mysql-bin.000009  pos: 4
# at 201
#190308  9:30:10 server id 1  end_log_pos 123 CRC32 0x536ddf69     Start: binlog v 4, server v 5.7.25-28-log created 190308  9:30:10
# at 320
#190308 12:15:33 server id 0  end_log_pos 367 CRC32 0x182ec41f     Rotate to mysql-bin.000009  pos: 194
# at 367
#190308 12:15:33 server id 0  end_log_pos 414 CRC32 0x7c85ab34     Rotate to mysql-bin.000009  pos: 2948
# at 414
#190308 12:15:27 server id 1  end_log_pos 3013 CRC32 0x4ab790eb    GTID    last_committed=16   sequence_number=17  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1860'/*!*/;
# at 479
#190308 12:15:27 server id 1  end_log_pos 3110 CRC32 0xe74d818a    Query   thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
create table t1(id int)        建表的操作
/*!*/;
# at 576
#190308 12:26:37 server id 102  end_log_pos 625 CRC32 0x532b5fe7   Rotate to mysql-relay.000006  pos: 4
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;


我们发现 Relay_Log_Pos: 414 是CREATE TABLE的语句,GTID为1111:1860,在relay log mysql-relay.000005未发现GTID 为1111:1859 的DROP语句,我们发现后面还有一个relay log为mysql-relay.000006,我们看看里面的内容

$mysqlbinlog mysql-relay.000006
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 12:26:37 server id 102  end_log_pos 123 CRC32 0x4357741b   Start: binlog v 4, server v 5.7.25-28-log created 190308 12:26:37
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 12:26:37 server id 102  end_log_pos 194 CRC32 0xa087d1c5   Previous-GTIDs
# 00021800-1111-1111-1111-111111111111:1860
# at 194
#700101  8:00:00 server id 1  end_log_pos 0 CRC32 0x6677862d   Rotate to mysql-bin.000009  pos: 4
# at 241
#190308  9:30:10 server id 1  end_log_pos 123 CRC32 0x536ddf69     Start: binlog v 4, server v 5.7.25-28-log created 190308  9:30:10

# at 360
#190308 12:26:37 server id 0  end_log_pos 407 CRC32 0xaea58e04     Rotate to mysql-bin.000009  pos: 194
# at 407
#190308 12:26:37 server id 0  end_log_pos 454 CRC32 0xb19784ba     Rotate to mysql-bin.000009  pos: 2768
# at 454
#190308 12:15:23 server id 1  end_log_pos 2833 CRC32 0x2279702d    GTID    last_committed=15   sequence_number=16  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1859'/*!*/;
# at 519
#190308 12:15:23 server id 1  end_log_pos 2948 CRC32 0xe93f8d9d    Query   thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;

DROP TABLE `t1` /* generated by server */   删表的操作
/*!*/;
# at 634
#190308 12:26:37 server id 0  end_log_pos 681 CRC32 0xd7fd7810     Rotate to mysql-bin.000009  pos: 3110
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

我们发现GTID为1111:1859的事务在1111:1860的后面,这也是符合预期的,因为在t7时刻启动从库线程的时候,主库需要发给从库的GTID只有1111:1860,因此mysql-relay.000005里面只有1111:1860,当我们在t11时刻执行SET GLOBAL GTID_PURGED=’1111:1-1858’的时候,从库发送给主库的GTID集合为 1111:1-1858和 1111:1-1858:1860合集,也就是1111:1-1158:1860,这时候主库需要发送给从库的GTID 为1111:1859。

但是由于1111:1859 在 1111:1860 的后面,因此始终无法执行1111:1859的DROP事务,导致一直报错。

解决方案

重启MySQL Slave

在t12操作之后,直接重启从库,此时,mysql-relay.000005 和 mysql-relay.000006 都被删除,这时候由于gtid_executed 变量为1111:1-1858,那么从主库重新拉取的GTID为1111:1859-1860,我们从relay log mysql-relay.000007 和 mysql-relay.000008 里面验证一下

由于mysql-relay.000007 里面没有包含实际的事务信息,我们直接看mysql-relay.000008


$mysqlbinlog mysql-relay.000008 /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #190308 13:22:11 server id 102 end_log_pos 123 CRC32 0xd3d8ddd6 Start: binlog v 4, server v 5.7.25-28-log created 190308 13:22:11 # This Format_description_event appears in a relay log and was generated by the slave thread. # at 123 #190308 13:22:11 server id 102 end_log_pos 154 CRC32 0x5c6327ce Previous-GTIDs # [empty] # at 154 #700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4 # at 201 #190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10 BINLOG ' osWBXA8BAAAAdwAAAHsAAAAAAAQANS43LjI1LTI4LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA AWnfbVM= '/*!*/; # at 320 #190308 13:22:11 server id 0 end_log_pos 367 CRC32 0x08b48d1c Rotate to mysql-bin.000009 pos: 194 # at 367 #190308 13:22:11 server id 0 end_log_pos 414 CRC32 0xc7dd1334 Rotate to mysql-bin.000009 pos: 2768 # at 414 #190308 12:15:23 server id 1 end_log_pos 2833 CRC32 0x2279702d GTID last_committed=15 sequence_number=16 rbr_only=no SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1859'/*!*/; # at 479 #190308 12:15:23 server id 1 end_log_pos 2948 CRC32 0xe93f8d9d Query thread_id=4 exec_time=0 error_code=0 use `test`/*!*/; DROP TABLE `t1` /* generated by server */ /*!*/; # at 594 #190308 12:15:27 server id 1 end_log_pos 3013 CRC32 0x4ab790eb GTID last_committed=16 sequence_number=17 rbr_only=no SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1860'/*!*/; # at 659 #190308 12:15:27 server id 1 end_log_pos 3110 CRC32 0xe74d818a Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1552018527/*!*/; create table t1(id int) /*!*/; SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

我们看到mysql-relay.000008 包含了1111:1859-1860,从库SQL线程也恢复正常状态

              Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 3110
               Relay_Log_File: mysql-relay.000008
                Relay_Log_Pos: 756
        Relay_Master_Log_File: mysql-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
           Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1859-1860
            Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1860

结语

GTID在上线生产的时候,需要多做测试,明白其底层的原理,并不是所有未执行的GTID事务一定会被执行,也需要注意GTID的事务的执行顺序。重启大法还是很有用的哈哈哈