前言
在测试环境开启GTID运行一年多之后,我们准备近期上线生产。为了保证GTID顺利的上线,在测试环境模拟各种故障场景,观察GTID 的表现
案例采用sandbox测试,GTID的一些关键测试参数如下
master-info-repository=table
relay-log-info-repository=table
gtid_mode=ON
log-slave-updates
enforce-gtid-consistency
relay_log_recovery=on
案例一
其中Master的UUID是00021800-1111-1111-1111-111111111111,Slave的UUID是 00021802-3333-3333-3333-333333333333,我们简称 Master的UUID为1111,Slave的UUID为3333。
时间线 | Master | Slave | 备注 |
---|---|---|---|
t1 | CREATE TABLE t1 | 生成的GTID为1111:1856 | |
t2 | STOP SLAVE SQL_THREAD | 停止从库的SQL THREAD 线程 | |
t3 | RESET MASTER; | 清空从库的gtid_executed和binlog信息 | |
t4 | SET GLOBAL GTID_PURGED='1111:1-1857' | 手动让从库认为1111:1857已经执行 | |
t5 | DROP TABLE t1 | 主库删除t1,GTID为1111:1857 | |
t6 | CREATE TABLE t1 | 主库创建t1,GTID为1111:1858 | |
t7 | START SLAVE SQL_THREAD | 启动从库SQL线程 | |
t8 | SHOW SLAVE STATUS\G | Error 'Table 't1' already exists' | |
t9 | STOP SLAVE SQL_THREAD; | ||
t10 | RESET MASTER | ||
t11 | SET GLOBAL GTID_PURGED='1111:1-1856' | 让从库重新去执行1111:1857的事务 | |
t12 | START SLAVE SQL_THREAD; | Error 'Table 't1' already exists' on query |
当我们不小心在从库让 SQL THREAD 跳过一个GTID事务的时候,想让从库重新执行这个跳过的GTID,却始终报错,我们把从库的线程的关键信息提供一下
Master_Log_File: mysql-bin.000009
Read_Master_Log_Pos: 2768
Relay_Log_File: mysql-relay.000006
Relay_Log_Pos: 594
Relay_Master_Log_File: mysql-bin.000009
Slave_IO_Running: Yes
Slave_SQL_Running: No
Exec_Master_Log_Pos: 2606
Last_SQL_Errno: 1050
Last_SQL_Error: Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'create table t1(id int)'
Replicate_Ignore_Server_Ids:
Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1857-1858
Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1856
Auto_Position: 1
刚刚我们执行的操作只是关闭了SQL THREAD,那么主库执行的DROP TABLE操作(被从库第一次手动跳过的)以及CREATE 操作都应该在relay log里面,我们解析一下,结果删去了一些不必要的输出。
$ mysqlbinlog mysql-relay.000006
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 10:56:36 server id 102 end_log_pos 123 CRC32 0x73fa3393 Start: binlog v 4, server v 5.7.25-28-log created 190308 10:56:36
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 10:56:36 server id 102 end_log_pos 154 CRC32 0x4c964c64 Previous-GTIDs
# [empty]
# at 154
#700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4
# at 201
#190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10
# at 320
#190308 10:56:36 server id 0 end_log_pos 367 CRC32 0x72b7d6b4 Rotate to mysql-bin.000009 pos: 194
# at 367
#190308 10:56:36 server id 0 end_log_pos 414 CRC32 0x73628a00 Rotate to mysql-bin.000009 pos: 2426
# at 414
#190308 10:57:44 server id 1 end_log_pos 2491 CRC32 0x18c5896c GTID last_committed=13 sequence_number=14 rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1857'/*!*/;
# at 479
#190308 10:57:44 server id 1 end_log_pos 2606 CRC32 0x7c575de0 Query thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
DROP TABLE `t1` /* generated by server */ 删除表的操作
/*!*/;
# at 594
#190308 10:57:48 server id 1 end_log_pos 2671 CRC32 0x7138d3a8 GTID last_committed=14 sequence_number=15 rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1858'/*!*/;
# at 659
#190308 10:57:48 server id 1 end_log_pos 2768 CRC32 0x3a8eeb36 Query thread_id=4 exec_time=0 error_code=0
SET TIMESTAMP=1552013868/*!*/;
create table t1(id int) 创建表的操作
/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
我们可以看到 GTID 1111:1857 和 GTID 1111:1858 都在relay log里面,但是SQL thread 始终没有执行,我们看看为什么
Relay_Log_Pos = 594 表示的是SQL thread 执行的位点在relay log 的位置
我们看看594 在relay log 是什么event
# at 594
#190308 10:57:48 server id 1 end_log_pos 2671 CRC32 0x7138d3a8 GTID last_committed=14 sequence_number=15 rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1858'/*!*/;
# at 659
#190308 10:57:48 server id 1 end_log_pos 2768 CRC32 0x3a8eeb36 Query thread_id=4 exec_time=0 error_code=0
SET TIMESTAMP=1552013868/*!*/;
create table t1(id int)
但是DROP TABLE 的位点是在 414 的位点,而SQL thread 位点却在594,所以始终没办法执行到 414 位点的DROP TABLE 操作,无法修复,那是为啥呢?
因为位点为414 的DROP TABLE 操作的GTID 是 1111:1857 ,被我们人为的认为从库已经执行。当从库线程发现GTID 为1111:1857已经执行过,则跳过该事务,继续往下执行回放。当执行到位点594时,即GTID为1111:1858。这时候,当我们在t11时刻,设置gtid_purged=1111:1-1856',让从库重新执行GTID 为1111:1157 的事务时,SQL thread 还是从relay log 的594位点开始执行,无法恢复。那怎么修复呢?
解决方案
1.让SQL thread 从414 位点开始执行
STOP SLAVE;
CHANGE MASTER TO RELAY_LOG_FILE='mysql-relay.000006' ,RELAY_LOG_POS=414 ,MASTER_AUTO_POSITION=0;
CHANGE MASTER TO MASTER_AUTO_POSITION=1;
START SLAVE;
SHOW SLAVE STATUS\G
我们再观察下slave的状态
Master_Log_File: mysql-bin.000009
Read_Master_Log_Pos: 2768
Relay_Log_File: mysql-relay.000002
Relay_Log_Pos: 756
Relay_Master_Log_File: mysql-bin.000009
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 2768
Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1857-1858
Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1858
我们看到1111:1157 已经被加入Executed_Gtid_Set。
2.重新启动MySQL Slave
因为relay_log_recovery为on的状态,老的relay log 会被删除,重新根据gtid_executed 去主库拉取binlog,这时候就会顺利往下执行了。
案例二
时间线 | Master | Slave | 备注 |
---|---|---|---|
t1 | CREATE TABLE t1 | 生成的GTID为1111:1858 | |
t2 | STOP SLAVE | 停止从库的线程 | |
t3 | RESET MASTER; | 清空从库的gtid_executed和binlog信息 | |
t4 | SET GLOBAL GTID_PURGED='1111:1-1859' | 手动让从库认为1111:1859已经执行 | |
t5 | DROP TABLE t1 | 主库删除t1,GTID为1111:1859 | |
t6 | CREATE TABLE t1 | 主库创建t1,GTID为1111:1860 | |
t7 | START SLAVE | 启动从库线程 | |
t8 | SHOW SLAVE STATUS\G | Error 'Table 't1' already exists' | |
t9 | STOP SLAVE; | ||
t10 | RESET MASTER | ||
t11 | SET GLOBAL GTID_PURGED='1111:1-1858' | 让从库重新去执行1111:1859的事务 | |
t12 | START SLAVE; | Error 'Table 't1' already exists' on query |
这次和案例一不一样的地方是本次停止的是整个从库线程,包括IO线程和SQL线程。在启动前IO线程是没有拿到DROP的GTID为1111:1159的事务的,启动SLAVE线程后MySQL会怎么处理呢?
我们查一下官方文档,发现了以下这句话
In the initial handshake, the slave sends a GTID set containing the transactions that it has already received, committed, or both. This GTID set is equal to the union of the set of GTIDs in the gtid_executed system variable (@@GLOBAL.gtid_executed), and the set of GTIDs recorded in the Performance Schema replication_connection_status table as received transactions (the result of the statement SELECT RECEIVED_TRANSACTION_SET FROM PERFORMANCE_SCHEMA.replication_connection_status).
https://dev.mysql.com/doc/refman/5.7/en/replication-gtids-auto-positioning.html
也就是说,当从库连上主库的时候,从库会发送一个GTID集合,可能是已经执行的GTID集合,也可能是已经收到的GTID集合,或者是两者。这个GTID是已经执行的GTID集合和已经收到的GTID集合的并集。
当t7 时刻启动slave的时候,已经执行的GTID集合为1111:1-1859,已经收到的GTID集合为1111:1-1858,发送给主库的GTID集合为1111:1-1859。主库需要发送给从库的GTID为1111:1860,也就是CREATE TABLE语句,我们去relay log里面验证下
当前slave 的状态
Master_Log_File: mysql-bin.000009
Read_Master_Log_Pos: 3110
Relay_Log_File: mysql-relay.000005
Relay_Log_Pos: 414
Relay_Master_Log_File: mysql-bin.000009
Slave_IO_Running: Yes
Slave_SQL_Running: No
Exec_Master_Log_Pos: 2948
Last_SQL_Errno: 1050
Last_SQL_Error: Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'create table t1(id int)'
Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1859-1860
Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1858
我们看下 mysql-relay.000005的内容
$mysqlbinlog mysql-relay.000005
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 12:15:33 server id 102 end_log_pos 123 CRC32 0x1f2e3e22 Start: binlog v 4, server v 5.7.25-28-log created 190308 12:15:33
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 12:15:33 server id 102 end_log_pos 154 CRC32 0x5422dcce Previous-GTIDs
# [empty]
# at 154
#700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4
# at 201
#190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10
# at 320
#190308 12:15:33 server id 0 end_log_pos 367 CRC32 0x182ec41f Rotate to mysql-bin.000009 pos: 194
# at 367
#190308 12:15:33 server id 0 end_log_pos 414 CRC32 0x7c85ab34 Rotate to mysql-bin.000009 pos: 2948
# at 414
#190308 12:15:27 server id 1 end_log_pos 3013 CRC32 0x4ab790eb GTID last_committed=16 sequence_number=17 rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1860'/*!*/;
# at 479
#190308 12:15:27 server id 1 end_log_pos 3110 CRC32 0xe74d818a Query thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
create table t1(id int) 建表的操作
/*!*/;
# at 576
#190308 12:26:37 server id 102 end_log_pos 625 CRC32 0x532b5fe7 Rotate to mysql-relay.000006 pos: 4
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
我们发现 Relay_Log_Pos: 414 是CREATE TABLE的语句,GTID为1111:1860,在relay log mysql-relay.000005未发现GTID 为1111:1859 的DROP语句,我们发现后面还有一个relay log为mysql-relay.000006,我们看看里面的内容
$mysqlbinlog mysql-relay.000006
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 12:26:37 server id 102 end_log_pos 123 CRC32 0x4357741b Start: binlog v 4, server v 5.7.25-28-log created 190308 12:26:37
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 12:26:37 server id 102 end_log_pos 194 CRC32 0xa087d1c5 Previous-GTIDs
# 00021800-1111-1111-1111-111111111111:1860
# at 194
#700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4
# at 241
#190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10
# at 360
#190308 12:26:37 server id 0 end_log_pos 407 CRC32 0xaea58e04 Rotate to mysql-bin.000009 pos: 194
# at 407
#190308 12:26:37 server id 0 end_log_pos 454 CRC32 0xb19784ba Rotate to mysql-bin.000009 pos: 2768
# at 454
#190308 12:15:23 server id 1 end_log_pos 2833 CRC32 0x2279702d GTID last_committed=15 sequence_number=16 rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1859'/*!*/;
# at 519
#190308 12:15:23 server id 1 end_log_pos 2948 CRC32 0xe93f8d9d Query thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
DROP TABLE `t1` /* generated by server */ 删表的操作
/*!*/;
# at 634
#190308 12:26:37 server id 0 end_log_pos 681 CRC32 0xd7fd7810 Rotate to mysql-bin.000009 pos: 3110
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
我们发现GTID为1111:1859的事务在1111:1860的后面,这也是符合预期的,因为在t7时刻启动从库线程的时候,主库需要发给从库的GTID只有1111:1860,因此mysql-relay.000005里面只有1111:1860,当我们在t11时刻执行SET GLOBAL GTID_PURGED='1111:1-1858'的时候,从库发送给主库的GTID集合为 1111:1-1858和 1111:1-1858:1860合集,也就是1111:1-1158:1860,这时候主库需要发送给从库的GTID 为1111:1859。
但是由于1111:1859 在 1111:1860 的后面,因此始终无法执行1111:1859的DROP事务,导致一直报错。
解决方案
重启MySQL Slave
在t12操作之后,直接重启从库,此时,mysql-relay.000005 和 mysql-relay.000006 都被删除,这时候由于gtid_executed 变量为1111:1-1858,那么从主库重新拉取的GTID为1111:1859-1860,我们从relay log mysql-relay.000007 和 mysql-relay.000008 里面验证一下
由于mysql-relay.000007 里面没有包含实际的事务信息,我们直接看mysql-relay.000008
$mysqlbinlog mysql-relay.000008
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 13:22:11 server id 102 end_log_pos 123 CRC32 0xd3d8ddd6 Start: binlog v 4, server v 5.7.25-28-log created 190308 13:22:11
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 13:22:11 server id 102 end_log_pos 154 CRC32 0x5c6327ce Previous-GTIDs
# [empty]
# at 154
#700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4
# at 201
#190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10
BINLOG '
osWBXA8BAAAAdwAAAHsAAAAAAAQANS43LjI1LTI4LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA
AWnfbVM=
'/*!*/;
# at 320
#190308 13:22:11 server id 0 end_log_pos 367 CRC32 0x08b48d1c Rotate to mysql-bin.000009 pos: 194
# at 367
#190308 13:22:11 server id 0 end_log_pos 414 CRC32 0xc7dd1334 Rotate to mysql-bin.000009 pos: 2768
# at 414
#190308 12:15:23 server id 1 end_log_pos 2833 CRC32 0x2279702d GTID last_committed=15 sequence_number=16 rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1859'/*!*/;
# at 479
#190308 12:15:23 server id 1 end_log_pos 2948 CRC32 0xe93f8d9d Query thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
DROP TABLE `t1` /* generated by server */
/*!*/;
# at 594
#190308 12:15:27 server id 1 end_log_pos 3013 CRC32 0x4ab790eb GTID last_committed=16 sequence_number=17 rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1860'/*!*/;
# at 659
#190308 12:15:27 server id 1 end_log_pos 3110 CRC32 0xe74d818a Query thread_id=4 exec_time=0 error_code=0
SET TIMESTAMP=1552018527/*!*/;
create table t1(id int)
/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
我们看到mysql-relay.000008 包含了1111:1859-1860,从库SQL线程也恢复正常状态
Master_Log_File: mysql-bin.000009
Read_Master_Log_Pos: 3110
Relay_Log_File: mysql-relay.000008
Relay_Log_Pos: 756
Relay_Master_Log_File: mysql-bin.000009
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1859-1860
Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1860
结语
GTID在上线生产的时候,需要多做测试,明白其底层的原理,并不是所有未执行的GTID事务一定会被执行,也需要注意GTID的事务的执行顺序。重启大法还是很有用的哈哈哈