The purpose of this page is to show how databases replication problems were solved. Please contribute when you fix a MySQL replication problem (add a header <date>/<time>: <Problem>). If you fix something which is similar to an existing entry, just add an entry something like 'Fixed problem as in <entry reference>'. When we will have a lot of problems, I (SC) will try to reorganize them by categories.
- 2019
- 2013
-
2012
- 2012-12-09
- 2012-12-07
- 2012-11-14
- 2012-11-13: nebulous backup
- 2012-11-13: ipp001 crash
- 2012-11-08: ippc17 crash
- 2012-11-05
- 2012-11-02
- 2012-11-01
- 2012-10-23
- 2012-10-21/22 gpc1 Reingestion
- 2012-10-15 ippdb01 -> ippdb03
- 2012-10-11: ippdb01 -> ippdb03/ipp001
- 2012-10-10: ippdb01 -> ippdb03/ipp001
- 2012-10-09: ippdb01 -> ippdb03/ipp001
- 2012-10-08: ippdb01 -> ippdb03/ipp001
- 2012-08-25 / ippdb01 to ippdb03-ipp001
- 2012-07-06 / ipp001/gpc1
- 2012-06-12 -> 2012-06-14
- 20120529-20120529 ippdb00/ippdb02
- 20120427/11:05 ipp001
- 20120313/12:05 ipp001
- 20120312/13:31 ippdb03/ipp001
- 20120223/08:40: ippdb03
- 20120222/09:21: ippc19
- 20120203/08:51: ippc19
- 20120131/08:57: ipp001
-
2011
- 20111121/08:47: ippdb03/ipp001
- 20111117/16:23: ipp001 crashed
- 20111115/09:23: ippc19 errors
- 20111109/12:11: ippdb03/ipp001 errors
- 20110921/13:56: ippdb03 error
- 20110913/1720: ippdb03 "error"
- 20110913/0930: ipp001 error
- 20110801/1505: ippdb03 error
- 20110727/1410: ippdb03 Replication Error
- 20110726/1810: ippdb03 Replication Error
- 20110429/0850: gpc1 Replication Error
- 20110413/1450: ippadmin Replication Error
- 20110307/0830: Nebulous Replication Error
- 2010
2019
2019-03-29 09:30
Issue
Replication crashed on ippdb01 replicating ippdb09.
Last_Errno: 1032
Last_Error: Could not execute Update_rows event on table nebulous.storage_object; Can't find record in 'storage_object', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysqld-bin.002087, end_log_pos 1040641174
Fix
Short story
The issue was with so_id = 6604319651 which didn't (and still doesn't) exist in ippdb01 and a statement trying to update that row. I couldn't identify why 6604319651 was deleted but it's easy to mess up with replication.
Gory details
Full statements in /export/ippdb01.0/mysql/archives/dump-20190329.txt which contains selected lines of dump.txt generated on the master (db09) by:
mysqlbinlog mysqld-bin.002087 --base64-output=DECODE-ROWS --verbose > dump.txt
When attempting to execute the statement:
# at 1040640772 #190328 17:56:35 server id 2017081601 end_log_pos 1040641174 [...] ### UPDATE nebulous`.`storage_object` WHERE so_id=6604319651 [...] # at 1040641174
The replication stopped since row with so_id=6604319651 didn't exist.
In db09 a few hours after the crash:
mysql> SELECT * FROM storage_object WHERE so_id = 6604319651; Empty set (0.00 sec)
I assumed that so_id = 6604319651 would be deleted at some point in the replicant future. It is:
# at 1040669677 #190328 17:56:35 server id 2017081601 end_log_pos 1040669911 [...] ### DELETE FROM `nebulous`.`storage_object` WHERE so_id=6604319651 [...] # at 1040669911
So...
- I stopped replication: STOP SLAVE
- I skipped one stmt: SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1
- I manually played the INSERT stmt following #at 1040641174. Even if I thought the replicant would restart with it, I was not exactly sure if it would start there (better safe than sorry!). I checked that - in db01 -- the last so_id was 7876537519 and 7876537520 was not present.
- I restarted replication (START SLAVE) which immediately crashed since it tried to insert 7876537520. It crashed at the expected position: 1040641548
- I skipped the (now offending) INSERT: STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE
- And waited the crash at the DELETE stmt for so_id = 6604319651 (since there is nothing to delete) ... Which happened a few 100ths seconds later.
- I skipped the offending DELETE statement: STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE
- And the replication seems fine so far
2013
2013-01-04 09:10
Last_error value when executing SHOW SLAVE STATUS on that host is: Error 'Duplicate entry ...
Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;
2012
2012-12-09
Last_error value when executing SHOW SLAVE STATUS on that host is: Error 'Duplicate entry ...
Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;
2012-12-07
Last_error value when executing SHOW SLAVE STATUS on that host is: Error 'Duplicate entry ...
Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;
2012-11-14
Last_error value when executing SHOW SLAVE STATUS on that host is: Error 'Duplicate entry '21338906' for key 1' on query. Default database: 'gpc1'. Query: 'INSERT INTO diffInputSkyfile(diff_id, skycell_id, warp1, stack1, warp2, stack2, tess_id) SELECT diff_id, skycell_id, warp1, stack1, warp2, stack2, tess_id from skycellsToDiff'
Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;
2012-11-13: nebulous backup
Stopped mysql server on ippc63. rsyncing to ippc62. 2012-11-14: Restarted
2012-11-13: ipp001 crash
No apparent reason. Restarted mysql server. Replication easily caught up.
2012-11-08: ippc17 crash
Error 'Duplicate entry '824346-fpa' for key 1' on query. Default database: 'isp'. Query: 'INSERT INTO [...]
The usual:
stop slave; set global sql_slave_skip_counter=1; start slave;
fixed it.
2012-11-05
Stopped replication from ippdb00 to ippc63. Started rsync from ippc63 to ippc61. (On 2012-11-06, restarted replication)
2012-11-02
Backup complete. Replication to ippc63 restarted.
2012-11-01
Stopped replication and mysql server on ippc63 to make a backup to ippc62
2012-10-23
Again a few STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;
2012-10-21/22 gpc1 Reingestion
Reingestion on ippdb03 and ipp001
2012-10-15 ippdb01 -> ippdb03
Replication problem again
Error 'Duplicate entry '19665422' for key 1' on query. Default database: 'gpc1'. Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'
fixed ippdb03 with recipe
stop slave; set global sql_slave_skip_counter=1; start slave;
2012-10-11: ippdb01 -> ippdb03/ipp001
Replication problem happened twice last night.
Error 'Duplicate entry '19442342' for key 1' on query. Default database: 'gpc1'. Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'
Error 'Duplicate entry '19446512' for key 1' on query. Default database: 'gpc1'. Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'
2012-10-10: ippdb01 -> ippdb03/ipp001
Again
2012-10-09: ippdb01 -> ippdb03/ipp001
Same problem as yesterday.
Error 'Duplicate entry '19309559' for key 1' on query. Default database: 'gpc1'. Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'
Same fix:
STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;
2012-10-08: ippdb01 -> ippdb03/ipp001
Error 'Duplicate entry '19172463' for key 1' on query. Default database: 'gpc1'. Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'
Fix:
STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;
2012-08-25 / ippdb01 to ippdb03-ipp001
difftool.c allows playing twice an insert statement, hence breaking the replication.
2012-07-06 / ipp001/gpc1
Error 'Cannot add or update a child row: a foreign key constraint fails (gpc1/distComponent, CONSTRAINT distComponent_ibfk_1 FOREIGN KEY (dist_id) REFERENCES distRun (dist_id))' on query. Default database: 'gpc1'. Query: 'INSERT INTO distComponent (dist_id, component, bytes, md5sum, state, outdir, name, fault) VALUES (2066813, 'skycell.7.17', 108129, 'e5f49b59c7f693649f40a33e7a3980b9', 'full', 'neb://stsci08.0/gpc1/dist/STS.nightlyscience/2012/06/13/o6091g0322o/warp.2066813', 'o6091g0322o.498476.wrp.432885.skycell.7.17.tgz', 0)'
Tried to skip the offending statement without success.
2012-06-12 -> 2012-06-14
After ipdpb01 crash, see TODO
20120529-20120529 ippdb00/ippdb02
Full nebulous ingestion. Details here
20120427/11:05 ipp001
ipp001 replication broken because of 'Duplicate entry '32' for key 1' on query. Default database: 'ippadmin'. Query: 'insert into users (username, password) values ('chenx', 'chenx')'. I ran the usual STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;. Then 'Duplicate entry '30' for key 1' on query. Default database: 'ippadmin'. Query: 'INSERT into cookies (username, cookie) values ('chenx', '32709')'. Another STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; made it.
20120313/12:05 ipp001
ipp001 replication is broken again (broken foreign keys?! e.g. Cannot add or update a child row: a foreign key constraint fails (`gpc1/chipProcessedImfile`, CONSTRAINT `chipProcessedImfile_ibfk_1` FOREIGN KEY (`chip_id`, `exp_id`) REFERENCES `chipRun`). Made manual backup of ippdb03 to /export/ippdb03.0/mysql-dumps/backup_20120313T1208.sql.bz2
Master_Log_File: mysqld-bin.020936
Read_Master_Log_Pos: 73345738
20120312/13:31 ippdb03/ipp001
Error 'Table 'skycells.skycell' doesn't exist' on query. Default database: 'gpc1'. Query: 'update staticskyRun join staticskyInput using(sky_id) join stackRun using(stack_id) join skycells.skycell using(skycell_id) set staticskyRun.state = 'new.wait' where staticskyRun.label ='lap.threepi.20110809' and staticskyRun.state ='new' and ra/15 > 12'
From Bill: "I did use the table in an update last night. That was a one time command. The reason that the table is not in gpc1 is that I'm not finished with it yet and I didn't want to cause problems with replication. Apparently using it for a table update caused a problem."
I ran the usual STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; on ippdb03.
It looks though that there are other problems (multiple insert attempts for the same pk) for ipp001.
20120223/08:40: ippdb03
Duplicate entry 'XXXXX-X' for key 1' on query. Default database: 'ipptopsps'. I ran ~schastel/dev/many_failures/fix_replication.sh about 200 times without success. I talked to Roy who will replicate ipptopsps somewhere else. I'm dropping it from replication for now.
I stopped the mysql server on ippdb03, modified /etc/mysql/my.cnf (commented out ipptopsps replication), and restarted the server.
20120222/09:21: ippc19
Replication on ippc19 broke. The 'SHOW SLAVE STATUS\G' command showed the following message: "Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0". /var/log/mysql/mysqld.err that replication was broken at log 'mysqld-bin.000639' position 1041257298 (see message 120222 8:46:23). I stopped the slave, changed the master coordinates on the slave (with the input found in /var/log/mysql/mysqld.err) and restarted the slave, taht is:
STOP SLAVE; CHANGE MASTER TO MASTER_LOG_FILE='mysqld-bin.000639', MASTER_LOG_POS=1041257298; START SLAVE;
Then "SHOW SLAVE STATUS\G" commands showed that replication worked again (Seconds_Behind_Master is decreasing).
20120203/08:51: ippc19
ippc17 experienced crashes yesterday and tables were corrupted (see http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/PS1_IPP_CzarLog_20120130#Friday:2012.02.03). Repairing the tables crashed the replication server: the error messages were about "Duplicate entr[ies]"
I ran the usual: STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; about 16 times to fix it.
20120131/08:57: ipp001
Ken Smith told us about ipp001 unavailability. The mysql server on that host crashed for some unknown reason (it was 62000 seconds behind so I assume it happened around 15:00 HST on Jan 30? Maybe it's related to the global loss of connection to the development cluster that happened around that time). I restarted the mysql server and had to fix it with STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;
2011
20111121/08:47: ippdb03/ipp001
Two instances of czarpoll were started at the same time, running two "OPTIMIZE TABLE burntool". Replication broke on both clients. The message on both client was: "Query partially completed on the master (error on master: 1053) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'OPTIMIZE TABLE burntool'"
Fixed with STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;
20111117/16:23: ipp001 crashed
mysqld crashed. Nothing interesting in mysqld.err. Once restarted, slave resumed normally.
20111115/09:23: ippc19 errors
Errors like 'Duplicate entry...' after ippc17 crash last night. The usual "STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;" fixed the problem.
20111109/12:11: ippdb03/ipp001 errors
Alterations to gpc1:
mysql> ALTER TABLE diffInputSkyfile DROP PRIMARY KEY, ADD PRIMARY KEY(diff_skyfile_id); Query OK, 11661867 rows affected (12 min 37.50 sec) Records: 11661867 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE diffInputSkyfile CHANGE COLUMN skycell_id skycell_id VARCHAR(64) DEFAULT NULL; Query OK, 11661867 rows affected (11 min 28.22 sec) Records: 11661867 Duplicates: 0 Warnings: 0 mysql> UPDATE dbversion set schema_version = '1.1.71', updated= CURRENT_TIMESTAMP(); Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0
The first statement broke the replication on both ippdb03 and ipp001 (because of the multiple diff_skyfile_id = 10264768 in diffInputSkyfile. Error message was: "ERROR 1062 (23000): Duplicate entry '10264768' for key 1"). On both replicants I ran:
mysql> STOP SLAVE; mysql> delete from diffInputSkyfile where diff_id = 162937 and skycell_id = 'skycell.1485.049'; mysql> update diffInputSkyfile set diff_skyfile_id = 10264767 where diff_id = 162937 and skycell_id = 'skycell.1485.059';
and ran again the ALTER command:
mysql> ALTER TABLE diffInputSkyfile DROP PRIMARY KEY, ADD PRIMARY KEY(diff_skyfile_id);
and then:
mysql> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
20110921/13:56: ippdb03 error
From Anonymous: "screwed-up the replication to ippdb03 while adding the new ippMonitor user (i didn't think i could update a slave, so believed i was safe). i've tried to fix it"
The usual STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; fixed it
20110913/1720: ippdb03 "error"
Not really an error actually.
isp was being updating with the statement (SHOW FULL PROCESSLIST or look at mysqld.err):
INSERT INTO summitExp SELECT NULL, incoming.*, NULL, 0, NULL FROM incoming LEFT JOIN summitExp USING(exp_name, camera, telescope) WHERE summitExp.exp_name is NULL AND summitExp.camera is NULL AND summitExp.telescope is NULL
I killed the process in the mysql server and skipped the statement STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
Note: I had to do it twice.
20110913/0930: ipp001 error
Error 'The table 'incoming' is full' on query. Default database: 'isp'. Query:
'INSERT INTO incoming (exp_name, camera, telescope, dateobs, exp_type, uri) VALUES
('o5253i0240o01', 'isp', 'ps1', '2010-02-26 09:29:39', 'OBJECT',
'http://otis3.ifa.hawaii.edu/ds/skyprobe/o5253i0240o01/index.txt')'
I don't understand why this _temporary_ table can't be created. I dropped isp from replication.
20110801/1505: ippdb03 error
ippdb03 crashed often during the last weekend. SHOW SLAVE STATUS\G showed this error message:
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: ippdb01.ifa.hawaii.edu
Master_User: repl_gpc1
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysqld-bin.020270
Read_Master_Log_Pos: 962746384
Relay_Log_File: mysqld-relay-bin.000005
Relay_Log_Pos: 24123883
Relay_Master_Log_File: mysqld-bin.020269
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB: gpc1,czardb,ippadmin,isp
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary
log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is
corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the
master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be
able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
Skip_Counter: 1
Exec_Master_Log_Pos: 24123745
Relay_Log_Space: 2036505138
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
The relay log was corrupted (I ran: 'mysqlbinlog mysqld-relay-bin.000005 > /dev/null' and saw an error message lloking like:
ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 0, event_type: 0 Could not read entry at offset <some value>:Error in log format or read error
I googled a bit and found that whic fixed the problem: http://www.mysqlperformanceblog.com/2008/08/02/troubleshooting-relay-log-corruption-in-mysql/
Precisely I did: change master to master_log_file='mysqld-bin.020269',master_log_pos=24123745;
The problem seems to have been fixed
20110727/1410: ippdb03 Replication Error
ippdb03 crashed. SHOW SLAVE STATUS messages was like: 'Duplicate entry ...'
I applied STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; about twenty times and everything looks fine now.
(Messages were like: 'Duplicate entry ...')
20110726/1810: ippdb03 Replication Error
ippdb03 crashed. SHOW SLAVE STATUS messages was like: 'Duplicate entry ...'
I applied STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; four times and everything looks fine now.
(Messages were like: 'Duplicate entry ...')
20110429/0850: gpc1 Replication Error
From mysql slave on ippc02 (SHOW SLAVE STATUS) Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'drop table lapExp'.
I checked with Chris. I manually dropped the table in gpc1 and did a 'STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;'. Hopefully we will not have to ingest a new dump once again...
20110413/1450: ippadmin Replication Error
- Logged onto ipp001
- Started mysql client as root
- 'SHOW SLAVE STATUS\G' shows the following error:
Last_Error: Error 'Duplicate entry '26' for key 1' on query. Default database: 'ippadmin'. Query: 'insert into users (username, password) values ('denver', '********')' - I did the usual 'STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;'
- The slave is 172833 seconds (i.e. 2 days) behind its master. It clearly shows that no one looks at the replication server statuses.
20110307/0830: Nebulous Replication Error
- Logged onto ippdb02
- Started mysql client as root
- 'SHOW SLAVE STATUS\G' shows the following error:
Error 'Table 'nebulous.mountedvol' doesn't exist' on query. Default database: 'nebulous'. Query: 'REPLACE INTO mountedvol SELECT vol_id,name,host,path,allocate,available,xattr,mountpoint, '10651598144', '8973230080',note FROM volume WHERE mountpoint = '/data/ipp004.0'' - But, if I check the contents of the database, I can see nebulous.mountedvol exists! It seems that the replicated nebulous didn't pass the optimization of last week.
- I'm setting up a new replicated server.
2010
20101116/1659: gpc1 Replication Error
- Logged onto ipp001
- Started mysql client as root
- 'SHOW SLAVE STATUS\G' shows error similar to 20101104/1429: gpc1 Replication Error
- Same solution was applied (Statement skipped)
20101104/1429: gpc1 Replication Error
Problem detected through czartool
- Logged onto ipp001
- Started mysql client as root
- 'SHOW SLAVE STATUS\G' showed (shortened)
Slave_SQL_Running: No Last_Errno: 1062 Last_Error: Error 'Duplicate entry '23' for key 1' on query. Default database: 'ippadmin'. Query: 'INSERT into cookies (username, cookie) values ('psobs', '99497')' - Local non-replicated database involved... Statement skipped (in MySQL client):
STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
on one line:STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
- Then 'SHOW SLAVE STATUS\G' showed (shortened)
Slave_SQL_Running: Yes Last_Errno: 0 Last_Error: - Czartool didn't show any problem after the fix
