wiki:Replication_Issues

Context Navigation

The purpose of this page is to show how databases replication problems were solved. Please contribute when you fix a MySQL replication problem (add a header <date>/<time>: <Problem>). If you fix something which is similar to an existing entry, just add an entry something like 'Fixed problem as in <entry reference>'. When we will have a lot of problems, I (SC) will try to reorganize them by categories.

2019
1. 2019-03-29 09:30
  1. Issue
  2. Fix
    1. Short story
    2. Gory details
2013
1. 2013-01-04 09:10
2012
2011
2010
1. 20101116/1659: gpc1 Replication Error
2. 20101104/1429: gpc1 Replication Error

2019

2019-03-29 09:30

Issue

Replication crashed on ippdb01 replicating ippdb09.

                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows event on table nebulous.storage_object; Can't find record in 'storage_object', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysqld-bin.002087, end_log_pos 1040641174

Fix

Short story

The issue was with so_id = 6604319651 which didn't (and still doesn't) exist in ippdb01 and a statement trying to update that row. I couldn't identify why 6604319651 was deleted but it's easy to mess up with replication.

Gory details

Full statements in /export/ippdb01.0/mysql/archives/dump-20190329.txt which contains selected lines of dump.txt generated on the master (db09) by:

mysqlbinlog mysqld-bin.002087 --base64-output=DECODE-ROWS --verbose > dump.txt

When attempting to execute the statement:

# at 1040640772
#190328 17:56:35 server id 2017081601  end_log_pos 1040641174 [...]
### UPDATE nebulous`.`storage_object` WHERE so_id=6604319651
[...]
# at 1040641174

The replication stopped since row with so_id=6604319651 didn't exist.

In db09 a few hours after the crash:

mysql> SELECT * FROM storage_object WHERE so_id = 6604319651;
Empty set (0.00 sec)

I assumed that so_id = 6604319651 would be deleted at some point in the replicant future. It is:

# at 1040669677
#190328 17:56:35 server id 2017081601  end_log_pos 1040669911 [...]
### DELETE FROM `nebulous`.`storage_object` WHERE so_id=6604319651
[...]
# at 1040669911

So...

I stopped replication: STOP SLAVE
I skipped one stmt: SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1
I manually played the INSERT stmt following #at 1040641174. Even if I thought the replicant would restart with it, I was not exactly sure if it would start there (better safe than sorry!). I checked that - in db01 -- the last so_id was 7876537519 and 7876537520 was not present.
I restarted replication (START SLAVE) which immediately crashed since it tried to insert 7876537520. It crashed at the expected position: 1040641548
I skipped the (now offending) INSERT: STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE
And waited the crash at the DELETE stmt for so_id = 6604319651 (since there is nothing to delete) ... Which happened a few 100ths seconds later.
I skipped the offending DELETE statement: STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE
And the replication seems fine so far

2013

2013-01-04 09:10

Last_error value when executing SHOW SLAVE STATUS on that host is:
Error 'Duplicate entry ...

Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;

2012

2012-12-09

Last_error value when executing SHOW SLAVE STATUS on that host is:
Error 'Duplicate entry ...

Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;

2012-12-07

Last_error value when executing SHOW SLAVE STATUS on that host is:
Error 'Duplicate entry ...

Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;

2012-11-14

Last_error value when executing SHOW SLAVE STATUS on that host is:
Error 'Duplicate entry '21338906' for key 1' on query. Default database: 'gpc1'. Query: 'INSERT INTO diffInputSkyfile(diff_id, skycell_id, warp1, stack1, warp2, stack2, tess_id) SELECT diff_id, skycell_id, warp1, stack1, warp2, stack2, tess_id from skycellsToDiff'

Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;

2012-11-13: nebulous backup

Stopped mysql server on ippc63. rsyncing to ippc62. 2012-11-14: Restarted

2012-11-13: ipp001 crash

No apparent reason. Restarted mysql server. Replication easily caught up.

2012-11-08: ippc17 crash

Error 'Duplicate entry '824346-fpa' for key 1' on query. Default database: 'isp'. Query: 'INSERT INTO [...]

The usual:

stop slave; set global sql_slave_skip_counter=1; start slave;

fixed it.

2012-11-05

Stopped replication from ippdb00 to ippc63. Started rsync from ippc63 to ippc61. (On 2012-11-06, restarted replication)

2012-11-02

Backup complete. Replication to ippc63 restarted.

2012-11-01

Stopped replication and mysql server on ippc63 to make a backup to ippc62

2012-10-23

Again a few STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

2012-10-21/22 gpc1 Reingestion

Reingestion on ippdb03 and ipp001

2012-10-15 ippdb01 -> ippdb03

Replication problem again

 Error 'Duplicate entry '19665422' for key 1' on query. Default database: 'gpc1'. Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'

fixed ippdb03 with recipe

stop slave; set global sql_slave_skip_counter=1; start slave;

2012-10-11: ippdb01 -> ippdb03/ipp001

Replication problem happened twice last night.

Error 'Duplicate entry '19442342' for key 1' on query. Default database: 'gpc1'. 
Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'

Error 'Duplicate entry '19446512' for key 1' on query. Default database: 'gpc1'. 
Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'

2012-10-10: ippdb01 -> ippdb03/ipp001

Again

2012-10-09: ippdb01 -> ippdb03/ipp001

Same problem as yesterday.

Error 'Duplicate entry '19309559' for key 1' on query. Default database: 'gpc1'.
Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'

Same fix:

STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

2012-10-08: ippdb01 -> ippdb03/ipp001

Error 'Duplicate entry '19172463' for key 1' on query. Default database: 'gpc1'. 
Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'

Fix:

STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

2012-08-25 / ippdb01 to ippdb03-ipp001

difftool.c allows playing twice an insert statement, hence breaking the replication.

2012-07-06 / ipp001/gpc1

Error 'Cannot add or update a child row: a foreign key constraint fails (gpc1/distComponent, CONSTRAINT distComponent_ibfk_1 FOREIGN KEY (dist_id) REFERENCES distRun (dist_id))' on query. Default database: 'gpc1'. Query: 'INSERT INTO distComponent (dist_id, component, bytes, md5sum, state, outdir, name, fault) VALUES (2066813, 'skycell.7.17', 108129, 'e5f49b59c7f693649f40a33e7a3980b9', 'full', 'neb://stsci08.0/gpc1/dist/STS.nightlyscience/2012/06/13/o6091g0322o/warp.2066813', 'o6091g0322o.498476.wrp.432885.skycell.7.17.tgz', 0)'

Tried to skip the offending statement without success.

2012-06-12 -> 2012-06-14

After ipdpb01 crash, see TODO

20120529-20120529 ippdb00/ippdb02

Full nebulous ingestion. Details here

20120427/11:05 ipp001

ipp001 replication broken because of 'Duplicate entry '32' for key 1' on query. Default database: 'ippadmin'. Query: 'insert into users (username, password) values ('chenx', 'chenx')'. I ran the usual STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;. Then 'Duplicate entry '30' for key 1' on query. Default database: 'ippadmin'. Query: 'INSERT into cookies (username, cookie) values ('chenx', '32709')'. Another STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; made it.

20120313/12:05 ipp001

ipp001 replication is broken again (broken foreign keys?! e.g. Cannot add or update a child row: a foreign key constraint fails (`gpc1/chipProcessedImfile`, CONSTRAINT `chipProcessedImfile_ibfk_1` FOREIGN KEY (`chip_id`, `exp_id`) REFERENCES `chipRun`). Made manual backup of ippdb03 to /export/ippdb03.0/mysql-dumps/backup_20120313T1208.sql.bz2

            Master_Log_File: mysqld-bin.020936
        Read_Master_Log_Pos: 73345738

20120312/13:31 ippdb03/ipp001

Error 'Table 'skycells.skycell' doesn't exist' on query. Default database: 'gpc1'. Query: 'update staticskyRun join staticskyInput using(sky_id) join stackRun using(stack_id) join skycells.skycell using(skycell_id) set staticskyRun.state = 'new.wait' where staticskyRun.label ='lap.threepi.20110809' and staticskyRun.state ='new' and ra/15 > 12'

From Bill: "I did use the table in an update last night. That was a one time command. The reason that the table is not in gpc1 is that I'm not finished with it yet and I didn't want to cause problems with replication. Apparently using it for a table update caused a problem."

I ran the usual STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; on ippdb03.

It looks though that there are other problems (multiple insert attempts for the same pk) for ipp001.

20120223/08:40: ippdb03

Duplicate entry 'XXXXX-X' for key 1' on query. Default database: 'ipptopsps'. I ran ~schastel/dev/many_failures/fix_replication.sh about 200 times without success. I talked to Roy who will replicate ipptopsps somewhere else. I'm dropping it from replication for now. I stopped the mysql server on ippdb03, modified /etc/mysql/my.cnf (commented out ipptopsps replication), and restarted the server.

20120222/09:21: ippc19

Replication on ippc19 broke. The 'SHOW SLAVE STATUS\G' command showed the following message: "Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0". /var/log/mysql/mysqld.err that replication was broken at log 'mysqld-bin.000639' position 1041257298 (see message 120222 8:46:23). I stopped the slave, changed the master coordinates on the slave (with the input found in /var/log/mysql/mysqld.err) and restarted the slave, taht is:

STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_FILE='mysqld-bin.000639',  MASTER_LOG_POS=1041257298;
START SLAVE;

Then "SHOW SLAVE STATUS\G" commands showed that replication worked again (Seconds_Behind_Master is decreasing).

20120203/08:51: ippc19

ippc17 experienced crashes yesterday and tables were corrupted (see http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/PS1_IPP_CzarLog_20120130#Friday:2012.02.03). Repairing the tables crashed the replication server: the error messages were about "Duplicate entr[ies]"

I ran the usual: STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; about 16 times to fix it.

20120131/08:57: ipp001

Ken Smith told us about ipp001 unavailability. The mysql server on that host crashed for some unknown reason (it was 62000 seconds behind so I assume it happened around 15:00 HST on Jan 30? Maybe it's related to the global loss of connection to the development cluster that happened around that time). I restarted the mysql server and had to fix it with STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

2011

20111121/08:47: ippdb03/ipp001

Two instances of czarpoll were started at the same time, running two "OPTIMIZE TABLE burntool". Replication broke on both clients. The message on both client was: "Query partially completed on the master (error on master: 1053) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'OPTIMIZE TABLE burntool'"

Fixed with STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

20111117/16:23: ipp001 crashed

mysqld crashed. Nothing interesting in mysqld.err. Once restarted, slave resumed normally.

20111115/09:23: ippc19 errors

Errors like 'Duplicate entry...' after ippc17 crash last night. The usual "STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;" fixed the problem.

20111109/12:11: ippdb03/ipp001 errors

Alterations to gpc1:

mysql> ALTER TABLE diffInputSkyfile DROP PRIMARY KEY, ADD PRIMARY KEY(diff_skyfile_id);
Query OK, 11661867 rows affected (12 min 37.50 sec)
Records: 11661867  Duplicates: 0  Warnings: 0
mysql> ALTER TABLE diffInputSkyfile CHANGE COLUMN skycell_id  skycell_id VARCHAR(64) DEFAULT NULL;
Query OK, 11661867 rows affected (11 min 28.22 sec)
Records: 11661867  Duplicates: 0  Warnings: 0
mysql> UPDATE dbversion set schema_version = '1.1.71', updated= CURRENT_TIMESTAMP();
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

The first statement broke the replication on both ippdb03 and ipp001 (because of the multiple diff_skyfile_id = 10264768 in diffInputSkyfile. Error message was: "ERROR 1062 (23000): Duplicate entry '10264768' for key 1"). On both replicants I ran:

mysql> STOP SLAVE;
mysql> delete from diffInputSkyfile where diff_id = 162937 and skycell_id = 'skycell.1485.049';
mysql> update diffInputSkyfile set diff_skyfile_id = 10264767 where diff_id = 162937 and skycell_id = 'skycell.1485.059';

and ran again the ALTER command:

mysql> ALTER TABLE diffInputSkyfile DROP PRIMARY KEY, ADD PRIMARY KEY(diff_skyfile_id);

and then:

mysql> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;

20110921/13:56: ippdb03 error

From Anonymous: "screwed-up the replication to ippdb03 while adding the new ippMonitor user (i didn't think i could update a slave, so believed i was safe). i've tried to fix it" The usual STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; fixed it

20110913/1720: ippdb03 "error"

Not really an error actually. isp was being updating with the statement (SHOW FULL PROCESSLIST or look at mysqld.err):

INSERT INTO summitExp SELECT NULL, incoming.*, NULL, 0, NULL FROM incoming LEFT JOIN summitExp
  USING(exp_name, camera, telescope)   
  WHERE summitExp.exp_name is NULL AND summitExp.camera is NULL AND summitExp.telescope is NULL

I killed the process in the mysql server and skipped the statement STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;

Note: I had to do it twice.

20110913/0930: ipp001 error

Error 'The table 'incoming' is full' on query. Default database: 'isp'. Query: 
'INSERT INTO incoming (exp_name, camera, telescope, dateobs, exp_type, uri) VALUES
 ('o5253i0240o01', 'isp', 'ps1', '2010-02-26 09:29:39', 'OBJECT', 
'http://otis3.ifa.hawaii.edu/ds/skyprobe/o5253i0240o01/index.txt')'

I don't understand why this _temporary_ table can't be created. I dropped isp from replication.

20110801/1505: ippdb03 error

ippdb03 crashed often during the last weekend. SHOW SLAVE STATUS\G showed this error message:

*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: ippdb01.ifa.hawaii.edu
                Master_User: repl_gpc1
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysqld-bin.020270
        Read_Master_Log_Pos: 962746384
             Relay_Log_File: mysqld-relay-bin.000005
              Relay_Log_Pos: 24123883
      Relay_Master_Log_File: mysqld-bin.020269
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB: gpc1,czardb,ippadmin,isp
        Replicate_Ignore_DB: 
         Replicate_Do_Table: 
     Replicate_Ignore_Table: 
    Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
                 Last_Errno: 0
                 Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary
 log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is 
corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the 
master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be 
able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
               Skip_Counter: 1
        Exec_Master_Log_Pos: 24123745
            Relay_Log_Space: 2036505138
            Until_Condition: None
             Until_Log_File: 
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File: 
         Master_SSL_CA_Path: 
            Master_SSL_Cert: 
          Master_SSL_Cipher: 
             Master_SSL_Key: 
      Seconds_Behind_Master: NULL

The relay log was corrupted (I ran: 'mysqlbinlog mysqld-relay-bin.000005 > /dev/null' and saw an error message lloking like:

ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 0, event_type: 0
Could not read entry at offset <some value>:Error in log format or read error

I googled a bit and found that whic fixed the problem: http://www.mysqlperformanceblog.com/2008/08/02/troubleshooting-relay-log-corruption-in-mysql/

Precisely I did: change master to master_log_file='mysqld-bin.020269',master_log_pos=24123745;

The problem seems to have been fixed

20110727/1410: ippdb03 Replication Error

ippdb03 crashed. SHOW SLAVE STATUS messages was like: 'Duplicate entry ...'

I applied STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; about twenty times and everything looks fine now. (Messages were like: 'Duplicate entry ...')

20110726/1810: ippdb03 Replication Error

ippdb03 crashed. SHOW SLAVE STATUS messages was like: 'Duplicate entry ...'

I applied STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; four times and everything looks fine now. (Messages were like: 'Duplicate entry ...')

20110429/0850: gpc1 Replication Error

From mysql slave on ippc02 (SHOW SLAVE STATUS) Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'drop table lapExp'.

I checked with Chris. I manually dropped the table in gpc1 and did a 'STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;'. Hopefully we will not have to ingest a new dump once again...

20110413/1450: ippadmin Replication Error

Logged onto ipp001
Started mysql client as root

'SHOW SLAVE STATUS\G' shows the following error:

Last_Error: Error 'Duplicate entry '26' for key 1' on query. Default database: 'ippadmin'. Query: 'insert into users (username, password) values ('denver', '********')'

I did the usual 'STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;'
The slave is 172833 seconds (i.e. 2 days) behind its master. It clearly shows that no one looks at the replication server statuses.

20110307/0830: Nebulous Replication Error

Logged onto ippdb02
Started mysql client as root

'SHOW SLAVE STATUS\G' shows the following error:

Error 'Table 'nebulous.mountedvol' doesn't exist' on query. 
Default database: 'nebulous'. 
Query: 'REPLACE INTO mountedvol SELECT vol_id,name,host,path,allocate,available,xattr,mountpoint, 
       '10651598144', '8973230080',note FROM volume WHERE mountpoint = '/data/ipp004.0''

But, if I check the contents of the database, I can see nebulous.mountedvol exists! It seems that the replicated nebulous didn't pass the optimization of last week.
I'm setting up a new replicated server.

2010

20101116/1659: gpc1 Replication Error

Logged onto ipp001
Started mysql client as root
'SHOW SLAVE STATUS\G' shows error similar to 20101104/1429: gpc1 Replication Error
Same solution was applied (Statement skipped)

20101104/1429: gpc1 Replication Error

Problem detected through czartool

Logged onto ipp001
Started mysql client as root

'SHOW SLAVE STATUS\G' showed (shortened)

          Slave_SQL_Running: No
                 Last_Errno: 1062
                 Last_Error: Error 'Duplicate entry '23' for key 1' on query. Default database: 'ippadmin'. 
                             Query: 'INSERT into cookies (username, cookie) values ('psobs', '99497')'

Local non-replicated database involved... Statement skipped (in MySQL client):

  STOP SLAVE;
  SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; 
  START SLAVE;

on one line:

  STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;

Then 'SHOW SLAVE STATUS\G' showed (shortened)

          Slave_SQL_Running: Yes
                 Last_Errno: 0
                 Last_Error:

Czartool didn't show any problem after the fix

Last modified 7 years ago Last modified on Mar 29, 2019, 10:14:55 AM

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text