| Version 47 (modified by , 15 years ago) ( diff ) |
|---|
PS1 IPP Czar Logs for the week 2011.01.03 - 2011.01.09
(Up to PS1 IPP Czar Logs)
Monday : 2011.01.3
heather reverted (using regtool -revert...) burntool/registration.
bill and eugene have turned off processing because we are out of disk space.
Tuesday : 2011.01.04
bill is czar today
- It appears that all data from last night has had burtool applied.
- 12:30 Set stdscience to 'run' added ThreePi.nightlyscience back in
- 12:52 we seem to be getting a pretty decent rate of faults due to nfs errors
Wednesday : 2011.01.05
Bill is czar today
- (serge/07:40) cam revert on
- (serge/08:39) publishing restarted
- 10:00 warp stuck lots of entries in warpPendingSkyCell book in done state. ipp049 not responding to ssh 4 warps stuck running there. Stopped everything for awhile let jobs finish. Then reset the books (warp.reset, chip.reset, etc)
- 10:51 Gavin rebooted ipp049. publish was getting lots of faults. Stopped it and asked Serge to investigate)
- 11:35 Turned off some reverts in order to debug the fault rate. Also set poll limit to 32 to reduce the load in order to get an idea whether that is the problem or not.
- 12:20 two chips failed repeatedly. Turned out to that the log files had a storage object but no instances. Fixed with neb-mv
- Serge found the origin of the publishFailures (some runs got queued for a client with a non-existent data store)
- turned warp off to allow the diffs to make better progress. poll limit is 64
- 12:41 turned warp back on. Still getting failures even with only a few jobs running. There is at least one corrupt camRun 152676. See http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/PS1_Operations/broken_files
- 13:29 found another corrupt file warp_id 142806 skycell.1162.062 . Increased poll limit to 128
- 13:45 found 2 publishRuns that were failing and reverting repeatedly 9GB of log files was the result
- 14:06 ran nightly_science.pl --queue_stacks --date 2011-01-05
- 15:00 gave up trying to debug the cause of the high fault. All reverts back on.
Thursday : 2011.01.06
serge is czar
- (bills 06:57) figured out why ssdiffs aren't getting queued for MD03. warps and stacks were done with MD03.V2 but the survey task still had the MD03 template.
- registration is stuck. I am investigating.
- (bills 07:45) I reverted faults but issued the command wrong and reverted over 20000 old faults. Set newExp.state to 'wait' where state ='run' and exp_id < 273800
- (bills) 08:41 burntool is proceeding slowly. All but 5 or so chips are finshed and the query for pending files is slow compared to the time it takes to run burntool so there are no jobs to run most of the time.
- (roy/heather/serge) 08:52 burntool/registration very slow. Saw no failed registration chips, so restarted registration server. Saw worrying message in registration log:
Can't find regtool at /home/panstarrs/ipp/psconfig//ipp-20101215.lin64/bin/ipp_apply_burntool_single.pl line 47. Can't find required tools. at /home/panstarrs/ipp/psconfig//ipp-20101215.lin64/bin/ipp_apply_burntool_single.pl line 55. config error for: ipp_apply_burntool_single.pl --exp_id 274277 --class_id XY30 --this_uri neb://ipp016.0/gpc1/20110106/o5567g0206o/o5567g0206o.ota30.fits --previous_uri neb://ipp016.0/gpc1/20110106/o5567g0205o/o5567g0205o.ota30.fits --dbname gpc1 --verbose job exit status: 3 job host: ipp012 job dtime: 0.432504 job exit date: Thu Jan 6 08:09:55 2011
- (bills) 10:01 There are 816 magicRuns to process. I turned off magic reverts to look for repeating failures.
- (serge) 10:54 Stopped summitcopy to help registration to finish for last night data
- (serge) 11:08 removed ippc00 host from stdscience
- (from bill) To free up the cluster a bit I've turned off processing of ThreePi data for now. The command is labeltool -updatelabel -set_inactive -label ThreePi.nightlyscience
- (serge) 12:31 cleanup temporarily stopped. ipp009 looks mad (umount.nfs uses 100% of a cpu?!). Let's wait a bit before rebooting it.
- (serge) 13:28 Gene stopped stdscience and rebooted ipp009. When back I restart stdsciecne
- (serge) 15:05 MSS data published to MOPS ds. I reactivated 3pi processing: 'labeltool -updatelabel -label ThreePi.nightlyscience -dbname gpc1 -set_active'. I restarted cleanup and summitcopy. All revert are set to on in stdscience. All services (but addstar, detrendm and replication) are running.
- (serge) 16:16 Restarted distribution
- (from bill) 16:52 Since the stacks for last night's data hadn't been queued I ran the following by hand: nightly_science.pl --queue_stacks --date 2011-01-06
- (from Gene) 21:09 removed ThreePi.nightlyscience from stdscience label list
Friday : 2011.01.07
serge is czar
- (bills) 05:00 1 file repeatedly failing summit copy and 1 repeatedly failed registration. The problem is that nebulous instances for the image file (copy) and registration log file have been created on ipp025 but that node has been taken out of nebulous. ganglia says that ipp025 has a high load. I can log into it. On ippdb00 I tried force.umount and it successfully unmounted the ipp025 but the remount step never finished. To unstick registration and summit copy I used neb-mv to move the inaccessible instances out of the way.
- (bills) 05:15 summit copy is finished but registration/burntool has 134 unfinished exposures. It seems to be slowly making progress burntooling files. But no files are finishing registration. The burntool jobs are for chips from exposures after o5568g0183o (the one that faulted) so perhaps something is wrong. One chip (XY35) from o5568g0183o has burntool_state == -1 That is probably blocking things from proceeding. I don't see a regtool mode to fix this so I edited the database and set burntool_state back to zero. That didn't seem to help though.
- (bills) 05:55 We needed data_state changed from check_burntool to pending_burntool. Now we're moving along. Sounds like we may need a revertburntool mode.
- (bills) 06:05 Stopped registration to make rawImfile.burntool_state a key.
- (bills) 06:30 power cycled ipp025 There was a panic message on console. See https://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/ipp025_log
- (bills) 06:39 reverted warp, diff, and dist faults (probably caused by ipp025)
- (bills) 06:40 added ThreePi.nightlyscience back into stdscience label list. (Gene removed it last night)
- (bills) 07:00 registration/burntool has finished
- (bills) 07:11 Someone submitted a postage stamp by coordinate request through the web interface for a point in M31. CFA has some requests that are blocked by that so I've lowered the priority for the postage stamp label WEB and gpc1 label ps_ud_WEB to let the cfa requests through.
- (serge) 08:34 chip.off, warp.off, stack.off to speed up MD
- (serge) 08:39 chip.on, warp.on, stack.on
- (serge) 09:58 I had to queue the last 3 MD02 exposures for publishing by hand (pubtool -definerun -client_id 1 -label MD02.nightlyscience -dbname gpc1). The weirdest is that I had to enter the command three times. The first command only queued the first missing exposure (o5568g0132o MD02 z N5568 MD02 center). Once it was published, the same command added the second missing exposure. Successive calls to (pubtool -definerun -client_id 1 -label MD02.nightlyscience -dbname gpc1) didn't add the missing exposures...
Saturday : 2011.01.08
Sunday : 2011.01.09
Note:
See TracWiki
for help on using the wiki.
