| Version 22 (modified by , 15 years ago) ( diff ) |
|---|
PS1 IPP Czar Logs for the week 2011.01.03 - 2011.01.09
(Up to PS1 IPP Czar Logs)
Monday : 2011.01.3
heather reverted (using regtool -revert...) burntool/registration.
bill and eugene have turned off processing because we are out of disk space.
Tuesday : 2011.01.04
bill is czar today
- It appears that all data from last night has had burtool applied.
- 12:30 Set stdscience to 'run' added ThreePi.nightlyscience back in
- 12:52 we seem to be getting a pretty decent rate of faults due to nfs errors
Wednesday : 2011.01.05
Bill is czar today
- (serge/07:40) cam revert on
- (serge/08:39) publishing restarted
- 10:00 warp stuck lots of entries in warpPendingSkyCell book in done state. ipp049 not responding to ssh 4 warps stuck running there. Stopped everything for awhile let jobs finish. Then reset the books (warp.reset, chip.reset, etc)
- 10:51 Gavin rebooted ipp049. publish was getting lots of faults. Stopped it and asked Serge to investigate)
- 11:35 Turned off some reverts in order to debug the fault rate. Also set poll limit to 32 to reduce the load in order to get an idea whether that is the problem or not.
- 12:20 two chips failed repeatedly. Turned out to that the log files had a storage object but no instances. Fixed with neb-mv
- Serge found the origin of the publishFailures (some runs got queued for a client with a non-existent data store)
- turned warp off to allow the diffs to make better progress. poll limit is 64
- 12:41 turned warp back on. Still getting failures even with only a few jobs running. There is at least one corrupt camRun 152676. See http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/PS1_Operations/broken_files
- 13:29 found another corrupt file warp_id 142806 skycell.1162.062 . Increased poll limit to 128
- 13:45 found 2 publishRuns that were failing and reverting repeatedly 9GB of log files was the result
- 14:06 ran nightly_science.pl --queue_stacks --date 2011-01-05
- 15:00 gave up trying to debug the cause of the high fault. All reverts back on.
Thursday : 2011.01.06
- (bills 06:57) figured out why ssdiffs aren't getting queued for MD03. warps and stacks were done with MD03.V2 but the survey task still had the MD03 template.
- registration is stuck. I am investigating.
- (bills 07:45) I reverted faults but issued the command wrong and reverted over 20000 old faults. Set newExp.state to 'wait' where state ='run' and exp_id < 273800
- (bills) 08:41 burntool is proceeding slowly. All but 5 or so chips are finshed and the query for pending files is slow compared to the time it takes to run burntool so there are no jobs to run most of the time.
- (roy/heather/serge) 08:52 burntool/registration very slow. Saw no failed registration chips, so restarted registration server. Saw worrying message in registration log:
Can't find regtool at /home/panstarrs/ipp/psconfig//ipp-20101215.lin64/bin/ipp_apply_burntool_single.pl line 47. Can't find required tools. at /home/panstarrs/ipp/psconfig//ipp-20101215.lin64/bin/ipp_apply_burntool_single.pl line 55. config error for: ipp_apply_burntool_single.pl --exp_id 274277 --class_id XY30 --this_uri neb://ipp016.0/gpc1/20110106/o5567g0206o/o5567g0206o.ota30.fits --previous_uri neb://ipp016.0/gpc1/20110106/o5567g0205o/o5567g0205o.ota30.fits --dbname gpc1 --verbose job exit status: 3 job host: ipp012 job dtime: 0.432504 job exit date: Thu Jan 6 08:09:55 2011
Friday : 2011.01.07
Saturday : 2011.01.08
Sunday : 2011.01.09
Note:
See TracWiki
for help on using the wiki.
