IPP Software Navigation Tools IPP Links Communication Pan-STARRS Links

Changes between Version 13 and Version 14 of PS1_IPP_CzarLog_20110103


Ignore:
Timestamp:
Jan 5, 2011, 2:09:34 PM (15 years ago)
Author:
bills
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • PS1_IPP_CzarLog_20110103

    v13 v14  
    1919
    2020=== Wednesday : 2011.01.05 ===
     21Bill is czar today
    2122
    22  - (serge/07:40) cam revert on
    23  - (serge/08:39) publishing restarted
    24  - (bills/10:00) warp stuck lots of entries in warpPendingSkyCell book in done state. ipp049 not responding to ssh 4 warps stuck running there. Stopped everything for awhile let jobs finish. Then reset the books (warp.reset, chip.reset, etc)
    25  - (bills/10:51) Gavin rebooted ipp049. publish was getting lots of faults. Stopped it and asked Serge to investigate)
    26  - (bills/11:35) Turned off some reverts in order to debug the fault rate. Also set poll limit to 32 to reduce the load in order to get an idea whether that is the problem or not.
    27  - (bills/12:20) two chips failed repeatedly. Turned out to that the log files had a storage object but no instances. Fixed with neb-mv
     23 * (serge/07:40) cam revert on
     24 * (serge/08:39) publishing restarted
     25 * 10:00 warp stuck lots of entries in warpPendingSkyCell book in done state. ipp049 not responding to ssh 4 warps stuck running there. Stopped everything for awhile let jobs finish. Then reset the books (warp.reset, chip.reset, etc)
     26 * 10:51 Gavin rebooted ipp049. publish was getting lots of faults. Stopped it and asked Serge to investigate)
     27 * 11:35 Turned off some reverts in order to debug the fault rate. Also set poll limit to 32 to reduce the load in order to get an idea whether that is the problem or not.
     28 * 12:20 two chips failed repeatedly. Turned out to that the log files had a storage object but no instances. Fixed with neb-mv
    2829   * Serge found the origin of the publishFailures (some runs got queued for a client with a non-existent data store)
    2930   * turned warp off to allow the diffs to make better progress. poll limit is 64
    30  - (bills/12:41) turned warp back on. Still getting failures even with only a few jobs running. There is at least one corrupt camRun 152676. See http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/PS1_Operations/broken_files
    31  - (bills/13:29) found another corrupt file warp_id 142806 skycell.1162.062 . Increased poll limit to 128
     31 * 12:41 turned warp back on. Still getting failures even with only a few jobs running. There is at least one corrupt camRun 152676. See http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/PS1_Operations/broken_files
     32 * 13:29 found another corrupt file warp_id 142806 skycell.1162.062 . Increased poll limit to 128
     33 * 13:45 found 2 publishRuns that were failing and reverting repeatedly 9GB of log files was the result
     34 * 14:06 ran  nightly_science.pl --queue_stacks --date 2011-01-05
    3235
    3336=== Thursday : 2011.01.06 ===