- Introduction
- Nightly Products (as of 2015 onward, ie PSNSC)
- About Czar Pages: What to do if the czar pages don't update
- Getting started and checking processing status
-
Using
pantasks -
Morning duties: checking
summitcopyandburntool -
Stopping and starting
nebulous -
Stopping and starting the
pantasksservers - Monitoring LAP processing
- Queuing/Dequeuing data
- Running the microtest scripts
-
Common problems and their solutions
- Repairing or dropping a bad instance
- Diff failures
-
burntooldoesn't start… - Czartool reports negative burntool stats
- Finding log files
- Reverting
- If component fails again after reverting
- Re-adding a nightly-science date to pantasks
- Removing a host
- Raid BBU and Nodes with Targeted Data
- Corrupted data before the current stage
- Pausing a task(?)
- Restarting the apache servers
- removing an apache server from nebulous list
- Changing the code
- Node Trouble
- Who to contact
- Czar Logs
- MOPS
- Pantasks Log Archive
Introduction
This page outlines the procedures and responsibilities for the person currently acting as 'IPP Processing Czar'. In a nutshell, these include:
- monitoring the various pantasks servers running on the production cluster using Czartool and
pantasks_client - keeping a close eye on the 'stdscience' pantasks server in particular, using Czartool
- keeping an eye on production cluster load using Ganglia, if system down see #NodeTrouble.
- keeping an eye on available disk space using Czartool (or the
neb-dfcommand on any production machine)- also includes ippc19:homedir (~ipp log archive, user overuse), ipp001:mysql-backup, apache/nebulous hosts ippc20-c26 log clearing
- alerting the IPP group to any notable errors or failures (see here for details), investigate and resolve the possible ones
NB You will need to have ipp user access on the production cluster. For convenience, have someone who already has access (anyone on the IPP team) to add your ssh public key to ~ipp/.ssh/authorized_keys.
Node use summary -- IPP_NodeUse
Nightly Products (as of 2015 onward, ie PSNSC)
These are the nightly products the czar is responsible for with details therein
MOPS -- PSNSC_MOPS -- *SS quads processed through publishing, processed throughout the night and all finished by early morning
- summary for czaring --
- OSS, ESS, MSS and EU, XSS, Bright.nightlyscience labels: chip->warp, WWdiff, publishing -- time critical, problems and requested desperate WWdiffs must be addressed ASAP (must be fully finished early morning) -- first chunk check and email required, any issues resolved before midnight -- must switch soon to full desperate WWdiff check after each chunk finished
- FSS.* targeted chip->warp, WSdiff, publishing -- time critical, faults must be cleared ASAP (must be finished early morning) -- MEH taking care of until further notice
-- below are still being updated
QUB -- PSST_QUB (should have been labeled PSNSC_PSST_QUB) -- MEH taking care of until further notice (201904xx)
- summary for czaring --
- xSS WSdiff catalog products for PSST -- moderately time critical, faults must be cleared by end of day (WS queuing done by MEH)
- targeted QUB.* chip->warp -- time critical, faults must be cleared ASAP
- QUB.* stack, WSdiff, SSdiff manually dealt with by Mark Huber
SNIaF -- PSST_SNIaF -- (should have been labeled PSNSC_SNIaF) -- MEH taking care of until further notice (201904xx)
- summary for czaring -- processing chip->warp, WSdiff, but only warp images distributed -- time critical, faults must be cleared before noon (ended 201808xx)
CFA -- CFA_MD07 -- (should have been labeled PSNSC_CFA_MD07) -- MEH taking care of until further notice (201904xx)
- summary for czaring -- processing chip->warp, nightly stack, WS+SSdiff but not distributed yet -- not time critical, but faults should be cleared before cleanup..
NCU -- PSNSC_NCU -- NCU -- none planned currently, MEH taking care of until further notice (201904xx)
- summary for czaring (Mark Huber responsible for currently 20161026) -- NCU.nightlyscience chip->warp, WSdiffs, publishing for MOPS -- time critical, faults must be cleared ASAP (early morning) for MOPS
- nightly stack, SSdiffs -- TBD
K2 -- similar to SNIaF and 3PI (3PI needs to be documented still) -- none planned currenently -- MEH taking care of until further notice (201904xx)
- summary for czaring -- processing chip->warp, WSdiff with only diff cmf distributed to QUB -- fairly time critical like SNIaF, faults must be cleared before noon (WS queuing done by MEH)
Bright -- observations done in twilight and bright full moon time (fully replaces ThreePi observations now except for occasional filler fields) -- setup ongoing progress by MEH
- summary for czaring -- processing chip->warp, WWdiff for MOPS
- WSdiff with only diff cmf distributed to QUB -- time critical, faults must be cleared ASAP (early morning) QUB
- BrightTwi and Bright3Pi only WSdiffs to QUB, split to avoid messing up WWdiff sets during full moon
- details on https://ps1wiki.ifa.hawaii.edu/trac/wiki/EuclidObservations
ThreePi -- looks like it is used off and on instead of BRIGHT -- old processing setup in use
About Czar Pages: What to do if the czar pages don't update
IPP Czar pages are updated every five minutes or so by the czarpoll.pl script. This script runs on ipp113 as ippitc user within a screen session. If, for some reason, czarpoll crashes (it likely means that gpc1 mysql server has been restarted),
1) ssh as ippitc on ipp113 use ipp113 so that we don't overload c18'''
2) identify the screen session which is running and reattach it
3) restart czarpoll and then
4) detach from the screen session
The sequence of commands should therefore be something like:
1) ssh as ippitc on ipp113
yourhost:~$ ssh ippitc@ipp113
2) Identify the screen session which is running and reattach it
ippitc@ipp113:/home/panstarrs/ippitc>screen -list
There is a screen on:
18965.CzarPoll (Attached)
1 Socket in /var/run/screen/S-ippitc.
ippitc@ipp113:/home/panstarrs/ippitc>screen -r 18965.CzarPoll
(you see then the last lines of the display, e.g.:
Total time : 0:55.57s
CPU utilisation (percentage) : 57.7%
If these sessions are not running, restart them with
screen -S CzarPoll screen -S RoboCzar
3) Restart czarpoll from /data/ippc64.1/ippitc/src/ippMonitor/czartool
cd /data/ippc64.1/ippitc/src/ippMonitor/czartool ipp@ippc33:/data/ippc64.1/ippitc/src/ippMonitor/czartool>./czarpoll.pl
Note that you have to 'cd', there is some relative path nonsense in there. You will see something like:
* Checking nightly science status * Checking Nebulous * Checking all pantasks servers * Updating dates [...]
4) Detach from the screen session by typing CTRL-a d
CTRL-A d
The terminal window is cleared and you should see something like:
[detached] ippitc@ipp113:/home/panstarrs/ippitc>
You can safely logout or do other work
5) do the same for roboczar in a screen session (from /data/ippc64.1/ippitc/src/ippMonitor/czartool)
ippitc@ipp113:/data/ippc64.1/ippitc/src/ippMonitor/czartool>./roboczar.pl
6) Also check on nebdiskd. this should be running on ipp117 (location as of March 2017, after the cluster move to ITC). If nebdiskd is not running, restart it by logging into ipp117 as user ippitc and issuing the command 'nebdiskd'. The program puts itself in the background and sends output to the log file /tmp/nebdiskd.log:
ippitc@ipp117:nebdiskd
Remotely accessing czarpages
Access to the ippmonitor pages is restricted but can be accessed from external systems using a SOCKS proxy to an authorized machine. Various extensions for firefox/chrome exist to manage this easier than globally setting in the browser, one regularly used is foxyproxy on firefox.
- setup ssh tunnel to allowed machine on a particular port (here 9999)
ssh -D 9999 -f -C -q -N user@hostname
- depending on browser extension, configuration will vary but set to use SOCKSv5, host localhost, port 9999 (or which ever port chosen)
Getting started and checking processing status
Czartool makes it relatively easy to check the overall status of the processing pipeline. You can check the status of the various pantasks_servers, how much data was taken at the summit and has been copied to the cluster, and the status of various processes within stdscience, chip, camera, warp, diff etc.
Using pantasks
There are numerous pantasks servers. Their status can be checked with Czartool, but it is often necessary to use a client directly. To do this, first move to the directory corresponding to the server of interest, which are all under ~ipp on any cluster machine. For example, go to ~ipp/stdscience/, then run
pantasks_client
To check the current labels being processed:
pantasks: show.labels
Within pantasks, to check processing status, do
pantasks: status
Note: more information on the tasks with status -v
This will return something like
Task Status AV Name Nrun Njobs Ngood Nfail Ntime Command +- extra.labels.on 0 3 3 0 0 echo +- extra.labels.off 0 3 3 0 0 echo +- ns.initday.load 0 3 3 0 0 echo ++ ns.registration.load 0 1331 1331 0 0 automate_stacks.pl ++ ns.chips.load 0 66 66 0 0 automate_stacks.pl ++ ns.chips.run 0 4 4 0 0 automate_stacks.pl ++ ns.stacks.load 0 5825 5825 0 0 automate_stacks.pl ++ ns.stacks.run 0 6 6 0 0 automate_stacks.pl ++ ns.burntool.load 0 8 8 0 0 automate_stacks.pl ++ ns.burntool.run 0 360 360 0 0 ipp_apply_burntool.pl ++ chip.imfile.load 1 48039 48038 0 0 chiptool ++ chip.imfile.run 0 23524 17755 5769 0 chip_imfile.pl ++ chip.advanceexp 0 7514 7514 0 0 chiptool etc...
The first column, 'AV', translates to Active and Valid, i.e. whether a process is running and whether it is valid at this point in time. For example, above, +- ns.initday.load is active, but is not valid at present since it is scheduled to run only once per day (to initialize the nightlyscience automation).
The key thing to monitor here is the Nfail column. Depending on the process, different numbers of Nfail as a proportion of Njobs are deemed acceptable.
Morning duties: checking summitcopy and burntool
There is nothing to be processed if data has not been copied from the telescope. This is the job of summitcopy, which runs slowly through the night, then speeds up once observations are complete every day. You can check that it has successfully copied files using Czartool.
After summitcopy comes burntool. If burntool is running then czartool will report it in the nightly science status ('BURNING'). To check this manually, run the following in the stdscience pantasks_client
ns.show.dates
You should see a 'book' entry with today's date, like
<today's date e.g. 2010-08-05> BURNING
If not, something is wrong.
The different steps values are shown in this Wiki page.
If exposures are not being successfully registered at MHPCC then use 'regpeek', eg
trunk/tools/regpeek.pl
Stopping and starting nebulous
Stopping nebulous
- Stop all processing: Make sure that nothing from ipp, ippdvo, ippdor, and any other random user is running (checking http://ippmonitor.ipp.ifa.hawaii.edu/ippMonitor/clusterMonitor2/ might be a good idea)
- Stop apache servers on ippc01-c10 with
ssh <node> sudo /etc/init.d/apache2 stop
- Stop the mysql server on ippdb00 (this may take some time like 15 minutes)
ssh ippdb00 mysqladmin -uroot -pxxx shutdown now
Starting nebulous
- Start the mysql server on ippdb00
ssh ippdb00 /etc/init.d/mysql zap /etc/init.d/mysql start
- Start the apache servers on ippc01-ippc10
ssh <node> sudo /etc/init.d/apache2 start
- Start IPP pantasks
Stopping and starting the pantasks servers
It is occasionally necessary to stop and restart the pantasks_server instances. For example, when it is necessary to update and rebuild the code, or if pantasks itself becomes unresponsive or shows negative values in some columns of the status display (above).
When stopping the pantasks, check with the others to find out if they have rogue pantasks running that should also be stopped (for example, heather with addstars, or chris/mark with rogue stack pantasks). Please check the stopping addstar section below if necessary.
Stopping
To stop a single pantasks server (scheduler) instance
As any user, on any machine, run pantasks_client then
pantasks: stop
Wait until all jobs are finished (all Nrun = 0) then
pantasks: shutdown now
To shut down all pantasks_server instances
check_system.sh stop
Wait 'n' minutes for all Nrun values to be zero, then
check_system.sh shutdown
Starting
Each pantasks_server uses a local input and ptolemy.rc file (this file details the machine where the server is to run).
NB for the special case of the addstar server, see this page.
To start all* pantasks servers use
check_system.sh run
NB Don't be tempted to use start
- some
Starting a single server
To start a single server you need to ssh to the relevant machine (found in the ptolemy.rc file for that server) then do the following:
Note: you need to login ippc30 for stdscience server (2019.03.29 updated CCL)
ssh ipp@ippXXX cd <serverName> pantasks_server & pantasks_client pantasks: server input input pantasks: setup pantasks: run
So, for example for stdscience
ssh ippc16 cd stdscience pantasks_server & pantasks_client pantasks: server input input pantasks: setup pantasks: run
Note: For replication target.on has to be run for shuffling to happen.
Starting an already running server
For already-running servers, pantasks should be started with the following commands only:
pantasks_client: server input input pantasks_client: setup pantasks_client: run
This loads the hosts and labels needed and starts the processing running. See ~ipp/stdscience/input if this is not clear.
Starting all servers
If everything has been shut down, you can start all pantasks with the following in ~ipp:
check_system.sh start.server check_system.sh setup check_system.sh run
The first command launches the pantasks_servers on the correct host, the second calls the three commands listed above ( server input input; setup; run ).
Stopping/Starting addstars
There are a number of addstar pantasks currently in use for ippdvo (note, not all may be active):
- ipp004: addstar
- ipp005: addstarlap
- ipp007: addstar007
- ipp008: addstar008
- ipp009: addstar009
- ipp010: addstar010
to stop the addstars
go into each directory and issue these commands:
cd ~ippdvo/addstar pantasks_client stop status
It is extremely important to wait until all the addstar tasks are done before shutting down addstar. If you do not, you risk:
- causing an exposure to be addstarred twice
- destroying a minidvodb
- destroying the big mergedvodb
to start the addstars
The best is to ask heather for advice on this, because different addstars will be in different states of activity based on the addstar needs. In general it can be restarted as above (for a single addstar), and this will not hurt anything as by default when addstar is started it has nothing to do. This is to keep heather's sanity.
Monitoring LAP processing
LAP should run largely without any intervention other than regular reverting and repairing corrupt files. A detailed look at how to check the status of LAP processing is given here.
CZW PV2 issues
I have written a few scripts that I've been using to quietly fix PV2 processing issues. The details are given below. Note that none of these scripts call any commands that update/reprocess/copy data. After identifying a suggested course of action, the commands that should be issued are printed to the screen. This prevents accidents that could arise from undefined values in the scripts, requiring a person to copy and execute these commands.
Stacking faults
There are remaining unresolved errors in ppStack that cause some stacks to fail with fault 4 or 5. LAP does not block on these faults, as they are considered unrepairable. These can probably be resolved after the ppStack errors are fully diagnosed and repaired. However, until that happens, I've been simply setting these stackRuns to 'drop':
stacktool -updaterun -set_state drop -state new -label LAP.ThreePi.20130717 -fault 4 stacktool -updaterun -set_state drop -state new -label LAP.ThreePi.20130717 -fault 5
Another error that is less clear is the occurrence of fault 2 issues with stacks. These are rarely due to actual NFS errors, but seem to be caused by errors in the warp update process. The symptom is that the warpSkyfile is set to data_state = 'full', but without the pswarp command actually completing successfully. These cause the stacks to continually revert and fail due to the missing warpSkyfile. I've written a script /home/panstarrs/watersc1/PV2.LAP.20130717/fixes/unstick_fault2.pl to help fix these faults. The script scans the logfiles for the faulted stacks, uses the missing file mentioned in that file to determine what needs to be done to fix the missing file. It then prints the runwarpskycell.pl that needs to be run to fix this missing file. In the case that the chip images have been cleaned for this exposure (based on the message printed in the logfile), the appropriate chiptool command is also returned to update it. As the stack log doesn't change, if a chip update is required, it will continue to print that message, even after the chipRun has been updated.
Missing OTA/BT TABLE files
Due to the failure of ipp047, I've been using another script /home/panstarrs/watersc1/bin/chip_auto_repair_helper.pl to generate commands that can be used to identify hidden copies of missing single instance raw files. After identifying which chipImfile has faulted, it checks the existence of the fits image, and if that is found and looks reasonable (exists, not zero-byte), it checks the associated burntool table. If one of these files is missing a reasonable copy, the deneb-locate.py script is called to look for a hidden copy. If one is found, the appropriate copy command is printed. A large amount of diagnostic/md5sum information is also returned for verification.
Queuing/Dequeuing data
Adding data to the queue
Before pantasks can used to manage processing of a particular label, chiptool must first be run to queue data and create that label. The custom here is to write a small script that runs chiptool with the necessary arguments. This script is then left in the stdscience sub-directory with the same name as the survey in question (M31, MD04 etc). This is so that there is a record of what has been queued. An example script would be
#!/bin/csh -f set label = "M31.Run5.20100408" set options = "" set options = "$options -dbname gpc1" set options = "$options -definebyquery" set options = "$options -set_end_stage warp" set options = "$options -set_tess_id M31.V0" set options = "$options -set_data_group M31.Run5b" set options = "$options -set_dist_group M31" set options = "$options -comment M31%" set options = "$options -dateobs_begin 2009-12-09T00:00:00" set options = "$options -dateobs_end 2010-03-01T00:00:00" # set options = "$options -simple -pretend" chiptool $options -set_label $label -set_workdir neb://@HOST@.0/gpc1/$label
Now the label must added within pantasks
pantasks: add.label M31.Run5.20100408
Note: the add.label command does not propagate along the IPP chain. After adding it to stdscience, it might be required to add it to distribution server.
According to how things were set up, the system may be told to look for today's date. The command to add all data of a particular day (e.g. 2010-08-06) to the queue is:
ns.add.date 2010-08-06
Note that it may also be necessary to add previous days if the processing has not been finished for them, e.g., if the processing is not complete for the two days before:
ns.add.date 2010-08-05 ns.add.date 2010-08-04
Removing data from the queue
If a mistake has been made and a label needs to be removed from processing, then
pantasks: del.label M31.nightlyscience
chiptool must then be used to drop the label for data with a state of 'new'.
chiptool -updaterun -set_state drop -label bad_data_label -state new -dbname gpc1
If some of the data has already been processed (i.e. state!=new), then cleanup must be employed. TODO more here
Running the microtest scripts
The microtest data should be correctly automated, but still requires a script to be manually run. The basic pantasks tasks to reduce the microtest data are included in the stdscience/input file, in the add.microtest macro:
macro add.microtest add.label microtestMD07.nightlyscience add.label microtestMD07.noPattern.nightlyscience survey.add.WSdiff microtestMD07.nightlyscience MD07.refstack.20100330 microtestMD07 neb://@HOST@.0/gpc1 survey.add.WSdiff microtestMD07.noPattern.nightlyscience MD07.refstack.20100330 microtestMD07.noPattern neb://@HOST@.0/gpc1 survey.add.magic microtestMD07.nightlyscience /data/ipp050.0/gpc1_destreak survey.add.magic microtestMD07.noPattern.nightlyscience /data/ipp050.0/gpc1_destreak end
Once the two labels have made it through magic, the microtest.pl script can be run. You'll need to have ppCoord built and in your path. This isn't built by
psbuild. You just need to go into the ppViz directory and do psautogen --enable-optimize && make && make install. This script relies on VerifyStreaks having been run on the data as part of Magic (and being in the proper place). Note that if the VerifyStreaks binary could not be found in the course of the Magic processing, this will have be skipped. The script is then run as:
microtest.pl --dbhost ippdb01 --dbuser ipp --dbpass XXX --dbname gpc1 --label microtestMD07.nightlyscience --data_group microtestMD07.20XXYYZZ --verbose
Common problems and their solutions
Repairing or dropping a bad instance
Print the status
repair_bad_instance -c xy26 -e 260332
Repair the bad instances
repair_bad_instance -c xy26 -e 260332 -r
Copies a good instance on top of any bad ones
Drop the instance
repair_bad_instance -c xy26 -e 260332 -l
Drop the file if there are no good instances
Diff failures
A detailed guide to failures at the diff stage can be found here:
http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/diff_fixits
Adding crontab on ipp@ippc22:~/fault_cron_c22/ipp_faultfix.crontab being turned on to clear fault 5 warp+diff stage with quality 42 until new tag or code fix can be made, fault 4 will try to revert through night until ~0500 and then also get quality 42 (20161019 band-aid until possibly new tag helps)
burntool doesn't start…
burntool requires that all images from the summit for a given night are registered at MHPCC before it can begin processing. Occasionally, an image gets 'stuck', preventing processing to begin. This is sometimes due to a corrupt file, or just a failure to copy it o MHPCC. So, first check with the camera group that the image is ok.
If image is ok
Assuming that the image is actually good then
- stop summit copy pantasks
- revert the fault
run summit_copy.plleaving off the--md5argument.- set summit copy pantasks to run
This will call dsget without the md5 sum check and update the database.
If image is not ok
We need to tell the system to forget about this image. TODO (This is the summary of what was tried on Wed. 2010-09-08 for o5446g0443o)
update summitExp set exp_type = 'broken', imfiles=0, fault =0 where exp_name = 'o5447g0519o';
-> if it has no effect
- ns.del.date, ns.add.date
-> if it has no effect, check (and possibly change) obs_mode from 3PI to ENGINEERING:
UPDATE rawExp SET obs_mode = 'ENGINEERING' WHERE exp_id = 221762;
If a single OTA is bad, not available, or zero size then summitcopy will automatically set summitExp state drop to not block processing. If the OTA is permanently gone but exposure is okay (check with camera group), it can be manually dropped -- example for o7177g0443o40.fits zero size/gone ota
summitcopy pantasks to stop pztool -dbname gpc1 -revertcopied -exp_name o7177g0443o -inst gpc1 -telescope ps1 -fault 110 --> don't think this is needed since -revertcopied should set to run? pztool -updatepzexp -exp_name o7177g0443o -inst gpc1 -telescope ps1 -set_state run -summit_id 918643 -dbname gpc1 delete from summitImfile where summit_id=918643 and file_id="o7177g0443o40.fits" update summitExp set imfiles=59 where exp_name="o7177g0443o"; summitcopy pantasks to run
Czartool reports negative burntool stats
The burntool stats printed are equivalent to N_exposures_queued - N_exposures_burntooled. A negative value means that that target has been doubly queued.
Finding log files
On the main Czartool display, if there are any faults, they are shown in parenthesis. These in fact form links that will take you to the relevant ippMonitor page for the processing stage and label in question. Here a table will list details of the offending exposures, one column of which is 'state' (which should be 'new'). Linking from these will display the log for that particular exposure (or chip) from which the error may be diagnosed.
Reverting
When exposures fail at a certain stage (chip, cam, warp etc) they are given a 'fault' code:
| Code | Description |
| 1 | Error of unknown nature |
| 2 | Error with a system call (often an NFS error) |
| 3 | Error with configuration |
| 4 | Error in programming (look also for aborts) |
| 5 | Error with data |
| 6 | Error due to timeout |
| >6 | Reserved for magic |
It is sometimes possible to 'revert' certain failed exposures. Reverting simply means attempting to process an exposure second time in case the cause of the fault was temporary, for example an NFS error. Faults like these are usually given fault code '2'. Turning reverts on via the czartool page will attempt to revert all those exposures that failed with code '2'. Behind the scenes, czartool is using pantasks_client to perform the reverts, as described in the next section.
Reverting from pantasks_client
To manually revert failures with fault code 2, do something like the following in pantasks_client
pantasks: warp.revert.on
And off again with
pantasks: warp.revert.off
The process is similar for chip, camera etc. A special case, however, is destreaks which need to be reverted as follows.
From the distribution panstarks_client
destreak.off destreak.revert.on
Then, once there is nothing left to do
destreak.revert.off destreak.on
Reverting faults with codes other than 2
By running the stage tool program directly it may be possible to revert failures with codes other than 2. For example, for the chip stage:
chiptool -revertprocessedimfile -label M31.nightlyscience -fault 4 -dbname gpc1
Similar arguments can be used with warptool, camtool etc.
TODO: The page linked by processing failures in the czartool page should show the command as the ipp user that should be run to fix the problem.
If component fails again after reverting
If a component fails repeatedly then something is likely wrong with one of it's inputs or perhaps there is a bug in the code. NEITHER of these situations should be ignored.
The log file can provide clues as to the cause of the problem. This page gives an example of how to fix certain failures.
Re-adding a nightly-science date to pantasks
Sometimes, if the stdscience pantasks server has been restarted before all nightlyscience processing has been completed, it may be necessary to re-add the date once the server is back up-and-running. For example, for the date below, stacks were not created or distributed because stdscience had been stopped before all the warps were completed. So, to re-add the date from the warp stage:
pantasks: ns.add.date 2010-09-11 pantasks: ns.set.date 2010-09-11 TOWARP
Removing a host
Troublesome Hosts
Sometimes a particular machine will act unpredictably and should be taken out of processing. To do this, go to each pantasks server in turn and remove the host, ipp016 in the example below
pantasks: controller host off ipp016
We also need to set the same host to a state of 'repair' in nebulous:
neb-host ipp016 repair --note 'problem desc'
be sure to always leave a note
This leaves the machine accessible, but no new data can be allocated to it. See table below for a guide to the other nebulous states
| state | allocate? | available? |
| up | yes | yes |
| down | no | no |
| repair | no | yes |
Running neb-host with no arguments gives you a summary of the above for all hosts.
Non-Troublesome Hosts
The same commands can be used for non-troublesome hosts.
The controller machines command shows the list of hosts and 3 values: the first value is the number of connections from the pantasks server to the host. The controller host off <hostname> command has to be repeated as many times as the number of connections.
It may also happen that a working host has to be removed (if it was temporarily added to better share the load because of some machine failure for instance). The controller status command details the activity for each connection. Hosts should be removed only if they have the RESP(onding?) or IDLE status (so wait for the running tasks to complete).
Note: The controller host check <hostname> command only shows ONE connection status (TODO SC: I can't tell which one).
(SC TODO) controller status shows something looking like an addresse (e.g., 0.0.0.7d) which is different for each connection. It seems it's not possible to remove a particular connection. Am I right?
Raid BBU and Nodes with Targeted Data
Many times a node becomes problematic when the raid BBU goes offline and WriteThrough (rather than WriteBuffer) mode is activated -- definitely happens for a data targeted node, but can also happen for an untargeted node if there are very few targeted nodes with free space. The node should be checked and if BBU needs attention then notify the ipp-dev list for the appropriate person to do so. To check the BBU status, there will likely be an entry in the /var/log/messages from MR_MONITOR for WB changed to WT
Jan 14 06:23:49 ipp087 MR_MONITOR[16845]: <MRMON195> Controller ID: 0 BBU disabled; changing WB logical drives to WT, Forced WB VDs are not affected Jan 14 06:23:49 ipp087 Event ID:195 Jan 14 06:23:49 ipp087 MR_MONITOR[16845]: <MRMON054> Controller ID: 0 Policy change on VD: 0 Jan 14 06:23:49 ipp087 Current = Current Write Policy: Write Back;
and/or a ps-ipp-ops email like
Controller ID: 0 Battery has failed and cannot support data retention. Please replace the battery Generated on:Mon Feb 22 08:56:06 2016
There are also two possible commands depending on the node but try the command as given on a machine that doesn't support it the it should be fine (requires sudo and be careful with these commands not to mistype or truncate) --
- ipp083 (add more to list here as go)
--> MegaCli64 -AdpBbuCmd -aAll Battery State: Optimal ... lots of other details ... or if really bad and no Battery State will see needs to be replaced Battery Replacement required : Yes Remaining Capacity Low : Yes --> MegaCli64 -LDInfo -Lall -aALL Default Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU
- want WriteBack for Current Cache Policy not WriteThrough
- ipp096 (add more to list here as go)
--> tw_cli /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-6 OK - - 256K 81956.2 RiW ON ... other raid info ... Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK OK 85 18-Jul-2015
- want to see Cache RiW
IF the problem node is a data targeted node and enough data nodes (and most of target data nodes) are available, the problem data node may still be able to take random data, but needs to be removed from the targeted set and pantasks restarted (typically the following day if BBU or full node issues not addressed) -- for the ipp-20141024 tag
~ipp/psconfig/ipp-20141024.lin64/share/pantasks/modules/ipphosts.mhpcc.config
Corrupted data before the current stage
The ~ipp/src/ipp-<version>/tools directory contains a bunch of tools that can be used to fix weird problems. For instance, I had a repeating entry telling that warp failed because of a corrupted cam generated file. In my case (there are some others that do also the job):
perl runcameraexp.pl --help
and finally:
perl runcameraexp.pl --cam_id 140104
Pausing a task(?)
I'm not sure that "task" is the right word.
chip.off / warp.off (and to restart chip.on / warp.on)
Restarting the apache servers
You need root access on ippc01, ... ippc10 to perform this.
0) Make sure all activity involving nebulous is stopped (i.e. all pantasks + Roy stuff + condor if any). 1) Log into ippc01, ... ippc10 then:
1.1) Execute
sudo /etc/init.d/apache2 stop; sleep 1; sudo /etc/init.d/apache2 startNote: if you see a message like:
Starting apache2 ...(98)Address already in use: make_sock: could not bind to address [::]:80 (98)Address already in use: make_sock: could not bind to address 0.0.0.0:80 no listening sockets available, shutting down Unable to open logs [ ok ]
then run the stop command, wait longer, and run the start command again.
1.2) Check that the apache server is running, i.e. run
ps waux | grep apache
to see lines like:
root 10598 4.8 0.0 303004 26040 ? Ss 09:54 0:00 /usr/sbin/apache2 [...] start [...]
removing an apache server from nebulous list
If an apache server machine is down (and unfixable), you can remove it from the list of nebulous servers:
in ~ipp/.tcshrc comment out the broken servers, example: {{ #set nebservers = ($nebservers http://ippc06/nebulous); }}
Changing the code
This might mean rebuilding the current 'tag' (reflected in the directory name) or actually installing a new tag.
Creating/Installing a new tag
- Copy trunk to new tag name
svn copy https://svn.pan-starrs.ifa.hawaii.edu/repo/ipp/trunk https://svn.pan-starrs.ifa.hawaii.edu/repo/ipp/tags/ipp-test-20130502
- Check it out in the source directory
svn co https://svn.pan-starrs.ifa.hawaii.edu/repo/ipp/tags/ipp-test-20130502 ipp-test-20130502
- Update psconfig script if needed:
psbuild -bootstrap INSTALL_DIRECTORY
- Build
psbuild -extbuild psbuild -ops
Rebuilding the current tag
We will use the example of tag 20100701 which is store under
~ipp/src/ipp-20100701
To update the code and rebuild, shutdown all pantasks (as shown above) then do the following.
cd ~ipp/src/ipp-20100701 svn update XXXpsbuild -dev -optimizeXXX the -ops includes -dev -optimize -magic, magic still needs to be built even though not actively using psbuild -ops
Now restart all pantasks (as above).
Installing a new tag
- shutdown all
pantasks(as shown above) - change
~ipp/.tcshrcto point at the new tag (it is good to confirm by logging out and in again) - fix the files which are still installation specific:
- edit
~ipp/.ptolemyrcand changeCONFDIRto point at the new location - copy
nebulous.site.proto the working location (for now, just use the last installation version) eg
- edit
cp psconfig/ipp-20100623.lin64/share/pantasks/modules/nebulous.site.pro psconfig/ipp-20100701.lin64/share/pantasks/modules/nebulous.site.pro
- restart all
pantasks(as above) - For ippMonitor: update the PATH value in /home/panstarrs/ipp/ippconfig/ippmonitor.config and install the new ippMonitor (
cd ipp/src/ippMonitor; touch scripts/generate; make) - For the postage stamp server web interface update the PATH value in /data/ippc17.0/pstamp/work/ipprc
Changes to gpc1 database schema
(From trunk/dbconfig/notes.txt)
When changing the database schema:
- increment the pkg_version number on dbconfig/config.md
- increment the ippdb version number in ippTools/configure.ac (to match)
- increment the ippTools version number in ippTools/configure.ac
- build ippbd ('make src' in dbconfig)
- check in dbconfig, ippdb, and ippTools
Node Trouble
Sometimes nodes will crash or become unresponsive -- be sure to check console messages and if possibly heavy job use before power cycling.
WARNING for the -- stsci nodes --: consoles and managed PDU's connected to the compute node and can be power cycled like other nodes if becomes unresponsive. The JBOD enclosure requires onsite intervention to power on and after a power outage/facility shutdown, should not be powered up unless confirmation that JBODs are up and running first. If JBOD is unavailable, RAID controller will request for user intervention in order to continue if it does not detect storage volumes, onsite examination of the JBOD will be required in order to continue.
(add console login information)
console --user <username> -v <hostname>
^p -- PDU menu 4 -- off for a minute 3 -- on 1 -- exit and monitor boot process (any boot or RAID errors etc)
- be sure to log in via console and check mounts home and a data node, date is correct etc
- .Consolerc file should be kept up-to-date in the ~ipp user and must be updated whenever new data nodes start taking nightly processing data -- a version also exists in the trunk svn under hardware/dotConsolerc -- which node is in which cab information can be found on https://ps1wiki.ifa.hawaii.edu/trac/wiki/IppPduGuide
WARNING if node does not reboot, return PDU to off to prevent possible burnout/fire situation (ie do not flip power and walk away). Some nodes take 3-5 minutes to activate boot screen.
Has been cases when ippops2 reboots itself and ganglia does not restart -- in that case
# restart ganglia monitor /etc/init.d/gmond restart # restart gmetad web frontend /etc/init.d/gmetad restart
The IPP svn/wiki is on ipp002 and has also been known to crash, once rebooted the svn and wiki should be available again.
Who to contact
Any problems or concerns should be reported to the ipp development mailing list:
ps-ipp-dev@ifa.hawaii.edu
Different members of the IPP team are responsible for different parts of the code, and the relevant person will hopefully address the issue.
Czar Logs
The following links show pages of czar activities.
- 2010-09-21: ipp020 failed
- Replication Log: Replication Issues wiki page
- Czar Logs wiki pages
Different members of the IPP team are responsible for different parts of the code, and the relevant person will hopefully address the issue.
MOPS
- how to manually re-run for MOPS: http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/HowToManuallyRerunForMops
Pantasks Log Archive
(from Serge's notes somewhere, but is a czar responsibility)
When restarting nightly pantasks with the start_server.sh, the pantasks logs get archived under a logs directory. These logs build up over time, particularly with regular unfixed errors/faults like in cleanup and can get 10-20GB in size over a few months. Without using the start_server.sh script they then grow to large size under their default names...
The logs need to be regularly compressed and archived
- on another (typically compute node) compress the logs under the pantasks directory -- may be slower but running on ippc19 will just harass that node computationally. For example stdscience
cd /data/ippc19.0/home/ipp/stdscience/logs bzip2 -pvf */*;bzip2 -pvf */*/*;bzip2 -pvf */*/*/*
- since we are using ippc19 now as the active home node (late 2015/2016), the Archives are placed on ippc19.1 and a symlink Archives_c19.1 points to that location in the logs directory (previous use of ippc18 used the symlink Archives but is broken because the .1 mounts are not exported). Rsync from ippc19 and be sure to use bandwidth limits so not to overly harass ippc19...
rsync -avP --bwlimit=1000 201511 Archives_c19.1/
- then can clean up the logs in the panstars directory (bzip2 will have helped clear up some space already) -- typically leave past/active month available to be able to quickly check logs for any recent problems.
- alternative is just to leave compressed files in logs, past year only amounted to a few GB for nightly pantasks -- that way they are mirrored to ippc18 in case ippc19 raid has problems
