| Version 82 (modified by , 16 years ago) ( diff ) |
|---|
Introduction
This page outlines the procedures and responsibilities for the person currently acting as 'IPP Processing Czar'. In a nutshell, these include:
- monitoring the various pantasks servers running on the production cluster using Czartool and
pantasks_client - keeping a close eye on the 'stdscience' pantasks server in particular, using Czartool
- keeping an eye on production cluster load using Ganglia
- adding and removing labels based on the current set of processing priorities, outlined here
- keeping an eye on available disk space using the
neb-dfcommand (on any production machine) - alerting the IPP group to any notable errors or failures
NB You will need to have ipp user access on the production cluster. For convenience, have someone who already has access (anyone on the IPP team) to add your ssh public key to ~ipp/.ssh/authorized_keys.
Getting started and checking processing status
Czartool makes it relatively easy to check the overall status of the processing pipeline. You can check the status of the various pantasks_servers, how much data was taken at the summit and has been copied to the cluster, and the status of various processes within stdscience, chip, camera, warp, diff etc.
Using pantasks
There are numerous pantasks servers. Their status can be checked with Czartool, but it is often necessary to use a client directly. To do this, first move to the directory corresponding to the server of interest, which are all under ~ipp on any cluster machine. For example, go to ~ipp/stdscience/, then run
pantasks_client
To check the current labels being processed:
pantasks: show.labels
Within pantasks, to check processing status, do
pantasks: status
This will return something like
Task Status AV Name Nrun Njobs Ngood Nfail Ntime Command +- extra.labels.on 0 3 3 0 0 echo +- extra.labels.off 0 3 3 0 0 echo +- ns.initday.load 0 3 3 0 0 echo ++ ns.registration.load 0 1331 1331 0 0 automate_stacks.pl ++ ns.chips.load 0 66 66 0 0 automate_stacks.pl ++ ns.chips.run 0 4 4 0 0 automate_stacks.pl ++ ns.stacks.load 0 5825 5825 0 0 automate_stacks.pl ++ ns.stacks.run 0 6 6 0 0 automate_stacks.pl ++ ns.burntool.load 0 8 8 0 0 automate_stacks.pl ++ ns.burntool.run 0 360 360 0 0 ipp_apply_burntool.pl ++ chip.imfile.load 1 48039 48038 0 0 chiptool ++ chip.imfile.run 0 23524 17755 5769 0 chip_imfile.pl ++ chip.advanceexp 0 7514 7514 0 0 chiptool etc...
The first column, 'AV', translates to Active and Valid, i.e. whether a process is running and whether it is valid at this point in time. For example, above, +- ns.initday.load is active, but is not valid at present since it is scheduled to run only once per day (to initialize the nightlyscience automation).
The key thing to monitor here is the Nfail column. Depending on the process, different numbers of Nfail as a proportion of Njobs are deemed acceptable.
Morning duties: checking summitcopy and burntool
There is nothing to be processed if data has not been copied from the telescope. This is the job of summitcopy, which runs slowly through night, then speeds once observations are complete every day. You can check that it has successfully copies files using Czartool.
After summitcopy comes burntool. Easiest way to check this is to run the following in the stdscience pantasks_client
ns.show.dates
You should see a 'book' entry with today's date, like
<today's date e.g. 2010-08-05> BURNING
If not, something is wrong.
Stopping and starting the pantasks servers
It is occasionally necessary to stop and restart the pantasks_server instances. For example, when it is necessary to update and rebuild the code, or if pantasks itself becomes unresponsive or shows negative values in some columns of the status display (above).
Stopping
To stop a single pantasks server do
pantasks: stop
To shut down all pantasks_server instances, use
check_system.sh stop check_system.sh shutdown
Starting
Each pantasks_server uses a local input and ptolemy.rc file (this file details the machine where the server is to run).
Starting an already running server
For already-running servers, pantasks should be started with the following commands only:
pantasks_client: server input input pantasks_client: setup pantasks_client: run
This loads the hosts and labels needed and starts the processing running. See ~ipp/stdscience/input if this is not clear.
Starting all servers
If everything has been shut down, you can start all pantasks with the following in ~ipp:
check_system start.server check_system run
The first command launches the pantasks_servers on the correct hosts the second calls the three commands listed above ({{{server input input;
setup; run}}}).
Starting a single server
To start a single server you need to ssh to the relevant machine (found in the ptolemy.rc file for that server) then do the following:
ssh ipp@ippXXX cd <serverName> pantasks_server & pantasks_client pantasks: server input input pantasks: setup
So, for example for stdscience
ssh ippc16 cd ~stdscience pantasks_server & pantasks_client pantasks: server input input pantasks: setup
Queuing data
Before pantasks can used to manage processing of a particular label, chiptool must first be run to queue data and create that label. The custom here is to write a small script that runs chiptool with the necessary arguments. This script is then left in the stdscience sub-directory with the same name as the survey in question (M31, MD04 etc). This is so that there is a record of what has been queued. An example script would be
#!/bin/csh -f set label = "M31.Run5.20100408" set options = "" set options = "$options -dbname gpc1" set options = "$options -definebyquery" set options = "$options -set_end_stage warp" set options = "$options -set_tess_id M31.V0" set options = "$options -set_data_group M31.Run5b" set options = "$options -set_dist_group M31" set options = "$options -comment M31%" set options = "$options -dateobs_begin 2009-12-09T00:00:00" set options = "$options -dateobs_end 2010-03-01T00:00:00" # set options = "$options -simple -pretend" chiptool $options -set_label $label -set_workdir neb://@HOST@.0/gpc1/$label
Now the label must added within pantasks
pantasks: add.label M31.Run5.20100408
Removing data from the queue
If a mistake has been made and a label needs to be removed from processing, then
pantasks: del.label M31.nightlyscience
chiptool must then be used to drop the label for data with a state of 'new'.
chiptool -updaterun -set_state drop -label bad_data_label -state new -dbname gpc1
If some of the data has already been processed (i.e. state!=new), then cleanup must be employed. TODO more here
Running the microtest scripts
The microtest data should be correctly automated, but still requires a script to be manually run. The basic pantasks tasks to reduce the microtest data are included in the stdscience/input file, in the add.microtest macro:
macro add.microtest add.label microtestMD07.nightlyscience add.label microtestMD07.noPattern.nightlyscience survey.add.WSdiff microtestMD07.nightlyscience MD07.refstack.20100330 microtestMD07 neb://@HOST@.0/gpc1 survey.add.WSdiff microtestMD07.noPattern.nightlyscience MD07.refstack.20100330 microtestMD07.noPattern neb://@HOST@.0/gpc1 survey.add.magic microtestMD07.nightlyscience /data/ipp050.0/gpc1_destreak survey.add.magic microtestMD07.noPattern.nightlyscience /data/ipp050.0/gpc1_destreak end
Once the two labels have made it through magic, the microtest.pl script can be run. You'll need to have ppCoord built and in your path. This isn't built by
psbuild. You just need to go into the ppViz directory and do psautogen --enable-optimize && make && make install. This script relies on VerifyStreaks having been run on the data as part of Magic (and being in the proper place). Note that if the VerifyStreaks binary could not be found in the course of the Magic processing, this will have be skipped. The script is then run as:
microtest.pl --dbhost ippdb01 --dbuser ipp --dbpass XXX --dbname gpc1 --label microtestMD07.nightlyscience --data_group microtestMD07.20XXYYZZ --verbose
Finding and dealing with errors
Finding log files
On the main Czartool display, if there are any faults, they are shown in parenthesis. These in fact form links that will take you to the relevant ippMonitor page for the processing stage and label in question. Here a table will list details of the offending exposures, one column of which is 'state' (which should be 'new'). Linking from these will display the log for that particular exposure (or chip) from which the error may be diagnosed.
Reverting
When exposures fail at a certain stage (chip, cam, warp etc) they are given a 'fault' code:
| Code | Description |
| 1 | Error of unknown nature |
| 2 | Error with a system call (often an NFS error) |
| 3 | Error with configuration |
| 4 | Error in programming (look also for aborts) |
| 5 | Error with data |
| 6 | Error due to timeout |
| >6 | Reserved for magic |
It is sometimes possible to 'revert' certain failed exposures. Reverting simply means attempting to process an exposure second time in case the cause of the fault was temporary, for example an NFS error. Faults like these are usually given fault code '2'. Turning reverts on via the czartool page will attempt to revert all those exposures that failed with code '2'. Behind the scenes, czartool is using pantasks_client to perform the reverts, as described in the next section.
Reverting from pantasks_client
To manually revert failures with fault code 2, do something like the following in pantasks_client
pantasks: warp.revert.on
And off again with
pantasks: warp.revert.off
The process is similar for chip, camera etc. A special case, however, is destreaks which need to be reverted as follows.
From the distribution panstarks_client
destreak.off destreak.revert.on
Then, once there is nothing left to do
destreak.revert.off destreak.on
Reverting faults with codes other than 2
By running the stage tool program directly it may be possible to revert failures with codes other than 2. For example, for the chip stage:
chiptool -revertprocessedimfile -label M31.nightlyscience -fault 4 -dbname gpc1
Similar arguments can be used with warptool, camtool etc.
Removing a troublesome host
Sometimes a particular machine will act unpredictably and should be taken out of processing. To do this, go to each pantasks server in turn and remove the host, ipp016 in the example below
pantasks: controller host off ipp016
We also need to set the same host to a state of 'repair' in nebulous:
neb-host ipp016 repair
This leaves the machine accessible, but no new data can be allocated to it. See table below for a guide to the other nebulous states
| state | allocate? | available? |
| up | yes | yes |
| down | no | no |
| repair | no | yes |
Running neb-host with no arguments gives you a summary of the above for all hosts.
Changing the code
This might mean rebuilding the current 'tag' (reflected in the directory name) or actually installing a new tag.
Rebuilding the current tag
We will use the example of tag 20100701 which is store under
~ipp/src/ipp-20100701
To update the code and rebuild, shutdown all pantasks (as shown above) then do the following.
cd ~ipp/src/ipp-20100701 svn update psbuild -dev -optimize
Now restart all pantasks (as above).
Installing a new tag
- shutdown all
pantasks(as shown above) - change
~ipp/.tcshrcto point at the new tag (it is good to confirm by logging out and in again) - fix the files which are still installation specific:
- edit
~ipp/.ptolemyrc}} and change {{{CONFDIRto point at the new location - copy
nebulous.site.proto the working location (for now, just use the last installation version) eg
- edit
cp psconfig/ipp-20100623.lin64/share/pantasks/modules/nebulous.site.pro psconfig/ipp-20100701.lin64/share/pantasks/modules/nebulous.site.pro
- restart all
pantasks(as above)
Who to contact
Any problems or concerns should be reported to the ipp development mailing list:
[ps-ipp-dev@…]
Different members of the IPP team are responsible for different parts of the code, and the relevant person will hopefully address the issue.
