| Version 35 (modified by , 16 years ago) ( diff ) |
|---|
Introduction
This page outlines the procedures and responsibilities for the person currently acting as 'IPP Processing Czar'. In a nutshell, these include:
- monitoring the various pantasks servers running on the production cluster using
pantasks_client - alerting the IPP group to any notable errors or failures
- keeping an eye on production cluster load using Ganglia
- adding and removing labels based on the current set of processing priorities, outlined here
- keeping an eye on available disk space using the
neb-lscommand (on any production machine)
Setup and resources
You will need to have ipp user access on the production cluster. For convenience, have someone who already has access (anyone on the IPP team) to add your ssh public key to ~ipp/.ssh/authorized_keys.
Mostly, you be logged into a production cluster machine as ipp using the pantasks_client program to monitor operations, however there are other useful resources.
- Ganglia - for monitoring load on production cluster machines
- ippmonitor - a window onto the gpc1 database, particularly the NightSummary page
- processing priorities - the current list of priorities, for use when setting up labels in
stdscience
Getting started and checking processing status
Log in as ipp user on any production cluster machine and run
./check_system.sh
This lists the various pantasks servers currently running on the cluster, eg
pantasks server addstar is running (host: ipp004) pantasks server cleanup is running (host: ippc07) pantasks server detrend is NOT running (host: ippc06) pantasks server distribution is running (host: ippc15) pantasks server pstamp is running (host: ippdb02) pantasks server publishing is running (host: ippc08) pantasks server registration is running (host: ippc02) pantasks server replication is running (host: ippdb00) pantasks server stdscience is running (host: ippc16) pantasks server summitcopy is running (host: ippc01)
Assuming some or all of the servers are running, move to the directory corresponding to the server of interest, eg ~ipp/stdscience/, then run
pantasks_client
To check the current labels being processed:
pantasks: show.labels
Within pantasks, to check processing status, do
pantasks: status
This will return something like
Task Status AV Name Nrun Njobs Ngood Nfail Ntime Command +- extra.labels.on 0 3 3 0 0 echo +- extra.labels.off 0 3 3 0 0 echo +- ns.initday.load 0 3 3 0 0 echo ++ ns.registration.load 0 1331 1331 0 0 automate_stacks.pl ++ ns.chips.load 0 66 66 0 0 automate_stacks.pl ++ ns.chips.run 0 4 4 0 0 automate_stacks.pl ++ ns.stacks.load 0 5825 5825 0 0 automate_stacks.pl ++ ns.stacks.run 0 6 6 0 0 automate_stacks.pl ++ ns.burntool.load 0 8 8 0 0 automate_stacks.pl ++ ns.burntool.run 0 360 360 0 0 ipp_apply_burntool.pl ++ chip.imfile.load 1 48039 48038 0 0 chiptool ++ chip.imfile.run 0 23524 17755 5769 0 chip_imfile.pl ++ chip.advanceexp 0 7514 7514 0 0 chiptool etc...
The key thing to monitor here is the Nfail column. Depending on the process, different numbers of Nfail as a proportion of Njobs are deemed acceptable.
Stopping and staring the servers
It is occasionally necessary to stop and restart the pantasks_server instances. For example, when it is necessary to update and rebuild the code, or if pantasks itself becomes unresponsive or shows negative values in some columns of the status display (above).
Stopping
To shut down all pantasks_server instances, use
check_system.sh stop check_system.sh shutdown
Starting
Each pantasks_server uses the input file located in the directory where is in instantiated. It also uses the local ptolemy.rc file (this file details the machine where the server is to run).
To restart all the pantasks_server instances, you need to ssh to each relevant machine, which are found using check_system.sh. For each server do the following:
ssh ipp@ippXXX cd <serverName> pantasks_server & pantasks_client pantasks: server input input pantasks: setup
So, for example for stdscience
ssh ippc16 cd ~stdscience pantasks_server & pantasks_client pantasks: server input input pantasks: setup
Each server then needs to be handled differently for setup.
stdscience
Add surveys
pantasks: add.surveys
This adds the surveys defined in the 'input' file. Now show labels with
pantasks: show.labels
Working from this list, add and remove labels with del.label and add.label, eg
pantasks: del.label M31.nightlyscience pantasks: add.label ThreePi.DM.20100401
Now add some hosts. Since stdscience is the most intensive server, it requires more hosts than the others. The configuration as shown below is a good guide.
pantasks: hosts add wave1 pantasks: hosts add wave2 pantasks: hosts add wave2 pantasks: hosts add wave3 pantasks: hosts add wave3 pantasks: hosts add compute pantasks: hosts add compute
However, 1 x wave1, 3 x wave2, 4 x wave3, 4 x compute is probably needed for full-scale operations.
Now we are ready to run the server
pantasks: run
summitcopy, registration, replication
These are the easy ones, just
pantasks: run
publishing
This server is specifically for publishing data to MOPS.
add labels? TODO
pstamp
The postage stamp server.
pantasks: add.hosts pantasks: run
distribution
distribution runs the destreaking (magic) then bundles up available data for the datastore. In terms of labels, distribution roughly mirrors stdscience.
pantasks: add.labels
same labels as stdscience? TODO
Add hosts
pantasks: hosts add wave1 pantasks: hosts add wave2 pantasks: hosts add wave3 pantasks: hosts add compute
Check processing is running smoothly in stdscience using pantasks: status. If all is okay, then
pantasks: run
cleanup
pantasks: add.labels pantasks: hosts add wave2 pantasks: hosts add wave3 pantasks: hosts add compute pantasks: run
detrend, addstar
TODO
Adding and removing labels
Adding a label
Before pantasks can used to manage processing of a particular label, chiptool must first be run to queue data and create that label. The custom here is to write a small script that runs chiptool with the necessary arguments. This script is then left in the stdscience sub-directory with the same name as the survey in question (M31, MD04 etc). This is so that there is a record of what has been queued. An example script would be
#!/bin/csh -f set label = "M31.Run5.20100408" set options = "" set options = "$options -dbname gpc1" set options = "$options -definebyquery" set options = "$options -set_end_stage warp" set options = "$options -set_tess_id M31.V0" set options = "$options -set_data_group M31.Run5b" set options = "$options -set_dist_group M31" set options = "$options -comment M31%" set options = "$options -dateobs_begin 2009-12-09T00:00:00" set options = "$options -dateobs_end 2010-03-01T00:00:00" # set options = "$options -simple -pretend" chiptool $options -set_label $label -set_workdir neb://@HOST@.0/gpc1/$label
Removing a label
If a mistake has been made and a label needs to be removed from processing, then
pantasks: del.label M31.nightlyscience
chiptool must then be used to drop the label for data with a state of 'new'.
chiptool -updaterun -set_state drop -label bad_data_label -state new -dbname gpc1
If some of the data has already been processed (i.e. state!=new), then cleanup must be employed. TODO more here
Common issues
This section attempts to outline common issues encountered during processing and how to work through them.
stdscience
Chip failures, for example
AV Name Nrun Njobs Ngood Nfail Ntime Command ++ chip.imfile.run 0 23536 17755 5781 0 chip_imfile.pl
To investigate the failures, go to
ippmonitor->Science steps->Chip Failed Imfiles
where you can view the logs by clicking within the 'State' column.
Warp failures
To investigate the failures, go to
ippmonitor->Science steps->Warp Failed Skyfiles
Filter results by using 'new' in the state column. For the results, check that the values in the 'Fault' column are 2, which denotes an NFS error, in which case we can 'revert' using
pantasks: warp.revert.on
Remember to switch off again afterwards with
pantasks: warp.revert.off
Rebuilding the IPP code
The IPP in use presently is located at
~ipp/ipp-20100211
If the code needs an update and rebuild, then:
- stop pantasks (as above)
cd ~ipp/ipp-20100211svn updatepsbuild -dev -optimize- restart pantasks (as above)
Who to contact
Any problems or concerns should be reported to the ipp development mailing list:
ps-ipp-dev@ifa.hawaii.edu
Different members of the IPP team are responsible for different parts of the code, and the relevant person will hopefully address the issue.
