IPP Software Navigation Tools IPP Links Communication Pan-STARRS Links
wiki:Processing

Version 49 (modified by rhenders, 16 years ago) ( diff )

--

Introduction

This page outlines the procedures and responsibilities for the person currently acting as 'IPP Processing Czar'. In a nutshell, these include:

  • monitoring the various pantasks servers running on the production cluster using pantasks_client
  • alerting the IPP group to any notable errors or failures
  • keeping an eye on production cluster load using Ganglia
  • adding and removing labels based on the current set of processing priorities, outlined here
  • keeping an eye on available disk space using the neb-ls command (on any production machine)

Setup and available resources

You will need to have ipp user access on the production cluster. For convenience, have someone who already has access (anyone on the IPP team) to add your ssh public key to ~ipp/.ssh/authorized_keys. Mostly, you will be logged into a production cluster machine as ipp using the pantasks_client program to monitor operations, however there are other useful resources.

Getting started and checking processing status

Log in as ipp user on any production cluster machine and run

./check_system.sh

This lists the various pantasks servers currently running on the cluster, eg

pantasks server addstar is running (host: ipp004)
pantasks server cleanup is running (host: ippc07)
pantasks server detrend is NOT running (host: ippc06)
pantasks server distribution is running (host: ippc15)
pantasks server pstamp is running (host: ippdb02)
pantasks server publishing is running (host: ippc08)
pantasks server registration is running (host: ippc02)
pantasks server replication is running (host: ippdb00)
pantasks server stdscience is running (host: ippc16)
pantasks server summitcopy is running (host: ippc01)

Assuming some or all of the servers are running, move to the directory corresponding to the server of interest, eg ~ipp/stdscience/, then run

pantasks_client

To check the current labels being processed:

pantasks: show.labels

Within pantasks, to check processing status, do

pantasks: status

This will return something like

 Task Status
  AV Name                     Nrun   Njobs   Ngood Nfail Ntime Command               
  +- extra.labels.on             0       3       3     0     0 echo                  
  +- extra.labels.off            0       3       3     0     0 echo                  
  +- ns.initday.load             0       3       3     0     0 echo                  
  ++ ns.registration.load        0    1331    1331     0     0 automate_stacks.pl    
  ++ ns.chips.load               0      66      66     0     0 automate_stacks.pl    
  ++ ns.chips.run                0       4       4     0     0 automate_stacks.pl    
  ++ ns.stacks.load              0    5825    5825     0     0 automate_stacks.pl    
  ++ ns.stacks.run               0       6       6     0     0 automate_stacks.pl    
  ++ ns.burntool.load            0       8       8     0     0 automate_stacks.pl    
  ++ ns.burntool.run             0     360     360     0     0 ipp_apply_burntool.pl 
  ++ chip.imfile.load            1   48039   48038     0     0 chiptool              
  ++ chip.imfile.run             0   23524   17755  5769     0 chip_imfile.pl        
  ++ chip.advanceexp             0    7514    7514     0     0 chiptool    
  etc...       

The first column, 'AV', translates to Active and Valid, i.e. whether a process is running and whether it is valid at this point in time. For example, above, +- ns.initday.load is active, but is not valid at present since it is scheduled to run only once per day (to initialize the nightlyscience automation).

The key thing to monitor here is the Nfail column. Depending on the process, different numbers of Nfail as a proportion of Njobs are deemed acceptable.

Stopping and staring the pantasks servers

It is occasionally necessary to stop and restart the pantasks_server instances. For example, when it is necessary to update and rebuild the code, or if pantasks itself becomes unresponsive or shows negative values in some columns of the status display (above).

Stopping

To stop the current pantasks server do

pantasks: stop

To shut down all pantasks_server instances, use

check_system.sh stop
check_system.sh shutdown

Starting

To start a single pantasks server do

pantasks: run

Each pantasks_server uses the input file located in the directory where is in instantiated. It also uses the local ptolemy.rc file (this file details the machine where the server is to run).

To restart all the pantasks_server instances, you need to ssh to each relevant machine, which are found using check_system.sh. For each server do the following:

ssh ipp@ippXXX
cd <serverName>
pantasks_server &
pantasks_client
pantasks: server input input
pantasks: setup

So, for example for stdscience

ssh ippc16
cd ~stdscience
pantasks_server &
pantasks_client
pantasks: server input input
pantasks: setup

Each server then needs to be handled differently for setup.

stdscience

Add surveys

pantasks: add.surveys

This adds the surveys defined in the 'input' file. Now show labels with

pantasks: show.labels

Working from this list, add and remove labels with del.label and add.label, eg

pantasks: del.label M31.nightlyscience
pantasks: add.label ThreePi.DM.20100401

Now add some hosts. Since stdscience is the most intensive server, it requires more hosts than the others. The configuration below is a good guide.

pantasks: hosts add wave1
pantasks: hosts add wave2
pantasks: hosts add wave2
pantasks: hosts add wave3
pantasks: hosts add wave3
pantasks: hosts add compute
pantasks: hosts add compute

However, 1 x wave1, 3 x wave2, 4 x wave3, 4 x compute is probably needed for full-scale operations.

Now we are ready to run the server

pantasks: run

summitcopy, registration, replication

These are the easy ones, just

pantasks: run

publishing

This server is specifically for publishing data to MOPS.

add labels? TODO

pstamp

The postage stamp server.

pantasks: add.hosts
pantasks: run

distribution

distribution runs the destreaking (magic) then bundles up available data for the datastore. In terms of labels, distribution roughly mirrors stdscience.

pantasks: add.labels

same labels as stdscience? TODO

Add hosts

pantasks: hosts add wave1
pantasks: hosts add wave2
pantasks: hosts add wave3
pantasks: hosts add compute

Check processing is running smoothly in stdscience using pantasks: status. If all is okay, then

pantasks: run

cleanup

pantasks: add.labels
pantasks: hosts add wave2
pantasks: hosts add wave3
pantasks: hosts add compute
pantasks: run

detrend, addstar

TODO

Queuing data

Before pantasks can used to manage processing of a particular label, chiptool must first be run to queue data and create that label. The custom here is to write a small script that runs chiptool with the necessary arguments. This script is then left in the stdscience sub-directory with the same name as the survey in question (M31, MD04 etc). This is so that there is a record of what has been queued. An example script would be

#!/bin/csh -f

set label = "M31.Run5.20100408"

set options = ""
set options = "$options -dbname gpc1"
set options = "$options -definebyquery"
set options = "$options -set_end_stage warp"
set options = "$options -set_tess_id M31.V0"
set options = "$options -set_data_group M31.Run5b"
set options = "$options -set_dist_group M31"
set options = "$options -comment M31%"
set options = "$options -dateobs_begin 2009-12-09T00:00:00"
set options = "$options -dateobs_end 2010-03-01T00:00:00"
# set options = "$options -simple -pretend"

chiptool $options -set_label $label -set_workdir neb://@HOST@.0/gpc1/$label

Now the label must added within pantasks

pantasks: add.label M31.Run5.20100408

Removing data from the queue

If a mistake has been made and a label needs to be removed from processing, then

pantasks: del.label M31.nightlyscience

chiptool must then be used to drop the label for data with a state of 'new'.

chiptool -updaterun -set_state drop -label bad_data_label -state new -dbname gpc1

If some of the data has already been processed (i.e. state!=new), then cleanup must be employed. TODO more here

Finding and dealing with errors

As shown above, the "pantasks: status" command will display failures for a particular processing stage. A handy script also exists for monitoring all stages for a particular label (basically a shortcut for using the 'Night summary' page of the ippmonitor (see above)). It is found in /tools under the build directory. Example usage is shown below.

./processing_quick_check.pl --label SweetSpot.20100409

This provides the following output.

chip stage 
SweetSpot.20100409 full 149 

cam stage
SweetSpot.20100409 full 149 

fake stage
SweetSpot.20100409 full 149 

warp stage
SweetSpot.20100409 full 149 

stack stage

diff stage
SweetSpot.20100409 full 66 
SweetSpot.20100409 new 1 

magic stage

magicDS stage

dist stage

-------------------
faults and log files

diff new 4 neb://ipp034.0/gpc1/SweetSpot.20100409/2010/04/12/RINGS.V0/skycell.1500.091/RINGS.V0.skycell.1500.091.dif.50316.log  

The output highlights a problem at the diff stage, and in the "faults and log files" section, the relevant log file is listed, which can be viewed using neb-tail as below.

neb-tail neb://ipp034.0/gpc1/SweetSpot.20100409/2010/04/12/RINGS.V0/skycell.1500.091/RINGS.V0.skycell.1500.091.dif.50316.log

Another script in the /tools directory can be used to probe errors more thoroughly for a particular stage and label. The chip, camera, warp, stack, diff, magic and destreak stages are currently supported, as well as the use of multiple labels (wildcards are allowed). You can also search by a specific fault code (see table below), or limit the query. The output is the appropriate id and component, the machine it was run on, and the particular problem (e.g., a file that failed to be found, otherwise the resolved name of the processing log), all grouped into categories. With such a list, it's easy to identify patterns, e.g., a few warps are failing because of a single corrupt camera mask file, or machine 'X' can't read files on machine 'Y'. For example, using the same label as above:

./errors.pl --dbhost ippdb01 --dbuser ipp --dbpass ipp --dbname gpc1 --label 'SweetSpot.20100409' --stage diff

Will produce

Total: 1

Assertion failures: 1
50316.skycell.1500.091(ippc08): /data/ipp034.0/nebulous/61/ef/244153882.gpc1:SweetSpot.20100409:2010:04:12:RINGS.V0:skycell.1500.091:RINGS.V0.skycell.1500.091.dif.50316.log

The 'fault codes' mentioned above are as follows.

Code Description
1 Error of unknown nature
2 Error with a system call (often an NFS error)
3 Error with configuration
4 Error in programming (look also for aborts)
5 Error with data
6 Error due to timeout

Common issues

This section attempts to outline common issues encountered during processing and how to work through them.

stdscience

Chip failures, for example

  AV Name                     Nrun   Njobs   Ngood Nfail Ntime Command               
  ++ chip.imfile.run             0   23536   17755  5781     0 chip_imfile.pl        

To investigate the failures, go to

ippmonitor->Science steps->Chip Failed Imfiles

where you can view the logs by clicking within the 'State' column.

Warp failures

To investigate the failures, go to

ippmonitor->Science steps->Warp Failed Skyfiles

Filter results by using 'new' in the state column. For the results, check that the values in the 'Fault' column are 2, which denotes an NFS error, in which case we can 'revert' using

pantasks: warp.revert.on

Remember to switch off again afterwards with

pantasks: warp.revert.off

Rebuilding the IPP code

The IPP in use presently is located at

~ipp/ipp-20100211

If the code needs an update and rebuild, then:

  • stop pantasks (as above)
  • cd ~ipp/ipp-20100211
  • svn update
  • psbuild -dev -optimize
  • restart pantasks (as above)

Who to contact

Any problems or concerns should be reported to the ipp development mailing list:

ps-ipp-dev@ifa.hawaii.edu

Different members of the IPP team are responsible for different parts of the code, and the relevant person will hopefully address the issue.

magic has fault states greater than 5.

Note: See TracWiki for help on using the wiki.