wiki:Processing

Context Navigation

Version 40 (modified by rhenders, 16 years ago) ( diff )
--

Introduction
Setup and resources
Getting started and checking processing status
Stopping and staring the pantasks servers
1. Stopping
2. Starting
Queuing data
1. Removing data from the queue
Common issues
1. stdscience
Rebuilding the IPP code
Who to contact
Fault states

Introduction

This page outlines the procedures and responsibilities for the person currently acting as 'IPP Processing Czar'. In a nutshell, these include:

monitoring the various pantasks servers running on the production cluster using pantasks_client
alerting the IPP group to any notable errors or failures
keeping an eye on production cluster load using Ganglia
adding and removing labels based on the current set of processing priorities, outlined here
keeping an eye on available disk space using the neb-ls command (on any production machine)

Setup and resources

You will need to have ipp user access on the production cluster. For convenience, have someone who already has access (anyone on the IPP team) to add your ssh public key to ~ipp/.ssh/authorized_keys.

Mostly, you will be logged into a production cluster machine as ipp using the pantasks_client program to monitor operations, however there are other useful resources.

Ganglia - for monitoring load on production cluster machines
ippmonitor - a window onto the gpc1 database, particularly the NightSummary page
processing priorities - the current list of priorities, for use when setting up labels in stdscience

Getting started and checking processing status

./check_system.sh

This lists the various pantasks servers currently running on the cluster, eg

pantasks server addstar is running (host: ipp004)
pantasks server cleanup is running (host: ippc07)
pantasks server detrend is NOT running (host: ippc06)
pantasks server distribution is running (host: ippc15)
pantasks server pstamp is running (host: ippdb02)
pantasks server publishing is running (host: ippc08)
pantasks server registration is running (host: ippc02)
pantasks server replication is running (host: ippdb00)
pantasks server stdscience is running (host: ippc16)
pantasks server summitcopy is running (host: ippc01)

Assuming some or all of the servers are running, move to the directory corresponding to the server of interest, eg ~ipp/stdscience/, then run

pantasks_client

To check the current labels being processed:

pantasks: show.labels

Within pantasks, to check processing status, do

pantasks: status

This will return something like

 Task Status
  AV Name                     Nrun   Njobs   Ngood Nfail Ntime Command               
  +- extra.labels.on             0       3       3     0     0 echo                  
  +- extra.labels.off            0       3       3     0     0 echo                  
  +- ns.initday.load             0       3       3     0     0 echo                  
  ++ ns.registration.load        0    1331    1331     0     0 automate_stacks.pl    
  ++ ns.chips.load               0      66      66     0     0 automate_stacks.pl    
  ++ ns.chips.run                0       4       4     0     0 automate_stacks.pl    
  ++ ns.stacks.load              0    5825    5825     0     0 automate_stacks.pl    
  ++ ns.stacks.run               0       6       6     0     0 automate_stacks.pl    
  ++ ns.burntool.load            0       8       8     0     0 automate_stacks.pl    
  ++ ns.burntool.run             0     360     360     0     0 ipp_apply_burntool.pl 
  ++ chip.imfile.load            1   48039   48038     0     0 chiptool              
  ++ chip.imfile.run             0   23524   17755  5769     0 chip_imfile.pl        
  ++ chip.advanceexp             0    7514    7514     0     0 chiptool    
  etc...

The key thing to monitor here is the Nfail column. Depending on the process, different numbers of Nfail as a proportion of Njobs are deemed acceptable.

Stopping and staring the `pantasks` servers

It is occasionally necessary to stop and restart the pantasks_server instances. For example, when it is necessary to update and rebuild the code, or if pantasks itself becomes unresponsive or shows negative values in some columns of the status display (above).

Stopping

To shut down all pantasks_server instances, use

check_system.sh stop
check_system.sh shutdown

Starting

Each pantasks_server uses the input file located in the directory where is in instantiated. It also uses the local ptolemy.rc file (this file details the machine where the server is to run).

To restart all the pantasks_server instances, you need to ssh to each relevant machine, which are found using check_system.sh. For each server do the following:

ssh ipp@ippXXX
cd <serverName>
pantasks_server &
pantasks_client
pantasks: server input input
pantasks: setup

So, for example for stdscience

ssh ippc16
cd ~stdscience
pantasks_server &
pantasks_client
pantasks: server input input
pantasks: setup

Each server then needs to be handled differently for setup.

stdscience

Add surveys

pantasks: add.surveys

This adds the surveys defined in the 'input' file. Now show labels with

pantasks: show.labels

Working from this list, add and remove labels with del.label and add.label, eg

pantasks: del.label M31.nightlyscience
pantasks: add.label ThreePi.DM.20100401

Now add some hosts. Since stdscience is the most intensive server, it requires more hosts than the others. The configuration as shown below is a good guide.

pantasks: hosts add wave1
pantasks: hosts add wave2
pantasks: hosts add wave2
pantasks: hosts add wave3
pantasks: hosts add wave3
pantasks: hosts add compute
pantasks: hosts add compute

However, 1 x wave1, 3 x wave2, 4 x wave3, 4 x compute is probably needed for full-scale operations.

Now we are ready to run the server

pantasks: run

summitcopy, registration, replication

These are the easy ones, just

pantasks: run

publishing

This server is specifically for publishing data to MOPS.

add labels? TODO

pstamp

The postage stamp server.

pantasks: add.hosts
pantasks: run

distribution

distribution runs the destreaking (magic) then bundles up available data for the datastore. In terms of labels, distribution roughly mirrors stdscience.

pantasks: add.labels

same labels as stdscience? TODO

Add hosts

pantasks: hosts add wave1
pantasks: hosts add wave2
pantasks: hosts add wave3
pantasks: hosts add compute

Check processing is running smoothly in stdscience using pantasks: status. If all is okay, then

pantasks: run

cleanup

pantasks: add.labels
pantasks: hosts add wave2
pantasks: hosts add wave3
pantasks: hosts add compute
pantasks: run

detrend, addstar

TODO

Queuing data

Before pantasks can used to manage processing of a particular label, chiptool must first be run to queue data and create that label. The custom here is to write a small script that runs chiptool with the necessary arguments. This script is then left in the stdscience sub-directory with the same name as the survey in question (M31, MD04 etc). This is so that there is a record of what has been queued. An example script would be

#!/bin/csh -f

set label = "M31.Run5.20100408"

set options = ""
set options = "$options -dbname gpc1"
set options = "$options -definebyquery"
set options = "$options -set_end_stage warp"
set options = "$options -set_tess_id M31.V0"
set options = "$options -set_data_group M31.Run5b"
set options = "$options -set_dist_group M31"
set options = "$options -comment M31%"
set options = "$options -dateobs_begin 2009-12-09T00:00:00"
set options = "$options -dateobs_end 2010-03-01T00:00:00"
# set options = "$options -simple -pretend"

chiptool $options -set_label $label -set_workdir neb://@HOST@.0/gpc1/$label

Now the label must added within pantasks

pantasks: add.label M31.Run5.20100408

Removing data from the queue

If a mistake has been made and a label needs to be removed from processing, then

pantasks: del.label M31.nightlyscience

chiptool must then be used to drop the label for data with a state of 'new'.

chiptool -updaterun -set_state drop -label bad_data_label -state new -dbname gpc1

If some of the data has already been processed (i.e. state!=new), then cleanup must be employed. TODO more here

Common issues

This section attempts to outline common issues encountered during processing and how to work through them.

stdscience

Chip failures, for example

  AV Name                     Nrun   Njobs   Ngood Nfail Ntime Command               
  ++ chip.imfile.run             0   23536   17755  5781     0 chip_imfile.pl

To investigate the failures, go to

ippmonitor->Science steps->Chip Failed Imfiles

where you can view the logs by clicking within the 'State' column.

Warp failures

To investigate the failures, go to

ippmonitor->Science steps->Warp Failed Skyfiles

Filter results by using 'new' in the state column. For the results, check that the values in the 'Fault' column are 2, which denotes an NFS error, in which case we can 'revert' using

pantasks: warp.revert.on

Remember to switch off again afterwards with

pantasks: warp.revert.off

Rebuilding the IPP code

The IPP in use presently is located at

~ipp/ipp-20100211

If the code needs an update and rebuild, then:

stop pantasks (as above)
cd ~ipp/ipp-20100211
svn update
psbuild -dev -optimize
restart pantasks (as above)

Who to contact

Any problems or concerns should be reported to the ipp development mailing list:

ps-ipp-dev@ifa.hawaii.edu

Different members of the IPP team are responsible for different parts of the code, and the relevant person will hopefully address the issue.

Fault states

1	Error of unknown nature
2	Error with a system call (often an NFS error)
3	Error with configuration
4	Error in programming (look also for aborts)
5	Error with data
6	Error due to timeout

magic has fault states greater than 5.

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text

Context Navigation

Introduction

Setup and resources

Getting started and checking processing status

Stopping and staring the pantasks servers

Stopping

Starting

stdscience

summitcopy, registration, replication

publishing

pstamp

distribution

cleanup

detrend, addstar

Queuing data

Removing data from the queue

Common issues

stdscience

Rebuilding the IPP code

Who to contact

Fault states

Download in other formats:

Stopping and staring the `pantasks` servers