IPP Software Navigation Tools IPP Links Communication Pan-STARRS Links

Changes between Initial Version and Version 1 of Pantasks_FAQ


Ignore:
Timestamp:
Feb 24, 2009, 4:23:57 PM (17 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Pantasks_FAQ

    v1 v1  
     1==== What is pantasks? ====
     2pantasks is the ipp parallel process manager for distributed computing across multiple nodes. Also see the [wiki:IppTools_FAQ ippTools FAQ] for information on some of the commands that get launched by pantasks.
     3
     4==== How do I start pantasks? ====
     5start up pantasks from a terminal window
     6 > pantasks
     7 
     8 Welcome to pantasks - parallel task scheduler
     9 
     10load some pantasks commands
     11 pantasks: module pantasks.pro
     12
     13or, if you have a modified pantasks.pro file
     14 pantasks: input /home/username/pantasks/pantasks.pro
     15
     16After loading the pantasks.pro file, you can add a database easily:
     17 pantasks: add.database mydatabase
     18
     19If you don't add a database, pantasks will use the one declared in your .ipprc file.
     20
     21==== How do I configure pantasks? ====
     22
     23.pantasksrc
     24.ptolemyrc
     25
     26==== What are the primary pantasks commands? ====
     27
     28Connect to a controller host
     29 pantasks: controller host add myhost
     30
     31Check the controller host status (NOTE: This will sometimes return no info, even when there are active hosts):
     32 pantasks: controller status
     33
     34Check the processing status:
     35 pantasks: status
     36
     37For additional timing details:
     38 pantasks: status -taskstats
     39 
     40Start processing:
     41 pantasks: run
     42
     43Stop adding new processes, but finish out the queue :
     44 pantasks: stop
     45
     46Stop processes right now :
     47 pantasks: halt
     48
     49Exit pantasks :
     50 pantasks: exit
     51
     52
     53==== How do I get more verbose output from pantasks? ====
     54
     55 pantasks: $VERBOSE = 1
     56
     57Raise the number above 1 for more and more verbosity.
     58
     59==== Why does my process fail in pantasks but succeed on the command line? ====
     60
     61"I copied the command directly from the pantasks error stream (or from the verbose command output) and pasted it into a separate terminal.  It succeeds on the terminal, but it failed in pantasks."
     62
     63You may have a config error in your home directory.  When pantasks executes a command, it does so from the user's home directory on whichever remote host happens to have been assigned for that process.   If you happen to have some out of date config directories in that home directory, then they may be loaded '''before''' the system-level config directories.  Here is an example:
     64In your .ipprc file, you may have defined the path to your configuration directory with something like this:
     65
     66 PATH            STR     /path/to/my/system/level/ippconfig
     67
     68So in your system.config file you can define the directory for your GPC1 camera.config file with this line:
     69
     70 CAMERAS                METADATA
     71        GPC1                    STR     gpc1/camera.config
     72 END
     73 
     74so when you run any script that needs to reference the GPC1 camera it will look
     75 * first in the current directory for ./gpc1/camera.config
     76 * next in the directory defined by your .ipprc PATH variable (in this case /path/to/my/system/level/ippconfig/gpc1/camera.config)
     77
     78Thus, if you have some old gpc1/camera.config lying around in your home dir, then pantasks will look there first, and will hit a config error that would not appear if you run the command from the command line when you are not in your home dir.
     79
     80'''Solution:''' move any old config files out of your home directory, or update them to remove the config issue.
     81
     82==== Why are my nodes "down" or "resp"? ====
     83
     84 * First, check that you can ssh from the machine on which you are running pantasks to the node without being prompted for a password and without errors reading your shell startup file (.bashrc, .cshrc, .profile, .login and the like):
     85  * Try "ssh myhost"
     86  * If you're prompted for a password, then you need to set up ssh keys, and/or check your ssh configuration.
     87 * Second, check that you can start up 'pclient' over an ssh connection. For that to work, you need to run 'psconfig ipp-2.6.1' (or whatever the IPP version is called on your system) in your startup file.
     88  * Try "ssh myhost pclient". If it works, you can exit pclient with "exit" or "quit".
     89  * If it doesn't work, you need to check your shell configuration (.cshrc or .bashrc) to ensure the IPP environment is being loaded via psconfig. In .cshrc, just make sure that 'psconfig ipp-2.6.1' is executed. In .bashrc, there's a catch: bash doesn't let you use an alias in the same file that defines it, so you need to expand the 'source'  by hand in .bashrc, thus (this fix will be included in the INSTALL instructions in releases post 2.6.1.):
     90 if [ -f /IPP/psconfig.csh ]; then
     91    alias psconfig='source /IPP/psconfig.bash'
     92    source /IPP/psconfig.bash ipp-2.6.1
     93 else
     94    alias psconfig='echo psconfig not available'
     95 fi
     96 * If there is a long delay between executing the "ssh" command and the shell appearing, there may be a timeout problem.
     97  * Make sure your shell configuration is not too complex.
     98  * One bash user had the IPP environment being set up in both .bashrc and .cshrc, so that psconfig was being called first when bash started, and then again on starting psconfig (which is a csh script).  This produced a long delay, causing the command to time out in pantasks.
     99 * Finally, this can also be triggered by the readline bug, whose fix is described below under [wiki:#detrend_resid_imfile.pl_triggers_the_error_message_'Unknown_option:_--erbose' detrend_resid_imfile.pl triggers the error message 'Unknown option: --erbose']
     100
     101==== On startup I get "can't find config file. some functions will be unavailable."  Which config file is missing? ====
     102
     103You're missing the .ptolemyrc file.   Copy dvo.site from your site-level config directory into your home dir and rename it as .ptolemyrc.
     104
     105==== How do I re-run files that have failed at some stage after fixing the bug that caused the failure? ====
     106
     107Most of the <code>ippTools</code> binaries (which provide the database interaction) have some version of a <code>-revert</code> command.  For example, if a camera stage failed, try <code>camtool -revertprocessedexp -cam_id 12345</code>.  You can also revert based on an error code using the <code>-code</code> argument.
     108
     109==== Process X succeeded without fault, but it is not moving on to process Y.  How do I force it to proceed? ====
     110
     111First verify that none of the sub-stages failed.  For example, if a chipRun has state 'new' and you think it should be 'full' and moving on to camRun, then check to be sure that all of the contributing chipProcessedImfile rows in your database are completed with fault=0.   Then check to be sure that the next process in line was not initiated.  e.g. Do you have a camRun with the chip_id that you expect?
     112
     113If the process really just stopped without raising a fault and without initiating the next stage, then you can try manually setting its state to 'full' and using the appropriate ippTool to initiate the next process in the sequence.   For a stalled chipRun with chip_id 2591, this would be done like this:
     114
     115<pre>
     116chiptool -dbname myDatabase -updaterun  -label 'myLabel' -chip_id 2591 -state full   
     117
     118camtool -dbname myDatabase -definebyquery -chip_id 2591 -set_label 'mylabel'
     119</pre>
     120
     121
     122==== detrend_resid_imfile.pl triggers the error message 'Unknown option: --erbose' ====
     123
     124The dropped character in a long line is a classic symptom of a bad 'readline' library.
     125To fix it, do:
     126<pre>
     127% pschecklibs -build -force libreadline
     128</pre>
     129There should be no need to rebuild the IPP (since we're using dynamic libraries)
     130
     131
     132==== I've got a job sitting in the queue, and a bunch of idle hosts... how do I get the job onto a host? ====