| | 1 | ==== What is pantasks? ==== |
| | 2 | pantasks is the ipp parallel process manager for distributed computing across multiple nodes. Also see the [wiki:IppTools_FAQ ippTools FAQ] for information on some of the commands that get launched by pantasks. |
| | 3 | |
| | 4 | ==== How do I start pantasks? ==== |
| | 5 | start up pantasks from a terminal window |
| | 6 | > pantasks |
| | 7 | |
| | 8 | Welcome to pantasks - parallel task scheduler |
| | 9 | |
| | 10 | load some pantasks commands |
| | 11 | pantasks: module pantasks.pro |
| | 12 | |
| | 13 | or, if you have a modified pantasks.pro file |
| | 14 | pantasks: input /home/username/pantasks/pantasks.pro |
| | 15 | |
| | 16 | After loading the pantasks.pro file, you can add a database easily: |
| | 17 | pantasks: add.database mydatabase |
| | 18 | |
| | 19 | If you don't add a database, pantasks will use the one declared in your .ipprc file. |
| | 20 | |
| | 21 | ==== How do I configure pantasks? ==== |
| | 22 | |
| | 23 | .pantasksrc |
| | 24 | .ptolemyrc |
| | 25 | |
| | 26 | ==== What are the primary pantasks commands? ==== |
| | 27 | |
| | 28 | Connect to a controller host |
| | 29 | pantasks: controller host add myhost |
| | 30 | |
| | 31 | Check the controller host status (NOTE: This will sometimes return no info, even when there are active hosts): |
| | 32 | pantasks: controller status |
| | 33 | |
| | 34 | Check the processing status: |
| | 35 | pantasks: status |
| | 36 | |
| | 37 | For additional timing details: |
| | 38 | pantasks: status -taskstats |
| | 39 | |
| | 40 | Start processing: |
| | 41 | pantasks: run |
| | 42 | |
| | 43 | Stop adding new processes, but finish out the queue : |
| | 44 | pantasks: stop |
| | 45 | |
| | 46 | Stop processes right now : |
| | 47 | pantasks: halt |
| | 48 | |
| | 49 | Exit pantasks : |
| | 50 | pantasks: exit |
| | 51 | |
| | 52 | |
| | 53 | ==== How do I get more verbose output from pantasks? ==== |
| | 54 | |
| | 55 | pantasks: $VERBOSE = 1 |
| | 56 | |
| | 57 | Raise the number above 1 for more and more verbosity. |
| | 58 | |
| | 59 | ==== Why does my process fail in pantasks but succeed on the command line? ==== |
| | 60 | |
| | 61 | "I copied the command directly from the pantasks error stream (or from the verbose command output) and pasted it into a separate terminal. It succeeds on the terminal, but it failed in pantasks." |
| | 62 | |
| | 63 | You may have a config error in your home directory. When pantasks executes a command, it does so from the user's home directory on whichever remote host happens to have been assigned for that process. If you happen to have some out of date config directories in that home directory, then they may be loaded '''before''' the system-level config directories. Here is an example: |
| | 64 | In your .ipprc file, you may have defined the path to your configuration directory with something like this: |
| | 65 | |
| | 66 | PATH STR /path/to/my/system/level/ippconfig |
| | 67 | |
| | 68 | So in your system.config file you can define the directory for your GPC1 camera.config file with this line: |
| | 69 | |
| | 70 | CAMERAS METADATA |
| | 71 | GPC1 STR gpc1/camera.config |
| | 72 | END |
| | 73 | |
| | 74 | so when you run any script that needs to reference the GPC1 camera it will look |
| | 75 | * first in the current directory for ./gpc1/camera.config |
| | 76 | * next in the directory defined by your .ipprc PATH variable (in this case /path/to/my/system/level/ippconfig/gpc1/camera.config) |
| | 77 | |
| | 78 | Thus, if you have some old gpc1/camera.config lying around in your home dir, then pantasks will look there first, and will hit a config error that would not appear if you run the command from the command line when you are not in your home dir. |
| | 79 | |
| | 80 | '''Solution:''' move any old config files out of your home directory, or update them to remove the config issue. |
| | 81 | |
| | 82 | ==== Why are my nodes "down" or "resp"? ==== |
| | 83 | |
| | 84 | * First, check that you can ssh from the machine on which you are running pantasks to the node without being prompted for a password and without errors reading your shell startup file (.bashrc, .cshrc, .profile, .login and the like): |
| | 85 | * Try "ssh myhost" |
| | 86 | * If you're prompted for a password, then you need to set up ssh keys, and/or check your ssh configuration. |
| | 87 | * Second, check that you can start up 'pclient' over an ssh connection. For that to work, you need to run 'psconfig ipp-2.6.1' (or whatever the IPP version is called on your system) in your startup file. |
| | 88 | * Try "ssh myhost pclient". If it works, you can exit pclient with "exit" or "quit". |
| | 89 | * If it doesn't work, you need to check your shell configuration (.cshrc or .bashrc) to ensure the IPP environment is being loaded via psconfig. In .cshrc, just make sure that 'psconfig ipp-2.6.1' is executed. In .bashrc, there's a catch: bash doesn't let you use an alias in the same file that defines it, so you need to expand the 'source' by hand in .bashrc, thus (this fix will be included in the INSTALL instructions in releases post 2.6.1.): |
| | 90 | if [ -f /IPP/psconfig.csh ]; then |
| | 91 | alias psconfig='source /IPP/psconfig.bash' |
| | 92 | source /IPP/psconfig.bash ipp-2.6.1 |
| | 93 | else |
| | 94 | alias psconfig='echo psconfig not available' |
| | 95 | fi |
| | 96 | * If there is a long delay between executing the "ssh" command and the shell appearing, there may be a timeout problem. |
| | 97 | * Make sure your shell configuration is not too complex. |
| | 98 | * One bash user had the IPP environment being set up in both .bashrc and .cshrc, so that psconfig was being called first when bash started, and then again on starting psconfig (which is a csh script). This produced a long delay, causing the command to time out in pantasks. |
| | 99 | * Finally, this can also be triggered by the readline bug, whose fix is described below under [wiki:#detrend_resid_imfile.pl_triggers_the_error_message_'Unknown_option:_--erbose' detrend_resid_imfile.pl triggers the error message 'Unknown option: --erbose'] |
| | 100 | |
| | 101 | ==== On startup I get "can't find config file. some functions will be unavailable." Which config file is missing? ==== |
| | 102 | |
| | 103 | You're missing the .ptolemyrc file. Copy dvo.site from your site-level config directory into your home dir and rename it as .ptolemyrc. |
| | 104 | |
| | 105 | ==== How do I re-run files that have failed at some stage after fixing the bug that caused the failure? ==== |
| | 106 | |
| | 107 | Most of the <code>ippTools</code> binaries (which provide the database interaction) have some version of a <code>-revert</code> command. For example, if a camera stage failed, try <code>camtool -revertprocessedexp -cam_id 12345</code>. You can also revert based on an error code using the <code>-code</code> argument. |
| | 108 | |
| | 109 | ==== Process X succeeded without fault, but it is not moving on to process Y. How do I force it to proceed? ==== |
| | 110 | |
| | 111 | First verify that none of the sub-stages failed. For example, if a chipRun has state 'new' and you think it should be 'full' and moving on to camRun, then check to be sure that all of the contributing chipProcessedImfile rows in your database are completed with fault=0. Then check to be sure that the next process in line was not initiated. e.g. Do you have a camRun with the chip_id that you expect? |
| | 112 | |
| | 113 | If the process really just stopped without raising a fault and without initiating the next stage, then you can try manually setting its state to 'full' and using the appropriate ippTool to initiate the next process in the sequence. For a stalled chipRun with chip_id 2591, this would be done like this: |
| | 114 | |
| | 115 | <pre> |
| | 116 | chiptool -dbname myDatabase -updaterun -label 'myLabel' -chip_id 2591 -state full |
| | 117 | |
| | 118 | camtool -dbname myDatabase -definebyquery -chip_id 2591 -set_label 'mylabel' |
| | 119 | </pre> |
| | 120 | |
| | 121 | |
| | 122 | ==== detrend_resid_imfile.pl triggers the error message 'Unknown option: --erbose' ==== |
| | 123 | |
| | 124 | The dropped character in a long line is a classic symptom of a bad 'readline' library. |
| | 125 | To fix it, do: |
| | 126 | <pre> |
| | 127 | % pschecklibs -build -force libreadline |
| | 128 | </pre> |
| | 129 | There should be no need to rebuild the IPP (since we're using dynamic libraries) |
| | 130 | |
| | 131 | |
| | 132 | ==== I've got a job sitting in the queue, and a bunch of idle hosts... how do I get the job onto a host? ==== |