This article describes the concept, design, and operation of psched, the Pan-STARRS IPP task scheduler.

Basic Concept

The purpose of psched is to manage the automatic construction and execution of inter-related (often repetative) operations. psched uses a set of rules to define UNIX commands, and their corresponding command-line arguments, to be performed on some regular, repeated basis. The utility of psched is that it can easily define an analysis system which is completely state-based, as opposed to an event-driven system.

Consider, for example, a telescope which obtains a collection of images over the course of a night. Every minute or two, it takes an image and writes the image to some disk. An event-driven analysis system would involve having the telescope initiate a process at the end of the exposure. This process would perform an analysis, write some output, then send trigger another process. This type of operation works very well for a simple set up with reliable hardware. Such a system becomes more difficult to maintain when hardware failures occur or when multiple systems need to interact with each other. When failures occur, the triggering information (the events) is easily lost, thus some mechanisms are needed to detect these failures and either re-send the trigger or send an alternative failure-mode trigger. Or, if two systems need to interact, one or the other system must block for results from the first. Stopping and restarting such an analysis system is very delicate since the appropriate triggers must be set up some how, eg by noticing which images have not succeeded and restarting them at the appropriate stage. All of these types of methods of handling complexity and failures are essentially state-based rules. psched allows the easy definition of a totally state-based analysis system.

In a state-based system, some mechanism examines the state of the system and decides which actions to perform based on the current state. In the illustration above, the mechanism could examine the images available (either by examining the disk or by examining the state of a data table) and decide to perform an operation based on what images are available. This makes it very easy to handle complexity and errors. If an analysis fails, the state either is not successfully updated or the error state is recorded, both situations being easy to detect and easy to handle. Restarting the system simply involves starting the state-monitoring mechanism. Combining results from multiple input sources simply involves watching for the multiple inputs to be available. psched provides a mechanism to define state monitors, and to define the actions which are performed when those states occur. psched action consist of initiating UNIX commands, where the arguments of those commands may depend on the results of the state tests.

Tasks vs Jobs

The primary function of psched is to repeatedly perform tasks, and execute jobs on the basis of those tasks. A task consists of a set of rules which describe system state tests to perform on a regular time scale. Based on the results of those state tests, the task will then choose whether or not to construct a job. The task also defines actions to perform upon the completion of a job, based upon the output and exit status of the job. A task thus defines the repeat period. It may optionally define valid or invalid time ranges (eg, Mon-Fri or 10:00-17:00, etc). The task may also specify that the job be run locally (ie, in the background on the same computer as psched) or remotely by the parallel process controller (pcontrol). A job may even be restricted to a specific computer managed by pcontrol. An example of a simple tasks is given below.

  task datalist
    command ls /data/foo
    periods -exec 5.0
    periods -timeout 50.0
    periods -poll 1.0

    task.exit 0
      queueprint stdout
      queuedelete stdout
    end
 
    task.exit 1
      queuepush failure "task failed"
    end
  end

This task does not perform any system state tests; it is simply constructs a new job every 5.0 seconds. The job in this case is always the same: ls /data/foo . When the job finished, if the job exit status is 0 (normal UNIX success status), the resulting output is printed to the screen. If the job returns an exit status of 1 (a failure), the failure queue receives a single entry. Although they are not defined in this case, it is also possible to specify the action to be taken if the job crashes (does not exit normally) or if it times out (runs beyond the specified timeout period). A slightly more complex task which performs a state test and constructs a command based on that test is shown below

  task datalist
    periods -exec 5.0
    periods -timeout 50.0
    periods -poll 1.0

    task.exec 
      $file = `next.file`
      if ($file == "none")
        break
      end
      command cp /data/foo/$file /data/bar
    end

    task.exit 0
      queueprint stdout
      queuedelete stdout
      queuepush copied $file
    end
 
    task.exit 1
      queuepush failure $file
    end
  end
The task.exec macro is executed by psched every 5.0 seconds. This macro executes a (hypothetical user-defined) UNIX command (next.file) which examines the system state, return either a filename or the word "none". If the result of this test is "none", the task does nothing: no job is constructed. Otherwise, a job is constructed using the name of the file returned by the state test. Successful jobs have the filename added to the 'copied' queue, while failed jobs add the filename to the 'failure' queue.

Parallel vs Local Job Processing

Task Restrictions

Inter-Task and Inter-Job Communications

psched Design

The Opihi Shell

Task List

Job List

pcontrol Interface