wiki:AlternateDistributionClient

Context Navigation

Version 11 (modified by welling, 17 years ago) ( diff )
--

An Alternate Data Distribution Client

update_mirror.py is a simple client for the IPP distribution system. It requires only Python 2.6 to run; the IPP software need not be installed. The script understands the IPP distribution system communication protocol and is intended to be run from cron to implement daily updates of a mirror site.

The advantages of the script are simplicity, brevity, ease of installation and ease of operation. Potential disadvantages are:

The script does not maintain a local MySQL copy of the distribution source database. This can be imported directly, however, and the corresponding information is available in the .mdc files of the distribution.
The script is not intended to serve as a distribution server in its own right, so it cannot form part of a distribution chain using the IPP distribution mechanism. The mirrored files can be redistributed using rsync, however.
The distribution protocol is pretty stable, but it may change at some future time. That change will break the script until a corresponding update can be made.

Installation Instructions

Create an empty directory on the filesystem where you want to mirror the data distribution. On Odyssey the corresponding directory is /n/panstarrs/data/MIRROR . The source files are attached below.
Copy update_mirror.py and parse_config.py into that directory.
Customize update_mirror.py for your site, by editing the variables at the top of the file and making the following changes:
1. set mirrorRootDir to the root directory of your mirror, presumably the directory created in step (1) above.
2. set desiredProductList to contain a list of the data products you want to mirror.
3. Possibly update topURI, if the distribution server at IfA has changed from its current host.
4. Possibly modify hiddenProductList to include the names of new data products not visible from the top distribution layer. This bit is a little cryptic- the original spec for the distribution mechanism said that the data products would be visible from the top level directory of the distribution server, but this has been changed for security reasons. New products are now announced on the IPP mailing lists and must be edited into hiddenProductList manually.
Possibly modify your crontab to call the update script as often as desired. This is currently unnecessary because regular distributions of Pan-STARRS data have not yet begun. Be sure to pick an update interval long enough that previous updates complete before a new update begins! The script does not currently contain a mechanism to detect simultaneous instances of itself.

As an example, the customized variables used on Odyssey are currently:

topURI= "http://alala2.ifa.hawaii.edu/ipp049/ds"
mirrorRootDir= "/panstarrs/data/MIRROR"
desiredProductList= ['md-all-rel-200907']
hiddenProductList= ['md-all-rel-200907',
                    'durham-200907',
                    'qub-200907',
                    'edb-200907',
                    'jhu-200907',
                    'threepi-all-200908']

Mode of Operation

The distribution system provides a list of products; new data is appended to those products over time. The client must remember the most recent update time for a product and only download new material posted since that time. It must also keep track of previously failed downloads so that they can be re-tried later.

update_mirror.py stores its status information in two comma-separated-value files, the default names for which are product_status.csv and download_failures.csv. It maintains a log called update_mirror.log by default. If it is interrupted it will also save the state of its internal queues in a file called update_mirror_leftover_work.pkl so that it can pick up where it left off when it is restarted. When it starts up, it:

Re-queues leftover work from a previous crash, if any.
Re-queues any entries from the download failures file.
Checks for updates to desired products which are more recent than the times recorded in the product_status.csv file, and adds those filesets to the download queue.
Commences to work through the download queues using multiple instances of wget and tar, as specified by the IPP distribution protocol.

When it finishes working through the queue it updates its status files and exits. Note: if you try to kill it with SIGTERM (the normal kill signal) it will finish processing the currently in-progress filesets before quitting. This can take a while, but it's worth it in terms of cleanup. If you need to stop it immediately kill it with SIGKILL (kill -9). You are then responsible for cleanup.

Command Line Options

This is the result of doing './update_mirror.py --help':

    usage: update_mirror.py [-v][-d][--mirrorroot RootDir]
                [--statusfile Filename][--failurefile Filename]
                [--pipecrash Filename][--logfile Filename]
                [--last LastFilesetId]

          where:
          -v requests verbose output
          -d requests debugging output
          --mirrorroot specifies the directory to store the incoming
            distribution (default /panstarrs/data/MIRROR)
          --statusfile specifies the name of the .csv file storing
            information about the last downloaded fileset in each
            distribution (default product_status.csv)
          --failurefile specifies the name of the .csv file storing
            information about filesets for which downloading has failed
            (default download_failures.csv)
          --pipecrash specifies the name of the Python .pkl file in which
            the contents of the working download queue are to be stored
            in case of an interrupt (default update_mirror_leftover_work.pkl)
          --logfile specifies the name of the log (default update_mirror.log)
          --last specifies a filesetID at which to stop downloading
            (default None)

    NOTE: if you 'kill' this script (with SIGTERM, the default) it will
    attempt to clean up and take good notes before it dies.  This can
    take a while as the worker threads finish downloading.  If you must
    kill it immediately, use 'kill -9'.

Special Situations

Sometimes a fileset included in a distribution will be found to be invalid, for example by being published with the wrong checksum. The distribution protocol doesn't provide a way to tell the client to give up on a fileset, so these problems are currently just announced by email. If this happens you will have to manually edit download_failures.csv to remove the offending file; otherwise update_mirror.py will try to download it each time it runs.

If you only want part of a distribution, for example only the 'camera' stage, you can modify update_mirror.py to ignore the rest. The easiest place to do this is the clause in the routine fetchUpdates where culledProductList is constructed. Each entry in productList is a dictionary containing all the columns of the table describing the filesets; simply exclude the filesets you don't want from culledProductList. Note that there is no way to keep track of the excluded filesets between runs, though- if you want them later you will have to manually adjust the script and product_status.csv to go back and re-download filesets from that time interval.

Attachments (2)

parse_config.py (14.8 KB ) - added by welling 17 years ago.
update_mirror.py (29.4 KB ) - added by welling 16 years ago.

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text