wiki:AlternateDistributionClient

Context Navigation

Version 9 (modified by welling, 17 years ago) ( diff )
--

An Alternate Data Distribution Client

update_mirror.py is a simple client for the IPP distribution system. It requires only Python 2.6 to run; the IPP software need not be installed. The script understands the IPP distribution system communication protocol and is intended to be run from cron to implement daily updates of a mirror site.

The advantages of the script are simplicity, brevity, ease of installation and ease of operation. Potential disadvantages are:

The script does not maintain a local MySQL copy of the distribution source database. This can be imported directly, however, and the corresponding information is available in the .mdc files of the distribution.
The script is not intended to serve as a distribution server in its own right, so it cannot form part of a distribution chain using the IPP distribution mechanism. The mirrored files can be redistributed using rsync, however.
The distribution protocol is pretty stable, but it may change at some future time. That change will break the script until a corresponding update can be made.

Installation Instructions

Create an empty directory on the filesystem where you want to mirror the data distribution. On Odyssey the corresponding directory is /n/panstarrs/data/MIRROR . The source files are attached below.
Copy update_mirror.py and parse_config.py into that directory.
Customize update_mirror.py for your site, by editing the variables at the top of the file and making the following changes:
1. set mirrorRootDir to the root directory of your mirror, presumably the directory created in step (1) above.
2. set desiredProductList to contain a list of the data products you want to mirror.
3. Possibly update topURI, if the distribution server at IfA has changed from its current host.
4. Possibly modify hiddenProductList to include the names of new data products not visible from the top distribution layer. This bit is a little cryptic- the original spec for the distribution mechanism said that the data products would be visible from the top level directory of the distribution server, but this has been changed for security reasons. New products are now announced on the IPP mailing lists and must be edited into hiddenProductList manually.
Possibly modify your crontab to call the update script as often as desired. This is currently unnecessary because regular distributions of Pan-STARRS data have not yet begun. Be sure to pick an update interval long enough that previous updates complete before a new update begins! The script does not currently contain a mechanism to detect simultaneous instances of itself.

As an example, the customized variables used on Odyssey are currently:

topURI= "http://alala2.ifa.hawaii.edu/ipp049/ds"
mirrorRootDir= "/panstarrs/data/MIRROR"
desiredProductList= ['md-all-rel-200907']
hiddenProductList= ['md-all-rel-200907',
                    'durham-200907',
                    'qub-200907',
                    'edb-200907',
                    'jhu-200907',
                    'threepi-all-200908']

Mode of Operation

The distribution system provides a list of products; new data is appended to those products over time. The client must remember the most recent update time for a product and only download new material posted since that time. It must also keep track of previously failed downloads so that they can be re-tried later.

update_mirror.py stores its status information in two comma-separated-value files, the default names for which are product_status.csv and download_failures.csv. It maintains a log called update_mirror.log by default. If it is interrupted it will also save the state of its internal queues in a file called update_mirror_leftover_work.pkl so that it can pick up where it left off when it is restarted. When it starts up, it:

Re-queues leftover work from a previous crash, if any.
Re-queues any entries from the download failures file.
Checks for updates to desired products which are more recent than the times recorded in the product_status.csv file, and adds those filesets to the download queue.
Commences to work through the download queues using multiple instances of wget and tar, as specified by the IPP distribution protocol.

When it finishes working through the queue it updates its status files and exits. Note: if you try to kill it with SIGTERM (the normal kill signal) it will finish processing the currently in-progress filesets before quitting. This can take a while, but it's worth it in terms of cleanup. If you need to stop it immediately kill it with SIGKILL (kill -9). You are then responsible for cleanup.

Attachments (2)

parse_config.py (14.8 KB ) - added by welling 17 years ago.
update_mirror.py (29.4 KB ) - added by welling 16 years ago.

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text