IPP Software Navigation Tools IPP Links Communication Pan-STARRS Links
wiki:ippToPsps

Version 69 (modified by rhenders, 16 years ago) ( diff )

--

IPP to PSPS interface: ippToPsps

ippToPsps is the interface between IPP and PSPS. In short, ippToPsps creates FITS files from IPP data, then publishes them to a datastore in the form of batches. On the PSPS side, the DXLayer polls the datastore, collects batches when they become available, then converts the contents to csv files before sending them on to SQL Server loader software, which merges them into the PSPS database. Ultimately there will be feedback from PSPS regarding errors in the received data, to which ippToPsps will need to respond.

It is intended that the binary tables in the FITS files generated by ippToPsps match the PSPS database schemas perfectly, the consequence being that any alterations to the PSPS database schema will only affect ippToPsps code, and not the DXLayer. A certain amount of data validation will be performed by ippToPsps before publication, more validation occurring at the loading and merge stages on the PSPS side.

The outputs of ippToPsps are referred to as 'batches', and are detailed below.

Batch name PSPS name Description IPP Source
Initialisation IN Metadata for the other batches eg filter ID, survey ID etc generated from XML config
Detection P2 Single exposure detections generated from one smf file per exposure plus associated DVO database
Difference ? difference image detections generated from one cmf file per skycell per exposure
Stack ST stack image detections generated from...


Configuration

Due to the potential for changes in both input and output for ippToPsps, the code is heavily configurable. Configuration files are in an XML format as this affords the most flexibility (human and machine readable, expandable, self-describing etc). ippToPsps is pointed to a config directory, under which subdirectories for each batch type hold the various XML config files.

Table shapes

All FITS tables mirror PSPS database tables. Since the PSPS schema will probably remain in a state of flux for some time, rather than hard-coding table descriptions, instead ippToPsps reads table shapes from an XML config. This config can be regenerated from the master PSPS schema using a Perl script (pspsSchema2xml.pl) in the scripts directory. The same script also generates C header files for each batch-type. These headers contains enums for each PSPS table and are used by the code at runtime. This helps minimise code changes.

Initialisation data

The table shapes of the initialisation batch are handled as above. The actual initialisation data (lists of filters etc), which is liable to change, is held in a config and used by ippToPsps to populate the tables in the FITS file. This data is also used when generating other batch types, detections for example, as look-up tables for setting survey ID etc.

IPP to PSPS mappings

Most data to be loaded into the FITS tables comes from IPP smf or cmf files. For many columns, there is a direct mapping between these files and the PSPS database column. These mappings are detailed in a config.

Architecture

ippToPsps

ippToPsps is a C program within the IPP build. When given the correct arguments it will generate a single FITS for the specified product (above). The program is run from a Perl script, which itself generates a list of exposure IDs based on arguments provided by the user (label etc). An instance of ippToPsps is run per exposure ID. Upon completion, the calling script bundles the resultant FITS files up as a batch, then publishes it to the datastore, ready for collection by the DXLayer.

DXLayer

The DXLayer polls the datastore waiting for new batches. Upon receipt of a new batch, the FITS files are converted to a csv format, suitable for ingest by the ODM.

ODM Loader

Performs validation on incoming data based on metadata previously loaded as an initialisation batch (see above). If validation is successful, new batches are merged into the PSPS database. One basic requirement of the ODM is that all detections in a detection batch have unique object IDS. Object IDs are assigned by the IPP DVO to each detection on a chip. The number is formulated from the Ra and Dec of the detection.

Notes about the different batch types

Detections

The input for the detection batch is one IPP camera-stage smf file for a given exposure, as well as an associated DVO database from which to retrieve object, and other, IDs. One FITS file is generated for each exposure. The extensions are:

1 Primary extension
1 FrameMeta extension
1 ImageMeta extension per chip
1 Detection extension per chip
1 SkinnyObject extension per chip
1 ObjectCalColor extension per chip

So, 242 extensions in all, including the obligatory primary header. The 'object ID' is featured in the last three tables, and must remain unique across the exposure (it is generated within DVO). In the merged PSPS database, the primary key on the detections table is both the object ID and detection ID, meaning the the same object can appear in multiple, overlapping exposures as they will have different detection IDs.

Diffs

The input for difference batches is a set of cmf files, one for each skycell covered by a particular exposure. A FITS output file is generated with the following extensions:

Unresolved fields

Below are tables detailing which fields in the PSPS FITS files are still not populated by ippToPsps.

Unresolved fields for camera stage detections

PSPS field PSPS type PSPS Description Comments
FrameMeta
frameNameSTRINGframe name provided by camera software
cameraIDSHORTcamera identifier 1?
cameraConfigID SHORT camera configuration identifier
analysisVerSTRINGIPP software analysis release need added to smf?
p1RecpSTRINGIPP phase 1 MD5 Checksum need added to smf?
p2RecipSTRINGIPP phase 2 MD5 Checksum need added to smf?
p3RecipSTRINGIPP phase 3 MD5 Checksum need added to smf?
numPhotoRefLONGnumber of photometric reference sources
calibModNumSHORTcalibration modification number for future
dataReleaseBYTEData release for future
ImageMeta
photoCalIDLONGphotometry reduction code identifier will use IPP dvo.photcodes in PSPS init batch
biasFLOATdetector bias level (unit = ADU) need added to smf?
biasScatFLOATscatter in bias level (unit = ADU) need added to smf?
numPhotoRefLONGnumber of photometric reference sources
psfModelIDLONGPSF model identifier need from smf?
momentThetaFLOATmodel PSF parameters at chip center (unit = deg) have major/minor, but angle?
detectorIDSHORT identifier for actual CCD chip
qaFlagsLONGQ/A flags for this OTA need from DVO?
calibModNumSHORTcalibration modification numberfor future
dataReleaseBYTEData release for future
Detection
psfLikelihoodFLOATPSF likelihood need in smf
momentWidMajorFLOAT PSF width in major axis from moments (unit = arcsec) only have MOMENT_XX/XY/YY in psf table
momentWidMinorFLOATPSF width in minot axis from moments (unit = arcsec) only have MOMENT_XX/XY/YY in psf table
momentThetaFLOATPSF orientation angle from moments (unit = deg) same as 'ANGLE' used for psf?
crLikelihoodFLOATLikelihood the source is a cosmic ray need added to smf?
infoFlagLONGflag indicating provenance information
historyModeNumSHORTmodification number in the O-D association historyfor future
dataReleaseBYTEData release when this detection was originally taken. Recalibrations do not affect this value.for future
SkinnyObject
projectionCellID LONGprojection cell identifier at discovery time ???
ObjectCalColor
calColorFLOAT color adopted for magnitude calculation (unit = mag) for future
calColorErrFLOAT error in calibrating color (unit = mag) for future

Recovery system design

Currently, the IPP to PSPS interface is a 'one-way' system. Batches are created by ippToPsps and posted on an IPP instance of the datastore. These batches are collected by the DXLayer on the PSPS side. As a basis for a future recovery system, the IPP urgently requires some feedback from PSPS so that it may learn which batches have succeeded and which have failed, and the reason why. With this information data can be either deleted, or regenerated accordingly. This is important simply because, with such large data volumes, we cannot afford the high levels of redundancy currently in place. At present, for a given batch, the following copies exist within the pipeline:

  • a copy exists on the IPP cluster after generation by ippToPsps program
  • a copy exists on the IPP datastore after publication by ippToPsps
  • the DXLayer retains a copy after it has sent the csv version to the ODM
  • the DXLayer also keeps a copy of these (larger) csv files

We therefore need to quickly implement the basic framework of a feedback loop such that the IPP can quickly learn if a given batch has been successfully merged into the PSPS database or not. This will enable it to safely delete the data files and remove the copy from the datastore.

Previous design

Previously, Conrad and I had discussed a design whereby a second datastore instance was utilized, this time on the PSPS cluster. The DXLayer would act as the 'middle-man', polling the ODM for updates on loading progress, then posting the results on the PSPS datastore for the IPP. Polling this, ippToPsps could acquire a list of batches it knows are safe to be discarded. Simultaneously, the DXLayer could delete its copies of the same redundant data.

The update placed on the PSPS datastore could take the form of an XML file. At first this would simply detail those files it is safe to delete, but could evolve into a more complex recovery report, i.e. which batches failed, and what is required to be done by the IPP.

New design

Instead of creating a new datastore instance within PSPS and using the DXLayer as communication layer between the ODM and the IPP, we propose that the DXLayer forms no part of the feedback system. It should be simplified such that it only facilitates loading, i.e. polling the IPP datastore for new data, converting it to csv files then sending these on to the ODM. Instead, to complete the circle, the ippToPsps code will poll the ODM directly, bypassing the DXLayer altogether. The IPP then knows which batches have merged successfully and can delete them accordingly. This also forms the basis of a full recovery system as, at a later date, ippToPsps can be coded to respond intelligently to the myriad of errors that may occur within the ODM. The DXLayer need know nothing of the how or why a certain batch is being submitted by the IPP, it should just grab it, convert it and pass it along to the ODM.

Since ippToPsps will (soon) keep a record of all the jobs and corresponding exposure IDs in the IPP database, it is unnecessary for this information to be duplicated by the DXLayer, which currently has its own local database for this information.

Rather than waste the code already written for the DXLayer, it can be used within ippToPsps, for example, the ODM polling scripts.

The question remains of what should be done with the copies of the data currently retained by the DXLayer? The options are that it can either be deleted automatically after a defined amount of time, or the IPP can send list of batches it is safe to delete through the datastore, or perhaps the DXLayer should not retain files at all. Since it can quickly and easily acquire data from the IPP datastore anyway, it is probably unnecessary for it to hold any copies.

Advantages over previous design

  • no need for second datastore (not a big overhead, but additional systems administration in an already complicated system).
  • no need to define new XML standard that incorporates the whole array of recovery options
  • no need for the DXLayer to poll the ODM
  • no need for the DXLayer to have a database to log the batches (already done on the IPP side)
  • no need for the DXLayer to keep data at all?

Links

Datastore test area for PSPS on Maui Datastore test area for PSPS at JHU

Attachments (2)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.