IPP Software Navigation Tools IPP Links Communication Pan-STARRS Links

Changeset 6040


Ignore:
Timestamp:
Jan 18, 2006, 2:04:39 PM (20 years ago)
Author:
eugene
Message:

lots of writing done

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/doc/design/ippSSDD.tex

    r6034 r6040  
    1 %%% $Id: ippSSDD.tex,v 1.1 2006-01-18 13:34:20 eugene Exp $
     1%%% $Id: ippSSDD.tex,v 1.2 2006-01-19 00:04:39 eugene Exp $
    22\documentclass[panstarrs]{panstarrs}
    33
     
    307307\begin{itemize}
    308308
    309 \item {\bf Image Server:} This component is a large data store for all
    310   images used by the IPP, including the raw images from the telescope,
    311   the master calibration images, the reference static-sky images, and
    312   any temporary image data products produced by the IPP.  The Image
    313   Server accepts the incoming data and stores it until it is no longer
    314   needed by other portions of the IPP.  The Image Server is not
    315   restricted to imaging data: it is capable of storing any large data
    316   files which are not well-suited for inclusion in a more structured
    317   relational database, and for which access needs to be widely
    318   available beyond the individual process which created the file.  The
    319   IPP has developed the software system called 'Nebulous' to perform
    320   this function.
     309\item {\bf Image/File Server:} This component is a large data store
     310  for all images and large used by the IPP, including the raw images
     311  from the telescope, the master calibration images, the reference
     312  static-sky images, and any temporary image data products produced by
     313  the IPP.  The Image/File Server accepts data products and stores
     314  them until they are no longer needed by other portions of the IPP.
     315  It allows other IPP subsystems to refer to the data files with an
     316  abstract identification without needing to worry about the details
     317  of the physical location.  Conversely, it allows other entities to
     318  determine or specify the physical locations if needed.  The
     319  Image/File Server is capable of storing any large data files which
     320  are not well-suited for inclusion in a more structured relational
     321  database, and for which access needs to be widely available beyond
     322  the individual process which created the file.  The IPP has
     323  developed the software system called 'Nebulous' to perform this
     324  function.
    321325
    322326\item {\bf Metadata Database:} This component stores the data which is
     
    415419\label{sec:ArchComponents}
    416420
    417 \subsection{IPP Image Server}
     421\subsection{Nebulous : the IPP Image/File Server}
    418422
    419423\subsubsection{Corresponding Requirements}
    420424
    421 The Image Server must meet the requirements specified in Section 3.4.1
    422 of the Pan-STARRS PS-1 IPP SRS (PSDC-430-005).  The specified design
    423 is chosen to meet requirements 3.4.1.3, and 3.4.1.5.  The other three
    424 requirements (3.4.1.1, 3.4.1.2, and 3.4.1.4) depend on the volume and
    425 capabilities of the hardware, and are addressed in
    426 Section~\ref{sec:Hardware}.
    427 
    428 \subsubsection{Image Server Overview}
    429 
    430 The IPP Image Server is a repository for all images and other large
    431 data files required by the IPP.  Along with the storage hardware, it
    432 provides tools for managing the distribution of these large data files
    433 and for accessing the files.  Data files stored by the IPP Image
    434 Server include the raw images, the calibration images, intermediate
    435 processing stage images as needed, final processed images, difference
    436 images, image subsections, and any large non-imaging data files needed
    437 by the IPP.  The IPP Image Server must retain the files for as long as
    438 they are needed by the IPP.
    439 
    440 The IPP Image Server is a parallel storage system.  It stores data
    441 across a collection of computer nodes, each with their own data
    442 storage resources.  Any single file is stored on only a single
    443 computer and storage device.  In order to achieve the data throughput
    444 requirements, the IPP Image Server may distribute the images across
    445 the processor nodes in an organized fashion, i.e., associating
    446 specific machines with specific detectors.  It is not the
    447 responsibility of the IPP Image Server to determine which computer
    448 should be associated with a specific data concept (Chip / region of
    449 sky), but it must enable the association of a particular file with a
    450 particular machine.
    451 
    452 There are three data concepts relevant to the IPP Image Server:
     425The IPP Image/File Server must meet the requirements specified in
     426Section 3.4.1 of the Pan-STARRS PS-1 IPP SRS (PSDC-430-005).  The
     427specified design is chosen to meet requirements 3.4.1.3, and 3.4.1.5.
     428The other three requirements (3.4.1.1, 3.4.1.2, and 3.4.1.4) depend on
     429the volume and capabilities of the hardware, and are addressed in
     430Section~\ref{sec:Hardware}. 
     431
     432\subsubsection{Image/File Server Overview}
     433
     434The IPP Image/File Server is a repository for all images and other
     435large data files required by the IPP.  Along with the storage
     436hardware, it provides tools for managing the distribution of these
     437large data files and for accessing the files.  Data files stored by
     438the IPP Image/File Server include the raw images, the calibration
     439images, intermediate processing stage images as needed, final
     440processed images, difference images, image subsections, and any large
     441non-imaging data files needed by the IPP.  The IPP Image/File Server
     442must retain the files for as long as they are needed by the IPP.
     443
     444The IPP team has developed the software system 'Nebulous' to meet the
     445requirements of the Image/File Server.  Nebulous uses a MySql database
     446engine to track the identities and locations of the files under its
     447management.  Nebulous currently requires that all files it manages be
     448available from a locally mounted file system; however, it does not
     449place specific requirements upon the choice of this file system.
     450Currently, the IPP is being implemented use NFS as the distributed
     451file system, though other systems such as GFS or GSFS could be used in
     452an equivalent fashion. 
     453
     454The decision to make use of a locally mounted file system is driven by
     455two IPP-wide design decisions.  First, as a practical and technical
     456consideration, in several locations throughout the IPP analysis, the
     457efficiency can be increased significantly if processes have the
     458ability to seek randomly in a file.  For example, when combining
     459multiple partially-overlapping images, rather the be forced to read
     460through an entire image data, the proccess can seek through a
     461multi-component file to the pixels of interest.  The other driving
     462motivation is more philosophical in nature: the IPP is designed to use
     463components which operate as simple user commands in the UNIX
     464environment.  By requiring Nebulous to work with files in the locally
     465mounted file system, all UNIX programs will be able to interact with
     466those files if needed.  This allows the IPP to include essentially any
     467normal analysis program as part of the IPP system without difficulty.
     468This philosophical choice is also present in the design of the IPP
     469Scheduler / Controller system and in the design implementation of the
     470data analysis programs.
     471
     472Nebulous is a parallel storage system.  It stores data across a
     473collection of computer nodes, each with their own data storage
     474resources.  Any single file is stored on only a single computer and
     475storage device.  In order to achieve the data throughput requirements,
     476Nebulous will be used to distribute the images across the processor
     477nodes in an organized fashion, i.e., associating specific machines
     478with specific detectors.  It is not the responsibility of Nebulous to
     479determine which computer should be associated with a specific data
     480concept (Chip / region of sky), but it instead provides the hooks to
     481enable the association of a particular file with a particular machine.
     482
     483There are three data concepts relevant to Nebulous:
    453484\begin{itemize}
    454485\item {\bf Storage object:} This represents a single, unique data
    455   entity in the Image Server.
    456 
    457 \item {\bf Instance:} A single copy of the storage object in the Image
    458   Server.  In general, a given storage object may have several instances
    459   in the Image Server, normally on different computer nodes.
    460 
    461 \item {\bf File ID:} This is the identifier of a particular storage
    462   object in the Image Server.  The file ID is simply a unique string,
    463   equivalent to the filename in a UNIX file system.
     486  entity managed by Nebulous.
     487
     488\item {\bf Storage ID:} This is the identifier of a particular storage
     489  object in Nebulous.  The Storage ID is the key used by any Nebulous
     490  users to retrieve a specific file of interest.  The file ID is
     491  simply a unique string, equivalent to the filename in a UNIX file
     492  system.
     493
     494\item {\bf Instance (or Storage Object Instance):} A single copy of
     495  the storage object in Nebulous.  In general, a given storage object
     496  may have several instances in Nebulous, normally on different
     497  computer nodes.
     498
    464499\end{itemize}
    465 
    466 The Image Server provides file pointers (in C), handles (in Perl or
    467 Python), or file names corresponding to the instances of the storage
    468 objects.  The Image Server provides the data organization but does not
    469 define a file system; it assumes the existence of an appropriate file
    470 system which makes the files visible as local files.  This
    471 may be done over many machines with a network file system such as NFS
    472 or GFS.
    473 
    474 The IPP Image Server provides the storage and access mechanisms, but
    475 it does not include any logic or information about the data.  The
    476 Image Server does not, e.g., monitor the age of images and delete them
    477 on some schedule.
    478 
    479 As shown in Figure~\ref{fig:ImageServer}, the IPP Image Server
    480 consists of the following components:
    481 
     500Upon request of a specific Storage ID, Nebulous provides file pointers
     501(in C), handles (in Perl), or file names corresponding to the
     502instances of the storage objects. 
     503
     504Nebulous provides the storage and access mechanisms, but it does not
     505include any logic or information about the data.  It does not, e.g.,
     506monitor the age of images and delete them on some schedule.  This
     507functionality currently resides in the IPP Scheduler
     508(Section~\ref{sec:scheduler}).
     509
     510As shown in Figure~\ref{fig:Nebulous}, Nebulous consists of the
     511following principal components:
    482512\begin{itemize}
    483 \item Image Server storage hardware
    484 \item Image Server database
    485 \item Image Server daemon
    486 \item Image Server client APIs
    487 \item Image Server maintenance tools (not shown)
     513\item Nebulous client(s)
     514\item Nebulous server
     515\item Nebulous database
     516\item Storage hardware
    488517\end{itemize}
    489518
     
    496525\end{figure}
    497526
    498 \subsubsection{IPP Image Server Client APIs}
     527\subsubsection{Nebulous Client APIs}
    499528
    500529Clients interact with the IPP Image Server via a small number of C
    501 APIs.  Bindings are also provided for Perl and Python and UNIX shell
    502 commands in some cases.  The client commands are:
     530APIs.  Bindings are also provided for Perl \tbd{and Python} and UNIX
     531shell commands in some cases.  This document only gives an overview of
     532the commands; for details on usage, please see the Nebulous user's
     533guide.  The client commands are:
    503534
    504535\begin{itemize}
    505 \item {\tt new object}: create a new storage object in the Image
    506   Server.  This function takes as input the file ID and returns a
    507   C-style file pointer or a Perl file handle to the instance of the
    508   storage object.  The arguments to the function include an optional
    509   node name on which the new storage object must be located.  If this
    510   target is not given, the Image Server places the new storage object
    511   on an appropriate machine from the pool, though the details need to
    512   be specified.
    513 
    514 \item {\tt open object}: open an instance of an existing storage
    515   object, as identified by the file ID.  This function may also
    516   specify the node on which the object should be opened (if an
    517   instance of the object is not stored on that node, the function
    518   returns an error).  On success, the function returns a file pointer.
    519 
    520 \item {\tt find object}: return a list of filenames in the UNIX name
    521   space associated with the storage object identified by the given
    522   file ID.  Since there are in general multiple instances for a given
    523   storage object, this function returns the collection of all
    524   available instances.  These may be freely opened by the client
    525   server using the standard \code{fopen} functions.
    526 
    527 \item {\tt stat object}: returns status information about the
    528   specified storage object, including the number of instances of the object.
    529 
    530 \item {\tt replicate object}:a new instance of the given storage
    531   object.  The target node may be optionally specified, otherwise an
     536\item {\tt create} : create a new storage object in the Image Server.
     537  This function takes as input the requested Storage ID and returns a
     538  C-style file pointer, Perl file handle, or file name corresponding
     539  to the new instance of the storage object.  The arguments to the
     540  function include an optional node name on which the new storage
     541  object must be located.  If this target is not given, the Image
     542  Server places the new storage object on an appropriate machine from
     543  the pool.
     544
     545\item {\tt replicate} : a new instance of the given storage object is
     546  created.  The target node may be optionally specified, otherwise an
    532547  appropriate node is selected.
    533548
    534 \item {\tt cull object}: removes one of the instances of the storage
    535   object.  The input parameters may optionally specify the target
    536   machine to delete.
    537 
    538 \item {\tt delete object}: deletes all instances of the storage object
    539   and sets the storage object status to {\tt deleted}. 
     549\item {\tt cull} : removes one of the instances of the storage object.
     550  The input parameters may optionally specify the target machine from
     551  which to delete the object.
     552
     553\item {\tt delete} : deletes all instances of the storage object and
     554  sets the storage object status to {\tt deleted}.
     555
     556\item {\tt open} : open an instance of an existing storage object, as
     557  identified by the storage ID.  This function may also specify the
     558  node on which the object should be opened (if an instance of the
     559  object is not stored on that node, the function returns an error).
     560  On success, the function returns a file pointer.  If the object is
     561  opened for 'write' access, all but one instance is deleted to ensure
     562  consistency of the data. 
     563
     564\item {\tt find} : return a list of filenames in the UNIX name space
     565  associated with the storage object identified by the given file ID.
     566  Since there are in general multiple instances for a given storage
     567  object, this function returns the collection of all available
     568  instances.  These may be freely opened by the client server using
     569  the standard \code{fopen} functions.
     570
     571\item {\tt lock} : attempt to acquire a Nebulous lock on the storage object.
     572
     573\item {\tt unlock} : release a Nebulous lock from the storage object.
     574
     575\item {\tt stat} : returns status information about the specified
     576  storage object, including the number of instances of the object.
     577
     578\item {\tt copy} : create a new storage object with one instance of
     579  the corresponding object.
     580
     581\item {\tt move} : rename a storage object
     582
     583\item {\tt import} : copy an existing file object into Nebulous. 
     584
    540585\end{itemize}
    541586
    542 \subsubsection{IPP Image Server Daemon}
    543 
    544 The Image Server client requests are mediated via the Image Server
    545 daemon.  Communication between the clients and the server is via SOAP
     587\subsubsection{Nebulous Server}
     588
     589The Nebulous client requests are mediated via the Nebulous server.
     590Communication between the clients and the server is via SOAP
    546591implementing the commands above.  The identity of the machine on which
    547 Image Server daemon runs is part of the Image Server configuration
     592the Nebulous server runs is part of the Nebulous configuration
    548593information.
    549594
    550 \subsubsection{IPP Image Server Database}
    551 
    552 The IPP Image Server daemon uses a database to store the information
    553 about the data storage objects, their instances, and the available
    554 hardware resources.  A {\tt mysql} database engine is used to manage
    555 the database table.  The database tables defined for the Image Server
    556 are listed in Table~\ref{tab:ImageServerTables}, and their contents are
     595The server is responsible for keeping track of storage objects, all
     596instances of that object, and enforcing locking semantics.  Extensive
     597logging and tracing support is provided for debug and to allow for
     598statics generation and possible {\em hotspot} optimization.
     599
     600Nebulous uses a centralized server model.  This model was choosen
     601because it allows efficient {\em pattern matching} of storage object
     602names.  The current 'best' technique for a distributed metadata store
     603is with distributed hash tables.  Unfortunately, no widely available
     604DHT implementation allows efficient {\em pattern matching} of key
     605names.
     606
     607\subsubsubsection{House keeping}
     608
     609\paragraph {Lock sweeping} In the event that a Storage Object operation fails to complete successfully
     610stale locks will have to be identified and removed from the IPP Pixel
     611Data Server Database.  This should be done periodically by comparing
     612the entries in the Lock table to the list of active nodes maintained
     613by the IPP Controller.  It should also happen as soon as possible
     614after a node goes offline (triggered by the IPP Controller marking a
     615node as offline?).  A sweep must be /completed/ before an offline node
     616can be marked on-line.
     617
     618Once a node is determined to be offline all entries in the Lock table
     619set by that node should be identified.  The locks on the Storage
     620Object Instances pointed to by those entries should then be rolled
     621back and Lock Record entries themselves must be removed from the lock
     622table.
     623
     624\paragraph{Consistency sweeping} Periodically the IPP Pixel Data Server meta-data and Storage Object will need
     625to be checked for sanity.  This would be similar to running fsck on a
     626modern filesystem.  Consistency sweeping should include Lock sweeping
     627and should be considered a super-set.
     628
     629\subsubsection{Nebulous Database}
     630
     631The Nebulous Server uses a database to store the information about the
     632data storage objects, their instances, and the available hardware
     633resources.  A {\tt mysql} database engine is used to manage the
     634database table.  The database tables defined for the Image Server are
     635listed in Table~\ref{tab:ImageServerTables}, and their contents are
    557636listed in Appendix~\ref{sec:ImageServerTableContents}.  This database
    558 engine need not be the same one used for other IPP subsystems.
     637engine is not in general the same one used for other IPP subsystems;
     638the full IPP hardware configuration will include independent machines
     639for each of the major databasing systems (Nebulous, Metadata, DVO).
     640In the earlier incarnations, the same hardware and database engine may
     641be used.
    559642%
    560643\begin{table}[ht]
    561644\begin{center}
    562 \caption{Image Server Database Tables\label{tab:ImageServerTables}}
     645\caption{Nebulos Database Tables\label{tab:ImageServerTables}}
    563646\begin{tabular}{ll}
    564647\hline
     
    574657\end{table}
    575658
    576 \subsubsection{IPP Image Server Storage Hardware}
    577 
    578 The IPP Image Server manages data across a collection of computers and
    579 possibly on multiple storage devices on those computer nodes.  The
    580 Image Server maintains a table of the available data volumes.  The
    581 Image Server tracks information about each volume such as the total
    582 capacity, the current capacity, the association between computer and
    583 data volume.
    584 
    585 \subsubsection{IPP Image Server Maintenance Tools}
    586 
    587 The IPP Image Server provides a collection of administration tools
    588 which allow for maintenance.  These are operations which may be
    589 automatically scheduled by the IPP or which may be initiated by a
    590 human via a command-shell interface.  The maintenance functions
    591 include migrating data between nodes to re-balance the available space
    592 (this would only occur for instances which have not been placed on a
    593 specific node by the client API).  Other functions include checking
    594 for file corruption, which involves sweeping all files on a data
    595 volume and comparing the calculated file checksum to the currently
    596 recorded value.
     659\subsubsection{Nebulous Storage Hardware}
     660
     661Nebulous manages data across a collection of computers and as needed
     662on multiple storage devices on those computer nodes.  Nebulous
     663maintains a table of the available data volumes.  It tracks
     664information about each volume such as the total capacity, the current
     665capacity, the association between computer and data volume.  \tbd{Is
     666Nebulous responsible for detecting unavailable hardware?  it is
     667reponsible for changing allocations?  or is this a pantasks
     668responsibility?}
     669
     670\subsubsection{Requirements Demonstrations}
     671
     672\tbd{summary of throughput tests : create / copy / delete objects per
     673  second, etc}
    597674
    598675%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     
    610687\subsubsection{Overview}
    611688
    612 \tbd{include a more complete discussion of glueforge}
    613 
    614 
    615 The IPP Metadata Database acts as a repository for non-pixel data
     689The IPP Metadata Database acts as a repository for most non-pixel data
    616690needed by the IPP subsystems.  This includes the image metadata, the
    617691environmental data, system configuration data and system reference
    618692data.  The Metadata Database is required to save the non-ephemeral
    619693data for the lifetime of the project for future reference and
    620 additional analysis.  The Metadata Database may be used in close
    621 coupling with the analysis pipelines to store temporary data either
    622 within or between stages of the analysis.  In this scenario, the
    623 analysis pipeline will interact directly with the database.  However,
    624 database latency may make this scenario impractical, in which case the
    625 database may be used for long-term storage only.  In this scenario,
    626 the data produced by analysis pipelines which is destined for the
    627 Metadata Database may be collected and inserted by a separate,
    628 dedicated process.  Metadata which is large in volume or poorly
    629 structured may also be stored in an appropriate container file (FITS
    630 Table, FITS Header, XML File) in the Image Server with the Metadata DB
    631 providing pointers to these files.
     694additional analysis.  The Metadata Database is also used in close
     695coupling with the analysis pipelines to track the state of elements as
     696they move through the processing system.  Metadata which is large in
     697volume or poorly structured is stored in an appropriate container file
     698(FITS Table, FITS Header, XML File) in Nebulous, with the Metadata DB
     699maintaining the corresponding Nebulous storage IDs of these files.
    632700
    633701The IPP Metadata Database is a simple database system, consisting of a
    634 number of simple tables without extensive inter-table links.  The
    635 \code{mysql} database engine will be used to drive the database.
     702number of simple tables without extensive inter-table links.  The IPP
     703uses the MySQL database engine for the the database.  To simplify the
     704coding and management of the database, the IPP uses autocoded APIs
     705constructed with the system called 'glueforge' to define and
     706manipulate the Metadata Database tables.
    636707
    637708\begin{table}[hb]
     
    675746is identified.
    676747
    677 \subsubsection{Metadata Queries}
    678 
    679 The IPP provides simple queries to the Metadata Database tables using
    680 auto-coded APIs.  These queries return a single row or a collection of
    681 rows based on the primary key.  The format of the API is identical for
    682 all Metadata tables.  New tables and APIs can be added to the IPP
    683 system by adding to the auto-code table description files.  The
    684 auto-code API includes read and write access permissions to be set
    685 for each table independently. See Appendix~\ref{sec:AutocodeIO} for
    686 further information.
     748\subsubsection{Autocoded Metadata Queries}
     749
     750The IPP provides standardized interfaces to the Metadata tables using
     751the 'glueforge' system.  Glueforge uses a standard table description
     752file to construct a collection of standard interface functions with
     753easily predicted names.  Given the description of a table (say Foo),
     754Glueforge provides a C-struct which represents the elements of single
     755row of the table (\code{FooRow}).  It also provides APIs to create a
     756new Foo table (\code{FooCreateTable()}), to insert a row either
     757supplying the elements of Foo (to \code{FooInsert()}) or by supplying
     758a pointer to data of type Foo (to \code{FooInsertObject()}).  Simple
     759queries may be constructed to select rows from the table
     760(\code{FooSelectRow()}).  The same mechanism generates data I/O
     761functions for writing FITS tables from a collection of the data
     762elements. 
     763
     764This autocoding system for interacting with the Metadata database
     765makes the software very flexible for changes to the structures of the
     766Metadata database tables.  The programmer does not need to know
     767anything about the details of a given table to interact with it,
     768except when a specific element is needed.  Thus, new columns can be
     769added to the tables, and only require a re-compilation for most
     770portions of the IPP code.  Even migration to a new table schema for
     771existing data becomes fairly trivial.  Such a migration only would
     772require the definition of a conversion function from the old structure
     773to the new structure.  The more general features of glueforge are
     774discussed in the glueforge manual pages. 
    687775
    688776%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     
    13691457
    13701458\subsection{Scheduler}
     1459\label{sec:scheduler}
    13711460
    13721461\subsubsection{Corresponding Requirements}
     
    15211610machine and monitors the success or failure of the processing stage.
    15221611
    1523 The analysis stages are written as UNIX commands, which may be
    1524 executed by the IPP Controller, or may be executed individually by
    1525 hand.  This option makes testing of the complete analysis system much
    1526 easier because the individual analysis stages may be tested
    1527 independently of each other and the IPP infrastructure.
    1528 
    1529 As part of this design model, the analysis stages have several methods
    1530 for accepting and returning the input and output data and for defining
    1531 optional choices in the analysis.  All of the analysis stages load an
    1532 analysis recipe, which defines the details of that analysis.  The
    1533 recipe includes the location of the data sources (from the metadata,
    1534 from the image headers, from other external files, or supplied
    1535 directly), which steps to employ, and how to assign optional
    1536 parameters.  For example, in the discussion of the Phase 2 analysis
    1537 below, the recipe file may specify {\em if} a bias subtraction should
    1538 be applied, {\em where} to find the overscan region and {\em which}
    1539 bias image, {\em if any}, to apply. 
    1540 
    1541 The recipe is loaded as part of the runtime configuration information
    1542 loaded when the analysis script starts.  Four levels of runtime
    1543 configuration information are defined.  The {\tt site} configuration
    1544 defines values specific to the particular installation of the
    1545 software.  For example, the name of the machine which hosts the
    1546 Metadata Database or a default path for data files could be part of
    1547 the {\tt site} configuration.  Multiple installations or versions of
    1548 the IPP software would need to have separate {\tt site} configuration
    1549 entries.  For example, a version of the IPP installed at the IfA would
    1550 use a different computer for the Image Server from the live IPP
    1551 installation running on the Pan-STARRS cluster.  The {\tt base}
    1552 configuration defines general data sources which may be needed by any
    1553 portion of the IPP.  The list of known telescopes or filters might be
    1554 an example.  The {\tt camera} configuration consists of information
    1555 which defines the parameters relevant to the cameras known by the IPP.
    1556 For example, the default layout of the detectors or the names of
    1557 specific header keyword values would be defined for each camera in a
    1558 camera-specific configuration collection.  Finally, each analysis
    1559 script loads its own recipe.  The location of this configuration
    1560 information may be a collection of configuration files available on
    1561 disk or some subset of the information may be stored in the Metadata
    1562 Database.  The source of these configuration entries can be overridden
    1563 when the script is executed, and individual configuration values may
    1564 also be specified on the command line.  Examples of the recipe and
    1565 other runtime configuration options are given in
    1566 Appendix~\ref{sec:RuntimeConfig}.
     1612The analysis stages are written using programs executed as UNIX
     1613commands.  These commands may be executed by the IPP Controller, or
     1614may be executed individually by hand.  This option makes testing of
     1615the complete analysis system much easier because the individual
     1616analysis stages may be tested independently of each other and the rest
     1617of the IPP infrastructure.  The analysis stages discussed in this
     1618section use two somewhat different types of programs.  One set of
     1619programs performs the heavy lifting of the data analysis: they examine
     1620the pixels in images and perform some statistical analysis of the
     1621pixel values or manipulate the data products in one way or another.
     1622Another set of programs are used to tie the analysis stages together.
     1623This latter set of programs examine the state of the Metadata database
     1624and select images for processing.  They are used by PanTasks, the IPP
     1625Scheduler to make decisions about which of the analysis programs to
     1626run on which data.  In this section, we discuss the details of the
     1627analysis stages in terms of the science analysis.  Below, in
     1628Section~\ref{AnalysisPrograms}, we discuss the major analysis programs
     1629and top-level analysis routines used by those program.  An important
     1630distinction to be noted here is that the same analysis programs with
     1631somewhat different options and configuration information can be
     1632employed by different analysis stages.  We exclude detailed discussion
     1633of the other connecting tools used to define the analysis stages.
     1634These tools and the detailed construction of the pipelines which make
     1635up the analysis stages are touched upon in Section~\ref{sec:PanTasks},
     1636which discusses the IPP Scheduler program, PanTasks.  They are
     1637discussed in more detail in the document 'ippTools'.
    15671638
    15681639%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     
    23382409
    23392410In order to facilitate testing and development, and to encourage
    2340 flexibility, the IPP will be built in a layered fashion.  The lowest
    2341 level functions will be written in C and collected together into a
     2411flexibility, the IPP is built in a layered fashion.  The lowest level
     2412functions will be written in C and collected together into a
    23422413Pan-STARRS library.  These library functions will be used to write
    23432414more complex modules.  The modules will be written in C but will make
     
    23482419functions in the operational system, the IPP will make use of Perl as
    23492420the scripting language to provide the required flow-control to tie the
    2350 modules together.
     2421modules together. \tbd{note that we use C only, not perl for
     2422scripting}.
    23512423
    23522424This approach satisfies the requirement that complicated low-level
     
    23912463detailed in the IPP PS-1 SRS (PSDC-430-005), Section 3.3.
    23922464
    2393 \subsection{IPP Stages}
     2465\subsection{IPP Analysis Programs}
     2466
     2467\tbd{clean this up}
    23942468
    23952469The major IPP processing tasks are organized into stages, which
     
    24022476images from multiple telescopes and search for transients).
    24032477
    2404 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     2478\section{Top-Level IPP Analysis Routines}
     2479
     2480The IPP uses a handful of high-level analysis routines which perform
     2481the bulk of the actual analysis effort.  These routines are used both
     2482as stand-alone programs and in some cases as library functions called
     2483by other stand-alone programs.  The 6 primary top-level analysis
     2484routines are:
     2485\begin{itemize}
     2486\item {\bf ppImage} : the complete single-image analysis program.
     2487\item {\bf ppMerge} : the basic image combinations program.
     2488\item {\bf psphot} : the photometry analysis routine.
     2489\item {\bf psastro} :  the astrometry analysis routine.
     2490\item {\bf stac} : the science image combination program.
     2491\item {\bf poisub} : the image difference program.
     2492\end{itemize}
     2493
     2494These programs are not mutually exclusive: psphot and psastro are both
     2495used by ppImage, while psastro is used as a stand-alone tool within
     2496the IPP, and psphot can be used as such. 
     2497
     2498\subsection{Software Configuration and  Camera Definition Information}
     2499
     2500Every camera which may be analysed by the IPP has differences in how
     2501the data is represented.  The IPP is built with the flexibility to
     2502handle data from many different cameras, not just the Pan-STARRS
     2503Gigapix cameras.  This is partly to allow testing of the analysis
     2504system on data from other telescopes, such as MegaPrime on CFHT and
     2505Suprime on Subaru, but also to allow us to adapt to changes in the
     2506design of the Gigapix cameras themselves.  It also means the IPP
     2507software may be used by astronomers for other analysis projects beyond
     2508the IPP. 
     2509
     2510Most cameras provide extensive descriptive information in the FITS
     2511image headers when the images are read out.  Typically, the location
     2512and orientations of the individual detectors are defined by keywords
     2513such as DATASEC and DETSEC.  Other variations on these words are used
     2514for cameras which place the pixels from multiple amplifiers in the
     2515same FITS data segment.  Other parameters, such as astrometric
     2516information or exposure times, are stored in headers as well.  It is
     2517possible to use these header keywords to guide the analysis software,
     2518but there are two difficulties. 
     2519
     2520First, it is very common for different keywords to be used by
     2521different cameras, sometimes even the same camera may use different
     2522keywords for the same information at different times (major readout
     2523software upgrades, for example, can be accompanied by keyword
     2524revisions).  In addition, within Pan-STARRS and the IPP, it is
     2525necessary to have the capability to refer to the Metadata database as
     2526the authoratative sources of some of these entries rather than the
     2527image headers.  Given this circumstance, it is at least necessary to
     2528define the appropriate source for a given data concept appropriate to
     2529data from a specific camera.
     2530
     2531The second problem arises when actually performing an analysis.  In
     2532many circumstances, the software needs to know what data to expect
     2533even when an appropriate camera image is not available.  This is
     2534particularly true for a camera which is composed of multiple chips and
     2535multiple amplifiers.  It is a frequent circumstance than some subset
     2536of the chips or amplifiers will either be unavailable or are invalid
     2537for one reason or another.  It is important for the software to have a
     2538guide for what data should be available from a perfect readout of the
     2539given camera so decisions can be made how to handle data which is not
     2540complete.  This is also important to validate that a particular
     2541dataset, which appears to be from a known camera, actually corresponds
     2542to that camera and has all of the necessary information where
     2543expected.
     2544
     2545As part of the flexible design model, all of these analysis programs
     2546use a common set of configuration files which define the details of
     2547how their analysis should be performed.  The configuration files a
     2548further divided into three main sets of configuration information:
     2549\begin{itemize}
     2550\item {\bf site configuration} this information defines the locations
     2551  of data resources and the other available configuration information.
     2552  For example, the site configuration would include the location of
     2553  Nebulous, the metadata database access information, the list of
     2554  available cameras, and so forth.
     2555\item {\bf camera configuration} this information describes
     2556  characteristics needed to interpret data from a specific camera.
     2557  This includes such details as where to extract particular metadata,
     2558  such as the filter name.  It also defines the camera layout and the
     2559  expected organization of the data files. 
     2560\item {\bf recipe} a particular analysis program may use one or
     2561  multiple recipes.  The recipe defines the value of optional
     2562  configuration information for that analysis program.  For example,
     2563  by using different recipes, ppImage can be made to perform a
     2564  complete Phase 2 image analysis, including detailed object detection
     2565  and astrometric calibration, or with a different recipe, ppImage may
     2566  be used to perform only bias subtraction (for example, as part of
     2567  the detrend image analysis).    Note that the details of a recipe in
     2568  general depend on the camera / telescope of interest.  As a result,
     2569  the identify of the recipe files which define different recipes are
     2570  included as part of the camera configuration information.
     2571\end{itemize}
     2572For all of the analysis programs, the source of the configuration
     2573files can be overridden when the program is executed, and individual
     2574configuration values may also be specified on the command line.  The
     2575details of the configuration file formats and the configuration
     2576variable names used by different programs and functions are discussed
     2577in the Modules SDRS document.
     2578
     2579\subsection{ppImage}
     2580
     2581This program is not only one of the work-horse programs of the IPP, it
     2582is also exemplary of the design of the top level analysis program.
     2583ppImage is used to perform the complete Phase 2 analysis on a single
     2584image data file.  This includes the complete detrending discussed
     2585above (bias, dark, flat, fringe, etc), as well as the object detection
     2586and classification, the astrometric calibration, and potentially the
     2587photometric calibration as well.  The object analysis and astrometry
     2588are in fact performed by ppImage using two of the other top-level
     2589routines discussed above, psphot and psastro; these components will be
     2590discussed independently and will mostly not be part of the ppImage
     2591discussion.
     2592
     2593The top-level design of ppImage is virtually identical to that of
     2594several other stand-alone analysis programs (eg, psphot, psastro,
     2595ppMerge).  The input to ppImage may be one of several options:
     2596\begin{itemize}
     2597\item a single FITS file by name (eg: /full/path/to/file.fits)
     2598\item a collection of FITS files by glob (eg: /full/path/to/file.*.fits)
     2599\item a single Nebulous storage ID (neb:nebID)
     2600\item a collection of Nebulous storage IDs (neb:nebID.regex)
     2601\end{itemize}
     2602The Nebulous versions can be viewed as simply utility versions to save
     2603the user from asking Nebulous for the corresponding file names.  In
     2604addition to the input file and any optional configuration information,
     2605ppImage also takes as part of the command-line arguments the root of
     2606the output file names; ppImage defines naming conventions to
     2607constructing the complete output files based on the recipe. 
     2608
     2609After parsing the command-line options, ppImage performs the following
     2610tasks:
     2611\begin{itemize}
     2612\item identify the camera associated with the file or files (all files
     2613  must be of the same camera). This analysis is performed by examining
     2614  the primary headers of the files and applying camera-identification
     2615  rules specified as part of the camera configuration information.  It
     2616  is also possible for the user to assert the camera which supplied
     2617  the image(s).
     2618
     2619\item construct a complete FPA structure representing the camera
     2620  (without pixel data) and associate the input filenames with the
     2621  different components of the FPA structure.  In this step, the camera
     2622  configuration information is used to identify the cell, chip, etc
     2623  which corresponds to each file, and the filenames are associated
     2624  with elements at the appropriate levels.  Since each input file has
     2625  already had its primary header read, this is also attached to the
     2626  appropriate FPA structure level.
     2627
     2628\item the complete FPA is then looped over, with nested loops for each
     2629  of the levels: FPA, Chip, Cell, and Readout.  At the appropriate
     2630  depth (depending on user options and the recipe), the pixels are
     2631  loaded from the corresponding file.  The recipe may, for example,
     2632  require the entire FPA image be read, or it may specify that each
     2633  cell be loaded one at a time. 
     2634
     2635\item the detrending analysis is peformed on each readout.  Note that
     2636  the detrend images needed for the analysis are defined by the recipe
     2637  and by querying the metadata database for the detrend image which
     2638  matches the science image in hand.  The details of this match may be
     2639  modified by the recipe as well, allowing for example, one recipe to
     2640  apply only the best available flat-field images, while another for
     2641  comparison only applies an earlier flat.  These types of options are
     2642  needed for testing, and also to perform, e.g., the flat-field
     2643  correction analysis.  Also note that the detrend images may be
     2644  loaded at a different level from the science images; it is likely
     2645  that a detrend image is only defined for each Cell, while the
     2646  science images may be loaded by Readout. 
     2647
     2648\item after each chip is processed, ppImage will optionally
     2649  reconfigure the resulting pixels into a single contiguous array.
     2650  This is the normal data source for the psphot function call used by
     2651  ppImage.  The reconfigured chip image on this stage may also be used
     2652  to generate thumbnail and rebinned sample images on the chip level.
     2653
     2654\item ppImage may also be used to reconfigure the pixels from a
     2655  complete FPA into a single pixel array.  For the GPC, this step
     2656  would not normally be performed on the full data array, but would be
     2657  used to generate rebinned sample images of the full mosaic (in
     2658  either FITS or JPEG formats).  This step is used by Phase 3 to
     2659  construct images for examining the state of the data processing or
     2660  by the detrend creation analysis to test the quality of the
     2661  residuals. 
     2662 
     2663\item the final stage is to output, at an appropriate depth, the
     2664  chosen output data files and to send summary metadata to the
     2665  metadata database.  these output files may include FITS images, FITS
     2666  thumbnails (very small rebinned images), FITS samples, JPEG images,
     2667  object tables, and/or astrometry calibration files.
     2668\end{itemize}
     2669
     2670Except for the details of the analysis (detrending, etc), the detail
     2671of the processing steps are identical for psphot, psastro, and
     2672ppMerge, as discussed below.
     2673
     2674\subsection{ppMerge}
     2675
     2676This program performs the job of stacking multiple images in which the
     2677pixel grid is kept the same for all of the input images.  As part of
     2678the combination process, the input images may be scaled and shifted.
     2679This operation is the basis for the creation of all of the primary
     2680master detrend frames: bias, dark, flat, fringe.  For all of these
     2681cases, the output pixel values are determined by applying some global
     2682statistic to the collection of input pixels after the input pixels
     2683have been rescaled.  The combination statistics may be any of several
     2684standard options, including mean, median, sigma-clipping, etc.  The
     2685input files may be supplied to ppMerge using any of a command-line
     2686glob, a text-based list, or an XML list.  The glob method forces the
     2687zero and scaling to have values of 0 and 1, respectively.  The text
     2688file listing allows for zero and scaling values, but only a single
     2689value for each input file.  The XML-format list is needed if
     2690subdivisions within a file (eg, cells) require independent zero and
     2691scaling factors.  It is also necessary to specify more than one image
     2692file if, eg, a full FPA composed of separate Chip files is to be
     2693processed at once.  This analysis is performed using a very similar
     2694structure to ppImage: the input list of images is parsed and
     2695associated with the approrpiate elements of the FPA struture, though
     2696in this case, an array of FPA structures is used to carry the
     2697information.  The elements of the FPA are looped over, as with
     2698ppImage, and the analysis is performed on the lowest level, after the
     2699input data pixels are read and re-scaled.  The output is written out
     2700after the image have been looped over.
     2701
     2702\subsection{psphot}
     2703
     2704The details of psphot are discussed in the psphot design document.
     2705Here we will just address some of the basic concepts.  PSPhot may be
     2706used as a stand-alone program, or it may be called as a function
     2707within another program (such as ppImage).  In the li
    24052708
    24062709\section{Interfaces}
     
    34023705%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    34033706
    3404 \section{Software Runtime Configuration Issues}
    3405 \label{sec:RuntimeConfig}
    3406 
    3407 The IPP Software requires extensive runtime configuration information.
    3408 This includes default parameters for analysis to be performed,
    3409 descriptions of how a particular analysis is performed, locations of
    3410 data sources, and so forth.  The IPP may store this information in the
    3411 Metadata Database or in configuration files available to the user.
    3412 Both methods are implemented in the current design.  In either method,
    3413 the necessary parameters are identical.  This section discusses the
    3414 contents of specific portions of the runtime configuration.
    3415 
    3416 \subsection{Camera Definition Information}
    3417 
    3418 Every camera which may be analysed by the IPP has differences in how
    3419 the data is represented.  The IPP is built with the flexibility to
    3420 handle data from many different cameras, not just the Pan-STARRS
    3421 Gigapix cameras.  This is partly to allow testing of the analysis
    3422 system on data from other telescopes, such as MegaPrime on CFHT and
    3423 Suprime on Subaru, but also to allow us to adapt to changes in the
    3424 design of the Gigapix cameras themselves.  It also means the IPP
    3425 software may be used by astronomers for other analysis projects beyond
    3426 the IPP. 
    3427 
    3428 Most cameras provide extensive descriptive information in the FITS
    3429 image headers when the images are read out.  Typically, the location
    3430 and orientations of the individual detectors are defined by keywords
    3431 such as DATASEC and DETSEC.  Other variations on these words are used
    3432 for cameras which place the pixels from multiple amplifiers in the
    3433 same FITS data segment.  Other parameters, such as astrometric
    3434 information or exposure times, are stored in headers as well.  It is
    3435 possible to use these header keywords to guide the analysis software,
    3436 but there are two difficulties. 
    3437 
    3438 First, it is very common for different keywords to be used by
    3439 different cameras, sometimes even the same camera may use different
    3440 keywords for the same information at different times (major readout
    3441 software upgrades, for example, can be accompanied by keyword
    3442 revisions).  In addition, within Pan-STARRS and the IPP, it is
    3443 necessary to have the capability to refer to the Metadata database as
    3444 the authoratative sources of some of these entries rather than the
    3445 image headers.  Given this circumstance, it is at least necessary to
    3446 define the appropriate source for a given data concept appropriate to
    3447 data from a specific camera.
    3448 
    3449 The second problem arises when actually performing an analysis.  In
    3450 many circumstances, the software needs to know what data to expect
    3451 even when an appropriate camera image is not available.  This is
    3452 particularly true for a camera which is composed of multiple chips and
    3453 multiple amplifiers.  It is a frequent circumstance than some subset
    3454 of the chips or amplifiers will either be unavailable or are invalid
    3455 for one reason or another.  It is important for the software to have a
    3456 guide for what data should be available from a perfect readout of the
    3457 given camera so decisions can be made how to handle data which is not
    3458 complete.  This is also important to validate that a particular
    3459 dataset, which appears to be from a known camera, actually corresponds
    3460 to that camera and has all of the necessary information where
    3461 expected.
    3462 
    3463 In order to facilitate the operation of the IPP with a variety of
    3464 cameras, and to allow the software the flexibility to change the
    3465 camera defintion dynamically, the IPP includes a collection of
    3466 software runtime configuration information which defines a given
    3467 camera.  This information is represented below in the form of the
    3468 PSLib Metadata Config file, but may be stored in the Metadata Database
    3469 or in an alternate format as appropriate.
    3470 
    3471 The a single camera is represented as a Focal Plane Array (FPA),
    3472 divided into Chips, divided into Cells.  For a single FPA, all imaging
    3473 data is stored in a FITS file or a collection of FITS files.  Software
    3474 needs to know where in a given file or set of files to find a
    3475 particular Cell, what Cells to expect, what chips to expect, and the
    3476 relationships between those entities, etc.
    3477 
    3478 A single camera configuration file (or dataset) represents the
    3479 description of a complete FPA.  In the configuration file, any
    3480 parameters which are specific to the complete FPA are placed on their
    3481 own lines.  These include the definition of the keywords or database
    3482 locations.  An incomplete example is given below.
    3483 
    3484 \begin{verbatim}
    3485 NCELL       S32    NN
    3486 NCHIP       S32    NN
    3487 EXPTIME-SRC STR    HD:EXPTIME # need to specify PHU vs EXTNAME
    3488 EXPTIME-KEY STR    EXPTIME 
    3489 DATE-KEY    STR    DATE-OBS
    3490 DATE-FMT    STR    YYYY/MM/DD
    3491 
    3492 TYPE        CELL   FILENAME           EXTNAME  CHIP      DATASEC       BIASSEC     
    3493 CELL.nn     CELL   @ROOT@CELL         AMP00    CHIP.00   CF:[0,0:0,0]  HD:BIASSEC
    3494 CELL.01     CELL   @ID/@ID@CELL.fits  AMP01    CHIP.00   DB:???
    3495 \end{verbatim}
    3496 
    3497 \subsection{Analysis Recipe Information}
    3498 
    3499 In order to maintain flexibility in the analysis details, the IPP uses
    3500 recipes to define how a particular analysis is implemented.  Each
    3501 major analysis script (eg, Phase 2) has its own recipe configuration
    3502 information, which may be stored in the Metadata Database or in the
    3503 form of the PSLib Metadata Config file.  This configuration
    3504 information includes all of the user configurable parameters.  Many of
    3505 these may specify a specific value, or they may specify lookup methods
    3506 (database locations, or header locations).  The specifies of each
    3507 depends on the context.  Below is an example recipe file for the bias
    3508 subtraction portion of Phase 2, giving several alternative options for
    3509 certain entries.  Note that, for example, the overscan subtraction may
    3510 be specified as using a particular region given in the recipe file, or
    3511 on the basis of a particular header keyword.
    3512 
    3513 \begin{verbatim}
    3514 # BIAS:
    3515 BIAS.IMAGE                 STR    NONE
    3516 BIAS.IMAGE                 STR    FILE:bias.fits
    3517 BIAS.IMAGE                 STR    DB:BEST
    3518 BIAS.IMAGE                 STR    DB:CLOSE
    3519 
    3520 BIAS.OVERSCAN              STR    HD:BIASSEC
    3521 BIAS.OVERSCAN              STR    CF:[0,16:0,2048]
    3522 BIAS.OVERSCAN              STR    NONE
    3523 
    3524 BIAS.OVERSCAN.STATS        STR    MEDIAN
    3525 BIAS.OVERSCAN.STATS        STR    MEAN
    3526 
    3527 BIAS.OVERSCAN.FIT          STR    SPLINE
    3528 BIAS.OVERSCAN.FIT.NPTS     S32    5
    3529 
    3530 BIAS.OVERSCAN.FIT          STR    POLYNOMIAL
    3531 BIAS.OVERSCAN.FIT.ORDER    S32    3
    3532 BIAS.OVERSCAN.FIT.NBIN     S32    5
    3533 \end{verbatim}
    3534 
    3535 \section{I/O Code Autogeneration}
    3536 \label{sec:AutocodeIO}
    3537 
    3538 The IPP includes a number of data collections which have multiple
    3539 representations.  A software tool will be used to automatically
    3540 generate code to provide I/O APIs to read and write these data and to
    3541 define the data structures used to carry them within a program.
    3542 Within the IPP, examples of these different data entities include
    3543 database tables (ie, in the Metadata Database), FITS Tables (to
    3544 exchange bulk data), and XML (to exchange more complete datasets).
    3545 
    3546 I/O API Autocode template (example.def):
    3547 \begin{verbatim}
    3548 Name    Example
    3549 Table   EXAMPLE
    3550 EXTNAME EXAMPLE
    3551 
    3552 KEY     XVALUE
    3553 
    3554 # name  format   unit      comment
    3555 XVALUE  F32      pixels    "x coordinate"
    3556 BINNING S32      fraction  "binning factor"
    3557 NAME    STR[32]  string    "description of entry"
    3558 \end{verbatim}
    3559 
    3560 Running autocode on such a file would generate an output header and C
    3561 files \code{example.h, example.c} with the following structure and APIs:
    3562 
    3563 \begin{verbatim}
    3564 typedef struct {
    3565   psF32 XVALUE;    // x coordinate
    3566   psS32 BINNING;   // binning factor
    3567   char  NAME[32];  // description of entry
    3568 } Example;
    3569 
    3570 psMetadata *psFITSTableInitExample ();
    3571 psExample *psFITSTableLoadExample (char *filename, int *Nrows);
    3572 bool psFITSTableSaveExample (char *filename);
    3573 
    3574 psMetadata *psDatabaseTableInitExample ();
    3575 psExample *psDatabaseTableLoadExample (char *filename, int *Nrows);
    3576 bool psDatabaseTableSaveExample (char *filename);
    3577 psExample *psDatabaseTableLoadExampleRow (char *filename, psF32 XVALUE);
    3578 \end{verbatim}
    3579 
    3580 %\bibliographystyle{plain}
    3581 %\bibliography{panstarrs}
    3582 
    35833707\input{glossary.tex}
    35843708
    35853709\end{document}
     3710
     3711------
     3712
     3713* top-level routines
     3714* re-org the Phase 4 stuff to discuss Magic
     3715* astrometry calibration data formats
     3716* analysis stages, versions and iterations
     3717* output data products
Note: See TracChangeset for help on using the changeset viewer.