Context Navigation

← Previous Changeset
Next Changeset →

Changeset 6040

Timestamp:

Jan 18, 2006, 2:04:39 PM (20 years ago)

Author:

eugene

Message:

lots of writing done

File:

: 1 edited

trunk/doc/design/ippSSDD.tex (modified) (14 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/doc/design/ippSSDD.tex

-              r6034
+              r6040
 %%% $Id: ippSSDD.tex,v 1.1 2006-01-18 13:34:20 eugene Exp $
+%%% $Id: ippSSDD.tex,v 1.2 2006-01-19 00:04:39 eugene Exp $
 \documentclass[panstarrs]{panstarrs}
 …
 \begin{itemize}
+\item {\bf Image Server:} This component is a large data store for all
+  images used by the IPP, including the raw images from the telescope,
+  the master calibration images, the reference static-sky images, and
+  any temporary image data products produced by the IPP.  The Image
+  Server accepts the incoming data and stores it until it is no longer
+  needed by other portions of the IPP.  The Image Server is not
+  restricted to imaging data: it is capable of storing any large data
+  files which are not well-suited for inclusion in a more structured
+  relational database, and for which access needs to be widely
+  available beyond the individual process which created the file.  The
+  IPP has developed the software system called 'Nebulous' to perform
+  this function.
+\item {\bf Image/File Server:} This component is a large data store
+  for all images and large used by the IPP, including the raw images
+  from the telescope, the master calibration images, the reference
+  static-sky images, and any temporary image data products produced by
+  the IPP.  The Image/File Server accepts data products and stores
+  them until they are no longer needed by other portions of the IPP.
+  It allows other IPP subsystems to refer to the data files with an
+  abstract identification without needing to worry about the details
+  of the physical location.  Conversely, it allows other entities to
+  determine or specify the physical locations if needed.  The
+  Image/File Server is capable of storing any large data files which
+  are not well-suited for inclusion in a more structured relational
+  database, and for which access needs to be widely available beyond
+  the individual process which created the file.  The IPP has
+  developed the software system called 'Nebulous' to perform this
+  function.
 \item {\bf Metadata Database:} This component stores the data which is
 …
 \label{sec:ArchComponents}
 \subsection{IPP Image Server}
+\subsection{Nebulous : the IPP Image/File Server}
 \subsubsection{Corresponding Requirements}
+The Image Server must meet the requirements specified in Section 3.4.1
+of the Pan-STARRS PS-1 IPP SRS (PSDC-430-005).  The specified design
+is chosen to meet requirements 3.4.1.3, and 3.4.1.5.  The other three
+requirements (3.4.1.1, 3.4.1.2, and 3.4.1.4) depend on the volume and
+capabilities of the hardware, and are addressed in
+Section~\ref{sec:Hardware}.
+\subsubsection{Image Server Overview}
+The IPP Image Server is a repository for all images and other large
+data files required by the IPP.  Along with the storage hardware, it
+provides tools for managing the distribution of these large data files
+and for accessing the files.  Data files stored by the IPP Image
+Server include the raw images, the calibration images, intermediate
+processing stage images as needed, final processed images, difference
+images, image subsections, and any large non-imaging data files needed
+by the IPP.  The IPP Image Server must retain the files for as long as
+they are needed by the IPP.
+The IPP Image Server is a parallel storage system.  It stores data
+across a collection of computer nodes, each with their own data
+storage resources.  Any single file is stored on only a single
+computer and storage device.  In order to achieve the data throughput
+requirements, the IPP Image Server may distribute the images across
+the processor nodes in an organized fashion, i.e., associating
+specific machines with specific detectors.  It is not the
+responsibility of the IPP Image Server to determine which computer
+should be associated with a specific data concept (Chip / region of
+sky), but it must enable the association of a particular file with a
+particular machine.
+There are three data concepts relevant to the IPP Image Server:
+The IPP Image/File Server must meet the requirements specified in
+Section 3.4.1 of the Pan-STARRS PS-1 IPP SRS (PSDC-430-005).  The
+specified design is chosen to meet requirements 3.4.1.3, and 3.4.1.5.
+The other three requirements (3.4.1.1, 3.4.1.2, and 3.4.1.4) depend on
+the volume and capabilities of the hardware, and are addressed in
+Section~\ref{sec:Hardware}.
+\subsubsection{Image/File Server Overview}
+The IPP Image/File Server is a repository for all images and other
+large data files required by the IPP.  Along with the storage
+hardware, it provides tools for managing the distribution of these
+large data files and for accessing the files.  Data files stored by
+the IPP Image/File Server include the raw images, the calibration
+images, intermediate processing stage images as needed, final
+processed images, difference images, image subsections, and any large
+non-imaging data files needed by the IPP.  The IPP Image/File Server
+must retain the files for as long as they are needed by the IPP.
+The IPP team has developed the software system 'Nebulous' to meet the
+requirements of the Image/File Server.  Nebulous uses a MySql database
+engine to track the identities and locations of the files under its
+management.  Nebulous currently requires that all files it manages be
+available from a locally mounted file system; however, it does not
+place specific requirements upon the choice of this file system.
+Currently, the IPP is being implemented use NFS as the distributed
+file system, though other systems such as GFS or GSFS could be used in
+an equivalent fashion.
+The decision to make use of a locally mounted file system is driven by
+two IPP-wide design decisions.  First, as a practical and technical
+consideration, in several locations throughout the IPP analysis, the
+efficiency can be increased significantly if processes have the
+ability to seek randomly in a file.  For example, when combining
+multiple partially-overlapping images, rather the be forced to read
+through an entire image data, the proccess can seek through a
+multi-component file to the pixels of interest.  The other driving
+motivation is more philosophical in nature: the IPP is designed to use
+components which operate as simple user commands in the UNIX
+environment.  By requiring Nebulous to work with files in the locally
+mounted file system, all UNIX programs will be able to interact with
+those files if needed.  This allows the IPP to include essentially any
+normal analysis program as part of the IPP system without difficulty.
+This philosophical choice is also present in the design of the IPP
+Scheduler / Controller system and in the design implementation of the
+data analysis programs.
+Nebulous is a parallel storage system.  It stores data across a
+collection of computer nodes, each with their own data storage
+resources.  Any single file is stored on only a single computer and
+storage device.  In order to achieve the data throughput requirements,
+Nebulous will be used to distribute the images across the processor
+nodes in an organized fashion, i.e., associating specific machines
+with specific detectors.  It is not the responsibility of Nebulous to
+determine which computer should be associated with a specific data
+concept (Chip / region of sky), but it instead provides the hooks to
+enable the association of a particular file with a particular machine.
+There are three data concepts relevant to Nebulous:
 \begin{itemize}
 \item {\bf Storage object:} This represents a single, unique data
+  entity in the Image Server.
+\item {\bf Instance:} A single copy of the storage object in the Image
+  Server.  In general, a given storage object may have several instances
+  in the Image Server, normally on different computer nodes.
+\item {\bf File ID:} This is the identifier of a particular storage
+  object in the Image Server.  The file ID is simply a unique string,
+  equivalent to the filename in a UNIX file system.
+  entity managed by Nebulous.
+\item {\bf Storage ID:} This is the identifier of a particular storage
+  object in Nebulous.  The Storage ID is the key used by any Nebulous
+  users to retrieve a specific file of interest.  The file ID is
+  simply a unique string, equivalent to the filename in a UNIX file
+  system.
+\item {\bf Instance (or Storage Object Instance):} A single copy of
+  the storage object in Nebulous.  In general, a given storage object
+  may have several instances in Nebulous, normally on different
+  computer nodes.
 \end{itemize}
+The Image Server provides file pointers (in C), handles (in Perl or
+Python), or file names corresponding to the instances of the storage
+objects.  The Image Server provides the data organization but does not
+define a file system; it assumes the existence of an appropriate file
+system which makes the files visible as local files.  This
+may be done over many machines with a network file system such as NFS
+or GFS.
+The IPP Image Server provides the storage and access mechanisms, but
+it does not include any logic or information about the data.  The
+Image Server does not, e.g., monitor the age of images and delete them
+on some schedule.
+As shown in Figure~\ref{fig:ImageServer}, the IPP Image Server
+consists of the following components:
+Upon request of a specific Storage ID, Nebulous provides file pointers
+(in C), handles (in Perl), or file names corresponding to the
+instances of the storage objects.
+Nebulous provides the storage and access mechanisms, but it does not
+include any logic or information about the data.  It does not, e.g.,
+monitor the age of images and delete them on some schedule.  This
+functionality currently resides in the IPP Scheduler
+(Section~\ref{sec:scheduler}).
+As shown in Figure~\ref{fig:Nebulous}, Nebulous consists of the
+following principal components:
 \begin{itemize}
+\item Image Server storage hardware
+\item Image Server database
+\item Image Server daemon
+\item Image Server client APIs
+\item Image Server maintenance tools (not shown)
+\item Nebulous client(s)
+\item Nebulous server
+\item Nebulous database
+\item Storage hardware
 \end{itemize}
 …
 \end{figure}
 \subsubsection{IPP Image Server Client APIs}
+\subsubsection{Nebulous Client APIs}
 Clients interact with the IPP Image Server via a small number of C
+APIs.  Bindings are also provided for Perl and Python and UNIX shell
+commands in some cases.  The client commands are:
+APIs.  Bindings are also provided for Perl \tbd{and Python} and UNIX
+shell commands in some cases.  This document only gives an overview of
+the commands; for details on usage, please see the Nebulous user's
+guide.  The client commands are:
 \begin{itemize}
+\item {\tt new object}: create a new storage object in the Image
+  Server.  This function takes as input the file ID and returns a
+  C-style file pointer or a Perl file handle to the instance of the
+  storage object.  The arguments to the function include an optional
+  node name on which the new storage object must be located.  If this
+  target is not given, the Image Server places the new storage object
+  on an appropriate machine from the pool, though the details need to
+  be specified.
+\item {\tt open object}: open an instance of an existing storage
+  object, as identified by the file ID.  This function may also
+  specify the node on which the object should be opened (if an
+  instance of the object is not stored on that node, the function
+  returns an error).  On success, the function returns a file pointer.
+\item {\tt find object}: return a list of filenames in the UNIX name
+  space associated with the storage object identified by the given
+  file ID.  Since there are in general multiple instances for a given
+  storage object, this function returns the collection of all
+  available instances.  These may be freely opened by the client
+  server using the standard \code{fopen} functions.
+\item {\tt stat object}: returns status information about the
+  specified storage object, including the number of instances of the object.
+\item {\tt replicate object}:a new instance of the given storage
+  object.  The target node may be optionally specified, otherwise an
+\item {\tt create} : create a new storage object in the Image Server.
+  This function takes as input the requested Storage ID and returns a
+  C-style file pointer, Perl file handle, or file name corresponding
+  to the new instance of the storage object.  The arguments to the
+  function include an optional node name on which the new storage
+  object must be located.  If this target is not given, the Image
+  Server places the new storage object on an appropriate machine from
+  the pool.
+\item {\tt replicate} : a new instance of the given storage object is
+  created.  The target node may be optionally specified, otherwise an
   appropriate node is selected.
+\item {\tt cull object}: removes one of the instances of the storage
+  object.  The input parameters may optionally specify the target
+  machine to delete.
+\item {\tt delete object}: deletes all instances of the storage object
+  and sets the storage object status to {\tt deleted}.
+\item {\tt cull} : removes one of the instances of the storage object.
+  The input parameters may optionally specify the target machine from
+  which to delete the object.
+\item {\tt delete} : deletes all instances of the storage object and
+  sets the storage object status to {\tt deleted}.
+\item {\tt open} : open an instance of an existing storage object, as
+  identified by the storage ID.  This function may also specify the
+  node on which the object should be opened (if an instance of the
+  object is not stored on that node, the function returns an error).
+  On success, the function returns a file pointer.  If the object is
+  opened for 'write' access, all but one instance is deleted to ensure
+  consistency of the data.
+\item {\tt find} : return a list of filenames in the UNIX name space
+  associated with the storage object identified by the given file ID.
+  Since there are in general multiple instances for a given storage
+  object, this function returns the collection of all available
+  instances.  These may be freely opened by the client server using
+  the standard \code{fopen} functions.
+\item {\tt lock} : attempt to acquire a Nebulous lock on the storage object.
+\item {\tt unlock} : release a Nebulous lock from the storage object.
+\item {\tt stat} : returns status information about the specified
+  storage object, including the number of instances of the object.
+\item {\tt copy} : create a new storage object with one instance of
+  the corresponding object.
+\item {\tt move} : rename a storage object
+\item {\tt import} : copy an existing file object into Nebulous.
 \end{itemize}
 \subsubsection{IPP Image Server Daemon}
 The Image Server client requests are mediated via the Image Server
 daemon.  Communication between the clients and the server is via SOAP
+\subsubsection{Nebulous Server}
+The Nebulous client requests are mediated via the Nebulous server.
+Communication between the clients and the server is via SOAP
 implementing the commands above.  The identity of the machine on which
 Image Server daemon runs is part of the Image Server configuration
+the Nebulous server runs is part of the Nebulous configuration
 information.
+\subsubsection{IPP Image Server Database}
+The IPP Image Server daemon uses a database to store the information
+about the data storage objects, their instances, and the available
+hardware resources.  A {\tt mysql} database engine is used to manage
+the database table.  The database tables defined for the Image Server
+are listed in Table~\ref{tab:ImageServerTables}, and their contents are
+The server is responsible for keeping track of storage objects, all
+instances of that object, and enforcing locking semantics.  Extensive
+logging and tracing support is provided for debug and to allow for
+statics generation and possible {\em hotspot} optimization.
+Nebulous uses a centralized server model.  This model was choosen
+because it allows efficient {\em pattern matching} of storage object
+names.  The current 'best' technique for a distributed metadata store
+is with distributed hash tables.  Unfortunately, no widely available
+DHT implementation allows efficient {\em pattern matching} of key
+names.
+\subsubsubsection{House keeping}
+\paragraph {Lock sweeping} In the event that a Storage Object operation fails to complete successfully
+stale locks will have to be identified and removed from the IPP Pixel
+Data Server Database.  This should be done periodically by comparing
+the entries in the Lock table to the list of active nodes maintained
+by the IPP Controller.  It should also happen as soon as possible
+after a node goes offline (triggered by the IPP Controller marking a
+node as offline?).  A sweep must be /completed/ before an offline node
+can be marked on-line.
+Once a node is determined to be offline all entries in the Lock table
+set by that node should be identified.  The locks on the Storage
+Object Instances pointed to by those entries should then be rolled
+back and Lock Record entries themselves must be removed from the lock
+table.
+\paragraph{Consistency sweeping} Periodically the IPP Pixel Data Server meta-data and Storage Object will need
+to be checked for sanity.  This would be similar to running fsck on a
+modern filesystem.  Consistency sweeping should include Lock sweeping
+and should be considered a super-set.
+\subsubsection{Nebulous Database}
+The Nebulous Server uses a database to store the information about the
+data storage objects, their instances, and the available hardware
+resources.  A {\tt mysql} database engine is used to manage the
+database table.  The database tables defined for the Image Server are
+listed in Table~\ref{tab:ImageServerTables}, and their contents are
 listed in Appendix~\ref{sec:ImageServerTableContents}.  This database
+engine need not be the same one used for other IPP subsystems.
+engine is not in general the same one used for other IPP subsystems;
+the full IPP hardware configuration will include independent machines
+for each of the major databasing systems (Nebulous, Metadata, DVO).
+In the earlier incarnations, the same hardware and database engine may
+be used.
+%
 \begin{table}[ht]
 \begin{center}
 \caption{Image Server Database Tables\label{tab:ImageServerTables}}
+\caption{Nebulos Database Tables\label{tab:ImageServerTables}}
 \begin{tabular}{ll}
 \hline
 …
 \end{table}
+\subsubsection{IPP Image Server Storage Hardware}
+The IPP Image Server manages data across a collection of computers and
+possibly on multiple storage devices on those computer nodes.  The
+Image Server maintains a table of the available data volumes.  The
+Image Server tracks information about each volume such as the total
+capacity, the current capacity, the association between computer and
+data volume.
+\subsubsection{IPP Image Server Maintenance Tools}
+The IPP Image Server provides a collection of administration tools
+which allow for maintenance.  These are operations which may be
+automatically scheduled by the IPP or which may be initiated by a
+human via a command-shell interface.  The maintenance functions
+include migrating data between nodes to re-balance the available space
+(this would only occur for instances which have not been placed on a
+specific node by the client API).  Other functions include checking
+for file corruption, which involves sweeping all files on a data
+volume and comparing the calculated file checksum to the currently
+recorded value.
+\subsubsection{Nebulous Storage Hardware}
+Nebulous manages data across a collection of computers and as needed
+on multiple storage devices on those computer nodes.  Nebulous
+maintains a table of the available data volumes.  It tracks
+information about each volume such as the total capacity, the current
+capacity, the association between computer and data volume.  \tbd{Is
+Nebulous responsible for detecting unavailable hardware?  it is
+reponsible for changing allocations?  or is this a pantasks
+responsibility?}
+\subsubsection{Requirements Demonstrations}
+\tbd{summary of throughput tests : create / copy / delete objects per
+  second, etc}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 …
 \subsubsection{Overview}
+\tbd{include a more complete discussion of glueforge}
+The IPP Metadata Database acts as a repository for non-pixel data
+The IPP Metadata Database acts as a repository for most non-pixel data
 needed by the IPP subsystems.  This includes the image metadata, the
 environmental data, system configuration data and system reference
 data.  The Metadata Database is required to save the non-ephemeral
 data for the lifetime of the project for future reference and
+additional analysis.  The Metadata Database may be used in close
+coupling with the analysis pipelines to store temporary data either
+within or between stages of the analysis.  In this scenario, the
+analysis pipeline will interact directly with the database.  However,
+database latency may make this scenario impractical, in which case the
+database may be used for long-term storage only.  In this scenario,
+the data produced by analysis pipelines which is destined for the
+Metadata Database may be collected and inserted by a separate,
+dedicated process.  Metadata which is large in volume or poorly
+structured may also be stored in an appropriate container file (FITS
+Table, FITS Header, XML File) in the Image Server with the Metadata DB
+providing pointers to these files.
+additional analysis.  The Metadata Database is also used in close
+coupling with the analysis pipelines to track the state of elements as
+they move through the processing system.  Metadata which is large in
+volume or poorly structured is stored in an appropriate container file
+(FITS Table, FITS Header, XML File) in Nebulous, with the Metadata DB
+maintaining the corresponding Nebulous storage IDs of these files.
 The IPP Metadata Database is a simple database system, consisting of a
+number of simple tables without extensive inter-table links.  The
+\code{mysql} database engine will be used to drive the database.
+number of simple tables without extensive inter-table links.  The IPP
+uses the MySQL database engine for the the database.  To simplify the
+coding and management of the database, the IPP uses autocoded APIs
+constructed with the system called 'glueforge' to define and
+manipulate the Metadata Database tables.
 \begin{table}[hb]
 …
 is identified.
+\subsubsection{Metadata Queries}
+The IPP provides simple queries to the Metadata Database tables using
+auto-coded APIs.  These queries return a single row or a collection of
+rows based on the primary key.  The format of the API is identical for
+all Metadata tables.  New tables and APIs can be added to the IPP
+system by adding to the auto-code table description files.  The
+auto-code API includes read and write access permissions to be set
+for each table independently. See Appendix~\ref{sec:AutocodeIO} for
+further information.
+\subsubsection{Autocoded Metadata Queries}
+The IPP provides standardized interfaces to the Metadata tables using
+the 'glueforge' system.  Glueforge uses a standard table description
+file to construct a collection of standard interface functions with
+easily predicted names.  Given the description of a table (say Foo),
+Glueforge provides a C-struct which represents the elements of single
+row of the table (\code{FooRow}).  It also provides APIs to create a
+new Foo table (\code{FooCreateTable()}), to insert a row either
+supplying the elements of Foo (to \code{FooInsert()}) or by supplying
+a pointer to data of type Foo (to \code{FooInsertObject()}).  Simple
+queries may be constructed to select rows from the table
+(\code{FooSelectRow()}).  The same mechanism generates data I/O
+functions for writing FITS tables from a collection of the data
+elements.
+This autocoding system for interacting with the Metadata database
+makes the software very flexible for changes to the structures of the
+Metadata database tables.  The programmer does not need to know
+anything about the details of a given table to interact with it,
+except when a specific element is needed.  Thus, new columns can be
+added to the tables, and only require a re-compilation for most
+portions of the IPP code.  Even migration to a new table schema for
+existing data becomes fairly trivial.  Such a migration only would
+require the definition of a conversion function from the old structure
+to the new structure.  The more general features of glueforge are
+discussed in the glueforge manual pages.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 …
 \subsection{Scheduler}
+\label{sec:scheduler}
 \subsubsection{Corresponding Requirements}
 …
 machine and monitors the success or failure of the processing stage.
+The analysis stages are written as UNIX commands, which may be
+executed by the IPP Controller, or may be executed individually by
+hand.  This option makes testing of the complete analysis system much
+easier because the individual analysis stages may be tested
+independently of each other and the IPP infrastructure.
+As part of this design model, the analysis stages have several methods
+for accepting and returning the input and output data and for defining
+optional choices in the analysis.  All of the analysis stages load an
+analysis recipe, which defines the details of that analysis.  The
+recipe includes the location of the data sources (from the metadata,
+from the image headers, from other external files, or supplied
+directly), which steps to employ, and how to assign optional
+parameters.  For example, in the discussion of the Phase 2 analysis
+below, the recipe file may specify {\em if} a bias subtraction should
+be applied, {\em where} to find the overscan region and {\em which}
+bias image, {\em if any}, to apply.
+The recipe is loaded as part of the runtime configuration information
+loaded when the analysis script starts.  Four levels of runtime
+configuration information are defined.  The {\tt site} configuration
+defines values specific to the particular installation of the
+software.  For example, the name of the machine which hosts the
+Metadata Database or a default path for data files could be part of
+the {\tt site} configuration.  Multiple installations or versions of
+the IPP software would need to have separate {\tt site} configuration
+entries.  For example, a version of the IPP installed at the IfA would
+use a different computer for the Image Server from the live IPP
+installation running on the Pan-STARRS cluster.  The {\tt base}
+configuration defines general data sources which may be needed by any
+portion of the IPP.  The list of known telescopes or filters might be
+an example.  The {\tt camera} configuration consists of information
+which defines the parameters relevant to the cameras known by the IPP.
+For example, the default layout of the detectors or the names of
+specific header keyword values would be defined for each camera in a
+camera-specific configuration collection.  Finally, each analysis
+script loads its own recipe.  The location of this configuration
+information may be a collection of configuration files available on
+disk or some subset of the information may be stored in the Metadata
+Database.  The source of these configuration entries can be overridden
+when the script is executed, and individual configuration values may
+also be specified on the command line.  Examples of the recipe and
+other runtime configuration options are given in
+Appendix~\ref{sec:RuntimeConfig}.
+The analysis stages are written using programs executed as UNIX
+commands.  These commands may be executed by the IPP Controller, or
+may be executed individually by hand.  This option makes testing of
+the complete analysis system much easier because the individual
+analysis stages may be tested independently of each other and the rest
+of the IPP infrastructure.  The analysis stages discussed in this
+section use two somewhat different types of programs.  One set of
+programs performs the heavy lifting of the data analysis: they examine
+the pixels in images and perform some statistical analysis of the
+pixel values or manipulate the data products in one way or another.
+Another set of programs are used to tie the analysis stages together.
+This latter set of programs examine the state of the Metadata database
+and select images for processing.  They are used by PanTasks, the IPP
+Scheduler to make decisions about which of the analysis programs to
+run on which data.  In this section, we discuss the details of the
+analysis stages in terms of the science analysis.  Below, in
+Section~\ref{AnalysisPrograms}, we discuss the major analysis programs
+and top-level analysis routines used by those program.  An important
+distinction to be noted here is that the same analysis programs with
+somewhat different options and configuration information can be
+employed by different analysis stages.  We exclude detailed discussion
+of the other connecting tools used to define the analysis stages.
+These tools and the detailed construction of the pipelines which make
+up the analysis stages are touched upon in Section~\ref{sec:PanTasks},
+which discusses the IPP Scheduler program, PanTasks.  They are
+discussed in more detail in the document 'ippTools'.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 …
 In order to facilitate testing and development, and to encourage
 flexibility, the IPP will be built in a layered fashion.  The lowest
 level functions will be written in C and collected together into a
+flexibility, the IPP is built in a layered fashion.  The lowest level
+functions will be written in C and collected together into a
 Pan-STARRS library.  These library functions will be used to write
 more complex modules.  The modules will be written in C but will make
 …
 functions in the operational system, the IPP will make use of Perl as
 the scripting language to provide the required flow-control to tie the
+modules together.
+modules together. \tbd{note that we use C only, not perl for
+scripting}.
 This approach satisfies the requirement that complicated low-level
 …
 detailed in the IPP PS-1 SRS (PSDC-430-005), Section 3.3.
+\subsection{IPP Stages}
+\subsection{IPP Analysis Programs}
+\tbd{clean this up}
 The major IPP processing tasks are organized into stages, which
 …
 images from multiple telescopes and search for transients).
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Top-Level IPP Analysis Routines}
+The IPP uses a handful of high-level analysis routines which perform
+the bulk of the actual analysis effort.  These routines are used both
+as stand-alone programs and in some cases as library functions called
+by other stand-alone programs.  The 6 primary top-level analysis
+routines are:
+\begin{itemize}
+\item {\bf ppImage} : the complete single-image analysis program.
+\item {\bf ppMerge} : the basic image combinations program.
+\item {\bf psphot} : the photometry analysis routine.
+\item {\bf psastro} :  the astrometry analysis routine.
+\item {\bf stac} : the science image combination program.
+\item {\bf poisub} : the image difference program.
+\end{itemize}
+These programs are not mutually exclusive: psphot and psastro are both
+used by ppImage, while psastro is used as a stand-alone tool within
+the IPP, and psphot can be used as such.
+\subsection{Software Configuration and  Camera Definition Information}
+Every camera which may be analysed by the IPP has differences in how
+the data is represented.  The IPP is built with the flexibility to
+handle data from many different cameras, not just the Pan-STARRS
+Gigapix cameras.  This is partly to allow testing of the analysis
+system on data from other telescopes, such as MegaPrime on CFHT and
+Suprime on Subaru, but also to allow us to adapt to changes in the
+design of the Gigapix cameras themselves.  It also means the IPP
+software may be used by astronomers for other analysis projects beyond
+the IPP.
+Most cameras provide extensive descriptive information in the FITS
+image headers when the images are read out.  Typically, the location
+and orientations of the individual detectors are defined by keywords
+such as DATASEC and DETSEC.  Other variations on these words are used
+for cameras which place the pixels from multiple amplifiers in the
+same FITS data segment.  Other parameters, such as astrometric
+information or exposure times, are stored in headers as well.  It is
+possible to use these header keywords to guide the analysis software,
+but there are two difficulties.
+First, it is very common for different keywords to be used by
+different cameras, sometimes even the same camera may use different
+keywords for the same information at different times (major readout
+software upgrades, for example, can be accompanied by keyword
+revisions).  In addition, within Pan-STARRS and the IPP, it is
+necessary to have the capability to refer to the Metadata database as
+the authoratative sources of some of these entries rather than the
+image headers.  Given this circumstance, it is at least necessary to
+define the appropriate source for a given data concept appropriate to
+data from a specific camera.
+The second problem arises when actually performing an analysis.  In
+many circumstances, the software needs to know what data to expect
+even when an appropriate camera image is not available.  This is
+particularly true for a camera which is composed of multiple chips and
+multiple amplifiers.  It is a frequent circumstance than some subset
+of the chips or amplifiers will either be unavailable or are invalid
+for one reason or another.  It is important for the software to have a
+guide for what data should be available from a perfect readout of the
+given camera so decisions can be made how to handle data which is not
+complete.  This is also important to validate that a particular
+dataset, which appears to be from a known camera, actually corresponds
+to that camera and has all of the necessary information where
+expected.
+As part of the flexible design model, all of these analysis programs
+use a common set of configuration files which define the details of
+how their analysis should be performed.  The configuration files a
+further divided into three main sets of configuration information:
+\begin{itemize}
+\item {\bf site configuration} this information defines the locations
+  of data resources and the other available configuration information.
+  For example, the site configuration would include the location of
+  Nebulous, the metadata database access information, the list of
+  available cameras, and so forth.
+\item {\bf camera configuration} this information describes
+  characteristics needed to interpret data from a specific camera.
+  This includes such details as where to extract particular metadata,
+  such as the filter name.  It also defines the camera layout and the
+  expected organization of the data files.
+\item {\bf recipe} a particular analysis program may use one or
+  multiple recipes.  The recipe defines the value of optional
+  configuration information for that analysis program.  For example,
+  by using different recipes, ppImage can be made to perform a
+  complete Phase 2 image analysis, including detailed object detection
+  and astrometric calibration, or with a different recipe, ppImage may
+  be used to perform only bias subtraction (for example, as part of
+  the detrend image analysis).    Note that the details of a recipe in
+  general depend on the camera / telescope of interest.  As a result,
+  the identify of the recipe files which define different recipes are
+  included as part of the camera configuration information.
+\end{itemize}
+For all of the analysis programs, the source of the configuration
+files can be overridden when the program is executed, and individual
+configuration values may also be specified on the command line.  The
+details of the configuration file formats and the configuration
+variable names used by different programs and functions are discussed
+in the Modules SDRS document.
+\subsection{ppImage}
+This program is not only one of the work-horse programs of the IPP, it
+is also exemplary of the design of the top level analysis program.
+ppImage is used to perform the complete Phase 2 analysis on a single
+image data file.  This includes the complete detrending discussed
+above (bias, dark, flat, fringe, etc), as well as the object detection
+and classification, the astrometric calibration, and potentially the
+photometric calibration as well.  The object analysis and astrometry
+are in fact performed by ppImage using two of the other top-level
+routines discussed above, psphot and psastro; these components will be
+discussed independently and will mostly not be part of the ppImage
+discussion.
+The top-level design of ppImage is virtually identical to that of
+several other stand-alone analysis programs (eg, psphot, psastro,
+ppMerge).  The input to ppImage may be one of several options:
+\begin{itemize}
+\item a single FITS file by name (eg: /full/path/to/file.fits)
+\item a collection of FITS files by glob (eg: /full/path/to/file.*.fits)
+\item a single Nebulous storage ID (neb:nebID)
+\item a collection of Nebulous storage IDs (neb:nebID.regex)
+\end{itemize}
+The Nebulous versions can be viewed as simply utility versions to save
+the user from asking Nebulous for the corresponding file names.  In
+addition to the input file and any optional configuration information,
+ppImage also takes as part of the command-line arguments the root of
+the output file names; ppImage defines naming conventions to
+constructing the complete output files based on the recipe.
+After parsing the command-line options, ppImage performs the following
+tasks:
+\begin{itemize}
+\item identify the camera associated with the file or files (all files
+  must be of the same camera). This analysis is performed by examining
+  the primary headers of the files and applying camera-identification
+  rules specified as part of the camera configuration information.  It
+  is also possible for the user to assert the camera which supplied
+  the image(s).
+\item construct a complete FPA structure representing the camera
+  (without pixel data) and associate the input filenames with the
+  different components of the FPA structure.  In this step, the camera
+  configuration information is used to identify the cell, chip, etc
+  which corresponds to each file, and the filenames are associated
+  with elements at the appropriate levels.  Since each input file has
+  already had its primary header read, this is also attached to the
+  appropriate FPA structure level.
+\item the complete FPA is then looped over, with nested loops for each
+  of the levels: FPA, Chip, Cell, and Readout.  At the appropriate
+  depth (depending on user options and the recipe), the pixels are
+  loaded from the corresponding file.  The recipe may, for example,
+  require the entire FPA image be read, or it may specify that each
+  cell be loaded one at a time.
+\item the detrending analysis is peformed on each readout.  Note that
+  the detrend images needed for the analysis are defined by the recipe
+  and by querying the metadata database for the detrend image which
+  matches the science image in hand.  The details of this match may be
+  modified by the recipe as well, allowing for example, one recipe to
+  apply only the best available flat-field images, while another for
+  comparison only applies an earlier flat.  These types of options are
+  needed for testing, and also to perform, e.g., the flat-field
+  correction analysis.  Also note that the detrend images may be
+  loaded at a different level from the science images; it is likely
+  that a detrend image is only defined for each Cell, while the
+  science images may be loaded by Readout.
+\item after each chip is processed, ppImage will optionally
+  reconfigure the resulting pixels into a single contiguous array.
+  This is the normal data source for the psphot function call used by
+  ppImage.  The reconfigured chip image on this stage may also be used
+  to generate thumbnail and rebinned sample images on the chip level.
+\item ppImage may also be used to reconfigure the pixels from a
+  complete FPA into a single pixel array.  For the GPC, this step
+  would not normally be performed on the full data array, but would be
+  used to generate rebinned sample images of the full mosaic (in
+  either FITS or JPEG formats).  This step is used by Phase 3 to
+  construct images for examining the state of the data processing or
+  by the detrend creation analysis to test the quality of the
+  residuals.
+\item the final stage is to output, at an appropriate depth, the
+  chosen output data files and to send summary metadata to the
+  metadata database.  these output files may include FITS images, FITS
+  thumbnails (very small rebinned images), FITS samples, JPEG images,
+  object tables, and/or astrometry calibration files.
+\end{itemize}
+Except for the details of the analysis (detrending, etc), the detail
+of the processing steps are identical for psphot, psastro, and
+ppMerge, as discussed below.
+\subsection{ppMerge}
+This program performs the job of stacking multiple images in which the
+pixel grid is kept the same for all of the input images.  As part of
+the combination process, the input images may be scaled and shifted.
+This operation is the basis for the creation of all of the primary
+master detrend frames: bias, dark, flat, fringe.  For all of these
+cases, the output pixel values are determined by applying some global
+statistic to the collection of input pixels after the input pixels
+have been rescaled.  The combination statistics may be any of several
+standard options, including mean, median, sigma-clipping, etc.  The
+input files may be supplied to ppMerge using any of a command-line
+glob, a text-based list, or an XML list.  The glob method forces the
+zero and scaling to have values of 0 and 1, respectively.  The text
+file listing allows for zero and scaling values, but only a single
+value for each input file.  The XML-format list is needed if
+subdivisions within a file (eg, cells) require independent zero and
+scaling factors.  It is also necessary to specify more than one image
+file if, eg, a full FPA composed of separate Chip files is to be
+processed at once.  This analysis is performed using a very similar
+structure to ppImage: the input list of images is parsed and
+associated with the approrpiate elements of the FPA struture, though
+in this case, an array of FPA structures is used to carry the
+information.  The elements of the FPA are looped over, as with
+ppImage, and the analysis is performed on the lowest level, after the
+input data pixels are read and re-scaled.  The output is written out
+after the image have been looped over.
+\subsection{psphot}
+The details of psphot are discussed in the psphot design document.
+Here we will just address some of the basic concepts.  PSPhot may be
+used as a stand-alone program, or it may be called as a function
+within another program (such as ppImage).  In the li
 \section{Interfaces}
 …
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Software Runtime Configuration Issues}
-\label{sec:RuntimeConfig}
-The IPP Software requires extensive runtime configuration information.
-This includes default parameters for analysis to be performed,
-descriptions of how a particular analysis is performed, locations of
-data sources, and so forth.  The IPP may store this information in the
-Metadata Database or in configuration files available to the user.
-Both methods are implemented in the current design.  In either method,
-the necessary parameters are identical.  This section discusses the
-contents of specific portions of the runtime configuration.
-\subsection{Camera Definition Information}
-Every camera which may be analysed by the IPP has differences in how
-the data is represented.  The IPP is built with the flexibility to
-handle data from many different cameras, not just the Pan-STARRS
-Gigapix cameras.  This is partly to allow testing of the analysis
-system on data from other telescopes, such as MegaPrime on CFHT and
-Suprime on Subaru, but also to allow us to adapt to changes in the
-design of the Gigapix cameras themselves.  It also means the IPP
-software may be used by astronomers for other analysis projects beyond
-the IPP.
-Most cameras provide extensive descriptive information in the FITS
-image headers when the images are read out.  Typically, the location
-and orientations of the individual detectors are defined by keywords
-such as DATASEC and DETSEC.  Other variations on these words are used
-for cameras which place the pixels from multiple amplifiers in the
-same FITS data segment.  Other parameters, such as astrometric
-information or exposure times, are stored in headers as well.  It is
-possible to use these header keywords to guide the analysis software,
-but there are two difficulties.
-First, it is very common for different keywords to be used by
-different cameras, sometimes even the same camera may use different
-keywords for the same information at different times (major readout
-software upgrades, for example, can be accompanied by keyword
-revisions).  In addition, within Pan-STARRS and the IPP, it is
-necessary to have the capability to refer to the Metadata database as
-the authoratative sources of some of these entries rather than the
-image headers.  Given this circumstance, it is at least necessary to
-define the appropriate source for a given data concept appropriate to
-data from a specific camera.
-The second problem arises when actually performing an analysis.  In
-many circumstances, the software needs to know what data to expect
-even when an appropriate camera image is not available.  This is
-particularly true for a camera which is composed of multiple chips and
-multiple amplifiers.  It is a frequent circumstance than some subset
-of the chips or amplifiers will either be unavailable or are invalid
-for one reason or another.  It is important for the software to have a
-guide for what data should be available from a perfect readout of the
-given camera so decisions can be made how to handle data which is not
-complete.  This is also important to validate that a particular
-dataset, which appears to be from a known camera, actually corresponds
-to that camera and has all of the necessary information where
-expected.
-In order to facilitate the operation of the IPP with a variety of
-cameras, and to allow the software the flexibility to change the
-camera defintion dynamically, the IPP includes a collection of
-software runtime configuration information which defines a given
-camera.  This information is represented below in the form of the
-PSLib Metadata Config file, but may be stored in the Metadata Database
-or in an alternate format as appropriate.
-The a single camera is represented as a Focal Plane Array (FPA),
-divided into Chips, divided into Cells.  For a single FPA, all imaging
-data is stored in a FITS file or a collection of FITS files.  Software
-needs to know where in a given file or set of files to find a
-particular Cell, what Cells to expect, what chips to expect, and the
-relationships between those entities, etc.
-A single camera configuration file (or dataset) represents the
-description of a complete FPA.  In the configuration file, any
-parameters which are specific to the complete FPA are placed on their
-own lines.  These include the definition of the keywords or database
-locations.  An incomplete example is given below.
-\begin{verbatim}
-NCELL       S32    NN
-NCHIP       S32    NN
-EXPTIME-SRC STR    HD:EXPTIME # need to specify PHU vs EXTNAME
-EXPTIME-KEY STR    EXPTIME
-DATE-KEY    STR    DATE-OBS
-DATE-FMT    STR    YYYY/MM/DD
-TYPE        CELL   FILENAME           EXTNAME  CHIP      DATASEC       BIASSEC
-CELL.nn     CELL   @ROOT@CELL         AMP00    CHIP.00   CF:[0,0:0,0]  HD:BIASSEC
-CELL.01     CELL   @ID/@ID@CELL.fits  AMP01    CHIP.00   DB:???
-\end{verbatim}
-\subsection{Analysis Recipe Information}
-In order to maintain flexibility in the analysis details, the IPP uses
-recipes to define how a particular analysis is implemented.  Each
-major analysis script (eg, Phase 2) has its own recipe configuration
-information, which may be stored in the Metadata Database or in the
-form of the PSLib Metadata Config file.  This configuration
-information includes all of the user configurable parameters.  Many of
-these may specify a specific value, or they may specify lookup methods
-(database locations, or header locations).  The specifies of each
-depends on the context.  Below is an example recipe file for the bias
-subtraction portion of Phase 2, giving several alternative options for
-certain entries.  Note that, for example, the overscan subtraction may
-be specified as using a particular region given in the recipe file, or
-on the basis of a particular header keyword.
-\begin{verbatim}
-# BIAS:
-BIAS.IMAGE                 STR    NONE
-BIAS.IMAGE                 STR    FILE:bias.fits
-BIAS.IMAGE                 STR    DB:BEST
-BIAS.IMAGE                 STR    DB:CLOSE
-BIAS.OVERSCAN              STR    HD:BIASSEC
-BIAS.OVERSCAN              STR    CF:[0,16:0,2048]
-BIAS.OVERSCAN              STR    NONE
-BIAS.OVERSCAN.STATS        STR    MEDIAN
-BIAS.OVERSCAN.STATS        STR    MEAN
-BIAS.OVERSCAN.FIT          STR    SPLINE
-BIAS.OVERSCAN.FIT.NPTS     S32    5
-BIAS.OVERSCAN.FIT          STR    POLYNOMIAL
-BIAS.OVERSCAN.FIT.ORDER    S32    3
-BIAS.OVERSCAN.FIT.NBIN     S32    5
-\end{verbatim}
-\section{I/O Code Autogeneration}
-\label{sec:AutocodeIO}
-The IPP includes a number of data collections which have multiple
-representations.  A software tool will be used to automatically
-generate code to provide I/O APIs to read and write these data and to
-define the data structures used to carry them within a program.
-Within the IPP, examples of these different data entities include
-database tables (ie, in the Metadata Database), FITS Tables (to
-exchange bulk data), and XML (to exchange more complete datasets).
-I/O API Autocode template (example.def):
-\begin{verbatim}
-Name    Example
-Table   EXAMPLE
-EXTNAME EXAMPLE
-KEY     XVALUE
-# name  format   unit      comment
-XVALUE  F32      pixels    "x coordinate"
-BINNING S32      fraction  "binning factor"
-NAME    STR[32]  string    "description of entry"
-\end{verbatim}
-Running autocode on such a file would generate an output header and C
-files \code{example.h, example.c} with the following structure and APIs:
-\begin{verbatim}
-typedef struct {
-  psF32 XVALUE;    // x coordinate
-  psS32 BINNING;   // binning factor
-  char  NAME[32];  // description of entry
-} Example;
-psMetadata *psFITSTableInitExample ();
-psExample *psFITSTableLoadExample (char *filename, int *Nrows);
-bool psFITSTableSaveExample (char *filename);
-psMetadata *psDatabaseTableInitExample ();
-psExample *psDatabaseTableLoadExample (char *filename, int *Nrows);
-bool psDatabaseTableSaveExample (char *filename);
-psExample *psDatabaseTableLoadExampleRow (char *filename, psF32 XVALUE);
-\end{verbatim}
-%\bibliographystyle{plain}
-%\bibliography{panstarrs}
 \input{glossary.tex}
 \end{document}
+------
+* top-level routines
+* re-org the Phase 4 stuff to discuss Magic
+* astrometry calibration data formats
+* analysis stages, versions and iterations
+* output data products

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 6040

Legend:

trunk/doc/design/ippSSDD.tex

Download in other formats: