Changeset 6040
- Timestamp:
- Jan 18, 2006, 2:04:39 PM (20 years ago)
- File:
-
- 1 edited
-
trunk/doc/design/ippSSDD.tex (modified) (14 diffs)
Legend:
- Unmodified
- Added
- Removed
-
trunk/doc/design/ippSSDD.tex
r6034 r6040 1 %%% $Id: ippSSDD.tex,v 1. 1 2006-01-18 13:34:20eugene Exp $1 %%% $Id: ippSSDD.tex,v 1.2 2006-01-19 00:04:39 eugene Exp $ 2 2 \documentclass[panstarrs]{panstarrs} 3 3 … … 307 307 \begin{itemize} 308 308 309 \item {\bf Image Server:} This component is a large data store for all 310 images used by the IPP, including the raw images from the telescope, 311 the master calibration images, the reference static-sky images, and 312 any temporary image data products produced by the IPP. The Image 313 Server accepts the incoming data and stores it until it is no longer 314 needed by other portions of the IPP. The Image Server is not 315 restricted to imaging data: it is capable of storing any large data 316 files which are not well-suited for inclusion in a more structured 317 relational database, and for which access needs to be widely 318 available beyond the individual process which created the file. The 319 IPP has developed the software system called 'Nebulous' to perform 320 this function. 309 \item {\bf Image/File Server:} This component is a large data store 310 for all images and large used by the IPP, including the raw images 311 from the telescope, the master calibration images, the reference 312 static-sky images, and any temporary image data products produced by 313 the IPP. The Image/File Server accepts data products and stores 314 them until they are no longer needed by other portions of the IPP. 315 It allows other IPP subsystems to refer to the data files with an 316 abstract identification without needing to worry about the details 317 of the physical location. Conversely, it allows other entities to 318 determine or specify the physical locations if needed. The 319 Image/File Server is capable of storing any large data files which 320 are not well-suited for inclusion in a more structured relational 321 database, and for which access needs to be widely available beyond 322 the individual process which created the file. The IPP has 323 developed the software system called 'Nebulous' to perform this 324 function. 321 325 322 326 \item {\bf Metadata Database:} This component stores the data which is … … 415 419 \label{sec:ArchComponents} 416 420 417 \subsection{ IPP Image Server}421 \subsection{Nebulous : the IPP Image/File Server} 418 422 419 423 \subsubsection{Corresponding Requirements} 420 424 421 The Image Server must meet the requirements specified in Section 3.4.1 422 of the Pan-STARRS PS-1 IPP SRS (PSDC-430-005). The specified design 423 is chosen to meet requirements 3.4.1.3, and 3.4.1.5. The other three 424 requirements (3.4.1.1, 3.4.1.2, and 3.4.1.4) depend on the volume and 425 capabilities of the hardware, and are addressed in 426 Section~\ref{sec:Hardware}. 427 428 \subsubsection{Image Server Overview} 429 430 The IPP Image Server is a repository for all images and other large 431 data files required by the IPP. Along with the storage hardware, it 432 provides tools for managing the distribution of these large data files 433 and for accessing the files. Data files stored by the IPP Image 434 Server include the raw images, the calibration images, intermediate 435 processing stage images as needed, final processed images, difference 436 images, image subsections, and any large non-imaging data files needed 437 by the IPP. The IPP Image Server must retain the files for as long as 438 they are needed by the IPP. 439 440 The IPP Image Server is a parallel storage system. It stores data 441 across a collection of computer nodes, each with their own data 442 storage resources. Any single file is stored on only a single 443 computer and storage device. In order to achieve the data throughput 444 requirements, the IPP Image Server may distribute the images across 445 the processor nodes in an organized fashion, i.e., associating 446 specific machines with specific detectors. It is not the 447 responsibility of the IPP Image Server to determine which computer 448 should be associated with a specific data concept (Chip / region of 449 sky), but it must enable the association of a particular file with a 450 particular machine. 451 452 There are three data concepts relevant to the IPP Image Server: 425 The IPP Image/File Server must meet the requirements specified in 426 Section 3.4.1 of the Pan-STARRS PS-1 IPP SRS (PSDC-430-005). The 427 specified design is chosen to meet requirements 3.4.1.3, and 3.4.1.5. 428 The other three requirements (3.4.1.1, 3.4.1.2, and 3.4.1.4) depend on 429 the volume and capabilities of the hardware, and are addressed in 430 Section~\ref{sec:Hardware}. 431 432 \subsubsection{Image/File Server Overview} 433 434 The IPP Image/File Server is a repository for all images and other 435 large data files required by the IPP. Along with the storage 436 hardware, it provides tools for managing the distribution of these 437 large data files and for accessing the files. Data files stored by 438 the IPP Image/File Server include the raw images, the calibration 439 images, intermediate processing stage images as needed, final 440 processed images, difference images, image subsections, and any large 441 non-imaging data files needed by the IPP. The IPP Image/File Server 442 must retain the files for as long as they are needed by the IPP. 443 444 The IPP team has developed the software system 'Nebulous' to meet the 445 requirements of the Image/File Server. Nebulous uses a MySql database 446 engine to track the identities and locations of the files under its 447 management. Nebulous currently requires that all files it manages be 448 available from a locally mounted file system; however, it does not 449 place specific requirements upon the choice of this file system. 450 Currently, the IPP is being implemented use NFS as the distributed 451 file system, though other systems such as GFS or GSFS could be used in 452 an equivalent fashion. 453 454 The decision to make use of a locally mounted file system is driven by 455 two IPP-wide design decisions. First, as a practical and technical 456 consideration, in several locations throughout the IPP analysis, the 457 efficiency can be increased significantly if processes have the 458 ability to seek randomly in a file. For example, when combining 459 multiple partially-overlapping images, rather the be forced to read 460 through an entire image data, the proccess can seek through a 461 multi-component file to the pixels of interest. The other driving 462 motivation is more philosophical in nature: the IPP is designed to use 463 components which operate as simple user commands in the UNIX 464 environment. By requiring Nebulous to work with files in the locally 465 mounted file system, all UNIX programs will be able to interact with 466 those files if needed. This allows the IPP to include essentially any 467 normal analysis program as part of the IPP system without difficulty. 468 This philosophical choice is also present in the design of the IPP 469 Scheduler / Controller system and in the design implementation of the 470 data analysis programs. 471 472 Nebulous is a parallel storage system. It stores data across a 473 collection of computer nodes, each with their own data storage 474 resources. Any single file is stored on only a single computer and 475 storage device. In order to achieve the data throughput requirements, 476 Nebulous will be used to distribute the images across the processor 477 nodes in an organized fashion, i.e., associating specific machines 478 with specific detectors. It is not the responsibility of Nebulous to 479 determine which computer should be associated with a specific data 480 concept (Chip / region of sky), but it instead provides the hooks to 481 enable the association of a particular file with a particular machine. 482 483 There are three data concepts relevant to Nebulous: 453 484 \begin{itemize} 454 485 \item {\bf Storage object:} This represents a single, unique data 455 entity in the Image Server. 456 457 \item {\bf Instance:} A single copy of the storage object in the Image 458 Server. In general, a given storage object may have several instances 459 in the Image Server, normally on different computer nodes. 460 461 \item {\bf File ID:} This is the identifier of a particular storage 462 object in the Image Server. The file ID is simply a unique string, 463 equivalent to the filename in a UNIX file system. 486 entity managed by Nebulous. 487 488 \item {\bf Storage ID:} This is the identifier of a particular storage 489 object in Nebulous. The Storage ID is the key used by any Nebulous 490 users to retrieve a specific file of interest. The file ID is 491 simply a unique string, equivalent to the filename in a UNIX file 492 system. 493 494 \item {\bf Instance (or Storage Object Instance):} A single copy of 495 the storage object in Nebulous. In general, a given storage object 496 may have several instances in Nebulous, normally on different 497 computer nodes. 498 464 499 \end{itemize} 465 466 The Image Server provides file pointers (in C), handles (in Perl or 467 Python), or file names corresponding to the instances of the storage 468 objects. The Image Server provides the data organization but does not 469 define a file system; it assumes the existence of an appropriate file 470 system which makes the files visible as local files. This 471 may be done over many machines with a network file system such as NFS 472 or GFS. 473 474 The IPP Image Server provides the storage and access mechanisms, but 475 it does not include any logic or information about the data. The 476 Image Server does not, e.g., monitor the age of images and delete them 477 on some schedule. 478 479 As shown in Figure~\ref{fig:ImageServer}, the IPP Image Server 480 consists of the following components: 481 500 Upon request of a specific Storage ID, Nebulous provides file pointers 501 (in C), handles (in Perl), or file names corresponding to the 502 instances of the storage objects. 503 504 Nebulous provides the storage and access mechanisms, but it does not 505 include any logic or information about the data. It does not, e.g., 506 monitor the age of images and delete them on some schedule. This 507 functionality currently resides in the IPP Scheduler 508 (Section~\ref{sec:scheduler}). 509 510 As shown in Figure~\ref{fig:Nebulous}, Nebulous consists of the 511 following principal components: 482 512 \begin{itemize} 483 \item Image Server storage hardware 484 \item Image Server database 485 \item Image Server daemon 486 \item Image Server client APIs 487 \item Image Server maintenance tools (not shown) 513 \item Nebulous client(s) 514 \item Nebulous server 515 \item Nebulous database 516 \item Storage hardware 488 517 \end{itemize} 489 518 … … 496 525 \end{figure} 497 526 498 \subsubsection{ IPP Image ServerClient APIs}527 \subsubsection{Nebulous Client APIs} 499 528 500 529 Clients interact with the IPP Image Server via a small number of C 501 APIs. Bindings are also provided for Perl and Python and UNIX shell 502 commands in some cases. The client commands are: 530 APIs. Bindings are also provided for Perl \tbd{and Python} and UNIX 531 shell commands in some cases. This document only gives an overview of 532 the commands; for details on usage, please see the Nebulous user's 533 guide. The client commands are: 503 534 504 535 \begin{itemize} 505 \item {\tt new object}: create a new storage object in the Image 506 Server. This function takes as input the file ID and returns a 507 C-style file pointer or a Perl file handle to the instance of the 508 storage object. The arguments to the function include an optional 509 node name on which the new storage object must be located. If this 510 target is not given, the Image Server places the new storage object 511 on an appropriate machine from the pool, though the details need to 512 be specified. 513 514 \item {\tt open object}: open an instance of an existing storage 515 object, as identified by the file ID. This function may also 516 specify the node on which the object should be opened (if an 517 instance of the object is not stored on that node, the function 518 returns an error). On success, the function returns a file pointer. 519 520 \item {\tt find object}: return a list of filenames in the UNIX name 521 space associated with the storage object identified by the given 522 file ID. Since there are in general multiple instances for a given 523 storage object, this function returns the collection of all 524 available instances. These may be freely opened by the client 525 server using the standard \code{fopen} functions. 526 527 \item {\tt stat object}: returns status information about the 528 specified storage object, including the number of instances of the object. 529 530 \item {\tt replicate object}:a new instance of the given storage 531 object. The target node may be optionally specified, otherwise an 536 \item {\tt create} : create a new storage object in the Image Server. 537 This function takes as input the requested Storage ID and returns a 538 C-style file pointer, Perl file handle, or file name corresponding 539 to the new instance of the storage object. The arguments to the 540 function include an optional node name on which the new storage 541 object must be located. If this target is not given, the Image 542 Server places the new storage object on an appropriate machine from 543 the pool. 544 545 \item {\tt replicate} : a new instance of the given storage object is 546 created. The target node may be optionally specified, otherwise an 532 547 appropriate node is selected. 533 548 534 \item {\tt cull object}: removes one of the instances of the storage 535 object. The input parameters may optionally specify the target 536 machine to delete. 537 538 \item {\tt delete object}: deletes all instances of the storage object 539 and sets the storage object status to {\tt deleted}. 549 \item {\tt cull} : removes one of the instances of the storage object. 550 The input parameters may optionally specify the target machine from 551 which to delete the object. 552 553 \item {\tt delete} : deletes all instances of the storage object and 554 sets the storage object status to {\tt deleted}. 555 556 \item {\tt open} : open an instance of an existing storage object, as 557 identified by the storage ID. This function may also specify the 558 node on which the object should be opened (if an instance of the 559 object is not stored on that node, the function returns an error). 560 On success, the function returns a file pointer. If the object is 561 opened for 'write' access, all but one instance is deleted to ensure 562 consistency of the data. 563 564 \item {\tt find} : return a list of filenames in the UNIX name space 565 associated with the storage object identified by the given file ID. 566 Since there are in general multiple instances for a given storage 567 object, this function returns the collection of all available 568 instances. These may be freely opened by the client server using 569 the standard \code{fopen} functions. 570 571 \item {\tt lock} : attempt to acquire a Nebulous lock on the storage object. 572 573 \item {\tt unlock} : release a Nebulous lock from the storage object. 574 575 \item {\tt stat} : returns status information about the specified 576 storage object, including the number of instances of the object. 577 578 \item {\tt copy} : create a new storage object with one instance of 579 the corresponding object. 580 581 \item {\tt move} : rename a storage object 582 583 \item {\tt import} : copy an existing file object into Nebulous. 584 540 585 \end{itemize} 541 586 542 \subsubsection{ IPP Image Server Daemon}543 544 The Image Server client requests are mediated via the Image Server545 daemon.Communication between the clients and the server is via SOAP587 \subsubsection{Nebulous Server} 588 589 The Nebulous client requests are mediated via the Nebulous server. 590 Communication between the clients and the server is via SOAP 546 591 implementing the commands above. The identity of the machine on which 547 Image Server daemon runs is part of the Image Serverconfiguration592 the Nebulous server runs is part of the Nebulous configuration 548 593 information. 549 594 550 \subsubsection{IPP Image Server Database} 551 552 The IPP Image Server daemon uses a database to store the information 553 about the data storage objects, their instances, and the available 554 hardware resources. A {\tt mysql} database engine is used to manage 555 the database table. The database tables defined for the Image Server 556 are listed in Table~\ref{tab:ImageServerTables}, and their contents are 595 The server is responsible for keeping track of storage objects, all 596 instances of that object, and enforcing locking semantics. Extensive 597 logging and tracing support is provided for debug and to allow for 598 statics generation and possible {\em hotspot} optimization. 599 600 Nebulous uses a centralized server model. This model was choosen 601 because it allows efficient {\em pattern matching} of storage object 602 names. The current 'best' technique for a distributed metadata store 603 is with distributed hash tables. Unfortunately, no widely available 604 DHT implementation allows efficient {\em pattern matching} of key 605 names. 606 607 \subsubsubsection{House keeping} 608 609 \paragraph {Lock sweeping} In the event that a Storage Object operation fails to complete successfully 610 stale locks will have to be identified and removed from the IPP Pixel 611 Data Server Database. This should be done periodically by comparing 612 the entries in the Lock table to the list of active nodes maintained 613 by the IPP Controller. It should also happen as soon as possible 614 after a node goes offline (triggered by the IPP Controller marking a 615 node as offline?). A sweep must be /completed/ before an offline node 616 can be marked on-line. 617 618 Once a node is determined to be offline all entries in the Lock table 619 set by that node should be identified. The locks on the Storage 620 Object Instances pointed to by those entries should then be rolled 621 back and Lock Record entries themselves must be removed from the lock 622 table. 623 624 \paragraph{Consistency sweeping} Periodically the IPP Pixel Data Server meta-data and Storage Object will need 625 to be checked for sanity. This would be similar to running fsck on a 626 modern filesystem. Consistency sweeping should include Lock sweeping 627 and should be considered a super-set. 628 629 \subsubsection{Nebulous Database} 630 631 The Nebulous Server uses a database to store the information about the 632 data storage objects, their instances, and the available hardware 633 resources. A {\tt mysql} database engine is used to manage the 634 database table. The database tables defined for the Image Server are 635 listed in Table~\ref{tab:ImageServerTables}, and their contents are 557 636 listed in Appendix~\ref{sec:ImageServerTableContents}. This database 558 engine need not be the same one used for other IPP subsystems. 637 engine is not in general the same one used for other IPP subsystems; 638 the full IPP hardware configuration will include independent machines 639 for each of the major databasing systems (Nebulous, Metadata, DVO). 640 In the earlier incarnations, the same hardware and database engine may 641 be used. 559 642 % 560 643 \begin{table}[ht] 561 644 \begin{center} 562 \caption{ Image ServerDatabase Tables\label{tab:ImageServerTables}}645 \caption{Nebulos Database Tables\label{tab:ImageServerTables}} 563 646 \begin{tabular}{ll} 564 647 \hline … … 574 657 \end{table} 575 658 576 \subsubsection{IPP Image Server Storage Hardware} 577 578 The IPP Image Server manages data across a collection of computers and 579 possibly on multiple storage devices on those computer nodes. The 580 Image Server maintains a table of the available data volumes. The 581 Image Server tracks information about each volume such as the total 582 capacity, the current capacity, the association between computer and 583 data volume. 584 585 \subsubsection{IPP Image Server Maintenance Tools} 586 587 The IPP Image Server provides a collection of administration tools 588 which allow for maintenance. These are operations which may be 589 automatically scheduled by the IPP or which may be initiated by a 590 human via a command-shell interface. The maintenance functions 591 include migrating data between nodes to re-balance the available space 592 (this would only occur for instances which have not been placed on a 593 specific node by the client API). Other functions include checking 594 for file corruption, which involves sweeping all files on a data 595 volume and comparing the calculated file checksum to the currently 596 recorded value. 659 \subsubsection{Nebulous Storage Hardware} 660 661 Nebulous manages data across a collection of computers and as needed 662 on multiple storage devices on those computer nodes. Nebulous 663 maintains a table of the available data volumes. It tracks 664 information about each volume such as the total capacity, the current 665 capacity, the association between computer and data volume. \tbd{Is 666 Nebulous responsible for detecting unavailable hardware? it is 667 reponsible for changing allocations? or is this a pantasks 668 responsibility?} 669 670 \subsubsection{Requirements Demonstrations} 671 672 \tbd{summary of throughput tests : create / copy / delete objects per 673 second, etc} 597 674 598 675 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% … … 610 687 \subsubsection{Overview} 611 688 612 \tbd{include a more complete discussion of glueforge} 613 614 615 The IPP Metadata Database acts as a repository for non-pixel data 689 The IPP Metadata Database acts as a repository for most non-pixel data 616 690 needed by the IPP subsystems. This includes the image metadata, the 617 691 environmental data, system configuration data and system reference 618 692 data. The Metadata Database is required to save the non-ephemeral 619 693 data for the lifetime of the project for future reference and 620 additional analysis. The Metadata Database may be used in close 621 coupling with the analysis pipelines to store temporary data either 622 within or between stages of the analysis. In this scenario, the 623 analysis pipeline will interact directly with the database. However, 624 database latency may make this scenario impractical, in which case the 625 database may be used for long-term storage only. In this scenario, 626 the data produced by analysis pipelines which is destined for the 627 Metadata Database may be collected and inserted by a separate, 628 dedicated process. Metadata which is large in volume or poorly 629 structured may also be stored in an appropriate container file (FITS 630 Table, FITS Header, XML File) in the Image Server with the Metadata DB 631 providing pointers to these files. 694 additional analysis. The Metadata Database is also used in close 695 coupling with the analysis pipelines to track the state of elements as 696 they move through the processing system. Metadata which is large in 697 volume or poorly structured is stored in an appropriate container file 698 (FITS Table, FITS Header, XML File) in Nebulous, with the Metadata DB 699 maintaining the corresponding Nebulous storage IDs of these files. 632 700 633 701 The IPP Metadata Database is a simple database system, consisting of a 634 number of simple tables without extensive inter-table links. The 635 \code{mysql} database engine will be used to drive the database. 702 number of simple tables without extensive inter-table links. The IPP 703 uses the MySQL database engine for the the database. To simplify the 704 coding and management of the database, the IPP uses autocoded APIs 705 constructed with the system called 'glueforge' to define and 706 manipulate the Metadata Database tables. 636 707 637 708 \begin{table}[hb] … … 675 746 is identified. 676 747 677 \subsubsection{Metadata Queries} 678 679 The IPP provides simple queries to the Metadata Database tables using 680 auto-coded APIs. These queries return a single row or a collection of 681 rows based on the primary key. The format of the API is identical for 682 all Metadata tables. New tables and APIs can be added to the IPP 683 system by adding to the auto-code table description files. The 684 auto-code API includes read and write access permissions to be set 685 for each table independently. See Appendix~\ref{sec:AutocodeIO} for 686 further information. 748 \subsubsection{Autocoded Metadata Queries} 749 750 The IPP provides standardized interfaces to the Metadata tables using 751 the 'glueforge' system. Glueforge uses a standard table description 752 file to construct a collection of standard interface functions with 753 easily predicted names. Given the description of a table (say Foo), 754 Glueforge provides a C-struct which represents the elements of single 755 row of the table (\code{FooRow}). It also provides APIs to create a 756 new Foo table (\code{FooCreateTable()}), to insert a row either 757 supplying the elements of Foo (to \code{FooInsert()}) or by supplying 758 a pointer to data of type Foo (to \code{FooInsertObject()}). Simple 759 queries may be constructed to select rows from the table 760 (\code{FooSelectRow()}). The same mechanism generates data I/O 761 functions for writing FITS tables from a collection of the data 762 elements. 763 764 This autocoding system for interacting with the Metadata database 765 makes the software very flexible for changes to the structures of the 766 Metadata database tables. The programmer does not need to know 767 anything about the details of a given table to interact with it, 768 except when a specific element is needed. Thus, new columns can be 769 added to the tables, and only require a re-compilation for most 770 portions of the IPP code. Even migration to a new table schema for 771 existing data becomes fairly trivial. Such a migration only would 772 require the definition of a conversion function from the old structure 773 to the new structure. The more general features of glueforge are 774 discussed in the glueforge manual pages. 687 775 688 776 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% … … 1369 1457 1370 1458 \subsection{Scheduler} 1459 \label{sec:scheduler} 1371 1460 1372 1461 \subsubsection{Corresponding Requirements} … … 1521 1610 machine and monitors the success or failure of the processing stage. 1522 1611 1523 The analysis stages are written as UNIX commands, which may be 1524 executed by the IPP Controller, or may be executed individually by 1525 hand. This option makes testing of the complete analysis system much 1526 easier because the individual analysis stages may be tested 1527 independently of each other and the IPP infrastructure. 1528 1529 As part of this design model, the analysis stages have several methods 1530 for accepting and returning the input and output data and for defining 1531 optional choices in the analysis. All of the analysis stages load an 1532 analysis recipe, which defines the details of that analysis. The 1533 recipe includes the location of the data sources (from the metadata, 1534 from the image headers, from other external files, or supplied 1535 directly), which steps to employ, and how to assign optional 1536 parameters. For example, in the discussion of the Phase 2 analysis 1537 below, the recipe file may specify {\em if} a bias subtraction should 1538 be applied, {\em where} to find the overscan region and {\em which} 1539 bias image, {\em if any}, to apply. 1540 1541 The recipe is loaded as part of the runtime configuration information 1542 loaded when the analysis script starts. Four levels of runtime 1543 configuration information are defined. The {\tt site} configuration 1544 defines values specific to the particular installation of the 1545 software. For example, the name of the machine which hosts the 1546 Metadata Database or a default path for data files could be part of 1547 the {\tt site} configuration. Multiple installations or versions of 1548 the IPP software would need to have separate {\tt site} configuration 1549 entries. For example, a version of the IPP installed at the IfA would 1550 use a different computer for the Image Server from the live IPP 1551 installation running on the Pan-STARRS cluster. The {\tt base} 1552 configuration defines general data sources which may be needed by any 1553 portion of the IPP. The list of known telescopes or filters might be 1554 an example. The {\tt camera} configuration consists of information 1555 which defines the parameters relevant to the cameras known by the IPP. 1556 For example, the default layout of the detectors or the names of 1557 specific header keyword values would be defined for each camera in a 1558 camera-specific configuration collection. Finally, each analysis 1559 script loads its own recipe. The location of this configuration 1560 information may be a collection of configuration files available on 1561 disk or some subset of the information may be stored in the Metadata 1562 Database. The source of these configuration entries can be overridden 1563 when the script is executed, and individual configuration values may 1564 also be specified on the command line. Examples of the recipe and 1565 other runtime configuration options are given in 1566 Appendix~\ref{sec:RuntimeConfig}. 1612 The analysis stages are written using programs executed as UNIX 1613 commands. These commands may be executed by the IPP Controller, or 1614 may be executed individually by hand. This option makes testing of 1615 the complete analysis system much easier because the individual 1616 analysis stages may be tested independently of each other and the rest 1617 of the IPP infrastructure. The analysis stages discussed in this 1618 section use two somewhat different types of programs. One set of 1619 programs performs the heavy lifting of the data analysis: they examine 1620 the pixels in images and perform some statistical analysis of the 1621 pixel values or manipulate the data products in one way or another. 1622 Another set of programs are used to tie the analysis stages together. 1623 This latter set of programs examine the state of the Metadata database 1624 and select images for processing. They are used by PanTasks, the IPP 1625 Scheduler to make decisions about which of the analysis programs to 1626 run on which data. In this section, we discuss the details of the 1627 analysis stages in terms of the science analysis. Below, in 1628 Section~\ref{AnalysisPrograms}, we discuss the major analysis programs 1629 and top-level analysis routines used by those program. An important 1630 distinction to be noted here is that the same analysis programs with 1631 somewhat different options and configuration information can be 1632 employed by different analysis stages. We exclude detailed discussion 1633 of the other connecting tools used to define the analysis stages. 1634 These tools and the detailed construction of the pipelines which make 1635 up the analysis stages are touched upon in Section~\ref{sec:PanTasks}, 1636 which discusses the IPP Scheduler program, PanTasks. They are 1637 discussed in more detail in the document 'ippTools'. 1567 1638 1568 1639 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% … … 2338 2409 2339 2410 In order to facilitate testing and development, and to encourage 2340 flexibility, the IPP will be built in a layered fashion. The lowest2341 levelfunctions will be written in C and collected together into a2411 flexibility, the IPP is built in a layered fashion. The lowest level 2412 functions will be written in C and collected together into a 2342 2413 Pan-STARRS library. These library functions will be used to write 2343 2414 more complex modules. The modules will be written in C but will make … … 2348 2419 functions in the operational system, the IPP will make use of Perl as 2349 2420 the scripting language to provide the required flow-control to tie the 2350 modules together. 2421 modules together. \tbd{note that we use C only, not perl for 2422 scripting}. 2351 2423 2352 2424 This approach satisfies the requirement that complicated low-level … … 2391 2463 detailed in the IPP PS-1 SRS (PSDC-430-005), Section 3.3. 2392 2464 2393 \subsection{IPP Stages} 2465 \subsection{IPP Analysis Programs} 2466 2467 \tbd{clean this up} 2394 2468 2395 2469 The major IPP processing tasks are organized into stages, which … … 2402 2476 images from multiple telescopes and search for transients). 2403 2477 2404 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 2478 \section{Top-Level IPP Analysis Routines} 2479 2480 The IPP uses a handful of high-level analysis routines which perform 2481 the bulk of the actual analysis effort. These routines are used both 2482 as stand-alone programs and in some cases as library functions called 2483 by other stand-alone programs. The 6 primary top-level analysis 2484 routines are: 2485 \begin{itemize} 2486 \item {\bf ppImage} : the complete single-image analysis program. 2487 \item {\bf ppMerge} : the basic image combinations program. 2488 \item {\bf psphot} : the photometry analysis routine. 2489 \item {\bf psastro} : the astrometry analysis routine. 2490 \item {\bf stac} : the science image combination program. 2491 \item {\bf poisub} : the image difference program. 2492 \end{itemize} 2493 2494 These programs are not mutually exclusive: psphot and psastro are both 2495 used by ppImage, while psastro is used as a stand-alone tool within 2496 the IPP, and psphot can be used as such. 2497 2498 \subsection{Software Configuration and Camera Definition Information} 2499 2500 Every camera which may be analysed by the IPP has differences in how 2501 the data is represented. The IPP is built with the flexibility to 2502 handle data from many different cameras, not just the Pan-STARRS 2503 Gigapix cameras. This is partly to allow testing of the analysis 2504 system on data from other telescopes, such as MegaPrime on CFHT and 2505 Suprime on Subaru, but also to allow us to adapt to changes in the 2506 design of the Gigapix cameras themselves. It also means the IPP 2507 software may be used by astronomers for other analysis projects beyond 2508 the IPP. 2509 2510 Most cameras provide extensive descriptive information in the FITS 2511 image headers when the images are read out. Typically, the location 2512 and orientations of the individual detectors are defined by keywords 2513 such as DATASEC and DETSEC. Other variations on these words are used 2514 for cameras which place the pixels from multiple amplifiers in the 2515 same FITS data segment. Other parameters, such as astrometric 2516 information or exposure times, are stored in headers as well. It is 2517 possible to use these header keywords to guide the analysis software, 2518 but there are two difficulties. 2519 2520 First, it is very common for different keywords to be used by 2521 different cameras, sometimes even the same camera may use different 2522 keywords for the same information at different times (major readout 2523 software upgrades, for example, can be accompanied by keyword 2524 revisions). In addition, within Pan-STARRS and the IPP, it is 2525 necessary to have the capability to refer to the Metadata database as 2526 the authoratative sources of some of these entries rather than the 2527 image headers. Given this circumstance, it is at least necessary to 2528 define the appropriate source for a given data concept appropriate to 2529 data from a specific camera. 2530 2531 The second problem arises when actually performing an analysis. In 2532 many circumstances, the software needs to know what data to expect 2533 even when an appropriate camera image is not available. This is 2534 particularly true for a camera which is composed of multiple chips and 2535 multiple amplifiers. It is a frequent circumstance than some subset 2536 of the chips or amplifiers will either be unavailable or are invalid 2537 for one reason or another. It is important for the software to have a 2538 guide for what data should be available from a perfect readout of the 2539 given camera so decisions can be made how to handle data which is not 2540 complete. This is also important to validate that a particular 2541 dataset, which appears to be from a known camera, actually corresponds 2542 to that camera and has all of the necessary information where 2543 expected. 2544 2545 As part of the flexible design model, all of these analysis programs 2546 use a common set of configuration files which define the details of 2547 how their analysis should be performed. The configuration files a 2548 further divided into three main sets of configuration information: 2549 \begin{itemize} 2550 \item {\bf site configuration} this information defines the locations 2551 of data resources and the other available configuration information. 2552 For example, the site configuration would include the location of 2553 Nebulous, the metadata database access information, the list of 2554 available cameras, and so forth. 2555 \item {\bf camera configuration} this information describes 2556 characteristics needed to interpret data from a specific camera. 2557 This includes such details as where to extract particular metadata, 2558 such as the filter name. It also defines the camera layout and the 2559 expected organization of the data files. 2560 \item {\bf recipe} a particular analysis program may use one or 2561 multiple recipes. The recipe defines the value of optional 2562 configuration information for that analysis program. For example, 2563 by using different recipes, ppImage can be made to perform a 2564 complete Phase 2 image analysis, including detailed object detection 2565 and astrometric calibration, or with a different recipe, ppImage may 2566 be used to perform only bias subtraction (for example, as part of 2567 the detrend image analysis). Note that the details of a recipe in 2568 general depend on the camera / telescope of interest. As a result, 2569 the identify of the recipe files which define different recipes are 2570 included as part of the camera configuration information. 2571 \end{itemize} 2572 For all of the analysis programs, the source of the configuration 2573 files can be overridden when the program is executed, and individual 2574 configuration values may also be specified on the command line. The 2575 details of the configuration file formats and the configuration 2576 variable names used by different programs and functions are discussed 2577 in the Modules SDRS document. 2578 2579 \subsection{ppImage} 2580 2581 This program is not only one of the work-horse programs of the IPP, it 2582 is also exemplary of the design of the top level analysis program. 2583 ppImage is used to perform the complete Phase 2 analysis on a single 2584 image data file. This includes the complete detrending discussed 2585 above (bias, dark, flat, fringe, etc), as well as the object detection 2586 and classification, the astrometric calibration, and potentially the 2587 photometric calibration as well. The object analysis and astrometry 2588 are in fact performed by ppImage using two of the other top-level 2589 routines discussed above, psphot and psastro; these components will be 2590 discussed independently and will mostly not be part of the ppImage 2591 discussion. 2592 2593 The top-level design of ppImage is virtually identical to that of 2594 several other stand-alone analysis programs (eg, psphot, psastro, 2595 ppMerge). The input to ppImage may be one of several options: 2596 \begin{itemize} 2597 \item a single FITS file by name (eg: /full/path/to/file.fits) 2598 \item a collection of FITS files by glob (eg: /full/path/to/file.*.fits) 2599 \item a single Nebulous storage ID (neb:nebID) 2600 \item a collection of Nebulous storage IDs (neb:nebID.regex) 2601 \end{itemize} 2602 The Nebulous versions can be viewed as simply utility versions to save 2603 the user from asking Nebulous for the corresponding file names. In 2604 addition to the input file and any optional configuration information, 2605 ppImage also takes as part of the command-line arguments the root of 2606 the output file names; ppImage defines naming conventions to 2607 constructing the complete output files based on the recipe. 2608 2609 After parsing the command-line options, ppImage performs the following 2610 tasks: 2611 \begin{itemize} 2612 \item identify the camera associated with the file or files (all files 2613 must be of the same camera). This analysis is performed by examining 2614 the primary headers of the files and applying camera-identification 2615 rules specified as part of the camera configuration information. It 2616 is also possible for the user to assert the camera which supplied 2617 the image(s). 2618 2619 \item construct a complete FPA structure representing the camera 2620 (without pixel data) and associate the input filenames with the 2621 different components of the FPA structure. In this step, the camera 2622 configuration information is used to identify the cell, chip, etc 2623 which corresponds to each file, and the filenames are associated 2624 with elements at the appropriate levels. Since each input file has 2625 already had its primary header read, this is also attached to the 2626 appropriate FPA structure level. 2627 2628 \item the complete FPA is then looped over, with nested loops for each 2629 of the levels: FPA, Chip, Cell, and Readout. At the appropriate 2630 depth (depending on user options and the recipe), the pixels are 2631 loaded from the corresponding file. The recipe may, for example, 2632 require the entire FPA image be read, or it may specify that each 2633 cell be loaded one at a time. 2634 2635 \item the detrending analysis is peformed on each readout. Note that 2636 the detrend images needed for the analysis are defined by the recipe 2637 and by querying the metadata database for the detrend image which 2638 matches the science image in hand. The details of this match may be 2639 modified by the recipe as well, allowing for example, one recipe to 2640 apply only the best available flat-field images, while another for 2641 comparison only applies an earlier flat. These types of options are 2642 needed for testing, and also to perform, e.g., the flat-field 2643 correction analysis. Also note that the detrend images may be 2644 loaded at a different level from the science images; it is likely 2645 that a detrend image is only defined for each Cell, while the 2646 science images may be loaded by Readout. 2647 2648 \item after each chip is processed, ppImage will optionally 2649 reconfigure the resulting pixels into a single contiguous array. 2650 This is the normal data source for the psphot function call used by 2651 ppImage. The reconfigured chip image on this stage may also be used 2652 to generate thumbnail and rebinned sample images on the chip level. 2653 2654 \item ppImage may also be used to reconfigure the pixels from a 2655 complete FPA into a single pixel array. For the GPC, this step 2656 would not normally be performed on the full data array, but would be 2657 used to generate rebinned sample images of the full mosaic (in 2658 either FITS or JPEG formats). This step is used by Phase 3 to 2659 construct images for examining the state of the data processing or 2660 by the detrend creation analysis to test the quality of the 2661 residuals. 2662 2663 \item the final stage is to output, at an appropriate depth, the 2664 chosen output data files and to send summary metadata to the 2665 metadata database. these output files may include FITS images, FITS 2666 thumbnails (very small rebinned images), FITS samples, JPEG images, 2667 object tables, and/or astrometry calibration files. 2668 \end{itemize} 2669 2670 Except for the details of the analysis (detrending, etc), the detail 2671 of the processing steps are identical for psphot, psastro, and 2672 ppMerge, as discussed below. 2673 2674 \subsection{ppMerge} 2675 2676 This program performs the job of stacking multiple images in which the 2677 pixel grid is kept the same for all of the input images. As part of 2678 the combination process, the input images may be scaled and shifted. 2679 This operation is the basis for the creation of all of the primary 2680 master detrend frames: bias, dark, flat, fringe. For all of these 2681 cases, the output pixel values are determined by applying some global 2682 statistic to the collection of input pixels after the input pixels 2683 have been rescaled. The combination statistics may be any of several 2684 standard options, including mean, median, sigma-clipping, etc. The 2685 input files may be supplied to ppMerge using any of a command-line 2686 glob, a text-based list, or an XML list. The glob method forces the 2687 zero and scaling to have values of 0 and 1, respectively. The text 2688 file listing allows for zero and scaling values, but only a single 2689 value for each input file. The XML-format list is needed if 2690 subdivisions within a file (eg, cells) require independent zero and 2691 scaling factors. It is also necessary to specify more than one image 2692 file if, eg, a full FPA composed of separate Chip files is to be 2693 processed at once. This analysis is performed using a very similar 2694 structure to ppImage: the input list of images is parsed and 2695 associated with the approrpiate elements of the FPA struture, though 2696 in this case, an array of FPA structures is used to carry the 2697 information. The elements of the FPA are looped over, as with 2698 ppImage, and the analysis is performed on the lowest level, after the 2699 input data pixels are read and re-scaled. The output is written out 2700 after the image have been looped over. 2701 2702 \subsection{psphot} 2703 2704 The details of psphot are discussed in the psphot design document. 2705 Here we will just address some of the basic concepts. PSPhot may be 2706 used as a stand-alone program, or it may be called as a function 2707 within another program (such as ppImage). In the li 2405 2708 2406 2709 \section{Interfaces} … … 3402 3705 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 3403 3706 3404 \section{Software Runtime Configuration Issues}3405 \label{sec:RuntimeConfig}3406 3407 The IPP Software requires extensive runtime configuration information.3408 This includes default parameters for analysis to be performed,3409 descriptions of how a particular analysis is performed, locations of3410 data sources, and so forth. The IPP may store this information in the3411 Metadata Database or in configuration files available to the user.3412 Both methods are implemented in the current design. In either method,3413 the necessary parameters are identical. This section discusses the3414 contents of specific portions of the runtime configuration.3415 3416 \subsection{Camera Definition Information}3417 3418 Every camera which may be analysed by the IPP has differences in how3419 the data is represented. The IPP is built with the flexibility to3420 handle data from many different cameras, not just the Pan-STARRS3421 Gigapix cameras. This is partly to allow testing of the analysis3422 system on data from other telescopes, such as MegaPrime on CFHT and3423 Suprime on Subaru, but also to allow us to adapt to changes in the3424 design of the Gigapix cameras themselves. It also means the IPP3425 software may be used by astronomers for other analysis projects beyond3426 the IPP.3427 3428 Most cameras provide extensive descriptive information in the FITS3429 image headers when the images are read out. Typically, the location3430 and orientations of the individual detectors are defined by keywords3431 such as DATASEC and DETSEC. Other variations on these words are used3432 for cameras which place the pixels from multiple amplifiers in the3433 same FITS data segment. Other parameters, such as astrometric3434 information or exposure times, are stored in headers as well. It is3435 possible to use these header keywords to guide the analysis software,3436 but there are two difficulties.3437 3438 First, it is very common for different keywords to be used by3439 different cameras, sometimes even the same camera may use different3440 keywords for the same information at different times (major readout3441 software upgrades, for example, can be accompanied by keyword3442 revisions). In addition, within Pan-STARRS and the IPP, it is3443 necessary to have the capability to refer to the Metadata database as3444 the authoratative sources of some of these entries rather than the3445 image headers. Given this circumstance, it is at least necessary to3446 define the appropriate source for a given data concept appropriate to3447 data from a specific camera.3448 3449 The second problem arises when actually performing an analysis. In3450 many circumstances, the software needs to know what data to expect3451 even when an appropriate camera image is not available. This is3452 particularly true for a camera which is composed of multiple chips and3453 multiple amplifiers. It is a frequent circumstance than some subset3454 of the chips or amplifiers will either be unavailable or are invalid3455 for one reason or another. It is important for the software to have a3456 guide for what data should be available from a perfect readout of the3457 given camera so decisions can be made how to handle data which is not3458 complete. This is also important to validate that a particular3459 dataset, which appears to be from a known camera, actually corresponds3460 to that camera and has all of the necessary information where3461 expected.3462 3463 In order to facilitate the operation of the IPP with a variety of3464 cameras, and to allow the software the flexibility to change the3465 camera defintion dynamically, the IPP includes a collection of3466 software runtime configuration information which defines a given3467 camera. This information is represented below in the form of the3468 PSLib Metadata Config file, but may be stored in the Metadata Database3469 or in an alternate format as appropriate.3470 3471 The a single camera is represented as a Focal Plane Array (FPA),3472 divided into Chips, divided into Cells. For a single FPA, all imaging3473 data is stored in a FITS file or a collection of FITS files. Software3474 needs to know where in a given file or set of files to find a3475 particular Cell, what Cells to expect, what chips to expect, and the3476 relationships between those entities, etc.3477 3478 A single camera configuration file (or dataset) represents the3479 description of a complete FPA. In the configuration file, any3480 parameters which are specific to the complete FPA are placed on their3481 own lines. These include the definition of the keywords or database3482 locations. An incomplete example is given below.3483 3484 \begin{verbatim}3485 NCELL S32 NN3486 NCHIP S32 NN3487 EXPTIME-SRC STR HD:EXPTIME # need to specify PHU vs EXTNAME3488 EXPTIME-KEY STR EXPTIME3489 DATE-KEY STR DATE-OBS3490 DATE-FMT STR YYYY/MM/DD3491 3492 TYPE CELL FILENAME EXTNAME CHIP DATASEC BIASSEC3493 CELL.nn CELL @ROOT@CELL AMP00 CHIP.00 CF:[0,0:0,0] HD:BIASSEC3494 CELL.01 CELL @ID/@ID@CELL.fits AMP01 CHIP.00 DB:???3495 \end{verbatim}3496 3497 \subsection{Analysis Recipe Information}3498 3499 In order to maintain flexibility in the analysis details, the IPP uses3500 recipes to define how a particular analysis is implemented. Each3501 major analysis script (eg, Phase 2) has its own recipe configuration3502 information, which may be stored in the Metadata Database or in the3503 form of the PSLib Metadata Config file. This configuration3504 information includes all of the user configurable parameters. Many of3505 these may specify a specific value, or they may specify lookup methods3506 (database locations, or header locations). The specifies of each3507 depends on the context. Below is an example recipe file for the bias3508 subtraction portion of Phase 2, giving several alternative options for3509 certain entries. Note that, for example, the overscan subtraction may3510 be specified as using a particular region given in the recipe file, or3511 on the basis of a particular header keyword.3512 3513 \begin{verbatim}3514 # BIAS:3515 BIAS.IMAGE STR NONE3516 BIAS.IMAGE STR FILE:bias.fits3517 BIAS.IMAGE STR DB:BEST3518 BIAS.IMAGE STR DB:CLOSE3519 3520 BIAS.OVERSCAN STR HD:BIASSEC3521 BIAS.OVERSCAN STR CF:[0,16:0,2048]3522 BIAS.OVERSCAN STR NONE3523 3524 BIAS.OVERSCAN.STATS STR MEDIAN3525 BIAS.OVERSCAN.STATS STR MEAN3526 3527 BIAS.OVERSCAN.FIT STR SPLINE3528 BIAS.OVERSCAN.FIT.NPTS S32 53529 3530 BIAS.OVERSCAN.FIT STR POLYNOMIAL3531 BIAS.OVERSCAN.FIT.ORDER S32 33532 BIAS.OVERSCAN.FIT.NBIN S32 53533 \end{verbatim}3534 3535 \section{I/O Code Autogeneration}3536 \label{sec:AutocodeIO}3537 3538 The IPP includes a number of data collections which have multiple3539 representations. A software tool will be used to automatically3540 generate code to provide I/O APIs to read and write these data and to3541 define the data structures used to carry them within a program.3542 Within the IPP, examples of these different data entities include3543 database tables (ie, in the Metadata Database), FITS Tables (to3544 exchange bulk data), and XML (to exchange more complete datasets).3545 3546 I/O API Autocode template (example.def):3547 \begin{verbatim}3548 Name Example3549 Table EXAMPLE3550 EXTNAME EXAMPLE3551 3552 KEY XVALUE3553 3554 # name format unit comment3555 XVALUE F32 pixels "x coordinate"3556 BINNING S32 fraction "binning factor"3557 NAME STR[32] string "description of entry"3558 \end{verbatim}3559 3560 Running autocode on such a file would generate an output header and C3561 files \code{example.h, example.c} with the following structure and APIs:3562 3563 \begin{verbatim}3564 typedef struct {3565 psF32 XVALUE; // x coordinate3566 psS32 BINNING; // binning factor3567 char NAME[32]; // description of entry3568 } Example;3569 3570 psMetadata *psFITSTableInitExample ();3571 psExample *psFITSTableLoadExample (char *filename, int *Nrows);3572 bool psFITSTableSaveExample (char *filename);3573 3574 psMetadata *psDatabaseTableInitExample ();3575 psExample *psDatabaseTableLoadExample (char *filename, int *Nrows);3576 bool psDatabaseTableSaveExample (char *filename);3577 psExample *psDatabaseTableLoadExampleRow (char *filename, psF32 XVALUE);3578 \end{verbatim}3579 3580 %\bibliographystyle{plain}3581 %\bibliography{panstarrs}3582 3583 3707 \input{glossary.tex} 3584 3708 3585 3709 \end{document} 3710 3711 ------ 3712 3713 * top-level routines 3714 * re-org the Phase 4 stuff to discuss Magic 3715 * astrometry calibration data formats 3716 * analysis stages, versions and iterations 3717 * output data products
Note:
See TracChangeset
for help on using the changeset viewer.
