pstampRequest 5977 request was submitted through the web upload interaface on May 27. The request file has 2624 rows all of which are bycoord requests for chip stage data. bycoord requests are the slowest to parse. It takes on the order of 2 seconds per request row to process these.
The request started parsing at 13:38 HST May 27. We can see this because that is the time stamp on the request file. At 15:55 parsing is about half done. 4994 jobs have been queued so far and all but 200 are requests for cleaned up data. So the parsing will probably complete around 18:00
The following is a diary of me watching this request process. If some faults occur this may provide a debugging and repair tutorial.
# The program pstampdump can be used to list the contents of a fits table
(ippdb02:work/20100527/5977) bills% pstampdump -simple web_1695.fits > dumpreq.txt
(ippdb02:work/20100527/5977) bills% wc dumpreq.txt
2624 49856 256353 dumpreq.txt
# Each row has a ROWNUM which must be unique in the file.
(ippdb02:work/20100527/5977) bills% grep Collected psparse.5977.log | tail -n 1
Collected 1 rows beginning with row 100595. bycoord chip null null
# it's working on the 1486th row now
(ippdb02:work/20100527/5977) bills% grep -n 100595 dumpreq.txt
1486:100595 205.550874 28.351276 100 100 2 stamp 1 gpc1 bycoord chip null null null null y 0 0 search
# We're up to 5123 jobs now.
mysql> select count(*) from pstampJob where req_id >= 5977 and fault = 0 ;
+----------+
| count(*) |
+----------+
| 5123 |
+----------+
1 row in set (0.02 sec)
# All but 200 of these are for data that has been cleaned.
mysql> select count(*) from pstampJob where req_id >= 5977 and fault = 0 and dep_id = 0;
+----------+
| count(*) |
+----------+
| 200 |
+----------+
1 row in set (0.02 sec)
# where are we now?
(ippdb02:work/20100527/5977) bills% date ; grep -n `grep Collected psparse.5977.log | tail -n 1 | awk '{print $7}'` dumpreq.txt
Thu May 27 16:06:56 HST 2010
1665:100774 205.543583 28.359453 100 100 2 stamp 1 gpc1 bycoord chip null null null null y 0 0 search
1665 of 2564 64%
# There are lots of dependents pending. The poll limit in pstamp pantasks is 256 so 1/4 of these have run and have queued
their updates.
(ippdb02:work/20100527/5977) bills% pst -pendingdependent -simple | wc
1027 11297 94484
# No faults in the pstamp or update pantasks as yet. The update pantasks queue is full with chip processing and destreaking.
# looks like my 18:00 finish time estimate was pessimistic. It's 88% done at 16:42
(ippdb02:work/20100527/5977) bills% date ; grep -n `grep Collected psparse.5977.log | tail -n 1 | awk '{print $7}'` dumpreq.txt | awk '{print $1}'
Thu May 27 16:41:54 HST 2010
2264:140277
6892 jobs, 6778 in run state, 114 faulted
mysql> select fault, count(job_id) from pstampJob where req_id = 5977 and fault > 0 group by fault;
+-------+---------------+
| fault | count(job_id) |
+-------+---------------+
| 23 | 93 | PSTAMP_NO_IMAGE_MATCH center coordinate not contained in any "magicked" images (or perhaps wrong error code)
| 26 | 23 | PSTAMP_GONE these are probably from purged or cleaned up data
+-------+---------------+
2 rows in set (0.04 sec)
# 97% done at 17:00
# 100% done at 17:06
# Final tally
7691 jobs
7570 to run
4793 dependents (chip images to regenerate (update))
121 faulted
+-------+---------------+
| fault | count(job_id) |
+-------+---------------+
| 23 | 98 |
| 26 | 23 |
+-------+---------------+
11797 seconds to parse 3.27 hours
It's 18:14 now. 453 jobs are left to run. 381 pending dependents. As one would expect the rate of progress is limited by the updates.
If we added some nodes to the update pantasks it would run faster. If we ran the updates in the stdscience pantasks, jobs like this
would starve processing, so it looks like we may want to keep it separte
18:38: the last job finished. No faults either in update or pstamp job processing. No tutorial for error handling here.
The results file is getting built now
pantasks: status
Scheduler is running
Controller is running
Task Status
AV Name Nrun Njobs Ngood Nfail Ntime Command
++ pstamp.request.find 0 1707 1707 0 0 pstamp_queue_requests.pl
++ pstamp.request.load 0 3370 3370 0 0 pstamptool
++ pstamp.request.run 0 2 2 0 0 pstamp_parser_run.pl
++ request.finish.load 1 3387 3386 0 0 pstamptool
++ request.finish.run 1 2 1 0 0 request_finish.pl
++ pstamp.job.load 0 3362 3362 0 0 pstamptool
++ pstamp.job.run 0 7570 7570 0 0 pstamp_job_run.pl
++ pstamp.dependent.load 0 1707 1707 0 0 pstamptool
++ pstamp.dependent.run 0 170583 170583 0 0 pstamp_checkdependent.pl
++ pstamp.cleanup.load 0 3403 3403 0 0 pstamptool
++ pstamp.cleanup.run 0 0 0 0 0 (dynamic)
++ pstamp.job.revert 0 54 54 0 0 pstamptool
#### TASKSTATS for PSTAMP
pantasks: status -taskstats
Scheduler is running
Controller is running
Task Statistics
| alljobs | success | failure |
AV Name | Njobs Tmin Tave Tmax | Njobs Tmin Tave Tmax | Njobs Tmin Tave Tmax |
++ pstamp.request.find | 1713 1.00 2.10 27.57 | 1712 1.00 2.10 27.57 | 0 NONE 0.00 0.00 |
++ pstamp.request.load | 3380 1.00 1.90 27.56 | 3380 1.00 1.90 27.56 | 0 NONE 0.00 0.00 |
++ pstamp.request.run | 2 7.62 5900.59 11797.36 | 2 7.62 5902.49 11797.36 | 0 NONE 0.00 0.00 |
++ request.finish.load | 3398 1.00 1.90 27.56 | 3398 1.00 1.90 27.56 | 0 NONE 0.00 0.00 |
++ request.finish.run | 2 6.87 6.87 6.87 | 1 6.87 6.87 6.87 | 0 NONE 0.00 0.00 |
++ pstamp.job.load | 3373 1.00 2.05 27.55 | 3373 1.00 2.05 27.55 | 0 NONE 0.00 0.00 |
++ pstamp.job.run | 7570 2.13 5.31 65.36 | 7570 2.13 5.35 65.36 | 0 NONE 0.00 0.00 |
++ pstamp.dependent.load | 1712 1.00 3.49 44.44 | 1712 1.00 3.49 44.44 | 0 NONE 0.00 0.00 |
++ pstamp.dependent.run | 170583 0.27 0.65 5.62 | 170583 0.27 0.65 5.62 | 0 NONE 0.00 0.00 |
++ pstamp.cleanup.load | 3413 1.00 1.91 27.48 | 3412 1.00 1.91 27.48 | 0 NONE 0.00 0.00 |
++ pstamp.cleanup.run | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 |
++ pstamp.job.revert | 54 1.00 1.97 4.77 | 54 1.00 1.97 4.77 | 0 NONE 0.00 0.00 |
#### taskstats for UPDATE
pantasks: status -taskstats
Scheduler is running
Controller is running
Task Statistics
| alljobs | success | failure |
AV Name | Njobs Tmin Tave Tmax | Njobs Tmin Tave Tmax | Njobs Tmin Tave Tmax |
++ warp.exp.load | 62154 1.00 1.00 7.00 | 62154 1.00 1.00 7.00 | 0 NONE 0.00 0.00 |
++ warp.exp.run | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 |
++ warp.skycell.load | 62135 1.00 1.00 7.00 | 62135 1.00 1.00 7.00 | 0 NONE 0.00 0.00 |
++ warp.skycell.run | 285 18.54 43.84 127.16 | 276 18.54 44.54 127.16 | 9 26.41 28.60 33.43 |
++ warp.advancerun | 9467 1.00 1.12 3.00 | 9467 1.00 1.12 3.00 | 0 NONE 0.00 0.00 |
++ diff.skycell.load | 62182 1.00 1.00 4.96 | 62181 1.00 1.00 4.96 | 0 NONE 0.00 0.00 |
++ diff.skycell.run | 221 2.11 11.62 198.35 | 14 79.59 140.68 198.35 | 207 2.11 3.00 6.31 |
++ diff.advance | 9464 1.00 1.00 1.35 | 9464 1.00 1.00 1.35 | 0 NONE 0.00 0.00 |
-+ destreak.revert.load | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 |
-+ destreak.revert.run | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 |
-+ destreak.completed.revert | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 |
-+ destreak.cleanup.load | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 |
-+ destreak.cleanup.run | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 | 0 NONE 0.00 0.00 |
-+ warp.revert.warped | 84 60.00 60.00 60.00 | 84 60.00 60.00 60.00 | 0 NONE 0.00 0.00 |
-+ warp.revert.overlap | 86 60.00 60.00 60.01 | 86 60.00 60.00 60.01 | 0 NONE 0.00 0.00 |
-+ diff.revert | 1 60.00 60.00 60.00 | 1 60.00 60.00 60.00 | 0 NONE 0.00 0.00 |
++ destreak.load | 29455 1.00 1.01 5.10 | 29455 1.00 1.01 5.10 | 0 NONE 0.00 0.00 |
++ destreak.run | 6051 7.11 18.86 84.80 | 6051 7.11 18.92 84.80 | 0 NONE 0.00 0.00 |
++ destreak.advance | 9497 1.00 1.00 3.01 | 9497 1.00 1.00 3.01 | 0 NONE 0.00 0.00 |
++ chip.imfile.load | 62155 1.00 1.05 7.40 | 62155 1.00 1.05 7.40 | 0 NONE 0.00 0.00 |
++ chip.imfile.run | 5953 0.53 73.84 271.32 | 5891 34.75 74.51 271.32 | 62 0.53 44.20 97.26 |
++ chip.advanceexp | 9477 1.49 2.04 5.00 | 9477 1.49 2.04 5.00 | 0 NONE 0.00 0.00 |
-+ chip.revert | 87 60.00 60.00 60.01 | 87 60.00 60.00 60.01 | 0 NONE 0.00 0.00 |
18:48 All done. The finish task took 813 seconds. The optimization work I did on that a few weeks ago worked pretty well
The end
