Opened 18 years ago
Last modified 18 years ago
#1009 assigned defect
pcontrol crashes when node crashes
| Reported by: | Paul Price | Owned by: | eugene |
|---|---|---|---|
| Priority: | high | Milestone: | |
| Component: | PanTasks | Version: | unspecified |
| Severity: | major | Keywords: | |
| Cc: |
Description
ipp003 went down (presumably the usual ethernet problem), which triggered a pcontrol crash (see below). Node crashes have been correlated with pcontrol crashes in my mind for some time, but wanted to get this down for reference and tracking.
Note that when pcontrol goes down, it appears to come back up, but the state (in particular, the connections to the pclients) is not preserved, so that processing does not continue.
<found out that ipp003 had gone down; the below happened soon after>
pantasks: controller status
job stack PENDING: 14 objects
0 anyhost 11 warp_skycell.pl --warp_id 21 --skycell_id skycell.0199092 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2ea
[...]
job stack BUSY: 6 objects
0 anyhost 11 warp_skycell.pl --warp_id 21 --skycell_id skycell.0199087 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2e9
1 anyhost 11 warp_skycell.pl --warp_id 21 --skycell_id skycell.0199085 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2e7
2 anyhost 11 warp_skycell.pl --warp_id 20 --skycell_id skycell.0198673 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2c9
3 anyhost 11 warp_skycell.pl --warp_id 21 --skycell_id skycell.0199086 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2e8
4 anyhost 11 warp_skycell.pl --warp_id 21 --skycell_id skycell.0199083 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2e5
5 anyhost 11 warp_skycell.pl --warp_id 20 --skycell_id skycell.0198671 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2c7
job stack DONE: 0 objects
job stack KILL: 0 objects
job stack EXIT: 0 objects
job stack CRASH: 0 objects
host stack OFF: 0 objects
host stack DOWN: 0 objects
host stack IDLE: 0 objects
host stack BUSY: 6 objects
0 bucket00 0.0.0.4
1 ipp002 0.0.0.0
2 ipp003 0.0.0.1
3 bucket00 0.0.0.5
4 bucket00 0.0.0.3
5 ipp003 0.0.0.2
host stack DONE: 0 objects
pantasks: got signal : 13
controller is down
starting controller connection
Connected
pantasks: status
Scheduler is running
Controller is running
[...]
Jobs in Pantasks Queue
23: warp.skycell.run 720 warp_skycell.pl (790990)
23: warp.skycell.run 740 warp_skycell.pl (7ed900)
23: warp.skycell.run 28 warp_skycell.pl (80bed0)
[...]
jobs currently running remotely:
Njobs: 0
pantasks: controller status
job stack PENDING: 1 objects
0 anyhost 11 warp_skycell.pl --warp_id 22 --skycell_id skycell.0199092 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.0
job stack BUSY: 0 objects
job stack DONE: 0 objects
job stack KILL: 0 objects
job stack EXIT: 0 objects
job stack CRASH: 0 objects
host stack OFF: 0 objects
host stack DOWN: 0 objects
host stack IDLE: 0 objects
host stack BUSY: 0 objects
host stack DONE: 0 objects
Change History (2)
comment:1 by , 18 years ago
comment:2 by , 18 years ago
| Cc: | added |
|---|---|
| Status: | new → assigned |
It sounds like as if 2008-08-04 (monday morning meeting), this may still be an issue.

Moreover, adding the nodes to the controller again does not re-initiate processing.
controller host add ipp002
HostID: 0
pantasks: controller host add ipp003
HostID: 1
pantasks: controller host add bucket00
HostID: 2
pantasks: controller host add bucket00
HostID: 3
pantasks: controller host add bucket00
HostID: 4
pantasks: status
[...]
[...]
Njobs: 0
Trying "halt" followed by "run" does not fix this.
There seems to be a single job running:
pantasks: controller status
job stack PENDING: 0 objects
job stack BUSY: 1 objects
0 anyhost 11 warp_skycell.pl --warp_id 22 --skycell_id skycell.0199093 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.1
job stack DONE: 0 objects
job stack KILL: 0 objects
job stack EXIT: 0 objects
job stack CRASH: 0 objects
host stack OFF: 0 objects
host stack DOWN: 0 objects
host stack IDLE: 4 objects
0 bucket00 0.0.0.2
1 bucket00 0.0.0.3
2 bucket00 0.0.0.4
3 ipp002 0.0.0.0
host stack BUSY: 1 objects
0 ipp003 0.0.0.1
host stack DONE: 0 objects
Doing a "warp.reset" does seem to fix the problem.