Context Navigation

← Previous Ticket
Next Ticket →

#1009 assigned defect

pcontrol crashes when node crashes

Reported by:	Paul Price	Owned by:	eugene
Priority:	high	Milestone:
Component:	PanTasks	Version:	unspecified
Severity:	major	Keywords:
Cc:

Description

ipp003 went down (presumably the usual ethernet problem), which triggered a pcontrol crash (see below). Node crashes have been correlated with pcontrol crashes in my mind for some time, but wanted to get this down for reference and tracking.
Note that when pcontrol goes down, it appears to come back up, but the state (in particular, the connections to the pclients) is not preserved, so that processing does not continue.

<found out that ipp003 had gone down; the below happened soon after>
pantasks: controller status
job stack PENDING: 14 objects
0 anyhost 11 warp_skycell.pl --warp_id 21 --skycell_id skycell.0199092 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2ea
[...]
job stack BUSY: 6 objects
0 anyhost 11 warp_skycell.pl --warp_id 21 --skycell_id skycell.0199087 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2e9
1 anyhost 11 warp_skycell.pl --warp_id 21 --skycell_id skycell.0199085 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2e7
2 anyhost 11 warp_skycell.pl --warp_id 20 --skycell_id skycell.0198673 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2c9
3 anyhost 11 warp_skycell.pl --warp_id 21 --skycell_id skycell.0199086 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2e8
4 anyhost 11 warp_skycell.pl --warp_id 21 --skycell_id skycell.0199083 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2e5
5 anyhost 11 warp_skycell.pl --warp_id 20 --skycell_id skycell.0198671 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.2c7
job stack DONE: 0 objects
job stack KILL: 0 objects
job stack EXIT: 0 objects
job stack CRASH: 0 objects
host stack OFF: 0 objects
host stack DOWN: 0 objects
host stack IDLE: 0 objects
host stack BUSY: 6 objects
0 bucket00 0.0.0.4
1 ipp002 0.0.0.0
2 ipp003 0.0.0.1
3 bucket00 0.0.0.5
4 bucket00 0.0.0.3
5 ipp003 0.0.0.2
host stack DONE: 0 objects
pantasks: got signal : 13
controller is down
starting controller connection
Connected

pantasks: status

Scheduler is running
Controller is running

[...]

Jobs in Pantasks Queue
23: warp.skycell.run 720 warp_skycell.pl (790990)
23: warp.skycell.run 740 warp_skycell.pl (7ed900)
23: warp.skycell.run 28 warp_skycell.pl (80bed0)

[...]

jobs currently running remotely:

Njobs: 0
pantasks: controller status
job stack PENDING: 1 objects
0 anyhost 11 warp_skycell.pl --warp_id 22 --skycell_id skycell.0199092 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.0
job stack BUSY: 0 objects
job stack DONE: 0 objects
job stack KILL: 0 objects
job stack EXIT: 0 objects
job stack CRASH: 0 objects
host stack OFF: 0 objects
host stack DOWN: 0 objects
host stack IDLE: 0 objects
host stack BUSY: 0 objects
host stack DONE: 0 objects

Change History (2)

comment:1 by Paul Price, 18 years ago

Moreover, adding the nodes to the controller again does not re-initiate processing.

controller host add ipp002
HostID: 0
pantasks: controller host add ipp003
HostID: 1
pantasks: controller host add bucket00
HostID: 2
pantasks: controller host add bucket00
HostID: 3
pantasks: controller host add bucket00
HostID: 4
pantasks: status

Scheduler is running
Controller is running

[...]

Jobs in Pantasks Queue
21: warp.skycell.run 720 warp_skycell.pl (790990)
21: warp.skycell.run 740 warp_skycell.pl (7ed900)
21: warp.skycell.run 28 warp_skycell.pl (80bed0)

[...]

jobs currently running remotely:

Njobs: 0

Trying "halt" followed by "run" does not fix this.

There seems to be a single job running:

pantasks: controller status
job stack PENDING: 0 objects
job stack BUSY: 1 objects
0 anyhost 11 warp_skycell.pl --warp_id 22 --skycell_id skycell.0199093 --tess_id MOPS --camera MEGACAM --workdir path://MOPS/run1/ 0.0.0.1
job stack DONE: 0 objects
job stack KILL: 0 objects
job stack EXIT: 0 objects
job stack CRASH: 0 objects
host stack OFF: 0 objects
host stack DOWN: 0 objects
host stack IDLE: 4 objects
0 bucket00 0.0.0.2
1 bucket00 0.0.0.3
2 bucket00 0.0.0.4
3 ipp002 0.0.0.0
host stack BUSY: 1 objects
0 ipp003 0.0.0.1
host stack DONE: 0 objects

Doing a "warp.reset" does seem to fix the problem.

comment:2 by jhoblitt, 18 years ago

Cc:	jhoblitt@… added
Status:	new → assigned

It sounds like as if 2008-08-04 (monday morning meeting), this may still be an issue.

Note: See TracTickets for help on using tickets.

Download in other formats: