IPP Software Navigation Tools IPP Links Communication Pan-STARRS Links

Opened 18 years ago

Closed 16 years ago

#1084 closed enhancement (fixed)

pantasks dies with waitpid error

Reported by: jhoblitt Owned by: eugene
Priority: high Milestone:
Component: PanTasks Version: unspecified
Severity: minor Keywords:
Cc:

Description (last modified by eugene)

I'm unsure of what exactly triggered this. It happened some time during the night.

failure for: pzgetexp -uri http://conductor/ds/gpc1/index.txt -inst gpc1 -telescope ps1
job exit status: 1
job host: localhost
job dtime: 1.000053
waitpid error: mis-matched PID (8361 vs 0). programming error

Attachments (1)

graph.php.png (14.0 KB ) - added by jhoblitt 18 years ago.
ganglia plot of network traffic: show pantasks freezing up when a node dies

Download all attachments as: .zip

Change History (9)

comment:1 by jhoblitt, 18 years ago

I just had a number of pclients die on me for no reason and the controller was still showing them as busy. I suspect that this is the cause of this error message and I just didn't wait long enough before restarting pantasks.

comment:2 by jhoblitt, 18 years ago

ipp005 just went down on me... I started getting tons of controllers errors:

controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)

And it still still thought that ipp005 was working:

job stack DONE: 0 objects
job stack KILL: 0 objects
job stack EXIT: 0 objects
job stack CRASH: 0 objects
host stack OFF: 0 objects
host stack DOWN: 0 objects
host stack IDLE: 0 objects
host stack BUSY: 15 objects
0 ipp007 0.0.0.2
1 ipp009 0.0.0.3
2 ipp005 0.0.0.0
3 ipp012 0.0.0.6
4 ipp021 0.0.0.e
5 ipp019 0.0.0.c
6 ipp013 0.0.0.7
7 ipp016 0.0.0.9
8 ipp017 0.0.0.a
9 ipp020 0.0.0.d
10 ipp011 0.0.0.5
11 ipp010 0.0.0.4
12 ipp018 0.0.0.b
13 ipp015 0.0.0.8
14 ipp006 0.0.0.1
host stack DONE: 0 objects

I quickly removed the host from the controller and after a few minutes it hasn't crashed yet.

by jhoblitt, 18 years ago

Attachment: graph.php.png added

ganglia plot of network traffic: show pantasks freezing up when a node dies

comment:3 by jhoblitt, 18 years ago

Hmm... the controller errors are continuing but at least it's continuing to launch new jobs after removing the dead host.

comment:4 by jhoblitt, 18 years ago

It died on me again last night:

controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (5 tries)
controller is not responding (6 tries)
controller is not responding (7 tries)
controller is not responding (8 tries)
controller is not responding (9 tries)
controller still not responding, giving up
missing PID in pcontrol message : programming error
ControllerCommand returns: 0
ControllerCommand response:

comment:5 by jhoblitt, 18 years ago

List of core files that have pilled up over the last couple of days:

-rw------- 1 jhoblitt users 26914816 Apr 16 17:49 core.pantasks.ipp004.1208404142.16884
-rw------- 1 jhoblitt users 26906624 Apr 16 17:55 core.pantasks.ipp004.1208404518.17139
-rw------- 1 jhoblitt users 416473088 Apr 14 11:28 core.pclient.ipp005.1208208468
-rw------- 1 jhoblitt users 1298432 Apr 16 17:50 core.pclient.ipp005.1208404205
-rw------- 1 jhoblitt users 1298432 Apr 16 17:55 core.pclient.ipp005.1208404543
-rw------- 1 jhoblitt users 1298432 Apr 16 17:58 core.pclient.ipp005.1208404692
-rw------- 1 jhoblitt users 1298432 Apr 16 17:59 core.pclient.ipp005.1208404764
-rw------- 1 jhoblitt users 35946496 Apr 17 07:33 core.pclient.ipp005.1208453630
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp005.1208465760
-rw------- 1 jhoblitt users 3387392 Apr 17 11:29 core.pclient.ipp005.1208467749
-rw------- 1 jhoblitt users 7987200 Apr 17 16:10 core.pclient.ipp005.1208484628
-rw------- 1 jhoblitt users 9867264 Apr 17 20:09 core.pclient.ipp005.1208498961
-rw------- 1 jhoblitt users 416595968 Apr 14 11:28 core.pclient.ipp006.1208208503
-rw------- 1 jhoblitt users 1298432 Apr 16 17:50 core.pclient.ipp006.1208404204
-rw------- 1 jhoblitt users 1298432 Apr 16 17:55 core.pclient.ipp006.1208404543
-rw------- 1 jhoblitt users 1298432 Apr 16 17:58 core.pclient.ipp006.1208404691
-rw------- 1 jhoblitt users 1298432 Apr 16 17:59 core.pclient.ipp006.1208404764
-rw------- 1 jhoblitt users 35647488 Apr 17 07:33 core.pclient.ipp006.1208453630
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp006.1208465760
-rw------- 1 jhoblitt users 12718080 Apr 17 20:09 core.pclient.ipp006.1208498967
-rw------- 1 jhoblitt users 416575488 Apr 14 11:28 core.pclient.ipp007.1208208492
-rw------- 1 jhoblitt users 1298432 Apr 16 17:50 core.pclient.ipp007.1208404204
-rw------- 1 jhoblitt users 1298432 Apr 16 17:55 core.pclient.ipp007.1208404543
-rw------- 1 jhoblitt users 1298432 Apr 16 17:58 core.pclient.ipp007.1208404691
-rw------- 1 jhoblitt users 1298432 Apr 16 17:59 core.pclient.ipp007.1208404764
-rw------- 1 jhoblitt users 36253696 Apr 17 07:33 core.pclient.ipp007.1208453629
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp007.1208465760
-rw------- 1 jhoblitt users 12746752 Apr 17 20:09 core.pclient.ipp007.1208498967
-rw------- 1 jhoblitt users 317894656 Apr 14 11:28 core.pclient.ipp008.1208208484
-rw------- 1 jhoblitt users 317972480 Apr 14 11:28 core.pclient.ipp009.1208208471
-rw------- 1 jhoblitt users 1298432 Apr 16 17:50 core.pclient.ipp009.1208404204
-rw------- 1 jhoblitt users 1298432 Apr 16 17:55 core.pclient.ipp009.1208404543
-rw------- 1 jhoblitt users 1298432 Apr 16 17:58 core.pclient.ipp009.1208404691
-rw------- 1 jhoblitt users 1298432 Apr 16 17:59 core.pclient.ipp009.1208404764
-rw------- 1 jhoblitt users 35758080 Apr 17 07:33 core.pclient.ipp009.1208453628
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp009.1208465760
-rw------- 1 jhoblitt users 8024064 Apr 17 16:10 core.pclient.ipp009.1208484628
-rw------- 1 jhoblitt users 12857344 Apr 17 20:09 core.pclient.ipp009.1208498966
-rw------- 1 jhoblitt users 416567296 Apr 14 11:27 core.pclient.ipp010.1208208459
-rw------- 1 jhoblitt users 1298432 Apr 16 17:50 core.pclient.ipp010.1208404204
-rw------- 1 jhoblitt users 1298432 Apr 16 17:55 core.pclient.ipp010.1208404543
-rw------- 1 jhoblitt users 1298432 Apr 16 17:58 core.pclient.ipp010.1208404691
-rw------- 1 jhoblitt users 1298432 Apr 16 17:59 core.pclient.ipp010.1208404764
-rw------- 1 jhoblitt users 35713024 Apr 17 07:33 core.pclient.ipp010.1208453627
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp010.1208465760
-rw------- 1 jhoblitt users 9064448 Apr 17 16:10 core.pclient.ipp010.1208484627
-rw------- 1 jhoblitt users 12787712 Apr 17 20:09 core.pclient.ipp010.1208498965
-rw------- 1 jhoblitt users 413589504 Apr 14 11:27 core.pclient.ipp011.1208208449
-rw------- 1 jhoblitt users 1409024 Apr 16 17:50 core.pclient.ipp011.1208404204
-rw------- 1 jhoblitt users 1298432 Apr 16 17:55 core.pclient.ipp011.1208404543
-rw------- 1 jhoblitt users 1298432 Apr 16 17:58 core.pclient.ipp011.1208404691
-rw------- 1 jhoblitt users 1298432 Apr 16 17:59 core.pclient.ipp011.1208404764
-rw------- 1 jhoblitt users 35942400 Apr 17 07:33 core.pclient.ipp011.1208453626
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp011.1208465760
-rw------- 1 jhoblitt users 12873728 Apr 17 20:09 core.pclient.ipp011.1208498965
-rw------- 1 jhoblitt users 36208640 Apr 17 07:33 core.pclient.ipp012.1208453626
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp012.1208465760
-rw------- 1 jhoblitt users 7954432 Apr 17 16:10 core.pclient.ipp012.1208484627
-rw------- 1 jhoblitt users 12632064 Apr 17 20:09 core.pclient.ipp012.1208498964
-rw------- 1 jhoblitt users 36233216 Apr 17 07:33 core.pclient.ipp013.1208453625
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp013.1208465760
-rw------- 1 jhoblitt users 8028160 Apr 17 16:10 core.pclient.ipp013.1208484627
-rw------- 1 jhoblitt users 12582912 Apr 17 20:09 core.pclient.ipp013.1208498963
-rw------- 1 jhoblitt users 36286464 Apr 17 07:33 core.pclient.ipp015.1208453624
-rw------- 1 jhoblitt users 1843200 Apr 17 10:56 core.pclient.ipp015.1208465760
-rw------- 1 jhoblitt users 12488704 Apr 17 20:09 core.pclient.ipp015.1208498963
-rw------- 1 jhoblitt users 413519872 Apr 14 11:27 core.pclient.ipp016.1208208439
-rw------- 1 jhoblitt users 1409024 Apr 16 17:50 core.pclient.ipp016.1208404204
-rw------- 1 jhoblitt users 1298432 Apr 16 17:55 core.pclient.ipp016.1208404543
-rw------- 1 jhoblitt users 1298432 Apr 16 17:58 core.pclient.ipp016.1208404691
-rw------- 1 jhoblitt users 1298432 Apr 16 17:59 core.pclient.ipp016.1208404764
-rw------- 1 jhoblitt users 36118528 Apr 17 07:33 core.pclient.ipp016.1208453623
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp016.1208465760
-rw------- 1 jhoblitt users 7913472 Apr 17 16:10 core.pclient.ipp016.1208484627
-rw------- 1 jhoblitt users 12840960 Apr 17 20:09 core.pclient.ipp016.1208498963
-rw------- 1 jhoblitt users 36233216 Apr 17 07:33 core.pclient.ipp017.1208453622
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp017.1208465760
-rw------- 1 jhoblitt users 3334144 Apr 17 11:29 core.pclient.ipp017.1208467749
-rw------- 1 jhoblitt users 12619776 Apr 17 20:09 core.pclient.ipp017.1208498963
-rw------- 1 jhoblitt users 36356096 Apr 17 07:33 core.pclient.ipp018.1208453621
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp018.1208465760
-rw------- 1 jhoblitt users 12632064 Apr 17 20:09 core.pclient.ipp018.1208498962
-rw------- 1 jhoblitt users 36163584 Apr 17 07:33 core.pclient.ipp019.1208453621
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp019.1208465760
-rw------- 1 jhoblitt users 12611584 Apr 17 20:09 core.pclient.ipp019.1208498962
-rw------- 1 jhoblitt users 413626368 Apr 14 11:27 core.pclient.ipp020.1208208426
-rw------- 1 jhoblitt users 1409024 Apr 16 17:50 core.pclient.ipp020.1208404204
-rw------- 1 jhoblitt users 1298432 Apr 16 17:55 core.pclient.ipp020.1208404543
-rw------- 1 jhoblitt users 1298432 Apr 16 17:58 core.pclient.ipp020.1208404691
-rw------- 1 jhoblitt users 1298432 Apr 16 17:59 core.pclient.ipp020.1208404764
-rw------- 1 jhoblitt users 36085760 Apr 17 07:33 core.pclient.ipp020.1208453620
-rw------- 1 jhoblitt users 1843200 Apr 17 10:56 core.pclient.ipp020.1208465760
-rw------- 1 jhoblitt users 12988416 Apr 17 20:09 core.pclient.ipp020.1208498962
-rw------- 1 jhoblitt users 1409024 Apr 16 17:50 core.pclient.ipp021.1208404204
-rw------- 1 jhoblitt users 1298432 Apr 16 17:55 core.pclient.ipp021.1208404543
-rw------- 1 jhoblitt users 1298432 Apr 16 17:58 core.pclient.ipp021.1208404691
-rw------- 1 jhoblitt users 1298432 Apr 16 17:59 core.pclient.ipp021.1208404764
-rw------- 1 jhoblitt users 36085760 Apr 17 07:33 core.pclient.ipp021.1208453619
-rw------- 1 jhoblitt users 1839104 Apr 17 10:56 core.pclient.ipp021.1208465760
-rw------- 1 jhoblitt users 7979008 Apr 17 16:10 core.pclient.ipp021.1208484626
-rw------- 1 jhoblitt users 12754944 Apr 17 20:09 core.pclient.ipp021.1208498961
-rw------- 1 jhoblitt users 149090304 Apr 14 11:27 core.pcontrol.ipp004.1208208425.12994
-rw------- 1 jhoblitt users 10416128 Apr 16 17:50 core.pcontrol.ipp004.1208404203.16892
-rw------- 1 jhoblitt users 10547200 Apr 16 17:55 core.pcontrol.ipp004.1208404543.17145
-rw------- 1 jhoblitt users 10420224 Apr 16 17:58 core.pcontrol.ipp004.1208404691.17374
-rw------- 1 jhoblitt users 9879552 Apr 16 17:59 core.pcontrol.ipp004.1208404764.17536
-rw------- 1 jhoblitt users 115744768 Apr 17 07:33 core.pcontrol.ipp004.1208453618.17645
-rw------- 1 jhoblitt users 11431936 Apr 17 11:29 core.pcontrol.ipp004.1208467749.13858
-rw------- 1 jhoblitt users 10551296 Apr 17 11:33 core.pcontrol.ipp004.1208467988.17995
-rw------- 1 jhoblitt users 10416128 Apr 17 14:44 core.pcontrol.ipp004.1208479474.5027
-rw------- 1 jhoblitt users 13430784 Apr 17 16:10 core.pcontrol.ipp004.1208484626.5897
-rw------- 1 jhoblitt users 38129664 Apr 17 20:09 core.pcontrol.ipp004.1208498960.18651

comment:6 by jhoblitt, 18 years ago

This time it didn't quite die, it just stopped doing anything:

controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (5 tries)
controller is not responding (6 tries)
controller is not responding (7 tries)
controller is not responding (8 tries)
controller is down
starting controller connection
Connected

comment:7 by jhoblitt, 18 years ago

I've gotten an error similar to this about a dozen times so far this weekend.

controller is not responding (7 tries)
controller is not responding (8 tries)
controller is not responding (9 tries)
controller still not responding, giving up
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (5 tries)
controller is not responding (6 tries)
controller is not responding (7 tries)
controller is not responding (8 tries)
controller is not responding (9 tries)
controller still not responding, giving up
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
controller is not responding (4 tries)
controller is not responding (5 tries)
controller is not responding (6 tries)
controller is not responding (7 tries)
controller is not responding (8 tries)
controller is not responding (9 tries)
controller still not responding, giving up
controller is not responding (0 tries)
controller is not responding (1 tries)
controller is not responding (2 tries)
controller is not responding (3 tries)
missing PID in pcontrol message : programming error
ControllerCommand returns: 1
ControllerCommand response: STATUS EXIT
EXITST 13
STDOUT 105
STDERR 695
DTIME 12.797896
HOSTNAME ipp010

comment:8 by eugene, 16 years ago

Description: modified (diff)
Resolution: fixed
Status: newclosed

I believe this issue has been addressed with various mods to pantasks.

Note: See TracTickets for help on using tickets.