https://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/Production_Cluster_Status
2017 Production Node Work
February 22 (Haydn)
- ipp121 - removed the old /dev/sdc 480GB SSD which was flaky. Replaced it with /dev/sdd and modified it so that it would boot. Installed a 1TB HGST laptop drive as the mirror/backup and initialized it.
January 27 (Haydn)
- ippcore - installed a replacement (used) second power supply. Seems to be working OK.
- ipp121 - changed its BIOS settings so that it will boot without having to press F11.
January 23 (Haydn)
- ipp032 - replaced the motherboard using one from ipp053.
January 22 (Haydn)
- We lost power to everything connected to MRTC-B's PDU-2-B, which included ippcore and cabs 1, 2, 3, 4, 9, 12, 13, and 14.
- ippcore had one of its power supplies fail. I removed the failed one and will obtain a replacement.
- Three of our PDU's tripped a circuit breaker.
- ipp032 looks like it needs its motherboard replaced.
- ipp121 isn't recognizing its boot device.
January 5 (Haydn)
- ippx001-ippx049 - upgraded them to using the 3.7.6 kernel.
- ippx029-ippx032 - replaced a failed 1400W power supply.
2016 Production Node Work
November 30 (Haydn)
- ipp083 - replaced RAID BBU.
- ipp122 - removed 10G card and installed it in vxlan-mhpcc2.
- picked up second N5K switch to send to Oahu.
November 10 (Haydn, Brad & Gavin)
- Got 10G connectivity with MRTC-A working again.
October 19 (Haydn & Donna)
- delivered ipp118-ipp122 to MRTC-B, and began setting them up.
August 26 (Haydn)
- vxlan-mhpcc - set it up and moved it underneath ipp068.
- removed two PDU's to send to the ITC.
- moved ippx073-ippx088 to the new switch.
August 24 (Haydn)
- installed two Cisco M-1600 units in the N5K, which should give us an additional 12 10G ports.
August 9 (Haydn & Laura)
- took inventory.
- replaced three failing disk drives in ipp008, ipp009, and ipp026.
May 13 (Haydn)
- ipp078 - rebooted it, because it had become unresponsive again. Ran the mirror script's commands by hand to verify that the mirror drive was in fact up to date. Swapped the SSD's. It seems to be working fine.
- cab22b-4948 - Gavin helped me debug the transceiver and fiber going to it. It is connected now.
May 12 (Haydn)
- ippc01-ippc10, ippc12-ippc16 - decommissioned, unplugged from power. Moved ippc01 and ippc02 to ATRC to be disassembled and used for spare parts.
- ipp080 - fsck'ed both partitions on both SSD's. No errors. Swapped the SSD's, however that caused problems. Made a note to check the mirror script. Swapped the SSD's back.
- ipp078 - fseck'ed both partitions on both SSD's. No errors.
- ipp092 - noticed a power supply has failed. Waiting for more spares to arrive.
May 4 (Haydn)
- ipp083 - replaced a broken 1200w power supply.
- ipp078 - relabeled its SSD's and connected them to the right cables, to make /usr/local/mirror/mirror work.
- ipp096 - relabeled its SSD's to make /usr/local/mirror/mirror work.
May 2 (Haydn)
- ippdb03 - replaced its charred motherboard with one scavenged from retired ippdb01. Also replaced the exit fan on its power supply, because the existing one's bearings were making horrible screeching sounds.
- ipp017 - removed its drives and transferred them to ipp033. Changed all of the documentation and settings, so that now the old, broken ipp017 is old-ipp017, and the old ipp033 is ipp017.
April 28 (Haydn)
- ippdb01 - installed the new machine.
April 13 (Haydn)
- ipp048 - found that JP7 was in the wrong position and it had BIOS POST error 0x36. Moved it from the 1.5v RAM (default) position to the 1.8v RAM position, and then it worked.
- ipp050 - found that it wouldn't work with CPU #1 in place. Removed it and applied fresh Arctic Silver to CPU #2.
- ippc05 - replaced /dev/sdb with another 500GB drive.
- ipp085 - replaced BBU with another, recently tested one. It is a little bit weak, but I hope it will work.
April 6 (Haydn)
- ipp085 - replaced BBU.
- ippdb02 - replaced 8 4GB sticks of RAM with 16 2GB sticks (to have the same 32GB RAM).
- ippdb01 - replaced 8 2GB sticks of RAM with 8 4GB sticks (to increase the RAM from 48GB to 64GB).
March 31 (Haydn)
- ipp008, ipp037, ipp012, and ipp013 - Fixed the tripped breaker which took them offline last night.
- ipp037 - tried to fix a broken drive sled, but we don't have any spares. We'll have a bunch when we retire the machines with RAID's with 1TB drives.
- ipp103 - replaced its BBU. Noticed it had a failed power supply and replaced it.
- ipp085 - rebooted it. Had Gavin update its LSI software. Next time I'll need to replace its BBU.
- ipp086 rebooted it. Got it to do a relearn on its BBU, which may fix it.
- ipp055, ipp056, ipp063, and ipp078 - replaced 4 failing hard disks.
- ipp039 - noticed it had two failed power supplies. It turns out that APC power socket doesn't work. Moved those two power supplies to a different socket and changed to the APC's settings to reflect it.
- ippx045-ippx048 - noticed it had a failed power supply. Replaced it.
- ipp097 - replaced BBU.
February 25 (Haydn)
- ipp087 - replaced its BBU cable. Now it is working.
- ipp066, ipp063, ipp005 - replaced failing drives.
- ipp005 - noticed that its drives are not correctly labeled and stuck a sticker on the machine to warn us in the future!
February 18 (Haydn)
- ipp017 - replaced its single CPU motherboard with an eBay one which works with 2 CPU's.
- ipp087 - replaced its BBU, but accidentally broke a connector on the cable.
February 10 (Haydn)
- ippdb03 - corrected a misconfiguration in the ACS16, which caused me to accidentally remove power from it last night. Gavin fixed the fsck problems today, mounted the 2TB MySQL RAID, and started it rebuilding. It should work now.
- ippc15 - unplugged it, so that it will stay off forever.
February 9 (Haydn)
- ippdb03 - moved its disks to ippc15 (and decommissioned ippc15). Now it should work reliably.
February 8 (Haydn)
- ippdb03 - replaced the motherboard, SATA cables and 500GB drives, rebuilt Linux, but it still doesn't work reliably. Probably the drive backplane is causing the problem.
February 5 (Haydn)
- ippdb03 - tried a number of tricks to resuscitate /dev/sdb and boot from it, but none worked. Finally replaced /dev/sda and /dev/sdb with tested 500GB drives, and reinstalled Linux from scratch. However, it is taking to long to rebuild the new RAID, so I don't have much confidence it is going to work.
February 4 (Haydn)
- ippdb03 - replaced failed 500GB /dev/sda.
February 3 (Haydn)
- ipp101, ipp103, ipp104 - installed 10G cables. Gavin is working on making them work.
- ippdb03 - replaced failing 500GB /dev/sdb.
February 2 (Haydn)
- ipp102 - Gavin enabled 10G for it.
- ipp103 - installed Linux on it, but it needs a 10G cable.
- ipp104 - installed Linux on it, but it needs a 10G cable.
February 1 (Haydn)
- ippx52 - turned it back on, though it again turned itself off overnight. We need to replace this board.
- ipp102 - installed Linux on it, but it needs 10G.
- ipp100 - Gavin enabled 10G for it.
January 29 (Haydn)
- ipp100 - installed Linux on it, but it needs 10G.
- ipp101 - installed Linux on it, but it needs a 10G cable.
January 28 (Haydn)
- ippx069-ippx072 -- reassembled the machine and powered it up.
- ippx071 - installed Linux on it (for the first time).
January 27 (Haydn)
- ippx052 - replaced the motherboard -- the old one stopped working and wouldn't stay on long enough to boot.
- ippx070 - replaced the heat sink compound on both of the CPU's, because it was reported to run slowly.
- ippx071 - replaced the motherboard -- the old one never worked.
January 26 (Haydn)
- ipp017 - froze last night. I tried running it today, and it froze within 5 minutes. Moved the CPU from slot 2 to slot 1 and it seems to be working ok.
- ipp039 - has two power supplies which don't show green idiot lights and their fans don't appear to be moving. Tried replacing them with known good power supplies, but the same problem remains. This machine is likely to have its power supply chassis fail soon!
- ippx052 - froze. It's power button is off. Turned it on twice, but each time it turned itself off before it could display anything on the screen.
January 25 (Haydn & Sifan)
- ipp098-ipp102 - installed all of their 6TB drives.
January 22 (Haydn & Sifan)
- ipp098-ipp102 - delivered them to MRTC-B and mounted them on rails in the racks.
January 16 (Haydn)
- ipp028 - froze yesterday. Removing CPU#1 fixed the problem.
2015 Production Node Work
December 23 (Haydn)
- ipp046 - it froze last night, Eugene turned it off. Today I restarted it and it worked fine. We don't have any known good replacement S5397 motherboards left, so I removed CPU #2 and started it back up, but it wouldn't work. Replaced CPU #2 and removed CPU #!1 and it booted back up. We'll have to see if this is a good fix or not. (All of the replacement motherboards are either questionable, or used to work with only 1 CPU.)
- ipp008 - proactively replaced drive #7.
- ipp035 - proactively replaced drive #16.
- ipp004 - proactively replaced drive #24.
December 7 (Haydn)
- ipp046 - today it worked as soon as I turned it on. I'll keep an eye on it for the future.
- ippc02 - power supply had failed. Replaced it with our last spare one.
November 30 (Haydn)
- ippc02 - replaced failed sdb.
- ipp017 - replaced motherboard with a recycled one which only works with 1 CPU, because the old one may not have been reliable. Couldn't get eth0 to work on the new motherboard.
October 27 (Haydn & Sifan)
- traced many of the 10G fiber cables to document them for Ken.
- ps1dsa2 - replaced the RAID card and brought it back online.
- ipp014 - proactively replaced drive #11.
October 14 (Haydn)
- ipp039 - replaced the RAID controller by borrowing one from ps1dsa2.
- ipp017 - found a piece of plastic had fallen inside, and may have been blocking airflow. Removed CPU1 and replaced CPU2.
September 25 (Haydn)
- a test of the fire alarm system this morning turned off all power at MRTC-B. When they restored power, for some reason the breaker for our circuits got tripped, and our machines were left off. I discovered this as soon as I got on site. They fixed it and the entire cluster got turned on all at once. All of the hardware seems to be ok.
- ippdb01 - proactively replaced drive #21.
- ippdb02 - proactively replaced drive #19.
- ipp044 - drive #4 failed for the fourth time. Last time I used a 1TB drive which I'd tested 100%. This time I used a 1.5TB drive which I'd also tested 100%. We'll see how long this one lasts for.
August 3 (Haydn)
- delivered a bunch of stsci boxes with Matt's help.
- replaced the BBU and BBU cable in ipp082 to make its RAID BBU work.
July 23 (Haydn)
- ippb06 - replaced RAID controller C and that fixed it!
- stsci - took inventory of all of stsci's spares -- everything is present and accounted for.
July 20 (Haydn)
- ippb06 - the third LSI 9280-24i4e RAID controller has failed and is invisible to the OS. Patrick will send us a replacement RAID controller for free ASAP. In the meantime, only the first to volumes are accessible.
- ippdb02 - the first CPU socket has stopped working. Brought it back online with only the second CPU socket filled.
July 16 (Haydn)
- ipp078 - SSD #1 went offline, which crashed it. The PDU was incorrectly configured for power cycling this machine. Fixed the SSD by relocating it to take the strain off of the cables.
- ipp096 - got it setup. Had to physically swap the SATA cables for the SSD's in order to get it to work!
July 6 (Haydn)
- ipp026 - replaced the motherboard. The old one would only work with 1 CPU and Ethernet wouldn't work. This one works with 1 CPU with Ethernet.
- ipp038 - replace the motherboard. The old one was dead. The new one works with 1 CPU.
- brought two new replacement Supermicro 1200W power supplies from eBay.
- brought five old Tyan S5397 motherboards, all of which when last used would only work with 1 CPU, so we've got three more spares.
July 2 (Haydn)
- ipp082 - Gavin identified that the RAID card wasn't working. Replacing it fixed it.
- ippc52 - had a problem with fsck'ing the disk. Gavin fixed it.
- ipp026 - wouldn't work with two CPU's, so removed one. Unfortunately the Ethernet stopped working. I'll have to replace the motherboard on Monday.
- ipp034 - removed one CPU and now it is working.
- ipp038 - took turns removing both CPU's, but it wouldn't boot. I'll have to replace the motherboard on Monday.
- ippdb02 - still dead.
- ippdb06 - an SSD has an ECC-ERROR. Set it to use one of the spare SSD's. It is rebuilding right now.
July 1 (Haydn)
- Potential power outage at MRTC-B at 9am this morning, so we shut down the cluster to avoid it.
- Power restored just before noon.
- ippdb02, ipp026, ipp034, ipp038 all wouldn't come back up. I'm guessing that their motherboards have failed. Tomorrow I'll come back and see if I can get them to run with one processor.
- ippc16 - tried to replace the board in the front which beeps incessantly, but the replacement I found wasn't a match.
- ipp097 - tried to replace the clunky rails it has with some new, better ones, but that case will only work with the clunky rails.
June 25 (Haydn)
- ipp096 - tried to set it up, but had several difficulties which prevented it from working.
- ippc43 - replaced a failing 3TB drive.
- ippdb01 - replaced a failing 300GB drive.
- delivered two spare used PDU's.
- met with Thomas (the electrician) about the power outage on 7/1.
June 11 (Haydn)
- all of the machines in cab2 have been unreliable for the past two days. The cab2 PDU smelled of burned electronics. Replacing it and configuring the new one fixed the problem.
May 12 (Haydn)
- ipp029 - replaced the motherboard with the original one, but removed the second CPU. We are out of NOS Tyan S5397 motherboards.
May 11 (Haydn)
- ipp029 - replaced the motherboard -- the old one would only work with a single CPU in the first socket. The second motherboard was an old one which had come from ippc06, but only worked for a few minutes. The third motherboard was a new-old-stock one from eBay. I may have accidentally damaged a power connector, but we'll see how long this one lasts. The old motherboard was an NOS eBay one from October 2013, so we got 18 months out of it.
May 8 (Haydn)
- ippc18 - figured out how to get it to put the RAID volumes in the right order, and now it boots. It coredumps when it tries to automount the NFS shares, though.
May 7 (Haydn)
- ippc18 - made a backup of the 1TB RAID, so that I can experiment with less fear tomorrow.
May 5 (Haydn)
- ippc13 - wouldn't boot because there is some sort of CMOS BIOS error and it was waiting for F1 to be pressed on the keyboard. I replaced the CMOS battery (the old one still had 3.2v), cleared the CMOS BIOS settings and re-entered them, but the problem remains. We cannot afford to replace this motherboard at the moment, so if this machine gets power cycled (and possibly merely rebooted) it will need to have a keyboard plugged into it and F1 pressed for it to come up.
- ippc18 - Gavin did most of the work, trying to get it to boot from the RAID, but no luck so far.
- ipptempmon - updated Linux, made a backup copy of it.
May 4 (Haydn)
- ippc18 - replaced the two failed drives, deleted the u0 RAID volume, recreated the RAID volume, and restored the contents from the better of the two failed drives. Couldn't get it to boot, though.
May 2 (Haydn)
- ippc18 - had a double drive RAID1 failure. Was not able to rebuild it.
- ippc19 - replaced a failed RAID1 drive and rebuilt it.
April 14 (Haydn)
- ippx030 - replaced /dev/sda, which had failed on Mon.
April 10 (Haydn)
- cab23pdu0 - failed last night. Replaced it with a spare and reconfigured it.
- ipp056 - checked on drive #18, which finished rebuilding this morning.
- ipp069 - proactively replaced drive #0, which reported ~60 media errors, up from ~50 last week.
- stsci12 - replaced A:7, which failed last week.
April 6 (Haydn)
- stsci07 - replaced B:12, which had failed.
- ipp017 - replaced the motherboard with a new one from eBay. The CPU power conversion area had caused a small fire.
- ipp088 - checked on the power supplies, but they're both green. There were some warning messages from them over the weekend.
April 1 (Haydn & Sifan)
- ipp096 - temporarily moved it back to ATRC in order to backup and rebuild one of the RAID volumes in ippb01.
March 25 (Haydn)
- ippdb08 - replaced the Intel 10G card with a known good, used one from the ipp036 experiment. Rsync is working fine now.
- ippdb06 - moved its 10G connection from ippcore to the N5K's port #33.
- ipp094 - replaced its RAID battery with an iBBU08 from eBay.
March 24 (Haydn)
- ippdb08 - replaced the RAID controller and battery, moved the Intel 10G card, and reinstalled Linux. Same problem with rsync continued.
March 20 (Haydn)
- ippdb08 - installed and set it up.
March 17 (Haydn)
- ipp094 - replaced the RAID controller and battery. Also found out that an iBBU08 is what LSI recommends for this board, so that will be my next fix. Currently the RAID battery is charging.
- ipp077 - opened it up to replace the SATA/power cable for the second SSD, however the cables I got won't fit. I rerouted the existing cables and restarted it,and the second SSD was online for a change, so I re-enabled mirroring to it.
- ippc19 - replaced the failed RAID battery. The new one is charging.
- stare - Paul Sydney dropped off a NAS box, however it is currently setup for DHCP. I'll have to take it back to ATRC to reconfigure it.
March 2 (Haydn)
- ipptempmon - reinstalled it and fixed the problem preventing the temperature sensors from working.
- ippc06 - pulled it and confirmed that the motherboard has failed.
February 25 (Haydn)
- ippc06 - a drive failed, and I can't get it to see either of the disk drives anymore.
- ipp094 - replaced a bad iBBU07 battery from eBay with another one from eBay. Hopefully this will work.
- ipp086 - replaced a bad iBBU05 battery with another one from eBay.
- ipp077 - opened it up to photograph the cables for the SSD's. They're unusual, and the ones I brought are not a drop in replacement.
- ipptempmon - am taking it back to ATRC to work on it. For some reason the temperature sensors are no longer visible to the OS.
- ps1dsa2 - began setting it up.
February 18 (Haydn)
- ipp044 - transplanted the RAID controller from ipp051 into it. LSI said there should be no problem. Reinstalled half of its missing RAM, its missing CPU, a new CPU cooler, and half of the RAM from ipp051, so it now has 32GB again. Couldn't get it to boot. Gavin will take a look.
- cab15con -- tried to set it up, but need Gavin's expertise.
February 13 (Haydn)
- ipp044 - successfully ran MemTest86+ for 23 hours. Gavin fixed the software problems with the RAID, so now it is back online. If it freezes soon, we should next replace the RAID card with the one from ipp051.
- ipp051 - took it offline. Once ipp044 is stable (or we have a spare RAID card in hand), then it will become ps1dsa2.
- ipp094 - replaced the RAID battery with one from eBay.
February 12 (Haydn)
- ipp044 - replaced its CPU, removed half the RAM leaving 8GB, and replaced all four power supply modules. Tried to replace the power supply chassis, but the 4 replacements we have all have different wiring. After replacing the CPU, etc. it wouldn't boot the RAID and quickly froze up. Replaced the RAM with the other half (8GB) and left it running MemTest86+. If it can run MemTest86+ overnight, then we know that the current CPU, RAM, motherboard, and power supplies are all ok, and that leaves the RAID card to be suspicious of.
February 11 (Haydn)
- ippx044 - the previous new WD20EURX green drive failed before it finished, so I replaced it today with a non-green drive.
- ipp044 - worked with Gavin and fixed its Ethernet problems. It stayed up for an hour and 20 minutes and froze again. I'll work on it again tomorrow.
February 9 (Haydn)
- ipp044 - couldn't get any of its Ethernet interfaces to work, and the machine freezes after a few minutes of being on. Tried a new motherboard, removing the second CPU, and removing the possibly bad RAM. Both problems remain. I'll try replacing the power supply chassis next -- some of the wires may not be making good contact, and I'll also try using the other CPU.
- ippx044 - replaced it's other 2TB drive and rebuilt its RAID.
- ipp096 - wired and cabled it, however Sifan needs us to saves its contents until ps1scn is back online and we have a copy of the DSA.
- ipp097 - wired and cabled it and installed Gentoo Linux. Waiting for 10G before releasing it.
February 6 (Haydn)
- ipp044 - did a bunch of debugging to determine that the spare power supply chassis was good, however one of the spare power supply modules wasn't. Installed the second CPU with an Intel cooler, cleaned the goo off of one of the memory modules and got it working again!
February 4 (Haydn)
- ipp077 - inspected the wiring for SSD #2, because it had gone offline. I may have fixed a loose connection in the SATA connection at the motherboard, but we'll see. For now it is working again.
- ipp083 - replaced the RAID battery with one from eBay.
- ippx044 - replaced /dev/sda with a "green" drive. Next have to replace drive labeled: /dev/sdb.
February 3 (Haydn)
- ipp094 - installed redundant power supply.
- delivered a bunch of spare parts and ipp095 & ipp096, but didn't have time to set them up.
January 30 (Haydn)
- ipp027 - stuck with POST code "FF". Replaced motherboard. The motherboard which failed wasn't an original one, but one which had been replaced in Oct 2013. I'm not sure where it came from.
- ipp092 - replaced RAID battery with one from eBay.
- ipp094 - replaced RAID battery with one from another RAID card from Unix Surplus. Uses iBBU07, not iBBU05.
- ipp084 - replaced RAID battery with one from eBay, but running MegaCli64 always coredumps. Maybe we should try running a more recent version of the software?
- ipp083 - rebooted it and initiated a RAID battery relearn.
- ipp077 - replaced /dev/sdc 300GB SSD with a new one and rebuilt it.
- ippx070 - replaced /dev/sda with a 500GB HD and rebuilt it.
- ippx085 - replaced /dev/sda with a 500GB HD and rebuilt it.
January 28 (Haydn)
- ipp044 - removed the charred motherboard. Cut the old power supply chassis out. Cleaned everything as well as I could. Unfortunately, the "new" power supply chassis does not appear to work -- as soon as power is applied it gives a shrill beep and turns off the voltages.
January 27 (Haydn)
- ipp044 - opened it up -- it was the cause of the acrid smoke observed here last night, even though the CPU's stopped working about 24 hours earlier. I'll need to get a respirator and I'll need to work on this machine outside (so that it doesn't accidentally trip the fire alarms in here). I'll return tomorrow. I examined one spare power supply I'd purchased on eBay, and I think I can adapt it to work on this chassis.
January 22 (Haydn)
- prepared a second fiber cable for bonding the connection between ippcore and ipp6509-e.
- ippb06 - moved it from ipp6509-e slot 2 port 3 to slot 1 port 4.
- worked with Gavin to conect ippcore slot 12 ports 3 & 4 and ipp6509-e slot 2 ports 3 & 4.
- ipp092 - replaced its RAID battery. Will need to do a manual relearn when it finishes charging.
- ipp064 - got stuck. Rebooted it and it booted from the CD. Rebooted it again and it booted from the RAID. Currently it is 71% done rebuilding the RAID.
MegaCli64 -PDRbld -ShowProg -PhysDrv [245:10] -a0 -NoLogWhen it finishes, we need to replace drive #0. - ipp094 - replaced the RAID backplane and now it works. I'll work on setting it up.
January 21 (Haydn)
- ippdb01 - replaced drive #10. New one is now a hot spare
tw_cli /c5 add type=spare disk=10 - ipp064 - replaced a failing 2TB drive.
- removed old fiber cables for ipp071, etc.
- moved quad 10G card from ipp6509-e to ippcore.
- prepared new fiber cables for tomorrow.
January 15 (Haydn)
- ipp092 - manual RAID battery relearn.
- ipp084 - manual RAID battery relearn.
- ippdb01 - removed drive in p23 and discovered the drives were temporarily mislabeled! Replaced drives 20 and 23 with new spares. Power cycled it to get it to recognize the new drives. Replaced drive 10, which had 434 reallocated sectors. Now waiting for RAID to finish rebuilding, before power cycling it to get it to recognize drive 10. Command to mark a spare is:
tw_cli /c5 add type=spare disk=N - ipp083 - power cycled it to get the RAID battery to correctly report its status, then it asked me to do a battery relearn, which I did.
- ipp086 - power cycled it to get the RAID battery to correctly report its status, then it started doing a battery relearn.
- ippc14 - replaced fried motherboard.
January 14 (Haydn)
- ipp091 - installed new RAID battery.
- ipp092 - installed new RAID battery.
- cab23 4948-10GE - added extra 10GE transceiver for ippx001-ippx036.
- cab22 4948-10GE - added extra 10GE transceiver for ippx045-ippx088.
2014 Production Node Work
November 14 (Haydn, Rita)
- ipp053 - replaced motherboard.
- ippx09-ippx015 - installed.
- Rita - labeling and cabling.
November 13 (Haydn, Rita, Donna)
- Haydn & Donna racked 14 2U compute nodes, and 7 4U storage nodes.
- Rita - labeling.
- Tried to fix stsci03. No luck. It seems like the power distribution box is not working. The power supplies are ok, but nothing comes out of the power distribution box. Contacted Supermicro, Source Support, and Aspen Systems for spare parts. Should arrive on Tues.
- ipp052 - replaced motherboard, because it wouldn't boot.
- ippx008 - installed.
- Donna - installing drives.
November 12 (Haydn, Rita)
- ippx001, and ippx003-ippx007 - installed.
- cooling outage at 5pm. Brought everything down, except for ippc18 and ippc19.
- stsci03 wouldn't come back on, though the power supplies are fine.
- ipp052 wouldn't boot. It stops after printing "Initializing BMC".
November 10 (Haydn)
- ippx002 is now up.
November 7 (Haydn)
- Trying to get ippx001 and ippx002 installed.
November 6 (Haydn)
- Trying to get ippx001 and ippx002 installed.
November 5 (Haydn)
- Installing ACS48 and Cisco 4948-10GE for cab23 and running cables for them.
November 4 (Rita)
- installed the rest of the drives in 11 of the 12 new machines.
November 3 (Haydn, Donna, Rita)
- Haydn & Donna emptied out two old racks.
- Haydn & Donna racked 12 2U machines and a Cisco 4948-10GE in the furthest rack (#23?)
- Rita installed some of the disk drives in two machines.
- Rita correctly re-labeled some of the drives in ipp072-ipp081.
October 31 (Haydn)
- ippdb05 and ippdb06 - re-setup ippdb06 from scratch and setup ippdb05.
October 29 (Haydn)
- ippdb06 - got it partially setup.
October 28 (Haydn)
- ippdb05 and ippdb06 - unpacked and racked them.
October 6 (Haydn)
- stsci18 - replaced motherboard to fix the problem where it would lose its BIOS settings when turned off.
- ipp040 - examined p2, which isn't part of the RAID. Replaced the drive in it with a known good one. Still didn't work. That port is probably broken.
- ipp072-082 - Gavin switched them over to use 10GE and rebooted them.
October 3 (Haydn)
- took inventory of all of the cables, fiber, serial adapters, PDU's, Ethernet switches, serial consoles, etc. for the coming builds.
- ipp053 - replaced a broken drive.
October 1 (Haydn)
- ipp013 - lost its BIOS settings again. Replaced the motherboard -- it was a Tyan S7010.
- ipp6509-e - replaced transceiver in s2p1, which was broken. Now the 4948-10G in c15u40 has dual links working again.
- ipp6509-e - removed the transceivers in s1p3 and s2p4, which used to go to 4948-10G's which serviced Jaws.
- ipp6509-e - s1p1 goes to the 4948-10G in c17u41 which will be used for the machines in cab16-18.
- cab1pdu0 - when I plugged ipp013 in, it caused it to reboot all of the other machines in the lower half of the rack (and on that half of the PDU). Also, when I used the web page to examine the PDU's power settings, the PDU thought that all of the outlets in the top half of the PDU were off! I used the web GUI to turn them back on.
- N5K - connected ports 39 and 40 to ippcore. Connected ipp072-082 as well.
September 29 (Haydn)
- ipp080 - fixed the SSD problem, but one fan isn't working due to a bad connector. Machine is partially setup, but won't boot from the SSD's. I may have to re-install from scratch.
- ipp073 - noticed a fan wasn't spinning.
- enabled all of the users on cab17con, and removed the obsolete users on the other serial consoles.
- cab17con - configured the power control to work.
- tested and discarded a used PDU -- none of the outlets would turn on.
- c16pdu0 - outgoing serial link doesn't work, so c16pdu1 is connected to its own port on the serial console server.
September 25 (Haydn)
- cab16pdu1 - swapped another PDU in for the one which wasn't working.
September 23 (Haydn)
- Found the solution to the serial console problem.
- ipp078 - setup.
- ipp076, ipp078, ipp079, ipp081 - released.
September 22 (Haydn)
- ipp013 - replaced drive #10, because it had two separate errors on Fri.
- ipp013 - a few minutes later it rebooted itself and gave a CMOS checksum error. Replaced the CMOS battery.
- cab17con - removed the ACS16 and replaced it with the ACS48. Configured the ACS48.
- installed the new Cisco 40 port 10GE switch. Not sure how to turn it on.
September 19 (Haydn & Rita)
- Rita finished installing the 4TB drives in the last two machines.
- Rita removed all of the 1TB 2.5" Jaws drives from their carriers.
- ipp013 - Haydn found a short in the 12v fan power cord, where Tyler had accidentally stretched it tight across the sharp edge of a CPU heatsink. Also had to clear and reconfigure the CMOS BIOS settings.
September 18 (Haydn & Rita)
- Rita finished installing the 4TB drives in all but two machines.
- Rita took inventory.
- Gavin got ipp081 working.
- Haydn got ipp079 and ipp076 done.
September 17 (Haydn & Rita)
- installed drives in ipp081 and tried to set it up, but no success yet. Connected it to the new ACS16 port 3, cab16 PDU0, and ippcore slot 10 port 21.
- prepared 11 USB sticks for the install and put one in each machine.
- connected ipp080 to ACS port 4, ipp079 to ACS port 5, ipp078 to ACS port 6.
- reinserted ipp047's power supply module "B" to test it.
- inserted a spare power supply module in ipp040, so it has 4 again.
September 15 (Haydn & Donna)
- unracked all of the Jaws nodes in cab16, cab17, and cab18.
- racked the new Supermicro machines in cab16 and cab17.
September 12 (Haydn)
- cab1 - replaced the broken PDU.
- j001bXX - moved all of the Jaws nodes from the two enclosures they were in, in cabs 16 and 18, to one enclosure in cab15.
- unplugged all of the rest of the Jaws nodes and removed their 2.5" HD's.
- tried to clean the optical transceiver for slot 2 port 1, but it needs to be replaced.
September 8 (Haydn)
- ipp047 - examined the power supplies. It has been working fine for 3 days. 47-A is probably bad. 47-B may be ok.
- ipps13 - added a stick of RAM which Kingston was kind enough to send us for free. Note: Kingston's warranty is only good for the original purchaser!
- inspected the power whips. We don't have any under cab21-cab23.
- inspected the newly arrived Hitachi 4TB drives.
- stsci18 - replaced a broken drive.
September 5 (Sifan)
- ipp047 - rotated PSs
- ipp047 - config now: 47-D, 47-C, 40-D
September 4 (Sifan)
- ipp047 - failed again: power supplies, replaced, rotated PSs
- ipp047 - configuration now: 47-A, 47-C, 40-D
Sept 2 (Sifan)
- Ipp047 failed: power supply issues - trying 3 PSs, one from ipp040
- Labeled PS from left, A, B, C, D
- Removed 040-D from ipp040
- Removed 047-C and 047-D placed under laptop cart
- ipp047 - now has 47-A, 47-B, 40-D
August 15 (Haydn)
- Added another scavenged Dell 8 outlet PDU to power the rest of machines in cab1.
- ipptempmon - moved it from cab1 to cab2 (there were no more outlets available in cab1).
- ipp051 - it went offline this morning. Saw there was a kernel panic on the console. Reset it. Now it is working again.
August 14 (Haydn)
- ipp047 - found it with power connected, but no lights on, and it wouldn't respond to toggling the power switch. Replaced the rightmost power supply module (with our last spare). Seems to be working fine now, but we should keep an eye on it.
- ipps13 - isolated a bad memory module, which will need to be replaced.
- PDU for cab1 stopped working. Circuit breakers look ok. No lights on display. Replaced it with a scavenged Dell 8 outlet one to power the bottom 4 machines in the rack.
July 28 (Haydn)
- ipp033-036 - switched them from using 10Gb fiber to 1Gb Ethernet.
- ipp036 - disconnected it from 10Gb fiber.
- ipp071 - attached its top port to ipp036's fiber, which goes to the Cisco 6509-e, slot 4, port 4. Did my best to clean that port.
July 16 (Haydn)
- ippdb01 - moved it from the top of cab5 to the bottom of cab3, to keep it cooler. Moved it to a different serial port. Patched the cooling airflow holes this created.
- ippb06 - removed the old 64GB RAM and put in a new set of 64GB RAM.
- ipp069-70 - replaced the serial cables with new ones.
July 3 (Haydn)
- ipp033-036 - connected them to fiber, to the ports in slot 4 of ipp-6509-3.
- ipp069, ipp070 - used compressed gas and lint-free cleaners to try to clean the switch end of their fiber. Straightened their cables slightly, and that made a difference.
July 2 (Haydn)
- ipp033, ipp034, ipp035, ipp036 - installed Intel 10G Ethernet cards.
- ipp033 - motherboard fried when I reapplied power. The IDE cable for the CD ROM has a metal sheath, covered by plastic netting. The plastic netting allowed the metal sheath to short out the motherboard. No idea why this happened at this particular time. Tested power supplies and replaced motherboard.
June 23 (Haydn & Sifan)
- ippb06, ipp067, ipp068, ipp069, ipp070 - installed, but couldn't get them working. Hardware and software problems.
June 16 (Haydn)
- ippc35 - replaced dead power supply.
May 28 (Haydn)
- Shut everything down while the ATS was replaced and tested. Turned everything back on.
May 19 (Haydn)
- ipp042 - wouldn't respond to reset or power cycling. Motherboard POST code stuck at FF. Replaced the motherboard and now it works.
- ippc05 - had a failing 500G HD. Replaced it with a tested, known good one.
- ipp036 - wouldn't respond to reset or power cycling. Motherboard POST code stuck at FF. Replaced the motherboard and now it works. When opening it, I found an unusually thick and messy layer of two kinds of heat conduction paste on the CPU's, and two white plastic screws stuck to the bottom of the motherboard, where they were squeezed between the motherboard and the case. The improperly installed CPU's or the plastic screws out of place, flexing the motherboard could have affected its life.
May 5 (Haydn)
- ipp051 - wouldn't respond to power cycling. Motherboard POST code stuck at FF. In all of the other cases where we've seen this, replacing the motherboard was the only solution which worked, so I did that without a lot of other testing.
- ippc16 - was beeping and had the red warning light on the front panel lit. Opened it up, plugged it into power, and it was still beeping. Checked that all of the fans were spinning (they were). Checked that all of the fan connectors and connectors going to the front panel with the beeper were tight. Some were a little loose. Somewhere in the process the beeping stopped. Re-installed it in the rack and it continued to be quiet, however it started beeping again as I was leaving the building. There is a button on the front panel to silence the beeping and the MRTC-B operators now know to press it to fix this problem.
April 30 (Haydn)
- ippc13 - wouldn't respond to power cycling. Power supply was ok. Replaced motherboard. One unusual thing about this particular S5397 motherboard is that after passing the memory test it says:
System Configuration Data Read Error Press <F1> to resume, <F2> to Setup
At this point it waits until you press <F1>, and then it boots normally. There is nothing I can do about this except replace the motherboard. Let me know if you want it replaced. I already cleared the BIOS to the default settings, and then carefully applied our changes, but that didn't make a difference.
March 13 (Haydn & Rita)
- ipp035 - wouldn't respond to power cycling. Power supply was ok. Replaced motherboard.
March 6 (Haydn)
- ipp035 - turned the power on and it started working again. Found a few incorrect BIOS settings and corrected them, but the power wasn't one of them.
February 27 (Haydn)
- ippd00 - removed it for use at the summit.
February 19 (Haydn)
- ipp026 - replaced the motherboard. Old one didn't have any obvious signs of failure, however CPU#1 had only 40% coverage for the heat sink compound, and CPU#2 had heat sink compound on about 10 of the pins on the bottom!
- ipp064 - replaced failed drive #18.
February 14 (Haydn)
- ipp026 - replaced the motherboard. Old one was charred. Tried the new motherboard which didn't work in ipp028, and then tried another new one. Will have to return one (hopefully for free replacement).
January 31 (Haydn)
- ipp028 - replaced the motherboard twice. Gavin had to fiddle something to make the Ethernet work.
January 29 (Haydn)
- ippc02 - replaced a broken, noisy fan. Didn't have the right spare. Will replace it soon.
January 28 (Haydn)
- ipp023 - noticed that one power supply module wasn't working. Reseated it and two others stopped working! Removed it and restarted the machine and it worked fine. Replaced the broken one with a spare.
January 21 (Haydn)
- ipp046 - replaced the motherboard. The old one was badly charred, but only on the bottom. Didn't have the right heat sink compound. Will need to redo this later.
January 13 (Haydn)
- stsci02 - RAID was only working in read-only mode. Replaced the RAID card, but that was a small mistake. Noticed that one power supply in the JBOD had failed. Reinstalled the original RAID card and got it working (with Gavin's help).
- inspected all of the power supplies at MHPCC and found that stsci04 and stsci17's JBOD's power supplies each have had one failure.
- stsci04 - replaced a failed JBOD power supply.
- stsci17 - replaced a failed JBOD power supply.
- ipp030 - replaced a failed 2TB drive #10.
January 8 (Haydn)
- ippb01 - took it apart and discovered that the N+1 power supply enclosure is not producing the right stable voltage on the +12v line. Contacted Patrick Lam and he will provide a replacement for free (we just pay the shipping and returned the failed enclosure). The 4 power supplies themselves seem to work fine when placed in other machines.
2013 Production Node Work
December 24 (Haydn)
- Searched for the cause of the burnt smell, but couldn't located it.
- ipp037 - noticed one power supply's light was off, tried reseating it, but that didn't fix it. Replaced it with a spare.
- ippc02 - MHPCC personnel noticed a red light for a failed fan was on. Temperature and air flow for the machine are the same as the others, so I didn't examine it further. On a future visit we can power it off so I can investigate it.
November 1 (Haydn)
- stsci14 - Rebooted it twice -- it wouldn't come all the way back up the first time.
- ipp038 - Wouldn't work unless the second CPU was removed. Replaced the motherboard with a new one and now it is working fine again.
October 24 (Haydn)
- ipp047 - Replaced the RAID card's backup battery.
- ipp046 - Replaced the RAID card's backup battery, wouldn't boot but said CMOS checksum error, replaced the CMOS battery with a used one I had on hand, reset all of our CMOS settings, rebooted it and it worked on the second try. It seemed to do an fsck, rebooted, and then it worked fine.
October 7 (Haydn)
- ipp034 - replaced broken motherboard. Upgraded it from 16GB RAM to 24GB. It wouldn't work with 32GB. (2GB sticks)
- ipp026 - tried to upgrade it from 16GB to 24GB RAM, but it would only show 20GB and the sticks were quite hot. Decided not to try to increase its RAM. Had to reset it to get it to boot the last time.
- ipp035 - upgraded the RAM from 16GB to 24GB.
- ipp036 - upgraded the RAM from 16GB to 32GB.
- ipp009 - reseated the RAM, but it still only sees 20GB of 24GB installed. It uses a different type of RAM than I have spare sticks for, so this won't be easy to fix until I can get some. (Tyan S7010 motherboard.)
- ipp031 - upgraded the RAM from 16GB to 20GB -- it wouldn't recognize the full 24GB, but that set of sticks has not worked any better in many other machines, so I'll settle for this.
October 4 (Haydn)
- ipp034 - noticed a burning smell. On investigation discovered the voltage regulators had charred.
- ipp030 - upgraded is RAM from 16GB to 20GB. It contains 24GB, but won't recognize the last 4GB.
October 3 (Haydn)
- ipp041 - replaced the motherboard so both CPU's would work.
October 2 (Haydn)
- ipp029 - replaced the motherboard so both CPU's would work, and upgraded it to 32GB RAM.
- ipp027 - replaced the motherboard so both CPU's would work, and upgraded it to 32GB RAM.
October 2 (Rita)
- STSCI10 - STSCI19 - reconfigured the BIOS to power on after power outage
- NOTE: enter Delete key to access BIOS menu
September 9 (Haydn)
- ippc12 - reinstalled it, this time with a new Tyan S5397 motherboard, and it seems to work fine now.
- ipp047 - replaced the module which holds the 4 power supplies. Tested it with an old S5397 motherboard (which only would work with a single CPU), and then installed a new Tyan S5397 motherboard.
- ippc63 - checked it over to try to determine why its temperature sensors are always reading unusually high. The power supply fan was working fine. Nothing seemed unduly hot. The CPU heatsinks were reasonable. Both CPU's had a good coating of heat sink compound. All of the fans which blow air across the heatsinks were working fine. I'm guessing the temperature sensor for CPU2 is not functioning correctly. I'll modify my software to only use the sensor for CPU1.
- ippd00 - rebooted it to verify that the DRAC has the correct settings. It does.
- found several failed 2TB drives which had been placed in the wrong boxes. I'll send them back to WD to be swapped for good ones.
September 4 (Haydn)
- ipp047 - tried to install the new power supply module. Unfortunately it didn't have exactly the same form factor. Succeeded in transplanting the electronic guts into the old sheet metal, but then the power cables didn't quite fit/match. Took it back to ATRC to solder new ones in place.
August 28 (Rita)
- ipp045, ipp034 - installed the replacement power supplies without problem.
August 26 (Haydn)
- ippd00 - partially installed. At this point I can ssh into it and finish installation from ATRC. Unfortunately, the rails are too long to fit our cabinet (we have things near the ends of the rails which limit their length.
- ipp047 - installed another motherboard, which it promptly fried (in less than 1 sec). Verified that all three of the power supplies do not cause problems in any other machines. Perhaps it is the power supply cage or wiring harness. Ordered a replacent on eBay, with US Priority Mail shipping.
- ipp045, ipp043 - reseated the power supplies which weren't working. All of the continued to not work. Removed them. Used replacements arrived today. Will install them on next visit.
August 22 (Haydn)
- Installed ipptempmon and confirmed that it is taking data.
- Investigated ipp047. Its motherboard caught fire between 12:30am and 1am last night. Motherboard destroyed, one power supply destroyed, chassis damaged. Will return tomorrow with a vacuum and cleaning supplies.
- Inspected all of the machines' power supplies. Found three others not working: ipp048 (replaced with the only spare), ipp045, and ipp043.
August 21 (Rita & Haydn)
- Replaced drives in stsci04 and stsci17.
- Tried to setup ippd00, however it Ethernet ports don't work with the tg3 or bcmx drivers on the Gentoo installation disk.
- Tried to install ipptempmon, however I had the Ethernet misconfigured.
- Chris Young showed us a tour of the Mana cluster.
August 14 (Haydn)
- ipp052 - replaced the Tyan S5397 motherboard with an S7010 motherboard, new CPU's, and 24GB of new RAM. Used one active (fan) heatsink and one passive one. Have ordered an active heatsink on eBay; will replace it after it arrives in about a week.
August 13 (Haydn)
- ipp052 - tried a number of tricks to get it to work, but no signs of life. Probably bad motherboard.
August 9 (Rita)
- ipp032 - swapped 1TB disks with 2TB, configured RAID and returned to ATRC to install OS while RAID config. continued. OS config. completed around 3PM. with no problems.
August 5 (Rita)
- ipp031 - swapped 1TB disks with 2TB, configured RAID and returned to ATRC to install OS while RAID config. continued. Had planned on upgrading ipp032 too, but found we were short of 2TB disks. Plan is to order ASAP and perform upgrade.
July 11 (Rita)
- ippc17 - replaced the BBU on the RAID controller. System restarted fine. Still too early to tell if the battery temp too high errors will continue.
- took pictures of stsci10 - 19 servers in the MHPCC server room. Received permission from MHPCC security to take these pictures that will be sent to Prem with STSCI.
- finished labeling all the drives in JBODs for stsci10 - 19.
July 9 (Rita)
- ippc17 - was reporting - Controller 5 ERROR - Battery temperature is too high
- opened the box and found that all fans were working correctly. Reseated the RAID card and found that the controller backup battery unit felt fine, was not unusually warm, etc. Restarted server and we did not receive the temp error again for the rest of the day. However, Gavin reported that next morning the error occurred again. Will see if more fans can be installed in unit, and if possibly the BBU is faulty.
June 6 (Rita and Chad Richards)
- stsci10 - stsci19 installation and tests completed today. Rita verified she could ssh into all 10 of the servers and pwd displayed her working directory. Will return next week to label the new servers, cabinets and RAID disks.
June 5 (Rita and Chad Richards)
- stsci10 - stsci19 were powered up and tested. Chad discovered a bad Raid card on stsci15 and replaced it with LSI MegaRAID 9280-8e RAID Controller which we had in our spares kit. RAID array rebuilding and should be complete later this evening. Will be able to complete tests on this server tomorrow.
- stsci15 had one failed RAID disk, which was swapped with one of the spares. RAID array rebuilding and should be complete later this evening. Will be able to complete tests on this server tomorrow.
- ran all functional tests except for the io-test nfs. All servers passed successfully. Aspen engineer will run the io-test nfs remotely from VPN connection tonight or run from MHPCC tomorrow. Tomorrow plan to complete testing of stsci10 and stsci15. Will also take one load of the server boxes from the MHPCC server room to store at Waikoa.
June 4 (Rita, Haydn and Aspen engineer, Chad Richards
- worked the afternoon and unboxed and racked stsci10 - stsci19
May 20 (Rita)
- power was configured to cab13 and cab14 and was able to connect the cab13 serial console in order for Gavin to configure. Gavin confirmed he could connect to PDU0 and PDU1 in cab13 but had problems seeing PDU1 on cab14. Seems as if the Out/Aux port on PDU0 doesn't work so we connected the cab13 serial console connection to PDU1 (instead of PDU0 as is done on the other cabinets) and used the Out/Aux port on PDU1 connected to the In/Console on PDU0.
NOTE: We need to switch the PDUs on cab14 so they can be configured as all the others - otherwise it is too confusing.
May 16 (Rita and Haydn)
- setup cabinets 13 and 14 in prep for the 10 additional Aspen servers. The cab13 serial console was secured in place but could not be configured because power to these cabinets wasn't installed yet. Ethernet cables were connected to the CISCO switch and strung to cab13 and cab14.
May 1 (Haydn)
- ippc13 - reinstalled it, and it seems to be working fine.
- ippc12 - reinstalled it, but it stopped working within just a few minutes. It had survived a one hour memory test at ATRC, so I was surprised it died so quickly. After May 15th I'll order an S7010 motherboard to replace it with.
- Used a digital anemometer I bought on eBay to check the fans of the ippc00 - ippc20 machines, and several seem to have problems with their power supply fans: ippdb03, ippc15, ippc16, ippc02, ippc03, ipc05, ippc06 and ippc19. Next time I visit, I'll bring some newer, better power supplies and possibly some replacement fans (both of which are still in transit), as well as a tube to make it easier to measure without interference -- I suspect that sometimes EMI from the power cord next to the fan may have caused problems with measurement.
- (Rita)
- In preparation for the 10 new servers arriving from Aspen Systems (for STScI) I opened ports on the IPPCORE router that weren't being used.
- The eth1 cables (which weren't active) were removed from these ports and wrapped away above the cabinet- Gi9/2 - Gi9/32 - even ports only.
Apr 19 (Haydn)
- stare04 - tried to upgrade it to 96GB RAM, but it didn't like the new RAM. Used the leftover 4GB sticks to upgrade it to 48GB.
- ippc29 - removed its old RAM and upgraded it to 88GB. When I tried to insert the last 8GB stick, then it showed 80GB, so I figured this was better.
- ippc26 - transplanted the leftover 4GB sticks to upgrade them to 48GB.
Apr 18 (Haydn)
- ippc17 - reinstalled it, now it has a 500w power supply.
- ippc13 - completely stopped working. Removed the extra 32GB of RAM which had been added. It remained inoperative. Took it back to ATRC to repair it.
- stare00-stare03 - removed their six 4GB RAM sticks (24GB) and replaced them with twelve 8GB RAM sticks to upgrade them to 96GB. Not all of the memory seems to be recognized all of the time, so sometimes these machines boot up with 80GB, 88GB, or 96GB.
Apr 17 (Rita)
- ippc07 - wasn't working at 1GB as the other servers on the new 48 port switch. It was working at 100MB. Discovered that the cable C7N2S0 was bad. Labeled the cable as bad and swapped with the C7N2S1 cable, which is now in ethernet port 0.
- ippc13 - checked on the memory check that Haydn had started 4/16/13. Looked as if the memory check ran without errors but I couldn't reboot from the monitor. Tried to power cycle the server many times but it wouldn't start up. Strangely it seemed to power up on it's own?! Gavin was then able to see it on the serial console and configured it to boot from the RAID disk. Haydn to do more checking on this server to see if the new memory just added could be faulty.
Apr 16 (Rita & Haydn)
- ipp064 - Replaced the LSI RAID BBU. Now write-thru caching is working again.
- ipp057 - Replaced failed drive 19.
- ippc17 - Discovered that the power supply fan had stopped working. The power supply is no longer supplying +12v, and the -12v supply is only providing -10.5v. Also, heat sink compound inadvertently got onto the bottom of one of the CPU's and the CPU socket's pins. We don't have the materials here to fix the power supply or clean the CPU or CPU socket, so we'll take it back to ATRC, and I'll try to fix it there.
- ippc13 - installed the leftover 32GB RAM from the old ippc12 into ippc13, to upgrade it to 64GB.
- completed the rest of the 10G data transfer hardware swap
- moved 2 x10G uplinks from the brocade to ipp2960s-te
- moved ippc01 - ippc16 connections from the brocade to ipp2960-s
- removed the 10GBase blade WS-X6704-10GE (on loan to use from Indiana Univ.)
- inserted PanSTARRS 10GBase blade WS-X6704-10GE (recently purchased by PS1)
- inserted long range transceiver (XENPAK-10GB-LR+ in Te13/1 and reconnected fiber
- inserted the two short range transceivers in Te13/3 and Te13/4 and reconnected fiber
- Gavin confirmed the new connection to ippc01-c16 was operational, and was able to restore traffice to Te13/1
- removed 10Gbase brocade switch from cab0(on loan from Indiana Univ.), which was replaced with ipp2960s-te
- Rita to box up all equipment for IU and mail back to IU from ATRC, ASAP
Apr 10 (Rita)
- Worked on ippdb00 and swapped out the 300GB disks for new 600GB disks. Reconfigured the RAID array, and handed over the OS rebuild to Gavin and Haydn to perform remotely. All came up fine and IPP team was notified that afternoon that they could proceed.
- While there, Gene and team updated stsci06 with ver 3.6.7 of kernel remotely and reboot. From what I could see manually all seemed fine when system came back online.
Apr 9 (Rita)
- Worked on the 10GB data transfer hardware swap
- Indiana Univ. hardware that has been on loan needs to be swamped with new hardware PS1 has recently purchased, so loaned hardware can be returned.
- Haydn removed the DELL 1U server on 4/8 that will be returned.
- Rita mounted the new Cisco 2960S-48TB-L in Cab 0 and inserted teh two SFP10G-SR transceiver modules.
- While there at MHPCC the IPPCORE was having problems. See http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/ippcore_log for details. Will continue with the 10GB data hardware swap once the IPPCORE is stable.
Apr 8 (Haydn)
- ippc19 - reinstalled it, but this time with a Tyan S7010 motherboard, and replaced the fan in the power supply. I had to modify the power supply to add a second 8-pin power connector for the S7010.
- xfer22 - Gavin shut it down, and I disconnected it, packed it in a box, and brought it back to ATRC to be shipped to IU soon.
- cab7 - reseated the ethernet cable connecting pdu0 and pdu1, because we couldn't reach pdu1. Gavin confirmed that it works now.
Apr 2 (Rita)
- power outage for parts of Kihei from approx. 4/1/13 @ 10:30PM - 4/2/13 1AM affected the MHPCC PanSTARRS IPP cluster. The IPP Panstarrs system unfortunately went down along with the UPS' about 45 minutes after the outage. I went to MHPCC 4/2 at 9Am to check it out:
All but 4 servers had restarted when power was restored.
- ipp046 - needed to have the BIOS reset. Once that was done the reboot went fine.
- ipp048 - had rebooted automatically and was running fsck, which took some time, but eventually recovered without problems.
- ippc17 - had to be power cycled a few times, and then it did reboot without problems.
- ippc19 - got no response after a few power cycles. Opened the box and reset the CMOS. When trying to re-power the server, saw an FF on LED and then Eddie and I smelled something burning and quickly unplugged power.
Brought the server back to the ATRC to investigate and it appears that the voltage regulator to the CPU had burned - MOBO must be replaced.
Mar 27 (Haydn and Rita)
- ippdb02: swapped out 300GB drives with new 600GB drives. Configured RAID array and installed OS. All seems to have gone fine. IPP team to check out server.
- ipp064: Haydn was going to replace the RAID battery, but we didn't have to correct model. Haydn to place order.
- Per request of Ken and Gene we checked the RAM in the IPP compute nodes in order to determine which ones to upgrade to 64GB and 48GB.
Mar 19 (Haydn)
- ippdb02: Powered machine down, checked how the RAID card is setup, and decided not to try another RAID card in it. Powered machine back up.
- ippcore: Together with Brad, cleaned fiber port 13/1. Gavin said it is working now.
Mar 8 (Rita)
- ippdb02: 1TB drives were swapped with the original 300GB drives. All rebooted fine.
MAR 7 (Rita)
- ippdb02: 2TB drives installed on 2/28 were not compatible. Meant to swap in the original 300GB drives but it was discovered that I accidentally installed the 1TB drives. Will return tomorrow and swap again with 300GB this time.
FEB 28 (Rita)
- ippdb02: Swapped out the 300GB drives and replaced with new 2TB drives. IPP team will test to see if they are compatible with the RAID controller.
FEB 26 (Rita)
- ippdb02: Swapped out the 1TB disks that were installed FEB-21-13, due to compatibility issues, and replaced the original 300GB 15K RPM drives. Was able to replace the drives in the correct slot and didn't have to rebuild the RAID array.
FEB 21 (Haydn/Rita)
- ippdb02: Upgraded the RAID from using 300GB 15k RPM Cheetah drives to using ordinary 1TB drives, to increase the RAID storage capacity.
JAN 16 (Haydn/Rita)
- ipp027-030: All of the RAID controller disks were swapped for 2TB disks. We are waiting for the RAID silvering to finish before installing the OS.
- ipp027: Had stopped working on January 6, 2013. It looks like another case of an incorrectly installed CPU heat sink (very little heat sink compound was used). Again, it seems to have made the CPU socket stop working. I tried 4 known good CPU's in the second socket, and none would work. All of those same CPU's worked fine in the first socket. Now the machine is running with just one CPU.
2012 Production Node Work
DEC 12 (Haydn/Rita)
- ipp029: Machine stopped working. Discovered that the heat sink hadn't been screwed down -- it was just loose. The heat sink compound was dark (almost black) and grainy, the top of the CPU was slightly pitted, and the bottom of the heat sink was also slightly pitted. We tried using a known good CPU and many unknown ones in socket 1, but nothing worked and it kept the machine from booting. We had to remove the CPU from socket 1 so that the machine would work.
- ipp032: Replaced drive #31.
NOV 9 (Haydn)
- ipp026: New Areca RAID controller installed and tested for compatibility. All was fine and data still accessible.
- ipp023, ipp024, ipp025, ipp026: All the Areca RAID controller disks were swapped for 2TB disks. OS was reinstalled and all backed up data restored.
OCT 11 (Haydn)
- IPP013: Spoke with Mr. Chau (chaut@…) about the Tyan S7010 motherboard we sent back to them to fix on 9/12. He said the motherboard is broken and cannot be repaired. Something broke within the Northbridge chip. We still have to pay their $50 processing fee. It is NOT covered under warranty. Their warranty period is 3 years from the date of manufacture, and even though we purchased them less than 3 years ago, the board was manufactured ~4 years ago. NewEgg.com no longer sells these boards. There are two for sale on eBay for ~$290+shipping. I'll check with Gavin to see what we want to do.
OCT 03 (Haydn/Rita)
- IPP046: removed and replaced the BIOS settings battery. The old battery was completely dead. Also verified that the CMOS jumper was in the normal position. Re-entered the usual BIOS settings for console redirection, power, and boot.
- IPP013: removed the CPU heatsink which was missing a fan, added the repaired fan, reinstalled it with fresh Arctic Silver thermal paste. When power was reapplied it almost instantly showed the Tyan logo screen, however it stayed there for a long time. I pressed the reset button and it did the same thing. Tim pressed the escape key, and then it instantly went through the rest of the boot process. I'm wondering if it has the correct BIOS settings? When it finished booting, I logged in via SSH, but it hadn't mounted our home directories. I let Heather (today's Czar) know about this problem.
SEPT 07 (Cindy/Rita/Haydn/Tyler)
- IPP013: replace mobo w/ cold spare Tyan S7010 which Tyler was testing on ippc12. Discovered CPU1 fan wire was sliced.
- IPP037: RAID issues surfaced starting on the afternoon of 9/6/2012. RAID volume appeared to become unresponsive after boot. Re-seated Areca controller. System remained responsive after re-seating. IPP team to backup nebulous data to available storage space on stsci0x nodes. /export/ipp037.0 mounted in read-only mode during rsync process.
SEPT 06 (Cindy/Rita)
- Servers affected IPP013, IPP046, IPPC04 not booting up after clean (emergency) shutdown due to MHPCC power outage (on 9/5/2012).
- IPP013: VGA and serial console not responding when server is powered on. All 4 PSU are powered on, memory reseated, PCI-e RAID controller reseated, CMOS cleared, chassis circuit board examined. Server powers up, HDD LED visible, VGA and serial console not displaying anything.
- IPP046: CMOS error, Cindy fixed by resetting BIOS (console redirection + power on after failure)
- IPPC04: Console Redirection not enabled in BIOS, all other BIOS settings were correct.
JUNE 19
- re-labeled the A disks on the 10 stsci servers.
- contacted Lonnie at ASPEN to work with the MegaCLi (MegaRAID command line utility), but we could not get the syntax correct for the PdLocate option. -- I plan to experiment further with the MegaCLi and document the correct syntax for others.
JUNE 18
- ippc04 fixed, so added it back into the cluster at MHPCC. Started up fine and informed Gavin so he could continue with the configure with the cluster.
- verified that the RAID disk locate utility from the MSM GUI was correct on all 10 servers, and that the labels did need to be switched for the A and B disks.
- re-labeled all the B disks on the stsci servers, but too late to start on the A disks.
JUNE 13
ippc04 MOBO was swapped into ipp015 (see april 4&5) Tyler found the JP1 in cmos clear position - machine hangs at checksum. With the jumper in normal mode the machine won't power on. Without the jumper in normal mode the machine will not be able to save the bios - which is what we saw. With the jumper removed it powers on and ran overnight.
JUNE 12
DB03 crashing - no hint as to why. Swapped MOBO, memory and CPU's out of ippc12 into ippbd03 - added 8x2GB of ram for a total of 48GB of RAM Brought DB03 back to ATRC with original memory and cpu's for diagnosis.
JUNE 5-6
Installed STSCI servers - all running.
MAY
Installed the BROCADE switch and CISCO 10G card. swapped cables around the big CISCO switch to make room for the STSCI machines.
APR 05
- ippc04 -- removed the motherboard from it and installed it in ipp015. Brought it back to ATRC, in case we want to install a new motherboard/CPU/RAM combination in it.
- ipp015 -- works now, but the motherboard I pulled out of it had the RAM in the wrong slots, and JP7 on the wrong pins. We could use better quality assurance when we are retrofitting motherboards.
APR 04
- ipp015 -- it stopped working yesterday, and I went to MHPCC to see if I could coax it into functioning again. It was off when I arrived. I tried turning it on, clearing the CMOS RAM, removing the battery, removing the RAM, removing the RAID card, moving the RAID card to a new slot, removing CPU #2, and swapping CPU #2 into CPU#1's place. Nothing made any difference. Tomorrow morning at 9am I will return and transplant ippc04's motherboard into ipp015.
MAR 28
- ipp041 -- took it down to swap in a motherboard we obtained on eBay. Unfortunately, the new motherboard wasn't actually new, and the power sockets on it wouldn't accept our power cables, so we had to remove it and revert back to the old motherboard, which only works with one CPU. The eBay seller apologized and is sending us a full refund.
MAR 21
- ipp014 -- took it down to rotate one of the CPU coolers by 180 degrees. It had accidentally been installed backwards.
- ipp041 -- took it down twice to identify what motherboard it uses. It is a Tyan S5397AG2NRF. It turns out there is a second variant, the S5397WAGNRF. Normally these cost about $500-$700, but I found one on eBay. The auction closes tomorrow morning, but currently there are 21 bidders and the price is $92, plus $12 for priority mail shipping. The board is "New In Box" and the pictures of it show it in the original box, with the original stickers over the slots and the cables and other components sealed in their bags. The difference between the two boards is that the one on eBay also has connectors for SAS drives, while the one we have doesn't have that part of the board populated. I think we should snap this one up -- it should go for less than $200, which is about a third of the best price Google/products showed.
MAR 14
- ipp015 -- was turned off when we arrived. We plugged a screen into it and turned it on and it worked, except for networking. I'd chosen to boot the top kernel in the list instead of the default one, so we tried again with the default kernel, and then it worked perfectly.
- ipp058 -- it was reported to us that one of its PDU's was in an odd state. We confirmed that the electrical plugs were connected to the right sockets on the PDU, and that their little green lights were both glowing (and thus indicating power on). We couldn't access the URL for the PDU from MHPCC. When Serge went to confirm the existence of the odd status, it was no longer there, so at that point there was nothing more to be fixed.
FEB 22
- ipp041 -- fixed the problem with the CPU not being able to make good contact with the heatsink. Had to remove a standoff which was in the way. Unfortunately, had only 1 hour to work on this machine, so we ran out of time to debug it and still the second CPU was keeping the machine from working, so we had to remove the second CPU again and leave. We wanted to begin working on this machine first, but couldn't make contact with anyone from the IPP-dev team, including the czar, so most of our time was wasted. Perhaps next week or in a few weeks we can visit it again. I bought some Q-tips and Goo-Gone to clean the second CPU with next time I'm there. This was frustrating.
- ippc11 -- reinstalled it with the new motherboard. Confirmed with Gavin that all of the cables for it were correct.
- ippc12 -- accidentally unplugged the power.
FEB 17
- ipp041 -- discovered that one of the CPU coolers wasn't making good contact with the CPU, so had to remove that CPU to get the machine to boot up. Next Wed or so we should be able to return and take a stab at fixing the problem.
FEB 9
- ipp027 -- began displaying the BIOS message today -- we have no idea why, but we were very thankful! Used the CD to update its BIOS. This cleared the old settings, so I changed the ones which we always change. Now it works fine again.
- ipp028 -- still catatonic. Displayed "E4" POST code. Cindy recommended removing everything from the machine, so we removed the RAID card. This caused the two beeps it was making to change to a different (undocumented in the manual) pattern. The POST code changed to "E1". Next we removed all of the RAM. The POST code changed to "60" after going through a bunch of different values. We started reinstalling the RAM, first 1 stick, then 4, then 8. With one stick the BIOS screen said that the BIOS settings were invalid, so I reset them to our usual values and saved it. Apparently this machine's BIOS settings spontaneously got corrupted. If this happens again, we should replace the CR2032 battery with a fresh one, and/or check that the CMOS reset jumper is in the default position. Reinstalled the RAID card. Oddly, this machine recognizes that it has an ATA DVD/RW drive, however it cannot read the CD in the drive.
FEB 8
- ipp030 - one small fan in back wasn't working, but didn't bother to replace it. It has miniscule impact.
- ipp028 - no fan failures found.
- ipp031 - no fan failures found.
- ipp023 - no fan failures found.
- ipp014 - no fan failures found, however one of the CPU fans is installed backwards -- it is pushing air toward all of the other fans. Fixing it requires removing the heat sink and rotating it 180 degrees, which we didn't want to do at that time, but we should consider doing in the future.
- ipp016 - missing one fan (on the left), had 2 broken fans. One fan was actually broken, and the other wouldn't work because of a short/open circuit on the litte board which the fans plug into. Replaced the broken fan, and connected the other fan directly to the headers behind the disk drives.
- ippc08 - the large power supply fan was actually ok.
- ipp025 - moved the power cables on the PDU to where they should have gone.
- ipp027 - the power cables weren't looped through the power supply handles, so one by one I moved them. Tried to update the BIOS with the CD which Gavin sent me the ISO for. Unfortunately, the machine turned catatonic, and we couldn't see anything on the video and the caps lock light wouldn't come on on the keyboard.
- ipp028 - wouldn't come back up. Gavin recommended using the motherboard jumper to clear the CMOS. Behaved just like ipp027 -- catatonic. Displayed "E4" POST code.
FEB 2
- Checked 4 servers for possible fan failures - Haydn and Rita
- ippc08 - larger fan on power supply not working -- frozen stuck. We did not have the replacement fan with us and will replace next week.
- ipp015 - we found two fans in the front (that pull the cold air in) that weren't working. We discovered two power connectors to power the fans were not connected. After connecting power all fans now working.
- ipp025 - no fan failures found.
- ipp017 - no fan failures found.
- Rita - While I was there working Gavin called for support to check out ipp027, which seemed to have crashed. After power cycling a few times, waiting 1 hr., and power cycling again, the system rebooted and came back online.
JAN 30
- ipp027 crashing and cpu overheated Opened ipp027 to find CPU1 fan had come unplugged. - reconnected it - running ok since then
- remounted ippc12 with new MOBO installed by Haydn
- Unboxed and racked WAVE 5 34 1U's in cab 10 and 11.
JAN 31
- Configured the bios on all 34 WAVE 5 machines and began the labeling of the nodes.
2011 Production Node Work
DEC 19
- Replaced the bad RAM in ipp063.
- Rebuilt the RAID in ipp064, however I didn't have the documentation, couldn't get help from Cindy or Gavin, and one parameter wasn't correct, so it will have to be redone. Maybe Wednesday...
- Found an old Tyan LGA771 motherboard and brought it back. I'll try populating it to see if it works in ippc11.
- Found two small fans. I'll try them in the power supply for ippc12.
- ipp064 sometimes shows 40 GB RAM when it boots, and other times 48 GB. It contains some inconsistent RAM. I'll try to find it and swap it out next time I'm there.
DEC 12
- Temporarily swapped two known good motherboards into the chassis for ippc11 and ippc12. They both worked fine, so this indicates that the motherboards in ippc11 and ippc12 are both broken. Perhaps the CPU's and RAM are okay, but we'll need a known good LGA771 motherboard to test them with.
DEC 8
- Swapped the RAM between ipp065, ipp066, and ipp063. ipp065 and ipp066 either sometimes or always indicated only 40GB, instead of 48GB. When we moved the RAM between the different machines, the good RAM from ipp063 always worked, the bad RAM always failed, and the inconsistent RAM continued to behave inconsistently, so the problem is that we have at least two bad sticks of RAM -- one which always fails and one which fails intermittently.
DEC 7
- Tested the power supplies on ippc11 and ippc12. ippc11's power supply's -12v line wasn't working. ippc12's power supply produces the right voltages, but the exit fan doesn't spin, which decreases the airflow to about 25% of what it should be. Found a spare power supply for ippc11. Replaced the motherboard in ippc11, but the "new" motherboard doesn't work.
NOV 21
- Cindy Bill and Rita swapped out motherboard and CPU's from ippc12 into ipp029
NOV 15
- added auto-monthly consistency checks for ippb machines
- switched the names for ippb00/03 machines - auto-mount configuration ippb00.0 -> ippb00.2 to reflect physical array swaps done.
Nov 10
- Haydn visited MHPCC.
- Installed a new BBU in ippdb02.
- Checked ipp029, because it was reported to crash quickly under load. The CPU heat sinks had not been installed correctly. Heat sink compound covered only ~50% of the CPU's. Removed the old heat sink compound and used Artic Silver 5. We should probably do this for all of the machines in this series -- just because they aren't crashing, doesn't mean they were installed correctly. They are probably simply throttling themselves to avoid thermal failure.
- Checked ippc11, because it was reported to crash under load every few days. Burned myself slightly when my bare forearm touched the bottom of the case. This is a 1U machine, and the power supply is incredibly hot (~55 deg C). Removed the CPU heat sinks, but they had originally been installed correctly. Re-racked the machine, but other machines with the same design have power supplies which are much cooler. Perhaps the power supply is failing, or one of the fans in it has failed. Because this is a 1U machine, the CPU heat sinks depend on the cover being on in order to get good air flow across them, but next time I'm there, there is a spare power supply which might work in this machine, and I can try running it with the top off for a few seconds to check if the power supply fans are working or not.
- Reseated the memory in ipp065, but it still reports only 40GB instead of 48GB.
Nov 7
- team racking and installation of 13 5 U's
- put cpu back into ipp036
- put ippc10 back in rack
Nov 4
- ippb00 Haydn replaced back plane for ippb00 a0 - Drive 3 detected. All drives put back in unit - parts sent back to YMI.
Nov 1
- Installed ASUS MOBO into ippc10
Oct 27
- Installed spare ASUS MOBO into ippc10 - cpu heat sinks do not install properly - Brought ippc10 back to ATRC -ordered proper CPU carriage frame from YMI
Oct 25
- ippb00 replaced cable,raid card and sled - still not detecting drive A0 3 - ordered back plane
Oct 24
- ipp036 has dead CPU0 fan unable to boot.
- Swapped MOBO out of ippc10 into ipp036 (still in rack - unable to remove it - only me working ) Left ipp036 with 1 processor - not sure if cpu damaged - did want to hurt MOBO of running system
- Installed 6 of the 8 3TB drives in STARE nodes. 2 drive DOA sent for RMA
Oct 19
- replaced ippb00 A0 drive 3 but still not being detected. worked with LSI to de-bug - needs hardware fix Ordered parts from YMI
Oct 10
- Swapped CPU0 memory with CPU1 memory
Sept 30
- installed 3TB disks in slots 2&3
Sept 20
- Bill and Haydn Installed the new MOBO on ipp020
Sept 13
- Bill and Haydn Installed the new MOBO on ipp021
Aug 31
- ipp010, ipp017, ipp020 (disk upgrades) and ipp021 (mobo
upgrade).
Aug 30
- Replaced IPP014's raid controller.
Aug 23 2011
- Haydn Bill and I were able upgrade ipp004,ipp013 and ipp018 MB and disks without incident.
- We corrected the existing error in the RAID card cabling in ipp004 so now the drives are seated in the same order as the other machines ( funnily we discovered and remembered this problem because we had missed swapping a drive and put a 500GB back in which was showing in the wrong slot )
- ipp037 has a new MB.
- Gavin installed the OS and building the RAID - he was instrumental once again in supporting us while at MHPCC.
- Haydn did all the drive replacements
Aug 16
- Installed the replacement MOBO for ippc10
Aug 15
- ipp011
- Upgraded to new MB and 2 TB drives. Had to move CPU0 memory stick from outer blue slot to middle black slot.
- Suppose to fill blue slots first. The memory was not detected in the outer black slot. Same weirdness seen in ipp009 ( according to Gavin ASA chassis)
- ipp019
- Upgraded to new MB and 2 TB drives. Memory all in blue slots and detected
- The 1st new MB installed did not detect any memory in the cpu0 no matter where we put the sticks. Tried a second new MB using all the same memory and CPU's - with good results. Will RMA the 1st new MB.
- ipp015
- Upgraded to 2TB drives. On reboot saw only 12GB of memory. For this machine memory is suppose to be an all black slots. However we noted that the memory was in a mixture of black and blue slots. Moved to all blacks. All memory detected.
Aug 1
- Swapped all memory out of ippdb03 - formally ippc00 - using the memory from ippc10 - waiting for MB
- Added 2 2TB disks to ippc10 to become the newest rendition of ippdb03
July 29
- Added new memory to ippb00 1nd 01 - now have 32GB each.
July 27
- Spent AM diagnosing the non-boot problem in ipp030 - failed MB
- After lunch
- Took failed MB out of ipp030 and replaced it with MB from ippc10.
- ippc10 remains out of the rack.
July 15
- Swapped the 1st 3 mem sticks out of ippdb00
May 16
- replace a raid card in ipp005
Feb 9
- Bill and I replaced the motherboards of ipp005, ipp006, ipp007, ipp025
Notables: ipp006 & ipp025 memory counts were missing 4GB. We identified which slot and put the stick in a black slot next to a blue working slot. All 24GB are now showing on both machines. The malfunctioning slots on the 2 machines were not the same. We tested the power cables and while we found some odd readings they were not consistent between the machines or relative to the cpu's with the malfunctioning slots.
The boards installed on the back of the CD drive are not working. We still have external CD drives hanging off the machines. We need to come up with a solution.
At some point in time - this fix should work for ipp009 so we have all 24 GB available.
ipp006: Voltage readings for CPU0 8pin power connector: +5v +12v +3.3V 5.2 12.5 3.4
-12V +12V 5vSB P6 11.9 12.2* 5.4* 410ms
- blinking
Voltage readings for CPU1 8pin power connector: same as cpu0 except for pg was 590ms and blinking too.
CPU1 slot closest to the cpu was not working.
ipp025: Voltage readings for CPU0 8pin power connector: Same display for ipp006 24 pin and 8 pin cpu0
Voltage readings for CPU0 8pin power connector: Same for CPU1 8pin. pg 430ms
Feb 4
We installed the new RAID card in ipp012 without problems.
However, we did not have success with the IDE->SATA converter board for the CD/DVD. The drive was not being detected on boot. We tried all SATA ports, different card, different CD/DVD, different cable to no avail. We checked the bios and there was note anything we could see that would prevent it from working. The external drive has been reattached. We brought a card and cd/dvd player back to test it and get it working here. It may need a jumper set to make sure it is not in slave mode - however we did not have a manual or a jumper - the boards did not come with one.
We also had the usual rack problems - ipp012 is particularly difficult to put back in - it is at the bottom of the rack and ipp013 is weighting down on it a bit.
Jan 31
disk upgrades on ipp008, ipp016, and ipp021, and motherboard upgrades on ipp008 and ipp016.
Jan 24
We put the power supplies from ipp009 in and it booted normally. However there are still fan related alarms sounding though all the fans are working. We swapped out the led/alarm board out with the same results.
ipp009 is up and ready for use but still short 4GB of memory. After many tests we can conclude that the problems lies in something other than the motherboard or memory. We had the same problem with a known good motherboard and memory. We narrowed it down to a set of black and blue memory slots associated with CPU0. Not the slots themselves, something that serves the slots. We know there is a problem with 2 slots because 12 x4GB memory sticks only showed 40 GB of memory. Also: The power supplies from ipp014 were put into ipp09 and there didn't seem to be any issues. Therefore we have no idea why ipp014 started booting normally with a power supply swap. Note: the ipp009 memory problem occurred with the good power supplies as well.
Jan 21
replaced motherboard hardware and disks for ipp009, ipp012, ipp014.
ipp012 new disk 11 had to be swapped out. ipp009: 6 4BG sticks were installed but only 5 are showing. All the memory has been swapped out for new memory with the same result. ipp014: We hear continuous beeps on power on changing too 2 short beeps and a long one after something appears on the screen. However the 2 beeps and a long on are not constant - they become random. We don't see any memory report on the first screen like the others. We changed the mother board and memory with identical results. There was a dead fan issue but we rewired it and all the fans are working and the fan light is off. No effect on the beep pattern. The initial quick short beeps sound like memory beeps but it weird that we get it to boot at all. Left out of the rack.
The new mother boards do not have an IDE connection for the existing optical drives. We (Bill) drove to the only 2 computer stores on the island and purchased the only 2 notebook size optical drives with sata connectors available. However - once we installed these we realized the connectors were something we had not seen - and did not have the cables for. At first glance they look like sata connections but there is a power component as well. opps. Fortunately Brad had external usb optical drives we could use. Hundreds of them, in fact. Gavin had to Mac Gyvered them in place because the cables were too short to set them anywhere.
2010 Production Node Work
(Up to Production Cluster Status)
Sept 15
- Removed cable obstructing power supply fan in ippc11
Sept 13
- Replaced CPUs (with new ones) and failed CPU fan in ipp014
Aug 23
- Swapped Memory out of IPP14
Aug 6
- Swapped Memory out of IPP12
- Upgraded IPP006 and IPP007 with 2TB drives
Jun 22
- Swapped Memory out of IPP14
April 27
- ipp018
- swapped memory -> continuous beeps
- Brad noticed one row of the hard drive array did not light up with red lights
- Reseated SATA cables to backplane and RAID card
- Swapped in set of new memory
April 26
- Swapped in larger RAID card memory module in ipp037
- ipp018
- reseated memory -> still 2G memory
- swapped CPUs -> still 2G memory showing
- swapped in new set of memory (all 8 sticks) -> no boot rapid beeps
April 16
- Swapped two new CPUs into ipp037 and booted it into .31 kernel
April 8
- Swapped positions of ipp005 and ipp037 in rack
- Attached ipp037 SATA cables to the RAID card so they're mapped properly
- Noted fan warnings on nodes ipp014, 016, 008, 007, 004, 025, 024, and 045 (appear to be due to fan power board failures)
April 1
- ipp008
- Put back into rack
- Still missing single HD array fan
- ipp018
- New motherboard; 2 new CPU fans (Sunon)
- Put back in rack
- ipp037
- New motherboard; 2 new CPU fans (Sunon)
- Put back in rack
- Array cables not mapped to RAID card correctly
- ipp014
- Unsuccessfully attempted to get one of the HD array duo fans working
- Appears to be an issue with the connector
March 22
- ipp018
- Boots but goes to A:\>
- Pulled CMOS battery; reset MB w/ jumper; no help
- ipp008
- No boot
- Rapid beeps
- One pwr supply dead
- Swapped in new MB
- Replaced CPU fans with those taken from ipp037
- Single HD array fan not working (appears to be module connector)
February 22
- Confirmed light on ippc17 drive
- Installed additional fan in ippc18 (small fan on right side)
February 16
- ipp008
- No boot
- rapid beeps -> then 1 sec beep
- ipp018
- No boot
- Both CPU fans not working - one has broken wire
- No codes on MB
- No lights on power supplies
- ipp037
- No RAID card
- CPUs installed w/o paste
February 11
- Installed an additional fan in ippc19 (small fan on right side)
- Swapped CPUs
- ipp018 -> ipp005
- ipp037 -> ipp008
- Notes: ipp018 had excessive paste, bent pins on CPU2 (on MB), and an apparently scorched heatsink
NODE SWAPPING MAP
*Wave1 IPP17 died. Its disks were swapped into Wave2 IPP37
- So IPP17 is a Wave2 Machine.
*The broken Wave1 Machine (previously IPP17) was sent back. When it came back
- It had a different motherboard and was incomplete - missing fans.
- A newly broken IPP14 disks were swapped in.
*So IPP14 became IPP37.
- IPP14 ( Now IPP37 ) was out of warranty so we decided to swap the mother board and cpu's
- Summary:
- IPP14 is a Wave 1 ASA Machine with a different motherboard.
- IPP17 Is a Wave 2 YMI Machine
- IPP37 is a Wave 1 Machine
