The few days before Christmas 2003 were slightly more exciting than
usual for the Hermes admins. On Sunday 14 December, a disk failed in
the old Hermes machine called green
. The machine was
tweaked so that it no longer used the disk, but otherwise it remained
in service.
This turned out to be a bad idea, because a week later the
failed disk started crashing the rest of the machine, which had to be
hurriedly taken out of service. David fixed green
by
removing the disk, but didn't restore it to service because it was the
Monday before Christmas and it could wait until after the holidays.
This turned out to be a good idea, because on the evening of Monday
22 December the old Hermes machine called orange
died
suddenly. David renamed green
to take the place of
orange
to keep Hermes working properly, until the problem
could be investigated the next morning.
On Tuesday 23rd we went into the machine room to discover the
ominous smell of burnt electronics. Somethine had gone seriously wrong
with orange
- not a power supply or a disk, and there
was an unusual panic message:
Dec 22 21:20:01 orange.csi.cam.ac.uk unix: WARNING: [AFT0] Stickpanic: ptl1 trap reason 0x2 TL=0x1 TT=0x68 TICK=0x80071d0045836cb8 TPC=0x1002a4c8 TnPC=0x10134ce0 TSTATE=0x80001e03 TL=0x2 TT=0x68 TICK=0x80071d0045836c4e TPC=0x10006aa4 TnPC=0x10006aa8 TSTATE=0x4480001504 Softerror encountered on Memory Module 1703 panic[cpu0]/thread=40033e40: Kernel panic at trap level 2 10406180 unix:sys_tl1_panic+8 (200000, 33e80, 200000, 40032040, 1e, 14) %l0-7: 00000003 00001c00 80001e03 10006c34 00000001 00000000 0000000f 104061e0 10406270 SUNW,UltraSPARC-II:cpu_ce_scrub_mem_err+4c (0, 6, 360fa640, 40032040, d813bf00, 0) %l0-7: 00000000 7ae321b0 80001e01 101329f0 00000001 00000000 0000000e 40032710 40031fe0 SUNW,UltraSPARC-II:cpu_ce_error+1a4 (0, 0, 0, 0, 0, 0) RED State Exception
After the holidays, when we called out Sun's on-site support, we discovered how badly the machine had broken, and how lucky we had been that it didn't set off the fire alarm (taking the rest of the machine room with it). The Sun engineer said it was the worst "thermal event" he had seen.
In mid-April, I received a phone call from a Sun UK manager saying that Sun were upset by this web page and would like it to be taken down. Although it was down for a while, I have put it back since there is no reason to be embarrassed about a machine failing after five years of heavy use. And my boss likes this page better than he likes Sun.
Pictures of the failed machine:
<fanf2@cam.ac.uk>
$Cambridge: hermes/doc/misc/orange-fire/index.html,v 1.9 2005/11/02 15:42:00 fanf2 Exp $