home of the madduck/ blog/
Unstable Linux

One thing I will never understand about this Linux hype are the operating system's miserable failures in the wake of hardware problems. Coming home from dinner, I find this all over my consoles:

Message from syslogd@piper at Sat Jan  7 22:53:35 2006 ...
piper kernel: Oops: 0002 [18] 

Message from syslogd@piper at Sat Jan  7 22:53:35 2006 ...
piper kernel: CR2: 0000004000000004

Since then, I cannot start new processes anymore (though apache2 and old processes work just fine), which means I cannot SSH into the box (which is far away from where I am right now), and thus it's become useless.

The problem could be anything: corrupt memory, a broken CPU, a harddrive with bad blocks in the swap area, etc... since Linux obviously seems to be able to wall in response to a problem, I would only wish it wouldn't pout as a consequence but handle the event more gracefully.

If only I had the time and energy to finally wave goodbye and choose NetBSD...

Update: I managed to get a dmesg output:

<1>Unable to handle kernel paging request at 0000004000000004 RIP: 
PGD 72cee067 PUD 0 
Oops: 0002 [18] 
CPU 0 
Modules linked in: rfcomm l2cap ipv6 af_packet ipt_REJECT ipt_state
iptable_filter iptable_nat ip_conntrack ip_tables deflate zlib_deflate
twofish serpent aes blowfish des sha256 sha1 md5 crypto_null af_key usbhid
hci_usb bluetooth raid5 xor dm_mod sbp2 ide_generic ide_cd eth1394
snd_seq_dummy snd_seq_oss snd_seq_midi snd_seq_midi_event snd_seq
snd_via82xx gameport snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd
i2c_viapro soundcore i2c_core ehci_hcd via82cxxx ohci1394 shpchp pci_hotplug
sk98lin uhci_hcd ide_core ieee1394 rtc parport_pc parport floppy psmouse
pcspkr serio_raw evdev xfs exportfs sr_mod cdrom sd_mod sata_via
sata_promise libata sg scsi_mod raid1 md unix fbcon tileblit font bitblit
vesafb cfbcopyarea cfbimgblt cfbfillrect softcursor
Pid: 136, comm: kswapd0 Not tainted 2.6.12-1-amd64-k8
RIP: 0010:[<ffffffff80152c54>] <ffffffff80152c54>{find_get_pages+36}
RSP: 0018:ffff81007f90dcc8  EFLAGS: 00010002
RAX: 0000004000000000 RBX: ffff81007f90dd08 RCX: ffff81007f90dd10
RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff81000289f678
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff81007113f670
R10: 0000000000000040 R11: 0000000000000000 R12: ffff8100705e54e0
R13: 000000000000007a R14: ffffffffffffffff R15: ffff8100705e55f8
FS:  00002aaaab730d70(0000) GS:ffffffff8040f940(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000004000000004 CR3: 0000000072925000 CR4: 00000000000006e0
Process kswapd0 (pid: 136, threadinfo ffff81007f90c000, task ffff81007f906760)
Stack: ffff81007f90dcf8 ffffffff8015bca7 ffff8100705e54f0 ffffffff8015c905 
       ffff810001172200 0000000000000000 0000000000000000 0000000000000000 
       ffff81000289f678 0000004000000000 
Call Trace:<ffffffff8015bca7>{pagevec_lookup+23}
           <ffffffff8010f0f7>{child_rip+8} <ffffffff8015de10>{kswapd+0}

Code: ff 40 04 ff c2 48 83 c1 08 39 d6 75 f0 fb 5b 89 f0 c3 66 66 
RIP <ffffffff80152c54>{find_get_pages+36} RSP <ffff81007f90dcc8>
CR2: 0000004000000004

Now, how does a mere mortal diagnose this? This was the third in a series of oopses, the others occurred in a find process... the machine runs a software RAID5, so it really could be anything, right?