RadeonHD and kompmgr, Part II

With EXA support enabled in the xorg.conf for radeonhd, I started to get X.org lock-ups when resuming from s2disk on the 2.6.28.8 kernel at random times; sometimes after 2 days of uptime, sometimes after just 15 minutes. Since I'm using a laptop with that funny Fn key layout, pressing alt-sysrq-k (SAK), alt-sysrq-u (remount read-only) and alt-sysrq-i (kill all tasks but init IIRC) makes the kernel believe I have pressed alt-sysrq-<insert numpad key here>, which doesn't do anything but set the kernel loglevel. Not being able to see anything on the pitch-black screen, all I have left then is alt-sysrq-s (sync system call) and alt-sysreq-b (reboot) and let fsck replay the ext3 journals while restarting.

Hearing that the drm kernel modules at the freedesktop.org repositories are not updated with the mainline kernel changes, I decided to try out the kernel 2.6.30-rc2. Fear. Despair. It locked up forever while suspending to disk.

Either it apparently did something more than mereley locking up, or the freedesktop.org drm and radeon kernel modules are more broken than I expected, but some hours after rebooting to 2.6.28.8 and doing my homework on the laptop, most of the running GUI programs crashed with random SIGSEGVs and SIGABRTs. Yes, what you read - even the KDE window manager got busted. Interestingly, I could bring it up again with some effort without restarting X, but then I noticed that one of the apps I was working with, Eclipse (3.4.x), didn't want to start again. When launched from the console, it would exit around 2 seconds after being invoked, with no explanation. After fiddling around with my ~/.eclipse dir, restarting X various times just to see knotify and artsd crashing like mad on KDE startup, I started session as another (dummy) user in X and tried to launch Eclipse 3.2 instead - its explanation was a bit more satisfactory: missing libraries.

"Of course", I thought, "the ldconfig cache may have become corrupted".

I kind of hit the nail:

bluecore:~# ldconfig
ldconfig: Cannot mmap file /usr/lib/libartsmidi.so.0.0.0.
ldconfig: Cannot lstat /usr/lib/libartsgui.so.0.0.0: Input/output error
ldconfig: Cannot lstat /usr/lib/libpangox-1.0.so.0.2002.3: Input/output error
ldconfig: Cannot mmap file /usr/lib/libwine.so.1.
ldconfig: Cannot mmap file /usr/lib/libpangox-1.0.so.
ldconfig: Cannot lstat /usr/lib/libpangoft2-1.0.so.0.2002.3: Input/output error
ldconfig: Cannot lstat /usr/lib/libakregatorprivate.so: Input/output error
ldconfig: Cannot mmap file /usr/lib/libartsmidi.so.0.
ldconfig: Cannot mmap file /usr/lib/libpangoxft-1.0.so.0.2002.3.
ldconfig: Cannot lstat /usr/lib/libartsgui_idl.so.0.0.0: Stale NFS file handle
ldconfig: Cannot lstat /usr/lib/libartsbuilder.so.0.0.0: Input/output error
ldconfig: Cannot mmap file /usr/lib/libpangoft2-1.0.so.0.
ldconfig: /usr/lib/libwine.so.1 is not a symbolic link

The 2.6.28.8 kernel was not amused by that attempt:

Apr 21 00:21:56 bluecore kernel: init_special_inode: bogus i_mode (3000)
Apr 21 00:22:44 bluecore kernel: init_special_inode: bogus i_mode (473)
Apr 21 00:26:40 bluecore kernel: init_special_inode: bogus i_mode (55000)
Apr 21 00:26:40 bluecore kernel: init_special_inode: bogus i_mode (0)
Apr 21 00:26:40 bluecore kernel: init_special_inode: bogus i_mode (71165)

My thoughts at that moment: WHAT THE F* HELL HAS HAPPENED!!!?!?! Considering that I hadn't run Wine for months, and it wasn't running at that moment either... I ran some ls's to see what had become of those libraries... it wasn't pretty. They had become character, block devices and FIFOs. 😕 I rebooted, and fsck still ignored the "clean" /usr (/dev/sda5) filesystem until I forced it to check it all filesystems again by setting their mount counts to some funny values using tune2fs.

Indeed, /usr was corrupted.

After init dropped me to a emergency console to run fsck manually on sda5, running e2fsck threw some interesting stuff, which I forgot to save to a file, but it was something like this:

inode ####### has 'compressed' flag set on an unsupported filesystem
inode ####### has invalid size
inode ####### (fifo) has non-zero size
inode ######## has 'compressed' flag set on an unsupported filesystem
((ad infinitum, ad nauseam))

After it finished, I noticed that the /usr/lost+found directory got populated with files and directories resembling parts of the Python 2.4 and 2.5 foundations installed. Of course, missing entire directories is a really bad sign. But I could recover the affect packages I noticed, with help of apt-get install --reinstall. Those where anything aRts, Python, wine, Pango and Akregator (which I actually remembered to reinstall right now, while writing this), although I reinstalled the Samba server I had installed just some hours before the crash, just in case. Apparently I didn't lose anything else. None of the other filesystems were affected by whatever smashed parts of /usr.

The obvious course of action then was blacklisting drm and radeon, and removing the Option "AccelMethod" "exa" line in xorg.conf, and nuking the 2.6.30-rc2 kernel image just in case. 😛 Eclipse and artsd worked again, and nothing else has shown any symptoms of being affected by fsck's decisions since then. It's been just 5 days though, and I have many apps lying around that I only use when asked/forced to, so I don't even remember their names.

I didn't say goodbye to kompmgr, however; it has a pretty decent performance and marginal CPU usage even while using a shadow framebuffer, with just one CPU core enabled at half of its maximum frequency, and even if the system is under load, so I figured that I really didn't need DRI or EXA all the time (radeonhd still doesn't have a opengl implementation, so having DRI enabled made no difference for clients) Still, it was nice to be able to scroll a huge konsole or Iceweasel/Firefox window without experiencing flickering.