I’ve experienced random crashes of the file-system on a Dell server, model PowerEdge 2650, with a Perc 3/Di SCSI controller, runninng a Debian testing system with the standard 2.6.8 Debian kernel (i686+smp), mainly during disk-intensive operations (for instance, I suspect such a crash happened when amanda backup task were launched on the machine).
There have been numerous discussions on the linux-poweredge mailing-list and many proposals for fixing this issue (see details on google).
The symptoms look like this :
Jun 9 20:52:58 myhost kernel: aacraid: Host adapter reset request. SCSI hang ?
Jun 9 20:52:58 myhost kernel: aacraid: Host adapter reset request. SCSI hang ?
Jun 9 20:52:58 myhost kernel: aacraid: SCSI bus appears hung
Jun 9 20:52:58 myhost kernel: aacraid: SCSI bus appears hung
Jun 9 20:52:58 myhost syslogd: /var/log/messages: Read-only file system
Jun 9 20:52:58 myhost kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
Jun 9 20:52:58 myhost kernel: SCSI error : <0 0 0 0> return code = 0x6000000
Jun 9 20:52:58 myhost kernel: end_request: I/O error, dev sda, sector 401836233
Jun 9 20:52:58 myhost kernel: scsi0 (0:0): rejecting I/O to offline device
Jun 9 20:52:58 myhost kernel: scsi0 (0:0): rejecting I/O to offline device
I think I have come closer than never to a solution, applying the following steps :
- upgrading the firmware of the Perc 3/Di controller : look at the Dell site for the right version…
- disabling the cache with
afacli
:
# afacli
open AFA0
AFA0 container set cache /read_cache_enable=FALSE /write_cache_enable=FALSE 0
AFA0 container show cache 0
Executing: container show cache 0Global Container Read Cache Size : 0
Global Container Write Cache Size : 118259712Read Cache Setting : DISABLE
Write Cache Setting : DISABLE
Write Cache Status : Inactive, cache disabled
- patching the 2.6.8
aacraid
driver’s code with the following patch : aac-remove-handle-aif.patch), to avoid tacking the controller offline in some circumstances (see explanation in this post : http://marc.theaimsgroup.com/?l=linux-scsi&m=110252243627410&w=2).- get the
kernel-source-2.6.8
package from stable - unpack it and apply patch
- get the running (uname -r) kernel’s .config from /boot and copy it to the /usr/src/kernel-source-2.6.7/
- make-kpkg clean
- make oldconfig
- make-kpkg –append_to_version=patchaacremovehandleaif –initrd kernel_image
- install resulting kernel, and reboot
- get the
- pray 😉
The machine had worked almost OK since it was in Debian’s 2.6.8 kernel with cache disabled and firmware upgraded, but it finally crashed again…
I hope that the patch against aacraid driver will solve the issue.
I have been dealing with the same issue, and I applied the patch. It seems to be crashing a lot less, but it is still crashing. Have you had any luck with anything else?
Thanks,
–jeremy
Well… it seems better, but it seems I managed to fix also by removing situations of very high disk usage, like launching the backup sessions in amanda with standard priority… I added a nice wrapper around the tar commands in amanda, and it seems to be much better like that.
A very basic question — how did you manage to install Debian on a Dell PowerEdge 2650 at all? I got stuck with the Debian installer on the boot disk not recognizing the RAID 5 (modprobe -v aic7xxx).
Experiments at the boot prompt such as linux aic7xxx=no_probe aacraid didn’t help.
Dunno how we actually did the install… probably upgraded from initial install quite a long time ago…
Google should be your friend on that one :/
My drive has been unmounting itself as described. Very annoying. Tonight, I applied the kernel driver patch. The system booted up fine and seems OK so far.
I checked out the driver patch and the kernel source code. It looks like the patch got officially applied in version 2.6.11. The source code lines deleted by the patch are there in 2.6.10, but are gone in 2.6.11.
I’ve installed Debian 3.1 a couple of times on the 2650. The first time, I think I started with the 2.4 kernel, then upgraded to 2.6. The upgrade hung on boot and I had to I had to add the kernel option pci=noacpi to /boot/grub/menu.lst:
kernel /boot/vmlinuz-2.6.8-2-686-smp root=/dev/sda1 ro pci=noacpi
This time around, I only have 2.6 kernels on the hard drive, so I guess I didn’t start with 2.4. Also, I have Lilo (no Grub) and there is no pci=noacpi option in lilo.conf. It’s like this time, I used a custom install CD I’ve been using to get nVidia SATA support at install time.