
The Case:
Recently I was working on an customer support case about a "Blue Screen Of Death" that was supposedly caused by our software. BSODs can't be directly triggered by User Mode programs, but are the result of bugs in Kernel Mode drivers. The BSODs happened on all the computers that the customers was using with this particular application. The machines had also a custom PCI Express card and driver installed that was used by another system vendor that our software was interfacing with. I suspected that the actual BSOD was caused by the driver of this card.
Memory.dmp:
First I looked at the kernel dump called "Memory.dmp". The settings for this file can be found under the "Startup and Recovery" system options.
I loaded the Memory.dmp file with WinDBG and run the "!analyze -v" command. The debugging report basically gave me no helpful information other than the advice to analyze this crash in a live Kernel debugging session to get the complete call stack of the crash.
Live Kernel Debugging Session:
There is a great article on CodeProject.com that got me started with the setup of a Live Kernel Debugging session. First I attached a Null-Modem serial DB9 cable to my computer that is running WinDBG and to the computer that is having the system crashes. The crashing machine's boot.ini needs to be modified to start in debugging mode and to accept commands from the serial port. Here is an example of a modified boot.ini of a Windows XP SP2 system:
1 [boot loader]
2 timeout=30
3 default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS
4 [operating systems]
5 multi(0)disk(0)rdisk(0)partition(1)\WINDOWS.0="Microsoft Windows XP Professional" /fastdetect
6 multi(0)disk(0)rdisk(0)partition(1)\WINDOWS.0="Microsoft Windows XP Professional" /fastdetect /debug /debugport=COM1 /baudrate=115200
Now it's time to shutdown the Crash computer and start up the WinDBG debugger on the remote computer. Choose (Ctrl+K) or the File->Kernel Debug menu. This opens a dialog that allows the configuration of the serial communication port.
Push the OK button and the Kernel Debugger is going into the "Waiting for Reconnect..." state. Now is it time to boot the Crash computer. Make sure that the [debugger enabled] boot option is selected.
The debugger will now automatically attach to the kernel on the Crash computer.
At this point the debugger is setup to break into the kernel as result of a system error. I was able to find easy repeat steps that would trigger the system crash reliably. In the debugger I was then running the !analyze -v command. I exported the kernel debug analysis report and emailed it to the hardware vendor whose driver I suspected to be the cause of the blue screen.
They were finally convinced that they needed to have a closer look at this issue. They found a bug in their driver that was related to multi-core CPUs.
Summary:
Everybody can do live kernel mode debugging. There are just a few steps and they work most of the time. It sounds very impressive, if you call the Technical Support of the misbehaving device and tell them: "I have a BSOD, I did a kernel debug and where can I sent the debugging report to?"