New
#1
Stop 0x124 Troubleshooting - 'New Method'
Before Reading: This tutorial is outdated, please check my blog for the updated version, which explains all the errors and places all the references of information together. BSODTutorials (Check November 2013)
**Please read corrections (See link)**
Vir Gnarus - Post #4 - Corrections
I understand, that most of the current BSOD analysts on the forum, use and understand the more efficient way of analyzing Stop 0x124 crashes as pointed out by Vir Gnarus. Although, I would like to create page which enables new BSOD analysts to understand and use this method in their analysis.
Thanks for Vir Gnarus, for explaining this method, he has already created a brilliant tutorial on Sysnative about how to debug Stop 0x124 PCI errors (see External Links)
Here's the start of a 0x124 bugcheck (without any extensions used):The WHEA_UNCORRECTABLE_ERROR bug check has a value of 0x00000124. This bug check indicates that a fatal hardware error has occurred. This bug check uses the error data that is provided by the Windows Hardware Error Architecture (WHEA).
I've highlighted the areas which will interest us; so let's break the down the very beginning of the crash into simple parts.Code:BugCheck 124, {4, fffffa800aaeb8d8, 0, 0} Probably caused by : GenuineIntel
The text in red, is basically the type of bugcheck, you can use this to find further information from the BSOD Index.
The text in blue, is the first parameter of the bugcheck, and this describes the cause of the error, which in this case is 0x4 and is linked to a Uncorrectable PCI Express Error.
0x4 can also mean a PCI error, as well as, PCI Express Error.
The text in green is the second parameter, and this describes the address of WHEA_ERROR_RECORD; we will use this address to extract some additional information.
As you can see we used the following extension with the address from the second parameter:Code:7: kd> !errrec fffffa800aaeb8d8 =============================================================================== Common Platform Error Record @ fffffa800aaeb8d8 ------------------------------------------------------------------------------- Record Id : 01cdc0c21e2fc73d Severity : Fatal (1) Length : 672 Creator : Microsoft Notify Type : PCI Express Error Timestamp : 11/12/2012 10:45:50 (UTC) Flags : 0x00000000 =============================================================================== Section 0 : PCI Express ------------------------------------------------------------------------------- Descriptor @ fffffa800aaeb958 Section @ fffffa800aaeb9e8 Offset : 272 Length : 208 Flags : 0x00000001 Primary Severity : Fatal Port Type : Root Port Version : 1.1 Command/Status: 0x0010/0x0000 Device Id : VenId:DevId : 8086:3405 Class code : 030000 Function No : 0x00 Device No : 0x00 Segment : 0x0000 Primary Bus : 0x00 Second. Bus : 0x00 Slot : 0x0000 Dev. Serial # : 0000000000000000 Express Capability Information @ fffffa800aaeba1c Device Caps : 00008020 Role-Based Error Reporting: 1 Device Ctl : 0000 ur fe nf ce Dev Status : 0000 ur fe nf ce Root Ctl : 0000 fs nfs cs AER Information @ fffffa800aaeba58 Uncorrectable Error Status : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und Uncorrectable Error Mask : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und Uncorrectable Error Severity : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und Correctable Error Status : 00000000 adv rtto rnro dllp tlp re Correctable Error Mask : 00000000 adv rtto rnro dllp tlp re Caps & Control : 00000000 ecrcchken ecrcchkcap ecrcgenen ecrcgencap fep Header Log : 00000000 00000000 00000000 00000000 Root Error Command : 00000000 fen nfen cen Root Error Status : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer Correctable Error Source ID : 00,00,00 Correctable Error Source ID : 00,00,00 =============================================================================== Section 1 : Processor Generic ------------------------------------------------------------------------------- Descriptor @ fffffa800aaeb9a0 Section @ fffffa800aaebab8 Offset : 480 Length : 192 Flags : 0x00000000 Severity : Informational Proc. Type : x86/x64 Instr. Set : x64 CPU Version : 0x00000000000106a5 Processor ID : 0x0000000000000007
The Port Type shows where the error occurred, it seems to have happened within the PCI/PCI Express port, with the error severity being Fatal. We can also check the AER (PCI Express Advance Error Reporting) section to see what exactly happened.Code:!errrec fffffa800aaeb8d8
I believe the capitalized parts are supposed to the most interesting and where the errors occurred, I think MTLP means Malformed TLP and ROF means Receiver Overflow.
In general, the parts indicate:
- UR = Unsupported Request Error
- MTLP = Malformed TLP
- SD = Surprise Down
- ROF = Receiver Overflow
- UC = Unexcepted Completion
- CT = Completion Timeout
What device is causing the crash? We can search the VenID or DevID using a PCI Database, and then check the Vendor ID or Device ID information provided by the database, with the information from the OP's msinfo32 file or their System Specifications which they may have uploaded.
However, the device may not always be the actual cause, it could be the port it is using or the motherboard. We can always use various stress tests and swaps in order to find a confirmation.
For processors, we can use the same extension (!errrec), however, less information will be displayed and I tend to just check the MCA (Processor Machine Check Architecture), here is an example:
EDIT:Code:=============================================================================== Section 2 : x86/x64 MCA ------------------------------------------------------------------------------- Descriptor @ 86ca6a0c Section @ 86ca6b94 Offset : 664 Length : 264 Flags : 0x00000000 Severity : Fatal Error : BUSLG_SRC_ERR_*_NOTIMEOUT_ERR (Proc 1 Bank 0) Status : 0xb20000001040080f
I would like to thank Arc for pointing this out to me.
The first parameter or argument has the value of 0x0, which is a Machine Check Exception, which means the CPU has detected a hardware problem and the address points to a processor error because the CPU has found one, so it could be a different form of hardware which is causing the issue.Code:WHEA_UNCORRECTABLE_ERROR (124) A fatal hardware error has occurred. Parameter 1 identifies the type of error source that reported the error. Parameter 2 holds the address of the WHEA_ERROR_RECORD structure that describes the error conditon. Arguments: Arg1: 00000000, Machine Check Exception Arg2: 86ca68fc, Address of the WHEA_ERROR_RECORD structure. Arg3: 00000000, High order 32-bits of the MCi_STATUS value. Arg4: 00000000, Low order 32-bits of the MCi_STATUS value.
In such a situation, it is best to use these steps:
All the hardware seeming to be running stable and tests reporting no errors, could mean a bad motherboard.
External Links:
Last edited by x BlueRobot; 17 Nov 2013 at 16:02.