**Please read corrections (See link)**
Vir Gnarus -
Post #4 - Corrections
I understand, that most of the current BSOD analysts on the forum, use and understand the more efficient way of analyzing Stop 0x124 crashes as pointed out by
Vir Gnarus. Although, I would like to create page which enables new BSOD analysts to understand and use this method in their analysis.
Thanks for
Vir Gnarus, for explaining this method, he has already created a brilliant tutorial on Sysnative about how to debug Stop 0x124 PCI errors (see
External Links)
Quote:
The WHEA_UNCORRECTABLE_ERROR bug check has a value of 0x00000124. This bug check indicates that a fatal hardware error has occurred. This bug check uses the error data that is provided by the Windows Hardware Error Architecture (WHEA).
Here's the start of a 0x124 bugcheck (without any extensions used):
Code:
BugCheck 124, {4, fffffa800aaeb8d8, 0, 0}
Probably caused by : GenuineIntel I've highlighted the areas which will interest us; so let's break the down the very beginning of the crash into simple parts.
The text in
red, is basically the type of bugcheck, you can use this to find further information from the
BSOD Index.
The text in
blue, is the first parameter of the bugcheck, and this describes the cause of the error, which in this case is
0x4 and is linked to a
Uncorrectable PCI Express Error.
0x4 can also mean a PCI error, as well as, PCI Express Error.
The text in
green is the second parameter, and this describes the address of WHEA_ERROR_RECORD; we will use this address to extract some additional information.
Code:
7: kd> !errrec fffffa800aaeb8d8
===============================================================================
Common Platform Error Record @ fffffa800aaeb8d8
-------------------------------------------------------------------------------
Record Id : 01cdc0c21e2fc73d
Severity : Fatal (1)
Length : 672
Creator : Microsoft
Notify Type : PCI Express Error
Timestamp : 11/12/2012 10:45:50 (UTC)
Flags : 0x00000000
===============================================================================
Section 0 : PCI Express
-------------------------------------------------------------------------------
Descriptor @ fffffa800aaeb958
Section @ fffffa800aaeb9e8
Offset : 272
Length : 208
Flags : 0x00000001 Primary
Severity : Fatal
Port Type : Root Port
Version : 1.1
Command/Status: 0x0010/0x0000
Device Id :
VenId:DevId : 8086:3405
Class code : 030000
Function No : 0x00
Device No : 0x00
Segment : 0x0000
Primary Bus : 0x00
Second. Bus : 0x00
Slot : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ fffffa800aaeba1c
Device Caps : 00008020 Role-Based Error Reporting: 1
Device Ctl : 0000 ur fe nf ce
Dev Status : 0000 ur fe nf ce
Root Ctl : 0000 fs nfs cs
AER Information @ fffffa800aaeba58
Uncorrectable Error Status : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
Uncorrectable Error Mask : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
Uncorrectable Error Severity : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
Correctable Error Status : 00000000 adv rtto rnro dllp tlp re
Correctable Error Mask : 00000000 adv rtto rnro dllp tlp re
Caps & Control : 00000000 ecrcchken ecrcchkcap ecrcgenen ecrcgencap fep
Header Log : 00000000 00000000 00000000 00000000
Root Error Command : 00000000 fen nfen cen
Root Error Status : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
Correctable Error Source ID : 00,00,00
Correctable Error Source ID : 00,00,00
===============================================================================
Section 1 : Processor Generic
-------------------------------------------------------------------------------
Descriptor @ fffffa800aaeb9a0
Section @ fffffa800aaebab8
Offset : 480
Length : 192
Flags : 0x00000000
Severity : Informational
Proc. Type : x86/x64
Instr. Set : x64
CPU Version : 0x00000000000106a5
Processor ID : 0x0000000000000007
As you can see we used the following extension with the address from the
second parameter:
Code:
!errrec fffffa800aaeb8d8
The
Port Type shows where the error occurred, it seems to have happened within the PCI/PCI Express port, with the error severity being
Fatal. We can also check the AER (PCI Express Advance Error Reporting) section to see what exactly happened.
I believe the capitalized parts are supposed to the most interesting and where the errors occurred, I think
MTLP means
Malformed TLP and
ROF means
Receiver Overflow.
In general, the parts indicate:
- UR = Unsupported Request Error
- MTLP = Malformed TLP
- SD = Surprise Down
- ROF = Receiver Overflow
- UC = Unexcepted Completion
- CT = Completion Timeout
What device is causing the crash? We can search the
VenID or
DevID using a
PCI Database, and then check the Vendor ID or Device ID information provided by the database, with the information from the OP's
msinfo32 file or their System Specifications which they may have uploaded.
However, the device may not always be the actual cause, it could be the port it is using or the motherboard. We can always use various stress tests and swaps in order to find a confirmation.
For processors, we can use the same extension (
!errrec), however, less information will be displayed and I tend to just check the MCA (Processor Machine Check Architecture), here is an example:
Code:
===============================================================================
Section 2 : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor @ 86ca6a0c
Section @ 86ca6b94
Offset : 664
Length : 264
Flags : 0x00000000
Severity : Fatal
Error : BUSLG_SRC_ERR_*_NOTIMEOUT_ERR (Proc 1 Bank 0)
Status : 0xb20000001040080f
EDIT:
I would like to thank
Arc for pointing this out to me.
Code:
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 00000000, Machine Check Exception
Arg2: 86ca68fc, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000, High order 32-bits of the MCi_STATUS value.
Arg4: 00000000, Low order 32-bits of the MCi_STATUS value.
The first parameter or argument has the value of
0x0, which is a Machine Check Exception, which means the CPU has detected a hardware problem and the address points to a processor error because the CPU has found one, so it could be a different form of hardware which is causing the issue.
In such a situation, it is best to use these steps:
All the hardware seeming to be running stable and tests reporting no errors, could mean a bad motherboard.
External Links: