Stop 0x124 Troubleshooting - 'New Method'

x BlueRobot · Jan 18, 2013

Before Reading: This tutorial is outdated, please check my blog for the updated version, which explains all the errors and places all the references of information together. BSODTutorials (Check November 2013)

**Please read corrections (See link)**

Vir Gnarus - Post #4 - Corrections

I understand, that most of the current BSOD analysts on the forum, use and understand the more efficient way of analyzing Stop 0x124 crashes as pointed out by Vir Gnarus. Although, I would like to create page which enables new BSOD analysts to understand and use this method in their analysis.

Thanks for Vir Gnarus, for explaining this method, he has already created a brilliant tutorial on Sysnative about how to debug Stop 0x124 PCI errors (see External Links)

The WHEA_UNCORRECTABLE_ERROR bug check has a value of 0x00000124. This bug check indicates that a fatal hardware error has occurred. This bug check uses the error data that is provided by the Windows Hardware Error Architecture (WHEA).

Here's the start of a 0x124 bugcheck (without any extensions used):

Code:

[COLOR=Red]BugCheck 124[/COLOR], {[COLOR=Blue]4[/COLOR], [COLOR=SeaGreen]fffffa800aaeb8d8[/COLOR], 0, 0}

Probably caused by : GenuineIntel

I've highlighted the areas which will interest us; so let's break the down the very beginning of the crash into simple parts.

The text in red, is basically the type of bugcheck, you can use this to find further information from the BSOD Index.

The text in blue, is the first parameter of the bugcheck, and this describes the cause of the error, which in this case is 0x4 and is linked to a Uncorrectable PCI Express Error.

:info: 0x4 can also mean a PCI error, as well as, PCI Express Error.

The text in green is the second parameter, and this describes the address of WHEA_ERROR_RECORD; we will use this address to extract some additional information.

Code:

7: kd> [COLOR=seagreen]!errrec[/COLOR] [COLOR=seagreen]fffffa800aaeb8d8[/COLOR]
===============================================================================
Common Platform Error Record @ fffffa800aaeb8d8
-------------------------------------------------------------------------------
Record Id     : 01cdc0c21e2fc73d
Severity      : Fatal (1)
Length        : 672
Creator       : Microsoft
Notify Type   : PCI Express Error
Timestamp     : 11/12/2012 10:45:50 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : PCI Express
-------------------------------------------------------------------------------
Descriptor    @ fffffa800aaeb958
Section       @ fffffa800aaeb9e8
Offset        : 272
Length        : 208
Flags         : 0x00000001 Primary
Severity      : [COLOR=red]Fatal[/COLOR]

Port Type     : [COLOR=Red]Root Port[/COLOR]
Version       : 1.1
Command/Status: 0x0010/0x0000
Device Id     :
  VenId:DevId : [COLOR=Blue]8086[/COLOR]:[COLOR=blue]3405[/COLOR]
  Class code  : 030000
  Function No : 0x00
  Device No   : 0x00
  Segment     : 0x0000
  Primary Bus : 0x00
  Second. Bus : 0x00
  Slot        : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ fffffa800aaeba1c
  Device Caps : 00008020 Role-Based Error Reporting: 1
  Device Ctl  : 0000 ur fe nf ce
  Dev Status  : 0000 ur fe nf ce
   Root Ctl   : 0000 fs nfs cs

[COLOR=red]AER Information @ fffffa800aaeba58[/COLOR]
  Uncorrectable Error Status    : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Severity  : 00062010 ur ecrc [COLOR=red]MTLP[/COLOR] [COLOR=red]ROF[/COLOR] uc ca cto [COLOR=red]FCP[/COLOR] ptlp sd [COLOR=red]DLP[/COLOR] und
  Correctable Error Status      : 00000000 adv rtto rnro dllp tlp re
  Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
  Caps & Control                : 00000000 ecrcchken ecrcchkcap ecrcgenen ecrcgencap fep
  Header Log                    : 00000000 00000000 00000000 00000000
  Root Error Command            : 00000000 fen nfen cen
  Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
  Correctable Error Source ID   : 00,00,00
  Correctable Error Source ID   : 00,00,00

===============================================================================
Section 1     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa800aaeb9a0
Section       @ fffffa800aaebab8
Offset        : 480
Length        : 192
Flags         : 0x00000000
Severity      : Informational

Proc. Type    : x86/x64
Instr. Set    : x64
CPU Version   : 0x00000000000106a5
Processor ID  : 0x0000000000000007

As you can see we used the following extension with the address from the second parameter:

Code:

!errrec fffffa800aaeb8d8

The Port Type shows where the error occurred, it seems to have happened within the PCI/PCI Express port, with the error severity being Fatal. We can also check the AER (PCI Express Advance Error Reporting) section to see what exactly happened.

I believe the capitalized parts are supposed to the most interesting and where the errors occurred, I think MTLP means Malformed TLP and ROF means Receiver Overflow.

In general, the parts indicate:

UR = Unsupported Request Error
MTLP = Malformed TLP
SD = Surprise Down
ROF = Receiver Overflow
UC = Unexcepted Completion
CT = Completion Timeout

What device is causing the crash? We can search the VenID or DevID using a PCI Database, and then check the Vendor ID or Device ID information provided by the database, with the information from the OP's msinfo32 file or their System Specifications which they may have uploaded.

However, the device may not always be the actual cause, it could be the port it is using or the motherboard. We can always use various stress tests and swaps in order to find a confirmation.

For processors, we can use the same extension (!errrec), however, less information will be displayed and I tend to just check the MCA (Processor Machine Check Architecture), here is an example:

Code:

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ 86ca6a0c
Section       @ 86ca6b94
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : [COLOR=red]Fatal[/COLOR]

Error         : [COLOR=Red]BUSLG_SRC_ERR_*_NOTIMEOUT_ERR (Proc 1 Bank 0)[/COLOR]
  Status      : 0xb20000001040080f

EDIT:

I would like to thank Arc for pointing this out to me.

Code:

[COLOR=Red]WHEA_UNCORRECTABLE_ERROR (124)[/COLOR]
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
[COLOR=Blue]Arg1: 00000000, Machine Check Exception[/COLOR]
Arg2: 86ca68fc, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000, High order 32-bits of the MCi_STATUS value.
Arg4: 00000000, Low order 32-bits of the MCi_STATUS value.

The first parameter or argument has the value of 0x0, which is a Machine Check Exception, which means the CPU has detected a hardware problem and the address points to a processor error because the CPU has found one, so it could be a different form of hardware which is causing the issue.

In such a situation, it is best to use these steps:

http://www.sevenforums.com/crash-lockup-debug-how/35349-stop-0x124-what-means-what-try.html

All the hardware seeming to be running stable and tests reporting no errors, could mean a bad motherboard.

External Links:

koolkat77 · Jan 18, 2013

Thanks for the information Harry, & very nicely written thread.
I personally find the 124's very hard to troubleshoot.

x BlueRobot · Jan 18, 2013

koolkat77 said:
Thanks for the information Harry, & very nicely written thread.
I personally find the 124's very hard to troubleshoot.

Thanks Yussi, and 124's do take quite some time, especially when the stress tests do not indicate any errors.

Vir Gnarus · Jan 24, 2013

Hi Bluerobot, thanks a lot for transferring this knowledge over to this forum for me. It definitely is good to get this around for others to know about as much as is available, and it certainly helps me having to personally direct people to where they can learn more about it every time I come across someone who isn't familiar.

I would like to benefit the article but adding a few clarifications, if I may:

The 0x4 subcode of the 0x124 bugcheck refers to the PCI-E bus, which is always true, as you pointed out. However, the PCI-E bus is a very commonly used bus for a whole assortment of things on the motherboard, and not just for PCI either. USB is also often shared on it, especially on OEM motherboards that like to keep things cheap by pushing a bunch of different items on the motherboard onto as few buses as possible. Don't rule out the possibility of it being something other than PCI-related! Check the motherboard brand and see if it's an OEM, and if it is, try correlating that crashdump with any other problematic behavior, like checking and seeing if the person is experiencing USB/Bluetooth issues and the like. Finding patterns in symptoms to correlate with this data goes a long way in establishing a more accurate conclusion!

As for PCI-E WHEA error records, yes, they are generated by the device that created the error, but the one that created the error isn't the one that caused the error! Much like any crashdump, the one who's found the problem and filing the report isn't the one who committed the crime. This also is the same with WHEA errors, especially PCI-E WHEA errors. Often you'll find that the device the reported the error is the root port, when all the root port is saying is, "I've found a problem and telling about it, but I really don't know where it originated from." The device id listed below it also correlates with the device telling the tale, so all it's telling you is the device id of the root port in this case, so it won't be of much assistance.

This particular crashdump you're using is very unusual. You see, the PCI-E WHEA error record will present 5 different items that are interest, which are the 3 Uncorrectable Error values and the 2 Correctable ones. It is important to understand what 'Status', 'Mask' and 'Severity' means to interpret these properly.

Status = Shows what issues have actually been discovered and are presented in the error report.

Mask = Shows what error types are masked and therefore to be disregarded, or to say, "Don't bother with this error if the system reports it, it's fine."

Severity = Shows which uncorrectable error types are considered a fatal error which cannot be recovered (therefore causing the BSOD).

Obviously, Correctable Errors do not harbor the Severity value because there shouldn't be any correctable errors that can be deemed fatal (why should they if they can be corrected?). So really the uncorrectable should be looked at, and it should be the Status of them in particular, when filtered through both Severity and Mask, since it's the Status that actually tells you what actually happened that caused the error. However, in this example, none of the values are highlighted (capitalized) so strangely no errors were collected on this WHEA error report! Talk about an oddity! Perhaps the error got lost during data collection. Very strange. I personally wouldn't know what to do with this, but having the PCI-E WHEA BSOD just means putting more attention on it and anything passing through it (be it PCI-E, PCI, USB, etc.).

Btw, the definitions for each of the error types is mentioned here. This only relates if the Root Port was the device the issued the error, which in this case it was. If something else, then locate the appropriate item at the bottom of the linked page. Also, despite me notifying them long ago, they haven't corrected the links for the Correctable Error Status and Mask pages under the Members section, but they are listed correctly at the bottom of the page. These 'definitions' are really only the names of each one. While names like 'Surprise Down' is somewhat self-explanatory, a Malformed TLP probably isn't so much. That information, however, is documented in the PCI Express Base Spec, which is not freely accessible (you may find an old copy lurking around on the internet, though).

Lastly, a description on how to understand generic WHEA errors (most common) besides the PCI-E specific ones I've mentioned here. Just to assure you, the developer manuals it refers too are free downloads, so don't worry bout that.

x BlueRobot · Jan 25, 2013

Thanks for the corrections Vir, I'll create a link for your post in the tutorial, or did you think it's best to quote it within the tutorial?

Vir Gnarus · Jan 25, 2013

It's really up to you, skipper. It's your tutorial!

x BlueRobot · Jan 25, 2013

I've added a little link at the top of the page, thought the quote may be a little too large to add.

x BlueRobot · Nov 15, 2013

The CPU Machine Check Exception error has been updated here in my blog post - BSODTutorials: Debugging Stop 0x124 - CPU Mnemonics

Stop 0x124 regarding the PCIe errors is going to be updated soon in my blog, I'm planning on bringing all the bits and pieces of information together.

x BlueRobot · Nov 17, 2013

Part 1 and Part 2 have been written up. I'm going to work on Part 3 now, which will explain the errors which I didn't explain in Part 2.

x BlueRobot · Nov 17, 2013

Part 3 has just been published, please check my blog (see link in my signature) for details. It will be provided within the November 2013 section.

Stop 0x124 Troubleshooting - 'New Method'

x BlueRobot

Closed by request

My Computer

koolkat77

New member

My Computer

x BlueRobot

Closed by request

My Computer

Vir Gnarus

New member

My Computer

x BlueRobot

Closed by request

My Computer

Vir Gnarus

New member

My Computer

x BlueRobot

Closed by request

My Computer

x BlueRobot

Closed by request

My Computer

x BlueRobot

Closed by request

My Computer

x BlueRobot

Closed by request

My Computer