Stop 0x124 Troubleshooting - 'New Method'

  1.    #1

    Stop 0x124 Troubleshooting - 'New Method'


    Before Reading: This tutorial is outdated, please check my blog for the updated version, which explains all the errors and places all the references of information together. BSODTutorials (Check November 2013)

    **Please read corrections (See link)**


    Vir Gnarus - Post #4 - Corrections

    I understand, that most of the current BSOD analysts on the forum, use and understand the more efficient way of analyzing Stop 0x124 crashes as pointed out by Vir Gnarus. Although, I would like to create page which enables new BSOD analysts to understand and use this method in their analysis.

    Thanks for Vir Gnarus, for explaining this method, he has already created a brilliant tutorial on Sysnative about how to debug Stop 0x124 PCI errors (see External Links)

    The WHEA_UNCORRECTABLE_ERROR bug check has a value of 0x00000124. This bug check indicates that a fatal hardware error has occurred. This bug check uses the error data that is provided by the Windows Hardware Error Architecture (WHEA).
    Here's the start of a 0x124 bugcheck (without any extensions used):

    Code:
    BugCheck 124, {4, fffffa800aaeb8d8, 0, 0}
    
    Probably caused by : GenuineIntel
    I've highlighted the areas which will interest us; so let's break the down the very beginning of the crash into simple parts.

    The text in red, is basically the type of bugcheck, you can use this to find further information from the BSOD Index.

    The text in blue, is the first parameter of the bugcheck, and this describes the cause of the error, which in this case is 0x4 and is linked to a Uncorrectable PCI Express Error.

    0x4 can also mean a PCI error, as well as, PCI Express Error.

    The text in green is the second parameter, and this describes the address of WHEA_ERROR_RECORD; we will use this address to extract some additional information.

    Code:
    7: kd> !errrec fffffa800aaeb8d8
    ===============================================================================
    Common Platform Error Record @ fffffa800aaeb8d8
    -------------------------------------------------------------------------------
    Record Id     : 01cdc0c21e2fc73d
    Severity      : Fatal (1)
    Length        : 672
    Creator       : Microsoft
    Notify Type   : PCI Express Error
    Timestamp     : 11/12/2012 10:45:50 (UTC)
    Flags         : 0x00000000
    
    ===============================================================================
    Section 0     : PCI Express
    -------------------------------------------------------------------------------
    Descriptor    @ fffffa800aaeb958
    Section       @ fffffa800aaeb9e8
    Offset        : 272
    Length        : 208
    Flags         : 0x00000001 Primary
    Severity      : Fatal
    
    Port Type     : Root Port
    Version       : 1.1
    Command/Status: 0x0010/0x0000
    Device Id     :
      VenId:DevId : 8086:3405
      Class code  : 030000
      Function No : 0x00
      Device No   : 0x00
      Segment     : 0x0000
      Primary Bus : 0x00
      Second. Bus : 0x00
      Slot        : 0x0000
    Dev. Serial # : 0000000000000000
    Express Capability Information @ fffffa800aaeba1c
      Device Caps : 00008020 Role-Based Error Reporting: 1
      Device Ctl  : 0000 ur fe nf ce
      Dev Status  : 0000 ur fe nf ce
       Root Ctl   : 0000 fs nfs cs
    
    AER Information @ fffffa800aaeba58
      Uncorrectable Error Status    : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
      Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
      Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
      Correctable Error Status      : 00000000 adv rtto rnro dllp tlp re
      Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
      Caps & Control                : 00000000 ecrcchken ecrcchkcap ecrcgenen ecrcgencap fep
      Header Log                    : 00000000 00000000 00000000 00000000
      Root Error Command            : 00000000 fen nfen cen
      Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
      Correctable Error Source ID   : 00,00,00
      Correctable Error Source ID   : 00,00,00
    
    ===============================================================================
    Section 1     : Processor Generic
    -------------------------------------------------------------------------------
    Descriptor    @ fffffa800aaeb9a0
    Section       @ fffffa800aaebab8
    Offset        : 480
    Length        : 192
    Flags         : 0x00000000
    Severity      : Informational
    
    Proc. Type    : x86/x64
    Instr. Set    : x64
    CPU Version   : 0x00000000000106a5
    Processor ID  : 0x0000000000000007
    As you can see we used the following extension with the address from the second parameter:

    Code:
    !errrec fffffa800aaeb8d8
    The Port Type shows where the error occurred, it seems to have happened within the PCI/PCI Express port, with the error severity being Fatal. We can also check the AER (PCI Express Advance Error Reporting) section to see what exactly happened.

    I believe the capitalized parts are supposed to the most interesting and where the errors occurred, I think MTLP means Malformed TLP and ROF means Receiver Overflow.

    In general, the parts indicate:
    1. UR = Unsupported Request Error
    2. MTLP = Malformed TLP
    3. SD = Surprise Down
    4. ROF = Receiver Overflow
    5. UC = Unexcepted Completion
    6. CT = Completion Timeout

    What device is causing the crash? We can search the VenID or DevID using a PCI Database, and then check the Vendor ID or Device ID information provided by the database, with the information from the OP's msinfo32 file or their System Specifications which they may have uploaded.

    However, the device may not always be the actual cause, it could be the port it is using or the motherboard. We can always use various stress tests and swaps in order to find a confirmation.

    For processors, we can use the same extension (!errrec), however, less information will be displayed and I tend to just check the MCA (Processor Machine Check Architecture), here is an example:

    Code:
    ===============================================================================
    Section 2     : x86/x64 MCA
    -------------------------------------------------------------------------------
    Descriptor    @ 86ca6a0c
    Section       @ 86ca6b94
    Offset        : 664
    Length        : 264
    Flags         : 0x00000000
    Severity      : Fatal
    
    Error         : BUSLG_SRC_ERR_*_NOTIMEOUT_ERR (Proc 1 Bank 0)
      Status      : 0xb20000001040080f
    EDIT:

    I would like to thank Arc for pointing this out to me.

    Code:
    WHEA_UNCORRECTABLE_ERROR (124)
    A fatal hardware error has occurred. Parameter 1 identifies the type of error
    source that reported the error. Parameter 2 holds the address of the
    WHEA_ERROR_RECORD structure that describes the error conditon.
    Arguments:
    Arg1: 00000000, Machine Check Exception
    Arg2: 86ca68fc, Address of the WHEA_ERROR_RECORD structure.
    Arg3: 00000000, High order 32-bits of the MCi_STATUS value.
    Arg4: 00000000, Low order 32-bits of the MCi_STATUS value.
    The first parameter or argument has the value of 0x0, which is a Machine Check Exception, which means the CPU has detected a hardware problem and the address points to a processor error because the CPU has found one, so it could be a different form of hardware which is causing the issue.

    In such a situation, it is best to use these steps:

    All the hardware seeming to be running stable and tests reporting no errors, could mean a bad motherboard.

    External Links:
    Last edited by x BlueRobot; 17 Nov 2013 at 16:02.
      My Computer


  2. Posts : 15,026
    Windows 10 Home 64Bit
       #2

    Thanks for the information Harry, & very nicely written thread.
    I personally find the 124's very hard to troubleshoot.
      My Computer


  3. Posts : 7,235
    Thread Starter
       #3

    koolkat77 said:
    Thanks for the information Harry, & very nicely written thread.
    I personally find the 124's very hard to troubleshoot.
    Thanks Yussi, and 124's do take quite some time, especially when the stress tests do not indicate any errors.
      My Computer


  4. Posts : 1,314
    Windows 7 64-bit
       #4

    Hi Bluerobot, thanks a lot for transferring this knowledge over to this forum for me. It definitely is good to get this around for others to know about as much as is available, and it certainly helps me having to personally direct people to where they can learn more about it every time I come across someone who isn't familiar.

    I would like to benefit the article but adding a few clarifications, if I may:


    The 0x4 subcode of the 0x124 bugcheck refers to the PCI-E bus, which is always true, as you pointed out. However, the PCI-E bus is a very commonly used bus for a whole assortment of things on the motherboard, and not just for PCI either. USB is also often shared on it, especially on OEM motherboards that like to keep things cheap by pushing a bunch of different items on the motherboard onto as few buses as possible. Don't rule out the possibility of it being something other than PCI-related! Check the motherboard brand and see if it's an OEM, and if it is, try correlating that crashdump with any other problematic behavior, like checking and seeing if the person is experiencing USB/Bluetooth issues and the like. Finding patterns in symptoms to correlate with this data goes a long way in establishing a more accurate conclusion!


    As for PCI-E WHEA error records, yes, they are generated by the device that created the error, but the one that created the error isn't the one that caused the error! Much like any crashdump, the one who's found the problem and filing the report isn't the one who committed the crime. This also is the same with WHEA errors, especially PCI-E WHEA errors. Often you'll find that the device the reported the error is the root port, when all the root port is saying is, "I've found a problem and telling about it, but I really don't know where it originated from." The device id listed below it also correlates with the device telling the tale, so all it's telling you is the device id of the root port in this case, so it won't be of much assistance.


    This particular crashdump you're using is very unusual. You see, the PCI-E WHEA error record will present 5 different items that are interest, which are the 3 Uncorrectable Error values and the 2 Correctable ones. It is important to understand what 'Status', 'Mask' and 'Severity' means to interpret these properly.

    Status = Shows what issues have actually been discovered and are presented in the error report.

    Mask = Shows what error types are masked and therefore to be disregarded, or to say, "Don't bother with this error if the system reports it, it's fine."

    Severity = Shows which uncorrectable error types are considered a fatal error which cannot be recovered (therefore causing the BSOD).

    Obviously, Correctable Errors do not harbor the Severity value because there shouldn't be any correctable errors that can be deemed fatal (why should they if they can be corrected?). So really the uncorrectable should be looked at, and it should be the Status of them in particular, when filtered through both Severity and Mask, since it's the Status that actually tells you what actually happened that caused the error. However, in this example, none of the values are highlighted (capitalized) so strangely no errors were collected on this WHEA error report! Talk about an oddity! Perhaps the error got lost during data collection. Very strange. I personally wouldn't know what to do with this, but having the PCI-E WHEA BSOD just means putting more attention on it and anything passing through it (be it PCI-E, PCI, USB, etc.).

    Btw, the definitions for each of the error types is mentioned here. This only relates if the Root Port was the device the issued the error, which in this case it was. If something else, then locate the appropriate item at the bottom of the linked page. Also, despite me notifying them long ago, they haven't corrected the links for the Correctable Error Status and Mask pages under the Members section, but they are listed correctly at the bottom of the page. These 'definitions' are really only the names of each one. While names like 'Surprise Down' is somewhat self-explanatory, a Malformed TLP probably isn't so much. That information, however, is documented in the PCI Express Base Spec, which is not freely accessible (you may find an old copy lurking around on the internet, though).


    Lastly, a description on how to understand generic WHEA errors (most common) besides the PCI-E specific ones I've mentioned here. Just to assure you, the developer manuals it refers too are free downloads, so don't worry bout that.
      My Computer


  5. Posts : 7,235
    Thread Starter
       #5

    Thanks for the corrections Vir, I'll create a link for your post in the tutorial, or did you think it's best to quote it within the tutorial? :)
      My Computer


  6. Posts : 1,314
    Windows 7 64-bit
       #6

    It's really up to you, skipper. It's your tutorial!
      My Computer


  7. Posts : 7,235
    Thread Starter
       #7

    I've added a little link at the top of the page, thought the quote may be a little too large to add.
      My Computer


  8. Posts : 7,235
    Thread Starter
       #8

    The CPU Machine Check Exception error has been updated here in my blog post - BSODTutorials: Debugging Stop 0x124 - CPU Mnemonics

    Stop 0x124 regarding the PCIe errors is going to be updated soon in my blog, I'm planning on bringing all the bits and pieces of information together.
      My Computer


  9. Posts : 7,235
    Thread Starter
       #9

    Part 1 and Part 2 have been written up. I'm going to work on Part 3 now, which will explain the errors which I didn't explain in Part 2.
      My Computer


  10. Posts : 7,235
    Thread Starter
       #10

    Part 3 has just been published, please check my blog (see link in my signature) for details. It will be provided within the November 2013 section.
      My Computer


 

  Related Discussions
Our Sites
Site Links
About Us
Windows 7 Forums is an independent web site and has not been authorized, sponsored, or otherwise approved by Microsoft Corporation. "Windows 7" and related materials are trademarks of Microsoft Corp.

© Designer Media Ltd
All times are GMT -5. The time now is 06:23.
Find Us