Random BSODs on a brand-new PC (ntoskrnl.exe)

Arc · 06 Feb 2013

Seeing the last dump, and understanding the situation, I can say you that the problem is your RAM modules.

Check the temperature of the system using Speccy. Over time, repeatedly. Probably the RAM is getting hot over time.

rslembo · 07 Feb 2013

Hello, Arc.
Something usefull on the last dump?
Could you help me understand why you guess it's RAM problem?
I'm gonna RMA them, but i need something to argue, as it doesn't show any problems at memtest86+

Also, had another BSOD by the morning, idle computer.
I'm going to try RMA at the local store today..

Thanks!

Arc · 07 Feb 2013

Nothing specific, even in a deeper search.

Code:

*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 19, {20, fffff8a00feb0390, fffff8a00feb03c0, 5030115}

GetPointerFromAddress: unable to read from fffff80002cc1100
GetUlongFromAddress: unable to read from fffff80002cc11c0
Probably caused by : ntkrnlmp.exe ( nt!CmpGetSymbolicLink+2b3 )

Followup: MachineOwner
---------

But, a stop 0x19 must be a device driver caused crash.

Stop the plan for RAMing the RAM for now. Give DV one more go, if for the last time everything might not go at par.

Use the computer normally with DV enabled. Let it replicate the crashes.... if they show something that I am missing here.

rslembo · 07 Feb 2013

Hello, Arc.

I went to the local store today with both RAM sticks.
As predicted, they'll have to make several tests for a possible RMA.
Those tests may take several days to complete.
Explained all details about no error results on memtest86+, just in case.

Came back with the sticks, gonna leave them for tests next Thursday, 14.
So we can proceed with the tests for now.

When I get home, will enable Driver Verifier once more, as you've suggested.
Just to clarify, should I leave DV running even after the first crash?

Thanks!

Arc · 07 Feb 2013

Usually if there is a reason to crash, DV does not take it very long, and a DV enabled crash dump should indicate to a driver.

Your last DV enabled crash dump that I have checked, that resulted a memory corruption, and no driver. So I told that it should be a memory related issue. But the last one is a stop 0x19, which should be driver related, so I am trying to give DV another go.

With DV enabled, use the computer normally. Let it crash. If you can make it crash twice, it will be nice :)

rslembo · 07 Feb 2013

Just got home, a bit late.
DV is enabled by now.
How "soon" would be that crash?

Posting results ASAP.
Thanks, Arc!

rslembo · 07 Feb 2013

So, I'm back with three new BSOD dump files in 2 hours.
Disabled DV for now.

Hope this works.
Thanks!

rslembo · 08 Feb 2013

After those three above:

Got more seven BSOD from 04:53 to 08:15.
Driver Verifier was off.

Attaching data..
Thanks!

Arc · 08 Feb 2013

Code:

BugCheck 4E, {99, 1cb1ee, 3, 1ca739}

Probably caused by : memory_corruption ( nt!MiBadShareCount+4c )

Followup: MachineOwner
---------
BugCheck A, {fffff70001086000, 0, 0, fffff80002a78aa0}

Probably caused by : memory_corruption ( nt!MiAgeWorkingSet+2d6 )

Followup: MachineOwner
---------

Those are again pointing the memory out.

As we are getting mixed results, and one part of that mixed results is hardware, before blaming the hardware we should make it sure. But Memtest, as you told is not showing any error.

In such a situation it is better to do a clean install of windows. If a vanilla installation crashes, too, we can say it with certainty that it is the hardware that is failing.

Vir Gnarus · 08 Feb 2013

I'm going to have to agree with Arc here, and if it ain't the memory then it's some other hardware. In that original 0x19 crash on the bad pool header, the pool table says it allocated 30 bytes for a particular driver's pool allocation, when in reality it should be 70 bytes. Compare those two numbers in binary and you'll see 30 is just 70 missing a single bit. A bitflip like this is a fairly rare occurrence in driver bugs and a very common occurrence in hardware failure, especially RAM. The fact that Driver Verifier isn't catching these crashes also is often a sign of hardware coming in and messing things up.

Again, if it ain't RAM, then it's gotta be Mobo/CPU/PSU that's causing it. Can you provide an HWInfo sensor log (check Sensors only at startup) for about 30 minutes or until (if possible) a crash?

Oh, and in your original post, the reason the Razer driver isn't letting Windows start when DV is on is because DV is seeing that the driver is performing something that'll corrupt the system. Razer drivers are notorious for their shoddy programming and bug-addled natures, but right now all you can do is try to find an update for them as well as any potential firmware update. Otherwise, I will have to recommend testing the system's stability with the driver uninstalled and resorting to the default Windows mouse driver.

Btw, I also noticed you said you just purchased this system like a month ago. Did you build it or is it a custom-built PC from a local PC shop? If the latter, why are you bothering trying to RMA are part when you should send it to the shop and ask them to look at it? They don't happen to have some sort of 'warranty' do they? Try figuring out your options to make sure you aren't unnecessarily working on this when you don't need too. The fact it's already dying a month after purchase is pretty iconic of hardware malfunction, no doubt, but if you can determine that it's not your responsibility right now to troubleshoot it and it's the shop's, then you should have them deal with it since they gave you a faulty system.

Analysts:

Strangely, and with considerable luck, the minidump for the 0x19 crash did indeed retain the pool information we need to look into this further. Start with !analyze -v:

Code:

6: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

BAD_POOL_HEADER (19)
The pool is already corrupt at the time of the current request.
This may or may not be due to the caller.
The internal pool links must be walked to figure out a possible cause of
the problem, and then special pool applied to the suspect tags or the driver
verifier to a suspect driver.
Arguments:
Arg1: 0000000000000020, a pool block header size is corrupt.
Arg2: fffff8a00feb0390, The pool entry we were looking for within the page.
Arg3: fffff8a00feb03c0, The next pool entry.
Arg4: 0000000005030115, (reserved)

Debugging Details:
------------------

TRIAGER: Could not open triage file : C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\triage\modclass.ini, error 2

BUGCHECK_STR:  0x19_20

POOL_ADDRESS:  fffff8a00feb0390 Paged pool

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

PROCESS_NAME:  svchost.exe

CURRENT_IRQL:  0

LOCK_ADDRESS:  fffff80002c8db80 -- (!locks fffff80002c8db80)

Resource @ nt!PiEngineLock (0xfffff80002c8db80)    Available

WARNING: SystemResourcesList->Flink chain invalid. Resource may be corrupted, or already deleted.


WARNING: SystemResourcesList->Blink chain invalid. Resource may be corrupted, or already deleted.

1 total locks

PNP_TRIAGE: 
    Lock address  : 0xfffff80002c8db80
    Thread Count  : 0
    Thread address: 0x0000000000000000
    Thread wait   : 0x0

LAST_CONTROL_TRANSFER:  from fffff80002bbccae to fffff80002a91fc0

STACK_TEXT:  
fffff880`09521d28 fffff800`02bbccae : 00000000`00000019 00000000`00000020 fffff8a0`0feb0390 fffff8a0`0feb03c0 : nt!KeBugCheckEx
fffff880`09521d30 fffff800`02d67663 : fffff8a0`019364d8 fffff8a0`019364a8 fffff8a0`624e4d43 fffff880`09522440 : nt!ExDeferredFreePool+0x12da
fffff880`09521de0 fffff800`02d583a8 : fffff8a0`00024010 fffff8a0`00024010 fffff880`09521f01 fffff880`09522440 : nt!CmpGetSymbolicLink+0x2b3
fffff880`09521ee0 fffff800`02d883d8 : fffffa80`07883cc8 fffff800`02a13000 fffffa80`07883b10 fffff800`00000000 : nt!CmpParseKey+0x948
fffff880`095221e0 fffff800`02d895f6 : 00000000`00000000 fffffa80`07883b10 fffff8a0`0ff09060 fffffa80`067047a0 : nt!ObpLookupObjectName+0x588
fffff880`095222d0 fffff800`02d3f088 : fffff880`030fb180 00000000`00000000 00000000`00000000 fffff880`095224a8 : nt!ObOpenObjectByName+0x306
fffff880`095223a0 fffff800`02d3f926 : fffff880`095227b8 fffff880`000f003f fffff880`095227d0 fffff800`02cc4600 : nt!CmCreateKey+0x2e1
fffff880`09522510 fffff800`02a91253 : ffffffff`ffffffff 00000000`00000000 fffff6fc`50053698 fffff800`02ab4038 : nt!NtCreateKey+0x2e
fffff880`09522560 fffff800`02a8d810 : fffff800`02cfca66 00000000`00000001 20207050`02a4b5d8 00000000`000000b5 : nt!KiSystemServiceCopyEnd+0x13
fffff880`09522768 fffff800`02cfca66 : 00000000`00000001 20207050`02a4b5d8 00000000`000000b5 fffff880`095229e0 : nt!KiServiceLinkage
fffff880`09522770 fffff800`02cf96b2 : 00000000`00000001 00000000`00000002 00000000`00000002 00000000`000f003f : nt!IopCreateRegistryKeyEx+0x7e
fffff880`09522840 fffff800`02cfa081 : 00000000`00000002 00000000`00000000 fffff880`09522a40 00000000`00000001 : nt!IopGetDeviceInterfaces+0x146
fffff880`095229b0 fffff800`02cf9351 : 00000000`00000002 00000000`00000000 00000000`00000000 fffff8a0`09334bd0 : nt!PiGetInterfaceDeviceList+0x41
fffff880`09522a10 fffff800`02d521a0 : fffff8a0`09334bd0 fffff800`00008000 fffff880`09522b01 fffff800`02f59da0 : nt!PiControlGetInterfaceDeviceList+0x111
fffff880`09522a90 fffff800`02a91253 : fffffa80`096fc060 00000000`013ee880 fffff880`09522b60 00000000`013ee908 : nt!NtPlugPlayControl+0x100
fffff880`09522ae0 00000000`77c7236a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`013ee848 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x77c7236a


STACK_COMMAND:  kb

FOLLOWUP_IP: 
nt!ExDeferredFreePool+12da
fffff800`02bbccae cc              int     3

SYMBOL_STACK_INDEX:  1

SYMBOL_NAME:  nt!ExDeferredFreePool+12da

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: nt

IMAGE_NAME:  ntkrnlmp.exe

DEBUG_FLR_IMAGE_TIMESTAMP:  503f82be

FAILURE_BUCKET_ID:  X64_0x19_20_nt!ExDeferredFreePool+12da

BUCKET_ID:  X64_0x19_20_nt!ExDeferredFreePool+12da

Followup: MachineOwner
---------

It's stating the header for a pool allocation has got corrupted. Each allocation of pool memory (driver/kernel reserved memory) has a header that has the pool tag which describes the purpose of the allocation - given by the driver allocating it - and some other details like the size of it. The subcode tells us that the header size defined in a pool header is incorrect and has been corrupted. Often this is caused by a driver mishandling its pool allocation and overflowing its allocated pool, which ends up spilling into the nearby pool allocation with its contents. We'll want to look at the pool table for this to see what we got, using the pool address mentioned:

Code:

6: kd> !pool fffff8a00feb0390
Pool page fffff8a00feb0390 region is Paged pool
 fffff8a00feb0000 size:  110 previous size:    0  (Allocated)  IoNm
 fffff8a00feb0110 size:   40 previous size:  110  (Allocated)  MmSm
 fffff8a00feb0150 size:   40 previous size:   40  (Allocated)  MmSm
 fffff8a00feb0190 size:   70 previous size:   40  (Free )  CMNb (Protected)
 fffff8a00feb0200 size:   40 previous size:   70  (Allocated)  MmSm
 fffff8a00feb0240 size:  150 previous size:   40  (Allocated)  NtFs
*fffff8a00feb0390 size:   30 previous size:  150  (Free ) *CMNb (Protected)
        Pooltag CMNb : Configuration Manager Name Tag, Binary : nt!cm

fffff8a00feb03c0 doesn't look like a valid small pool allocation, checking to see
if the entire page is actually part of a large page allocation...

GetUlongFromAddress: unable to read from fffff80002c2fa38
fffff8a00feb03c0 is not a valid small pool allocation, checking large pool...
unable to get pool big page table - either wrong symbols or pool tagging is disabled
fffff8a00feb03c0 is freed (or corrupt) pool
Bad previous allocation size @fffff8a00feb03c0, last size was 3

***
*** An error (or corruption) in the pool was detected;
*** Attempting to diagnose the problem.
***
*** Use !poolval fffff8a00feb0000 for more details.


Pool page [ fffff8a00feb0000 ] is __inVALID.

Analyzing linked list...
[ fffff8a00feb0390 --> fffff8a00feb0400 (size = 0x70 bytes)]: Corrupt region


Scanning for single bit errors...

None found

The highlighted pool region is the allocation made that has the fault. It will tell you the size (which is corrupted) and the size of the allocation made before it. This is rather misleading so make sure you're aware that previous size doesn't mean the size of the region we're looking at in its 'previous state', but it's only referring to the size of the allocation before it, which is tagged NtFs and is for an NTFS allocation. For this one, the tag is CMNb, which it explains is the Configuration Manager Name Tag (we wouldn't get this info if it wasn't for pooltag.txt in Windbg directory). It also gives you the driver that allocated it, which is nt!cm, as in the Conf Manager for the NT module. Consider these elements when looking at future pool regions, but for now the only element we're interested in is the size, which says it's 0x30.

Hmm, 0x30 seems rather odd given that the previous region tagged CMNb in the list was valued as 70 and was also recently freed, probably along with this one. Curious. We'll compare the two regions by dumping their contents based on their specified sizes. I usually use dc instead of dps first on pool allocation since strings are often the best clues to gain from em. Here they are:

Code:

6: kd> dc fffff8a00feb0190 fffff8a00feb0190+70
fffff8a0`0feb0190  05070104 e24e4d43 09d68b30 fffffa80  ....CMN.0.......
fffff8a0`0feb01a0  00000000 00000000 0feb01b0 fffff8a0  ................
fffff8a0`0feb01b0  0052005c 00470045 00530049 00520054  \.R.E.G.I.S.T.R.
fffff8a0`0feb01c0  005c0059 0041004d 00480043 004e0049  Y.\.M.A.C.H.I.N.
fffff8a0`0feb01d0  005c0045 00590053 00540053 004d0045  E.\.S.Y.S.T.E.M.
fffff8a0`0feb01e0  0043005c 006e006f 00720074 006c006f  \.C.o.n.t.r.o.l.
fffff8a0`0feb01f0  00650053 00300074 00310030 00000000  S.e.t.0.0.1.....
fffff8a0`0feb0200  03040107                             ....
6: kd> dc fffff8a00feb0390 fffff8a00feb0390+30
fffff8a0`0feb0390  05030115 e24e4d43 00000000 00000000  ....CMN.........
fffff8a0`0feb03a0  004c004c 00000000 0feb03b0 fffff8a0  L.L.............
fffff8a0`0feb03b0  0052005c 00470045 00530049 00520054  \.R.E.G.I.S.T.R.
fffff8a0`0feb03c0  005c0059                             Y.\.

Hmm, without a doubt the corrupt region looks to be cut off midway. Let's see what it looks like if we give it the 'normal' size of 0x70:

Code:

6: kd> dc fffff8a00feb0390 fffff8a00feb0390+70
fffff8a0`0feb0390  05030115 e24e4d43 00000000 00000000  ....CMN.........
fffff8a0`0feb03a0  004c004c 00000000 0feb03b0 fffff8a0  L.L.............
fffff8a0`0feb03b0  0052005c 00470045 00530049 00520054  \.R.E.G.I.S.T.R.
fffff8a0`0feb03c0  005c0059 0041004d 00480043 004e0049  Y.\.M.A.C.H.I.N.
fffff8a0`0feb03d0  005c0045 00590053 00540053 004d0045  E.\.S.Y.S.T.E.M.
fffff8a0`0feb03e0  0043005c 006e006f 00720074 006c006f  \.C.o.n.t.r.o.l.
fffff8a0`0feb03f0  00650053 00300074 00310030 fffff880  S.e.t.0.0.1.....
fffff8a0`0feb0400  03040104                             ....

Lookin pretty. Now let's look at each header by dumping the headers as data structures. I assumed the name was going to be _POOL_HEADER and sure enough, I was correct:

Code:

6: kd> dt !_POOL_HEADER
nt!_POOL_HEADER
   +0x000 PreviousSize     : Pos 0, 8 Bits
   +0x000 PoolIndex        : Pos 8, 8 Bits
   +0x000 BlockSize        : Pos 16, 8 Bits
   +0x000 PoolType         : Pos 24, 8 Bits
   +0x000 Ulong1           : Uint4B
   +0x004 PoolTag          : Uint4B
   +0x008 ProcessBilled    : Ptr64 _EPROCESS
   +0x008 AllocatorBackTraceIndex : Uint2B
   +0x00a PoolTagHash      : Uint2B
6: kd> dt !_POOL_HEADER fffff8a00feb0190
nt!_POOL_HEADER
   +0x000 PreviousSize     : 0y00000100 (0x4)
   +0x000 PoolIndex        : 0y00000001 (0x1)
   +0x000 BlockSize        : 0y00000111 (0x7)
   +0x000 PoolType         : 0y00000101 (0x5)
   +0x000 Ulong1           : 0x5070104
   +0x004 PoolTag          : 0xe24e4d43
   +0x008 ProcessBilled    : 0xfffffa80`09d68b30 _EPROCESS
   +0x008 AllocatorBackTraceIndex : 0x8b30
   +0x00a PoolTagHash      : 0x9d6
6: kd> dt !_POOL_HEADER fffff8a00feb0390
nt!_POOL_HEADER
   +0x000 PreviousSize     : 0y00010101 (0x15)
   +0x000 PoolIndex        : 0y00000001 (0x1)
   +0x000 BlockSize        : 0y00000011 (0x3)
   +0x000 PoolType         : 0y00000101 (0x5)
   +0x000 Ulong1           : 0x5030115
   +0x004 PoolTag          : 0xe24e4d43
   +0x008 ProcessBilled    : (null) 
   +0x008 AllocatorBackTraceIndex : 0
   +0x00a PoolTagHash      : 0

Aside from the manipulated BlockSize value which is the corrupted value, I'm initially suspicious about the lack of values in the other ones like ProcessBilled, AllocatorBackTraceIndex & PoolTagHash. However research hasn't really come up with anything solid on the last two, and ProcessBilled appears to be only necessary under the condition the driver requesting the pool allocation didn't provide a pool tag. So I'm not confident we're dealing with any issues with those items there, to my current knowledge. So at the moment, unless I'm completely unaware, we don't appear to see any other problems in the header besides the pool size getting messed up.

So far we're looking at just the 0x30 value being erroneous. Now, when you look at it in binary and compare it to 0x70, you'll see that there's only a solitary bit difference between the two:

Code:

6: kd> .formats 30
Evaluate expression:
  Hex:     00000000`00000030
  Decimal: 48
  Octal:   0000000000000000000060
  Binary:  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00110000
  Chars:   .......0
  Time:    Wed Dec 31 19:00:48 1969
  Float:   low 6.72623e-044 high 0
  Double:  2.37152e-322
6: kd> .formats 70
Evaluate expression:
  Hex:     00000000`00000070
  Decimal: 112
  Octal:   0000000000000000000160
  Binary:  00000000 00000000 00000000 00000000 00000000 00000000 00000000 01110000
  Chars:   .......p
  Time:    Wed Dec 31 19:01:52 1969
  Float:   low 1.56945e-043 high 0
  Double:  5.53354e-322

So my suspicion weighs rather heavy on there being a bit flip here. Everything else just looks too clean to have it be otherwise. Now just to make sure, we're going to dump a little further more than the end of the corrupt pool allocation to see if there's anything that matches up. I would suspect the pool allocation after this should actually have the valid PreviousSize, which should be 0x70 instead of 0x30 (actually, 0x7 instead of 0x3, since block sizes are counted by 0x10's). Let's give a look-see:

Code:

6: kd> dc fffff8a00feb0390 fffff8a00feb0390+100
fffff8a0`0feb0390  05030115 e24e4d43 00000000 00000000  ....CMN.........
fffff8a0`0feb03a0  004c004c 00000000 0feb03b0 fffff8a0  L.L.............
fffff8a0`0feb03b0  0052005c 00470045 00530049 00520054  \.R.E.G.I.S.T.R.
fffff8a0`0feb03c0  005c0059 0041004d 00480043 004e0049  Y.\.M.A.C.H.I.N.
fffff8a0`0feb03d0  005c0045 00590053 00540053 004d0045  E.\.S.Y.S.T.E.M.
fffff8a0`0feb03e0  0043005c 006e006f 00720074 006c006f  \.C.o.n.t.r.o.l.
fffff8a0`0feb03f0  00650053 00300074 00310030 fffff880  S.e.t.0.0.1.....
fffff8a0`0feb0400  03040104 6d536d4d 00000000 00000000  ....MmSm........  //Notice the 'MmSm' pooltag? This is where this allocation starts.
fffff8a0`0feb0410  068882d0 fffffa80 00000040 000c0000  ........@.......
fffff8a0`0feb0420  00000000 00000000 00040000 00000000  ................
fffff8a0`0feb0430  00000000 00000000 00000000 00000000  ................
fffff8a0`0feb0440  03140104 7346744e 00000000 00000000  ....NtFs........
fffff8a0`0feb0450  000009df 00010000 00000000 00000000  ................
fffff8a0`0feb0460  00000000 00000000 00000000 00000000  ................
fffff8a0`0feb0470  00000000 00000000 00000000 00000000  ................
fffff8a0`0feb0480  00000000 00000000 00000000 00000000  ................
fffff8a0`0feb0490  0050013b                             ;.P.

6: kd> dt !_POOL_HEADER fffff8a0`0feb0400
nt!_POOL_HEADER
   +0x000 PreviousSize     : 0y00000100 (0x4)
   +0x000 PoolIndex        : 0y00000001 (0x1)
   +0x000 BlockSize        : 0y00000100 (0x4)
   +0x000 PoolType         : 0y00000011 (0x3)
   +0x000 Ulong1           : 0x3040104
   +0x004 PoolTag          : 0x6d536d4d
   +0x008 ProcessBilled    : (null) 
   +0x008 AllocatorBackTraceIndex : 0
   +0x00a PoolTagHash      : 0

What?! This is where the monkey wrench is thrown in. It doesn't say 0x3, and it doesn't say 0x7, but instead it says previous size is 0x4, or 0x40. This isn't right! I went ahead and checked the next allocation after this one and that one correctly gave PreviousSize of 0x4, which is the size of the allocation we're looking at now. So it wasn't that the whole pool table was messed up all the way down. Something's odd here.

Continued in next post...