|09 Sep 2013||#1|
| || |
BSOD 0x124 every 8 hours, sitting or composing text; OK for 6 years!
I built a power workstation for use in unfunded engineering work in December 2007, using dual Xeon E5335s, and the machine has been a brick for all this time. A couple of weeks ago I had a RAM fan short the back of the video card and cause a disk crash. By the time I found it and recovered my data, I had a new SATA 3 card (Vantec UGT-ST644R; RAID but use as JBOD). My new fast 2 TB HD is on it, and a couple of external RAID boxes are on the external SATA cables, which limit them to SATA 2. Everything was great for a few days, but I hadn't gotten around to moving plugs around behind the computer to put all the RAID boxes on the same UPS as the computer; at least one is on a surge protector but not the UPS.
A day or two ago the lights flickered when we went to bed. When I got up, the computer had a BSOD x124, which is a generic hardware BSOD that you can't do much with as a user. I came here and read what you had from two fellows in 2011 and reconfigured my error correcting to get a minidump, and, sure enough, I got another BSOD with a minidump a few hours later.
I tried to make a boot floppy to re-flash my BIOS but there wasn't enough room. I farkled things trying out boot CD-ROMs and had to unplug everything except my C drive and run the Windows 7 installation disk boot repair to clean things up. Difficulties continued with the RAID drives until I finally found and fixed a loose SATA connection on the back of one of them - a 3 TB which was not on the UPS, and which does not now pass all the Vantec health tests (one HD may be ill), but which has vital data on it, but it works great for now and that's another day. I followed your other instructions and have the SF_diagnostic_tool ZIP file ready to go. The other RAID is a 2 TB Icy Dock SATA 2 that is my "cloud" backup, also on external SATA; the Vantec runs it SATA 2 as with the other RAID.
I've had Windows complain that it couldn't do a chkdsk on the 3 TB on boot because "a recent software installation" prevented this. I did a chkdsk from the properties window on the icon from the Computer window, and it found unallocated space from the BSOD but nothing else. I see no conflicts in the interrupt levels or memory allocated I/O on the Devices Manager Resources view. But a little earlier tonight, while typing a long post on my favorite forum, I got a BSOD x124, the third time today.
This computer has been a brick since I built it in December 2007. Something in the new Vantec or its driver may have been damaged in the power flicker last night; the RAID itself has been on the system for a couple of months now without problems. The Icy Dock has been there for a year or so without problems. They were on an external SATA 3 Kyocera card (the cable limited them to SATA 2, as now). Now, the internal SATA 2 does nothing, everything is on the Kyocera.
I also have a new SATA 3 card and have a bunch of stuff that was on a USB 2 hub that is now on a USB 3 hub, but I don't *think* that this is a problem.
My SF file is attached. HELP. I need this computer every day; I use it for health care, hosting videocons on Google+, occasional consulting, and urgent continuous personal uses such as health care records and analysis.
|My System Specs|
|10 Sep 2013||#2|
| || |
Well, the computer was showing it's usually cheerful screen on the logon screen this morning, and there are still only two minidump files from yesterday's three BSODs, so it made it through the night without a BSOD.
After the third BSOD I discovered that at least one of the SATA cables to the external RAID enclosures, the one to the 3 TB that I was having trouble mouting, could use a better connection, and did have some success re-seating it in the enclosure case. The cables I have been using came in the box with the RAID drives. I've got some locking six-foot SATA cables coming to make sure that never happens again. The 3 TB is USB 3; I may put it there if that is a more robust connection. The 2 TB "cloud" backup has a USB 2 connection that should be OK for occasional overnight automatic compressed backup. I may move it over to USB, particularly if you say that SATA has *anything* to do with my BSOD, freeing my SATA 3 card only for real-time system stuff like the boot disk an a data drive that I need all the time, which will happen if the 3 TB RAID gets at all slow. In the process of getting the Windows 7 install DVD to examine the boot disk, I removed an old 1 TB Hitachi that was there because, after it's disk crash a couple of weeks ago, I cleared it off and restored ti 100%, bootable, with with Acronis from backup files on the "cloud" RAID, and it was slowing down the Microsoft repair software with it's never-to-boot-again system installation; it has transitioned to a cold backup. If I put it back, it will be temporary, and it will be on the motherboard's SATA 2 connections, and I will put its BIOS boot order "over the fence" in the do-not-even-THINK-about-booting-from-these" with the Ethernet connections and RAID drives.
After settling in place and a chkdsk, both RAID drives are showing OK on the Diskeeper 12 health indicators, although I see no data on them from the Marvell Storage Utility that came with the Vantec SATA 3 card. The MSU did show the old 1 TB healthy, too, in spite of its age and the fact that I crashed it a couple of weeks ago. The Icy Dock has an LED indication of problems with either HD and it is OK; I may need software from the 3 TB RAID enclosure that I haven't installed to look at its internal health details.
BOTTOM LINE: If all the software looks OK, like driver compatibility and I/O sharing and such, please look at the SATA to the non-system drives an see if that is causing the BSOD. I would *love* to declare victory when the better SATA cables come in.
|My System Specs|
|11 Sep 2013||#3|
| || |
Looks like CPU is trying to get data from the Level0 cache, and it timed out in both processors. Could very well be a Thermal issue.
Error : BUSL0_SRC_ERR_M_NOTIMEOUT_ERR (Proc 4 Bank 0) Status : 0xb200000410000800
Error : BUSL0_SRC_ERR_M_NOTIMEOUT_ERR (Proc 6 Bank 0) Status : 0xb200000410000800
|My System Specs|
|11 Sep 2013||#4|
| || |
Wow, it's an Intel 5000X MCH (Northbridge, in later chipsets) error. I *never* would have gotten that following flags around manually. Your capability here is amazing. The big think for me here is that my new SATA 3 card and USB 3 card are not the problem and neither are their drivers. I was expecting to change card slots, drivers, or vendors. Your diagnosis tells me that my current configuration is working OK and to leave it alone.
I'm using the original Intel fans. The Bios supports Intel SmartFan and the temperature at which the fans spin up is set at 66 C. They were idling at the time of the BSOD. In fact, unless I stress the machine, the fans always idle except for one burst on boot before the chipset gets a handle on the CPU temperature and regulates the fans. I call that the "leaf blower" phase of boot, and it happens after most of the POST and begins before the video card turns on the display and shows the boot splash screen, for a few seconds while the splash screen (or, if you press <TAB>, the POST screen), then the CPU fans drop to idle.
In the past I have run core-intensive single-thread apps on each of the eight cores and the fans have run at half speed 24/7 for months, most recently about six months ago. This kind of thing is the real reason that I'm using an eight-core machine instead of a dual-core Athalon or some such, it's part of my personal unfunded research. No problems there.
Recently I've been trying to get the board to boot with Xeon E5450's with no joy (two pairs of them!). I've been laying the computer down and working on it under the desk. In the process, I've broking the SVI to VGA converter and two SATA cables. Although I'm careful and I've done this kind of thing scores of times, there are some relevant things here:
Since I replaced both SATA cables I haven't had a BSOD. I have better SATA cables with locking connectors coming in this afternoon and they will go on when I have them. But that doesn't seem to be relevant to what you have found.
What *may* be relevant to what you have found is that all three BSODs happend in a day in which I had just put back the E5335s after trying a pair of E5450s. Its dusty under there. I notice that both bus errors are on one chip; I don't know whether processors 4 and 6 refer to a problem that can be traced to the same pin or not. The BSODs stopped when I pressed in a SATA connector that was pulled out and slightly diagonal, but, it still could have been a spec of dust on one of the processor pins. Or not.
I'll try HwMonitor; by the name it gives more information than the others. The BIOS screen gives information on the CPU temperature and it is fine - while idling along executing the BIOS setup program. I'll watch the CPU temperature. The thing to look for is "sticky" numbers indicating that updates are not always forthcoming in a timely fashion, erratic temperature variation at any time, or any other anomaly.
|My System Specs|
|11 Sep 2013||#6|
| || |
OK, I clicked the wrong link on the CPUID web site and had to uninstall over ten junk programs, then to search for a way to get HwMonitor Pro. I set up data logging but I don't see any logs, or a way to get the logs or plots directly from the application window.
I DO SEE A BIG RED FLAG. THE+12 VOLTS RAILS ARE RUNNING AT +7.60 VOLTS AND THE -12 VOLTS RAIL IS RUNNING AT -7.26 VOLTS.
Core temperatures are running 39 C to 43 C on processor 1, 49 C to 55 C on processor 2. The two Intel heat sinks are different; one is aluminum and the other is copper. They were hard to get in December 2007 and NewEgg had a limit of 1, so I made two orders, and they came in a few days apart - with different heat sinks and different stepping levels. I have been treating the heat sink/fans as interchangeable and, although I always put the copper heat sink on socket 2, I don't keep track of which processor is in which slot.
The power supply is an Antec 750 Watt model ATX12V V2.3, with four +12 Volt rails and one -12 Volt rail. All four +12 Volt rails are rated at 25 Amps and the -12 Volt rail is rated at 0.8 Amps. They recommend that the total of the 12 Volt rail powers be less than 744 Watts, or an average of 15.5 Amps. I don't think that I am exceeding that, or exceeding 25 Amps on any one rail (I'm using two on this motherboard, one for the CPU fans and one for the ATAPI power bus).
This computer goes through a power supply every year and a half or so. A cheap power supply I got in a pinch once lasted a week. Something tells me that I need to get a big gamer power supply ASAP. Then, the E5450's might boot.
What do you think???? Does the 12 Volt rail have anything to do with the Intel 5000X MCH or the processor voltages? Do we have our smoking gun?
|My System Specs|
|11 Sep 2013||#7|
| || |
There were two minidump files; you can ignore the earlier one and just get the third BSOD. All the logs should cover the time of that last BSOD. The first time I had the BSOD it was not set for a dump. I came here and read the instructions and it was configured for the next one. And the next one.
CPUID Hardware Monitor Pro is reporting -7.26 Volts for the -12 Volt supply and +7.60 Volts for the +12 Volt supply in the first block of data, which are labeled as the CPU voltages. I pulled out my DVM and checked a couple of 12 Volt power supply pins and places on the motherboard and they are all 12.12 Volts, plus or minus a few millivolts. I don't know what to make of the low voltages reported by CPUID Hardware Monitor. I'll look into the motherboard information and see if their is a fan plug mismatch or some such that is causing a voltage drop for the 12 Volts lines into the Intel 5000X.
Oh, and thank you for enabling my signature photo early! I reconginze and appreciate that this is a singular courtesy, and I very much appreciate it. I took the photo Monday while I was trying some E5450's on for size; no joy but I love this picture. Its a strip across the LGA 771 socket.
|My System Specs|
|12 Sep 2013||#8|
| || |
I booted to BIOS setup and viewed the 12V lines and they read 12V by the BIOS. Since my voltmeter also reads 12V, I think it's clear that HwMonitor Pro uses a bad scale factor for that particular voltage on my particular motherboard, which has been discontinued for years now.
I downloaded and am running SpeedFan because I wanted to see what it has to say about the 12V rails while the computer is running, and it doesn't show the 12V rails. That tells me that the 12V rails aren't used in the CPU interfaces. I always thought that the +12V rails were used for fans and HD motors and such, not for powering components that deal with signals. The -12V rail was used in RS-232 signals and some types of logic like ECL that we don't see anymore, and to supply some comparators along with a +12V rail, but that technology hasn't been used on motherboards in decades so far as I know.
I notice that SpeedFan doesn't list my motherboard among its selection of ASUS motherboards, and I suspect that HwMonitor uses constants for another ASUS motherboard or some such. I've put in a question over on the HwMonitor forum, and if I don't get an answer I'll put in an error report to CPUID, but I'm convinced at this point that there is nothing wrong with my power supply, motherboard, or voltages, that HwMonitor is using the wrong scale factor for my motherboard for this particular factor, and that if there remains a problem that may be causing my BSODs it is not clear what it is. What I do know is that I haven't had another one for a couple of days now.
|My System Specs|
|16 Sep 2013||#9|
| || |
I was logged on a secure web site checking my settings when, abruptly, all the lights went out. The monitor was off, the disk activity light was off, but the power light was on. The POST sequence began and the computer booted. I got a Windows prompt screen "Windows did not shut down normally..." so this was no software-driven reboot, but a crash. I did NOT get a BSOD or a screen dump.
I ran SF_Diagnostic_Tool and attached the file to this post as soon as I could find the file, which was a few minutes. Note that there is no new minidump, but the tool probably picked up the two that are already there from last week. There was no BSOD this time, but there *was* a hard crash, which is worse than a BSOD, I would think, since the OS failed to trap the error and have a recovery option, e.g. the BSOD with information for the user and the minidump.
I would wager that the logs in the attached file are related to the BSODs of last week.
I re-flashed the BIOS Saturday, clearing the checksum as an option on the flashing operation, and as expected got a checksum error in the DMI log that was announced in a POST sequence line every time I booted until I went into BIOS setup, viewed it, and cleared it. I have *never* gotten a BIOS checksum error, so I think that the previous BIOS was OK, so I think that we can eliminate BIOS corruption as a cause. In any case, I flashed the BIOS from an ASUS utility that runs under Windows while the machine was cheerfully multitasking away, so I think that the BIOS is only important at boot time, not while running, and is unlikely to be involved with a BSOD, even the dreaded x124 BSOD.
I have noticed two mini-BIOS screens for Intel Boot Agent GE v 1.2.6 since I bought the machine. I thought that these were mini-BIOS EPROMs for the two CPUs but it turns out that these are for the two gigabit Ethernet "cards" embedded in the motherboard; they also appear on the boot device list (I always put them in the "do-not-boot-from-these" list). This motherboard will boot from LAN if you put one of the Ethernet cards in the boot list, or if you hit one of the function keys (I forget which) during POST. In the process of researching this I found updates to the LAN hardware on the Intel web site and ran them. I checked the mini-BIOS to see if they were updated with the driver update and, yes, they went from v 1.2.6 to v 1.2.36. But, again, I have never had any problem whatsoever with the LAN hardware/firmware and don't expect any involvement with the BSODs.
I have had some problems with USB hubs failing since my upgrade. In particuar, a new Manhattan MondoHub USB 3/2 hub with four USB 3 and 24 USB 2 ports will fail if you put too many devices on it. My Canon P-150 mini-scanner (powered for 0.5 A, additional USB draw 0.5 A) would drop some other USB connections off the hub and fail to install its driver until I moved it over to a motherboard USB 2 port, for example. The hub is rated at 4 Amps but somehow it almost acts like an unpowered USB hub. I have a really old IOGear card reader/USB hub that is having the same problem. I have two USB 3/2 powered hubs coming Thursday.
One of the RAIDs I had on external SATA kept being dropped off the HD list so I moved it over to USB - on the Manhattan MondoHub. It's just my backup drive and speed isn't important (Acronis doesn't write fast). It's stable now. The three-foot SATA 3 cables that I had ordered arrived, and they were three *inches* long, not three feet long; I have 1-meter SATA 3 cables coming.
All USB hubs are powered and their wall warts are on the same UPS as the main computer so that a power glitch won't disrupt the configuration.
|My System Specs|
|01 Oct 2013||#10|
| || |
I think we're about done with this one
Since my last post, here is what I have done with the system:
The only problem identified by you from the logs is a timeout in accessing L0 cache (on module!) for processors 4 and 6, which is one core on each E5335. Since this disturbing event can be caused by thermal issues, and one of my E5335's came with an aluminum heat sink (and fan), I ordered a new copper heat sink with fan from eBay. Here are temperature comparisons. Note that the aluminum heat sink was on processor 1, and that processor 2 has always had a copper heat sink.
CPU Al-Cu, C Cu-Cu, CNote that CPUID apparently used the cores from socket 2 for cores 0-3 and from socket 1 for cores 4-7, and that the room was apparently a little cooler when I checked the temperatures for two copper heat sinks. Also note that the copper heat sink apparently makes a difference of about 1 C to 2 C, but nothing that seems to make it worthwhile to go out and get a copper heat sink if you have an aluminum one. Both are from heat sink and fan combinations that came in the box with a Xeon E5335 (TDP 80 Watts) and seem to be the same units shipped with other Xeon chips with TDP up to 150 Watts, and I am speaking of genuine Intel heat sinks shipped with, or for, Xeon chips of the type in use.
The motherboard lowest default chip temperature at which to begin revving the CPU fans is 66 C, which is was never approached during the BSOD events or the crash. I am going to go out on a limb here and say that the BSODs were caused either by lint under the sockets or bad drivers in my new installation with some really old hardware still on USB buses. I have had the CPUs out six times, trying to get three pairs of Xeon E5450's to boot, with no joy, hence the possibility of lint in the socket(s) after the first two tries, which is when the BSOD occured, or four tries, when the crash occurred (11:28:59 AM, according to a message I got after reboot). I did examine the system logs after the crash, and there are none within about ten minutes before or about a half our after the crash, meaning that the cause of the crash wasn't logged or that the logs were in the disk cache and were lost, which means that SF_diagnostic_tool won't find any logs relevant to the crash because they aren't there.
Unless one of you guys gets back to me in the next day or two with data that indicates that I still have something to look at, I'm going to mark this thread as Solved.
|My System Specs|
|Similar help and support threads for2: BSOD 0x124 every 8 hours, sitting or composing text; OK for 6 years!|
|BSOD 0x124||BSOD Help and Support|
|BSOD STOP 0xF4 while sitting idle||BSOD Help and Support|
|BSOD while sitting idle||BSOD Help and Support|
|BSOD unknown reason PC sitting on idle||BSOD Help and Support|
|BSOD, PC sitting idle, then restarts; Info listed in post||BSOD Help and Support|
|BSOD when system sitting idle||BSOD Help and Support|
|I was sitting here ...||Chillout Room|