Single-bit corrected errors
29 answers - 904 bytes -

I've been moving disk drives (DVD and HDD) around and upgrading S
(SuSE 9.0 to 10.0) over the past couple of days. In the process
I've been playing with BIS to get the drives ordered the way I
wanted (and figure out who listens to BIS drive ordering ;).
While trying to figure out some strangeness I cleaned out the BIS
messages area. Last night I kept getting beeps (about one an
hour). It took a while to isolate it, but finally looked at the
hardware error report in the setup screens. Seems I've been
getting a corrected single-bit error in the same location on
average about once an hour. I guess ECC actually does something.
;-)
now to figure out which stick is faulty (it's all Crucial, with
a lifetime warranty ;-). If everything is right in the world the
error report _should_ point me to it. All I need is the Rosetta
stone.
No.1 | | 1386 bytes |
| 
Keith wrote:
I've been moving disk drives (DVD and HDD) around and upgrading S
(SuSE 9.0 to 10.0) over the past couple of days. In the process
I've been playing with BIS to get the drives ordered the way I
wanted (and figure out who listens to BIS drive ordering ;).
While trying to figure out some strangeness I cleaned out the BIS
messages area. Last night I kept getting beeps (about one an
hour). It took a while to isolate it, but finally looked at the
hardware error report in the setup screens. Seems I've been
getting a corrected single-bit error in the same location on
average about once an hour. I guess ECC actually does something.
;-)
now to figure out which stick is faulty (it's all Crucial, with
a lifetime warranty ;-). If everything is right in the world the
error report _should_ point me to it. All I need is the Rosetta
stone.
Keith,
If you only have 1 bank ,It will depend on your controller,how
hard this is to diagnose.If you have 2 banks (non-interleaved)
the address should tell you which bank it's in.Swapping one
module from the bad bank into a good one should tell which
module it is by seeing the address move higher or lower.
Not knowing the architecture specifics this is the best
advice I can give .
Hope this helps,
Mark Whitlock
No.2 | | 2332 bytes |
| 
In article <b%%Kf.1$Aj3.132@news.abs.net>, markw@abs.net says
Keith wrote:
I've been moving disk drives (DVD and HDD) around and upgrading S
(SuSE 9.0 to 10.0) over the past couple of days. In the process
I've been playing with BIS to get the drives ordered the way I
wanted (and figure out who listens to BIS drive ordering ;).
While trying to figure out some strangeness I cleaned out the BIS
messages area. Last night I kept getting beeps (about one an
hour). It took a while to isolate it, but finally looked at the
hardware error report in the setup screens. Seems I've been
getting a corrected single-bit error in the same location on
average about once an hour. I guess ECC actually does something.
;-)
now to figure out which stick is faulty (it's all Crucial, with
a lifetime warranty ;-). If everything is right in the world the
error report _should_ point me to it. All I need is the Rosetta
stone.
Keith,
If you only have 1 bank ,It will depend on your controller,how
hard this is to diagnose.If you have 2 banks (non-interleaved)
the address should tell you which bank it's in.Swapping one
module from the bad bank into a good one should tell which
module it is by seeing the address move higher or lower.
Not knowing the architecture specifics this is the best
advice I can give .
Thanks, but I was more or less just telling the regular .chipsters
that there is some evidence that ECC actually does something. ;-)
I've always had that question (even when qualifying boards in a
former life); how do you know ECC is really doing something
usefull?
My board is a Tyan S2875S ( Socket 940) with 2x2 banks,
unfortunately with mixed size DIMMs (2x256MB and 2x512MB) so I
can't just swap one DIMM high to low. Any crossing I do is
symmetrical, though I suppose swapping one set of DIMMs will at
least change the error signature, or not. That would at least
isolate the problem to the pair.
I've sent an email off to Tyan support, which hit an auto-
responder. I have to jump through their FAQ hoops (like; Why does
my system no longer automatically launch into the S once I have
updated my bios from v3.01 to v3.02?) to get to a real person.
No.3 | | 502 bytes |
| 
Yes, ECC really does something. I've been going thru servers and
upgrading the memory. In the process, I find servers that have memory
ECC fails that have been on line and running for two years. But they
have been perforing perfectly.
I don't leave the failing memory installed but replace it as a matter
of practice. But it's nice to know your insurance package really works.
(Hard drives would never work were it not for ECC).
Tom S.
No.4 | | 1085 bytes |
| 
In article <1140759300.161807.202300@i40g2000cwc.googlegroups. com>,
gtstephenson@gmail.com says
Yes, ECC really does something. I've been going thru servers and
upgrading the memory. In the process, I find servers that have memory
ECC fails that have been on line and running for two years. But they
have been perforing perfectly.
I know it "does something", if it's actually implemented. ;-) My
problem was finding out *if* it's implemented and if, how well.
I'd never had an example of failing memory to test it with before.
Now that I can test it, I can't find it (Tyan has *tried* to be
helpful, but so far).
I don't leave the failing memory installed but replace it as a matter
of practice. But it's nice to know your insurance package really works.
Indeed.
(Hard drives would never work were it not for ECC).
I know the theory behind ECC and certainly know it works. I also
know motherboard makers have a way of skirting around the truth
(remember the fake caches?).
No.5 | | 513 bytes |
| 
Keith <krw@att.bizzzzwrote in part:
I know it "does something", if it's actually implemented. ;-)
My problem was finding out *if* it's implemented and if,
how well. I'd never had an example of failing memory to
test it with before. Now that I can test it, I can't find it
(Tyan has *tried* to be helpful, but so far).
See if memtest-86 will provoke errors faster. I suspect one
cell is "leaking" and show up on slower refresh cycles.
-- Robert
No.6 | | 905 bytes |
| 
Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:
Keith <krw@att.bizzzzwrote in part:
>I know it "does something", if it's actually implemented. ;-)
>My problem was finding out *if* it's implemented and if,
>how well. I'd never had an example of failing memory to
>test it with before. Now that I can test it, I can't find it
>(Tyan has *tried* to be helpful, but so far).
>
See if memtest-86 will provoke errors faster. I suspect one
cell is "leaking" and show up on slower refresh cycles.
I think that's what Tyan was trying to get me to run, but couldn't answer
the obvious questions; where and Linux? Sheesh the BIS setup report
gives me a location and a data (always the same). ould think
*they* could decode this down to the DIMM! I guess some things are too
hard
No.7 | | 934 bytes |
| 
Fri, 24 Feb 2006 21:07:25 -0500, Keith <krw@att.bizzzzwrote:
Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:
>
>Keith <krw@att.bizzzzwrote in part:
I know it "does something", if it's actually implemented. ;-)
My problem was finding out *if* it's implemented and if,
how well. I'd never had an example of failing memory to
test it with before. Now that I can test it, I can't find it
(Tyan has *tried* to be helpful, but so far).
>>
>See if memtest-86 will provoke errors faster. I suspect one
>cell is "leaking" and show up on slower refresh cycles.
>
>I think that's what Tyan was trying to get me to run, but couldn't answer
>the obvious questions; where and Linux?
http://www.memtest86.com/#download0
No.8 | | 1060 bytes |
| 
Sat, 25 Feb 2006 00:56:53 GMT, Robert Redelmeier <redelm@ev1.net.invalid>
wrote:
>Keith <krw@att.bizzzzwrote in part:
>I know it "does something", if it's actually implemented. ;-)
>My problem was finding out *if* it's implemented and if,
>how well. I'd never had an example of failing memory to
>test it with before. Now that I can test it, I can't find it
>(Tyan has *tried* to be helpful, but so far).
>
>See if memtest-86 will provoke errors faster. I suspect one
>cell is "leaking" and show up on slower refresh cycles.
>
Robert
Control over refresh rates was dropped from memtest-86 some time ago, if that
is what you're referring to.
In any case, any conventional memory test technique will fail miserably to
detect cells with nascent leakage issues, as simply reading the row that the
cell resides in will refresh that cell, and you know how memory tests like to
hammer on memory
No.9 | | 574 bytes |
| 
daytripper <day_trippr@removeyahoo.comwrote in part:
In any case, any conventional memory test technique will fail
miserably to detect cells with nascent leakage issues, as simply
reading the row that the cell resides in will refresh that cell,
and you know how memory tests like to hammer on memory
Sure but I think the timescales are different. Memory
loses refresh fairly quickly, 500 ms IIRC. memtest-86 is
slow and takes multiple seconds for each pass. I think
one of the tests is write all, then read all.
-- Robert
No.10 | | 1187 bytes |
| 
I used to write firmware for some bus-powered USB device with 2MB of SDRAM
memory. We've had a problem that we could not reliably detect firmware
reload over a live copy vs unplug/replug (power cycle). The data was
retained in SDRAM for a few seconds. about 10-15 sec guaranteed that
data was gone. I suspect it's because the chip was not powered. In a powered
module, the data would be gone sooner without referesh.
"Robert Redelmeier" <redelm@ev1.net.invalidwrote in message
news:IsQLf.16326$@newssvr12.news.prodigy.com
daytripper <day_trippr@removeyahoo.comwrote in part:
>In any case, any conventional memory test technique will fail
>miserably to detect cells with nascent leakage issues, as simply
>reading the row that the cell resides in will refresh that cell,
>and you know how memory tests like to hammer on memory
>
Sure but I think the timescales are different. Memory
loses refresh fairly quickly, 500 ms IIRC. memtest-86 is
slow and takes multiple seconds for each pass. I think
one of the tests is write all, then read all.
-- Robert
No.11 | | 1430 bytes |
| 
Sat, 25 Feb 2006 03:48:56 GMT, Robert Redelmeier <redelm@ev1.net.invalid>
wrote:
>daytripper <day_trippr@removeyahoo.comwrote in part:
>In any case, any conventional memory test technique will fail
>miserably to detect cells with nascent leakage issues, as simply
>reading the row that the cell resides in will refresh that cell,
>and you know how memory tests like to hammer on memory
>
>Sure but I think the timescales are different. Memory
>loses refresh fairly quickly, 500 ms IIRC. memtest-86 is
>slow and takes multiple seconds for each pass. I think
>one of the tests is write all, then read all.
>
Robert
A specification will never tell you how long an sdram might retain data sans
refresh (in all its forms). What it will tell you is the *minimum* amount of
time the sdram is guaranteed to retain data, under the listed test conditions
- which usually includes high temperature which aggravates cell leakage.
So unless your system is soaking in temps that drive the die temps to 85C,
that value won't be any where near "real life". It is a value that guarantees
the vendor won't get late night phone calls from irate customers, so it (like
virtually every spec you'll ever see in a commodity market) is spec'd
conservatively to begin with
No.12 | | 1445 bytes |
| 
Fri, 24 Feb 2006 21:07:25 -0500, Keith <krw@att.bizzzzput finger
to keyboard and composed:
Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:
>
>Keith <krw@att.bizzzzwrote in part:
I know it "does something", if it's actually implemented. ;-)
My problem was finding out *if* it's implemented and if,
how well. I'd never had an example of failing memory to
test it with before. Now that I can test it, I can't find it
(Tyan has *tried* to be helpful, but so far).
>>
>See if memtest-86 will provoke errors faster. I suspect one
>cell is "leaking" and show up on slower refresh cycles.
>
>I think that's what Tyan was trying to get me to run, but couldn't answer
>the obvious questions; where and Linux? Sheesh the BIS setup report
>gives me a location and a data (always the same).
What is the BIS telling you? Does it identify the failing data bit or
check bit?
ould think
>*they* could decode this down to the DIMM! I guess some things are too
>hard
If you could find some way to non-destructively create a double bit
error, then that should locate your faulty module. For example, you
could place tape over a "good" data bit on the edge connector.
- Franc Zabkar
No.13 | | 2080 bytes |
| 
Sat, 25 Feb 2006 19:30:41 +1100, Franc Zabkar wrote:
Fri, 24 Feb 2006 21:07:25 -0500, Keith <krw@att.bizzzzput finger
to keyboard and composed:
>
>Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:
>>
Keith <krw@att.bizzzzwrote in part:
I know it "does something", if it's actually implemented. ;-)
My problem was finding out *if* it's implemented and if,
how well. I'd never had an example of failing memory to
test it with before. Now that I can test it, I can't find it
(Tyan has *tried* to be helpful, but so far).
See if memtest-86 will provoke errors faster. I suspect one
cell is "leaking" and show up on slower refresh cycles.
>>
>>I think that's what Tyan was trying to get me to run, but couldn't answer
>>the obvious questions; where and Linux? Sheesh the BIS setup report
>>gives me a location and a data (always the same).
>
What is the BIS telling you? Does it identify the failing data bit or
check bit?
I don't have the exact message (hard to cut-n-paste from BIS
;)but it's an ECC correction; A7DA0-00D1
>
>ould think
they* could decode this down to the DIMM! I guess some things are too
>>hard
>
If you could find some way to non-destructively create a double bit
error, then that should locate your faulty module. For example, you
could place tape over a "good" data bit on the edge connector.
I guess I could do something like that. Ugly. It's happening often enough
now that maybe I'll run with a single bank/DIMM for a while. Hmm, I
wonder if it'll run with three DIMMs. pull out the high DIMMs to
isolate it to high/low, then one of them to get to the DIMM. I really
hate pulling DIMMs with the board in the case though.
No.14 | | 739 bytes |
| 
Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:
Keith <krw@att.bizzzzwrote in part:
>I know it "does something", if it's actually implemented. ;-)
>My problem was finding out *if* it's implemented and if,
>how well. I'd never had an example of failing memory to
>test it with before. Now that I can test it, I can't find it
>(Tyan has *tried* to be helpful, but so far).
>
See if memtest-86 will provoke errors faster. I suspect one
cell is "leaking" and show up on slower refresh cycles.
Tyan tech support suggests memtest8 (same thing?). I'm not sure what
that's going to do though. The errors are corrected.
No.15 | | 379 bytes |
| 
Keith <krw@att.bizzzzwrote in part:
Tyan tech support suggests memtest8 (same thing?).
I think there's a memtest-86+ that is the successor project.
I'm not sure what that's going to do though.
The errors are corrected.
Sure, but they'll show in the logs! Run for one hour,
reboot and check BIS.
-- Robert
No.16 | | 753 bytes |
| 
Keith <krw@att.bizzzzwrote in part:
I don't have the exact message (hard to cut-n-paste from BIS ;)
but it's an ECC correction; A7DA0-00D1
That might be Address 7DA000D1 -- just shy of 2 Gig.
more likely address 00D17DA0 -- a bit over 13 megs
I really hate pulling DIMMs with the board in the case though.
What's the grief with pulling'em? Pushing those eject levers
isn't that tough, and they're usually locate close to the
mobo mounts. I can't imagine you not putting them in!
Inserting DIMMS takes a bit more force and the mobo might
flex more than I'd like, so I support it with one hand from
the edge which is often nearby.
-- Robert
No.17 | | 2028 bytes |
| 
Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
Keith <krw@att.bizzzzwrote in part:
>I don't have the exact message (hard to cut-n-paste from BIS ;)
>but it's an ECC correction; A7DA0-00D1
>
That might be Address 7DA000D1 -- just shy of 2 Gig.
Nah, only 1.5GB installed. ;-)
more likely address 00D17DA0 -- a bit over 13 megs
Here is the complete message:
02/25/06 10:19:54
Single Bit ECC Memory Error
Address and Syndrome
000A7DA0 - 00D1
If I assume this is a 32bit address (why would it be 32b) I could infer
that it's a little over 1/2 MB. If I add three zeros on the right (8
bytes per DIMM) I get about 5MB. It could be a cache line (add another
five bits)
>I really hate pulling DIMMs with the board in the case though.
>
What's the grief with pulling'em? Pushing those eject levers isn't that
tough, and they're usually locate close to the mobo mounts. I can't
imagine you not putting them in!
It's not pulling them that's the problem. I'm not looking forward to too
many insertion cycles though. ;-)
Inserting DIMMS takes a bit more force and the mobo might flex more than
I'd like, so I support it with one hand from the edge which is often
nearby.
There is a screw sorta in the middle of the DIMMs, so maybe I'll give it
a try.
BTW, there is something else going on too. It's beeping more often than
there are memory errors. The last one was at 10:00 this morning and
it's now 1:30PM. It beeps about every ten minutes. I just did a cruise
through th eBIS settings and didn't see anything strange.
other thing strange since I've gone to SuSE 10.0. THe computer no
longer shuts down by itself. It halts, but doesn't power off. and
dual head no longer works. I'm beginning to hate "upgrades"
No.18 | | 772 bytes |
| 
Sat, 25 Feb 2006 16:55:57 +0000, Robert Redelmeier wrote:
Keith <krw@att.bizzzzwrote in part:
>Tyan tech support suggests memtest8 (same thing?).
>
I think there's a memtest-86+ that is the successor project.
>
>I'm not sure what that's going to do though.
>The errors are corrected.
>
Sure, but they'll show in the logs! Run for one hour,
reboot and check BIS.
, but I *already* have memory errors in the logs. I don't have to
stress the system to get them. I'm just trying to find the culprit so
I can get it replaced. I can see the utility in a stress test after, but
don't see what information I can gain now.
No.19 | | 1604 bytes |
| 
Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:
Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
>
>Keith <krw@att.bizzzzwrote in part:
I don't have the exact message (hard to cut-n-paste from BIS ;)
but it's an ECC correction; A7DA0-00D1
>>
>That might be Address 7DA000D1 -- just shy of 2 Gig.
>
>Nah, only 1.5GB installed. ;-)
>
>more likely address 00D17DA0 -- a bit over 13 megs
>
>Here is the complete message:
>
02/25/06 10:19:54
Single Bit ECC Memory Error
Address and Syndrome
000A7DA0 - 00D1
>
>If I assume this is a 32bit address (why would it be 32b) I could infer
>that it's a little over 1/2 MB. If I add three zeros on the right (8
>bytes per DIMM) I get about 5MB. It could be a cache line (add another
>five bits)
>
[snipped]
If you look at that "syndrome", half of it is zero, the other has an 8-bit
value which probably would tell you which bit within a 72-bit codeword is the
offender if you had the magic decoder ring for the memory layout. I bet the
syndrome actually covers ecc codewords from the two dimms within a pair.
It would be interesting (if not particularly revealing, I suppose) to see if
after swapping the two dimms within the pair (iirc, you have two, unequal size
pairs of dimms) if the zeroes and "D1" swap position
/daytripper
No.20 | | 2055 bytes |
| 
Sat, 25 Feb 2006 15:10:54 -0500, daytripper wrote:
Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:
>
>Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
>>
Keith <krw@att.bizzzzwrote in part:
I don't have the exact message (hard to cut-n-paste from BIS ;)
but it's an ECC correction; A7DA0-00D1
That might be Address 7DA000D1 -- just shy of 2 Gig.
>>
>>Nah, only 1.5GB installed. ;-)
>>
more likely address 00D17DA0 -- a bit over 13 megs
>>
>>Here is the complete message:
>>
>02/25/06 10:19:54
>Single Bit ECC Memory Error
>Address and Syndrome
>000A7DA0 - 00D1
>>
>>If I assume this is a 32bit address (why would it be 32b) I could infer
>>that it's a little over 1/2 MB. If I add three zeros on the right (8
>>bytes per DIMM) I get about 5MB. It could be a cache line (add another
>>five bits)
>>
>
[snipped]
If you look at that "syndrome", half of it is zero, the other has an 8-bit
value which probably would tell you which bit within a 72-bit codeword is the
offender if you had the magic decoder ring for the memory layout. I bet the
syndrome actually covers ecc codewords from the two dimms within a pair.
, but why four bits set on one bank?
It would be interesting (if not particularly revealing, I suppose) to see if
after swapping the two dimms within the pair (iirc, you have two, unequal size
pairs of dimms) if the zeroes and "D1" swap position
Yes, I have four DIMMs, 2x256MB in the first two sockets and 2x512MB in
the second pair. If I have some more time I'll pull the 512MB pair and
see what happens. I wonder if BIS is smart enough to run with three
DIMMs?
No.21 | | 1679 bytes |
| 
Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzput finger
to keyboard and composed:
Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
>
>Keith <krw@att.bizzzzwrote in part:
I don't have the exact message (hard to cut-n-paste from BIS ;)
but it's an ECC correction; A7DA0-00D1
>>
>That might be Address 7DA000D1 -- just shy of 2 Gig.
>
>Nah, only 1.5GB installed. ;-)
>
>more likely address 00D17DA0 -- a bit over 13 megs
>
>Here is the complete message:
>
02/25/06 10:19:54
Single Bit ECC Memory Error
Address and Syndrome
000A7DA0 - 00D1
>
>If I assume this is a 32bit address (why would it be 32b) I could infer
>that it's a little over 1/2 MB. If I add three zeros on the right (8
>bytes per DIMM) I get about 5MB. It could be a cache line (add another
>five bits)
My understanding is that each ECC DIMM is 72 bits wide, ie 64 data
bits + 8 check bits. Your "syndrome" consists of 16 bits which could
represent a pair of modules. I suspect that if you switch the modules
the syndrome pattern would become D100. Also, my understanding is
that accesses memory as 128 bits, so the memory address might
be 000A7DA00 bytes (ie reported address x 16). This works out at
approximately 10.5MB, which is still within the first two modules. If
I'm right, then switching the modules should not change the address of
the error. If you're right, then it will.
- Franc Zabkar
No.22 | | 1713 bytes |
| 
Sat, 25 Feb 2006 15:18:34 -0500, Keith <krw@att.bizzzzput finger
to keyboard and composed:
Sat, 25 Feb 2006 15:10:54 -0500, daytripper wrote:
>
>Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:
>>
Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
Keith <krw@att.bizzzzwrote in part:
I don't have the exact message (hard to cut-n-paste from BIS ;)
but it's an ECC correction; A7DA0-00D1
That might be Address 7DA000D1 -- just shy of 2 Gig.
Nah, only 1.5GB installed. ;-)
more likely address 00D17DA0 -- a bit over 13 megs
Here is the complete message:
02/25/06 10:19:54
Single Bit ECC Memory Error
Address and Syndrome
000A7DA0 - 00D1
If I assume this is a 32bit address (why would it be 32b) I could infer
that it's a little over 1/2 MB. If I add three zeros on the right (8
bytes per DIMM) I get about 5MB. It could be a cache line (add another
five bits)
>>
>[snipped]
>>
>If you look at that "syndrome", half of it is zero, the other has an 8-bit
>value which probably would tell you which bit within a 72-bit codeword is the
>offender if you had the magic decoder ring for the memory layout. I bet the
>syndrome actually covers ecc codewords from the two dimms within a pair.
>
>, but why four bits set on one bank?
Because they are syndrome bits, not data bits. A syndrome pattern of
D1 codes for one particular data bit.
- Franc Zabkar
No.23 | | 1820 bytes |
| 
Sun, 26 Feb 2006 07:43:51 +1100, Franc Zabkar
<fzabkar@iinternode.on.netput finger to keyboard and composed:
Sat, 25 Feb 2006 15:18:34 -0500, Keith <krw@att.bizzzzput finger
>to keyboard and composed:
>
>Sat, 25 Feb 2006 15:10:54 -0500, daytripper wrote:
>>
Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:
Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
Keith <krw@att.bizzzzwrote in part:
I don't have the exact message (hard to cut-n-paste from BIS ;)
but it's an ECC correction; A7DA0-00D1
That might be Address 7DA000D1 -- just shy of 2 Gig.
Nah, only 1.5GB installed. ;-)
more likely address 00D17DA0 -- a bit over 13 megs
Here is the complete message:
02/25/06 10:19:54
Single Bit ECC Memory Error
Address and Syndrome
000A7DA0 - 00D1
If I assume this is a 32bit address (why would it be 32b) I could infer
that it's a little over 1/2 MB. If I add three zeros on the right (8
bytes per DIMM) I get about 5MB. It could be a cache line (add another
five bits)
[snipped]
If you look at that "syndrome", half of it is zero, the other has an 8-bit
value which probably would tell you which bit within a 72-bit codeword is the
offender if you had the magic decoder ring for the memory layout. I bet the
syndrome actually covers ecc codewords from the two dimms within a pair.
>>
>>, but why four bits set on one bank?
>
>Because they are syndrome bits, not data bits. A syndrome pattern of
>D1 codes for one particular data bit.
or check bit.
- Franc Zabkar
No.24 | | 2195 bytes |
| 
Sun, 26 Feb 2006 07:40:10 +1100, Franc Zabkar wrote:
Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzput finger
to keyboard and composed:
>
>Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
>>
Keith <krw@att.bizzzzwrote in part:
I don't have the exact message (hard to cut-n-paste from BIS ;)
but it's an ECC correction; A7DA0-00D1
That might be Address 7DA000D1 -- just shy of 2 Gig.
>>
>>Nah, only 1.5GB installed. ;-)
>>
more likely address 00D17DA0 -- a bit over 13 megs
>>
>>Here is the complete message:
>>
>02/25/06 10:19:54
>Single Bit ECC Memory Error
>Address and Syndrome
>000A7DA0 - 00D1
>>
>>If I assume this is a 32bit address (why would it be 32b) I could infer
>>that it's a little over 1/2 MB. If I add three zeros on the right (8
>>bytes per DIMM) I get about 5MB. It could be a cache line (add another
>>five bits)
>
My understanding is that each ECC DIMM is 72 bits wide, ie 64 data
bits + 8 check bits. Your "syndrome" consists of 16 bits which could
represent a pair of modules. I suspect that if you switch the modules
the syndrome pattern would become D100. Also, my understanding is
that accesses memory as 128 bits, so the memory address might
be 000A7DA00 bytes (ie reported address x 16). This works out at
approximately 10.5MB, which is still within the first two modules. If
I'm right, then switching the modules should not change the address of
the error. If you're right, then it will.
I thought of that possibility as well. There are a lot of ways to encode
such messages. I'm not too surprised that the error codes aren't
published (PC documentation sucks), but am quite surprised that Tyan
doesn't have a clue.
I uess I'll swap the bottom pair (256MB) and see what happens.
No.25 | | 1973 bytes |
| 
Sun, 26 Feb 2006 07:46:37 +1100, Franc Zabkar wrote:
Sun, 26 Feb 2006 07:43:51 +1100, Franc Zabkar
<fzabkar@iinternode.on.netput finger to keyboard and composed:
>
>Sat, 25 Feb 2006 15:18:34 -0500, Keith <krw@att.bizzzzput finger
>>to keyboard and composed:
>>
Sat, 25 Feb 2006 15:10:54 -0500, daytripper wrote:
Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:
Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
Keith <krw@att.bizzzzwrote in part:
I don't have the exact message (hard to cut-n-paste from BIS ;)
but it's an ECC correction; A7DA0-00D1
That might be Address 7DA000D1 -- just shy of 2 Gig.
Nah, only 1.5GB installed. ;-)
more likely address 00D17DA0 -- a bit over 13 megs
Here is the complete message:
02/25/06 10:19:54
Single Bit ECC Memory Error
Address and Syndrome
000A7DA0 - 00D1
If I assume this is a 32bit address (why would it be 32b) I could infer
that it's a little over 1/2 MB. If I add three zeros on the right (8
bytes per DIMM) I get about 5MB. It could be a cache line (add another
five bits)
[snipped]
If you look at that "syndrome", half of it is zero, the other has an 8-bit
value which probably would tell you which bit within a 72-bit codeword is the
offender if you had the magic decoder ring for the memory layout. I bet the
syndrome actually covers ecc codewords from the two dimms within a pair.
, but why four bits set on one bank?
>>
>>Because they are syndrome bits, not data bits. A syndrome pattern of
>>D1 codes for one particular data bit.
>
or check bit.
Makes sense. I was thinking about a "sybdrome" bit per byte (perhaps the
check bits themselves).
No.26 | | 1174 bytes |
| 
Fri, 24 Feb 2006 21:39:54 -0500, daytripper <day_trippr@REMVEyahoo.com>
wrote:
Fri, 24 Feb 2006 21:07:25 -0500, Keith <krw@att.bizzzzwrote:
>
>Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:
>>
Keith <krw@att.bizzzzwrote in part:
I know it "does something", if it's actually implemented. ;-)
My problem was finding out *if* it's implemented and if,
how well. I'd never had an example of failing memory to
test it with before. Now that I can test it, I can't find it
(Tyan has *tried* to be helpful, but so far).
See if memtest-86 will provoke errors faster. I suspect one
cell is "leaking" and show up on slower refresh cycles.
>>
>>I think that's what Tyan was trying to get me to run, but couldn't answer
>>the obvious questions; where and Linux?
>
>http://www.memtest86.com/#download0
I think Chris Brady has pretty much passed the baton to the guys at
www.memtest.org now - more recent versions.
No.27 | | 3967 bytes |
| 
Sat, 25 Feb 2006 16:33:42 -0500, Keith <krw@att.bizzzzwrote:
Sun, 26 Feb 2006 07:40:10 +1100, Franc Zabkar wrote:
>
>Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzput finger
>to keyboard and composed:
>>
Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
Keith <krw@att.bizzzzwrote in part:
I don't have the exact message (hard to cut-n-paste from BIS ;)
but it's an ECC correction; A7DA0-00D1
That might be Address 7DA000D1 -- just shy of 2 Gig.
Nah, only 1.5GB installed. ;-)
more likely address 00D17DA0 -- a bit over 13 megs
Here is the complete message:
02/25/06 10:19:54
Single Bit ECC Memory Error
Address and Syndrome
000A7DA0 - 00D1
If I assume this is a 32bit address (why would it be 32b) I could infer
that it's a little over 1/2 MB. If I add three zeros on the right (8
bytes per DIMM) I get about 5MB. It could be a cache line (add another
five bits)
>>
>My understanding is that each ECC DIMM is 72 bits wide, ie 64 data
>bits + 8 check bits. Your "syndrome" consists of 16 bits which could
>represent a pair of modules. I suspect that if you switch the modules
>the syndrome pattern would become D100. Also, my understanding is
>that accesses memory as 128 bits, so the memory address might
>be 000A7DA00 bytes (ie reported address x 16). This works out at
>approximately 10.5MB, which is still within the first two modules. If
>I'm right, then switching the modules should not change the address of
>the error. If you're right, then it will.
>
>I thought of that possibility as well. There are a lot of ways to encode
>such messages. I'm not too surprised that the error codes aren't
>published (PC documentation sucks), but am quite surprised that Tyan
>doesn't have a clue.
I dunno - with the complications in DIMM addressing of AMD64 CPUs, and some
Intel chipsets these days, you'd probably have to talk to the guy who wrote
the BIS to find the final answer *if* he could even remember.
>I uess I'll swap the bottom pair (256MB) and see what happens.
If the AMD docs are correct (for your rev. CPU) and I'm reading them
right:-P, since you have different sized chip selects (or non-power of two
number of chip selects), you should be running in non-interleaved mode
(that's non-interleaved ranks, which AMD calls chip-select banks). Also,
on pg.85 of the BIS & Kernel Developer's Guide we have:
"Non-interleaving mode can always be used. The BIS must assign the largest
DIMM chip-select range to the lowest address. As addresses increase, the
chip select size must remain constant or decrease. This is necessary to
keep DIMM chip select banks on aligned address boundaries as chipselect
banks with different depths are added. The masking does not work
otherwise."
So the order of the DIMMs in the sockets is not always relevant to the
address assignment and I'd guess that one of your 512MB DIMMs is bad
though I'm not clear on what happens if you have an odd number of equal
sized chip selects, i.e. if your 256MB DIMMs are single sided and the
512MBs are double sided?
Have you checked if the BIS can run the memory in 64-bit "mode", i.e.
instead of 128-bit mode - might help pin things down closer. Also note
that memtest86+ (www.memtest.org) claims to support ECC polling and ECC
status - I'd give it a go, to see what it makes of things.
Final thought: in light of your observation of "there is something else
going on too" in another post, have you checked caps for bulging leaking?
No.28 | | 973 bytes |
| 
"Keith" <krw@att.bizzzzwrote in message
@att.bizzzz
Sat, 25 Feb 2006 16:55:57 +0000, Robert Redelmeier wrote:
>
>Keith <krw@att.bizzzzwrote in part:
Tyan tech support suggests memtest8 (same thing?).
>>
>I think there's a memtest-86+ that is the successor project.
>>
I'm not sure what that's going to do though.
The errors are corrected.
>>
>Sure, but they'll show in the logs! Run for one hour,
>reboot and check BIS.
>
, but I *already* have memory errors in the logs. I don't have to
stress the system to get them. I'm just trying to find the culprit so
I can get it replaced. I can see the utility in a stress test after,
but
don't see what information I can gain now.
No.29 | | 2090 bytes |
| 
Sat, 25 Feb 2006 16:35:40 -0500, Keith <krw@att.bizzzzwrote:
Sun, 26 Feb 2006 07:46:37 +1100, Franc Zabkar wrote:
>
>Sun, 26 Feb 2006 07:43:51 +1100, Franc Zabkar
><fzabkar@iinternode.on.netput finger to keyboard and composed:
>>
Sat, 25 Feb 2006 15:18:34 -0500, Keith <krw@att.bizzzzput finger
to keyboard and composed:
Sat, 25 Feb 2006 15:10:54 -0500, daytripper wrote:
Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:
Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
Keith <krw@att.bizzzzwrote in part:
I don't have the exact message (hard to cut-n-paste from BIS ;)
but it's an ECC correction; A7DA0-00D1
That might be Address 7DA000D1 -- just shy of 2 Gig.
Nah, only 1.5GB installed. ;-)
more likely address 00D17DA0 -- a bit over 13 megs
Here is the complete message:
02/25/06 10:19:54
Single Bit ECC Memory Error
Address and Syndrome
000A7DA0 - 00D1
If I assume this is a 32bit address (why would it be 32b) I could infer
that it's a little over 1/2 MB. If I add three zeros on the right (8
bytes per DIMM) I get about 5MB. It could be a cache line (add another
five bits)
[snipped]
If you look at that "syndrome", half of it is zero, the other has an 8-bit
value which probably would tell you which bit within a 72-bit codeword is the
offender if you had the magic decoder ring for the memory layout. I bet the
syndrome actually covers ecc codewords from the two dimms within a pair.
, but why four bits set on one bank?
Because they are syndrome bits, not data bits. A syndrome pattern of
D1 codes for one particular data bit.
>>
>or check bit.
>
>Makes sense. I was thinking about a "sybdrome" bit per byte (perhaps the
>check bits themselves).
Google "modified Hamming code"