Hardware

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Single-bit corrected errors

    29 answers - 904 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit


    I've been moving disk drives (DVD and HDD) around and upgrading S
    (SuSE 9.0 to 10.0) over the past couple of days. In the process
    I've been playing with BIS to get the drives ordered the way I
    wanted (and figure out who listens to BIS drive ordering ;).
    While trying to figure out some strangeness I cleaned out the BIS
    messages area. Last night I kept getting beeps (about one an
    hour). It took a while to isolate it, but finally looked at the
    hardware error report in the setup screens. Seems I've been
    getting a corrected single-bit error in the same location on
    average about once an hour. I guess ECC actually does something.
    ;-)
    now to figure out which stick is faulty (it's all Crucial, with
    a lifetime warranty ;-). If everything is right in the world the
    error report _should_ point me to it. All I need is the Rosetta
    stone.
  • No.1 | | 1386 bytes | |

    Keith wrote:
    I've been moving disk drives (DVD and HDD) around and upgrading S
    (SuSE 9.0 to 10.0) over the past couple of days. In the process
    I've been playing with BIS to get the drives ordered the way I
    wanted (and figure out who listens to BIS drive ordering ;).
    While trying to figure out some strangeness I cleaned out the BIS
    messages area. Last night I kept getting beeps (about one an
    hour). It took a while to isolate it, but finally looked at the
    hardware error report in the setup screens. Seems I've been
    getting a corrected single-bit error in the same location on
    average about once an hour. I guess ECC actually does something.
    ;-)

    now to figure out which stick is faulty (it's all Crucial, with
    a lifetime warranty ;-). If everything is right in the world the
    error report _should_ point me to it. All I need is the Rosetta
    stone.

    Keith,
    If you only have 1 bank ,It will depend on your controller,how
    hard this is to diagnose.If you have 2 banks (non-interleaved)
    the address should tell you which bank it's in.Swapping one
    module from the bad bank into a good one should tell which
    module it is by seeing the address move higher or lower.
    Not knowing the architecture specifics this is the best
    advice I can give .
    Hope this helps,
    Mark Whitlock
  • No.2 | | 2332 bytes | |

    In article <b%%Kf.1$Aj3.132@news.abs.net>, markw@abs.net says
    Keith wrote:
    I've been moving disk drives (DVD and HDD) around and upgrading S
    (SuSE 9.0 to 10.0) over the past couple of days. In the process
    I've been playing with BIS to get the drives ordered the way I
    wanted (and figure out who listens to BIS drive ordering ;).
    While trying to figure out some strangeness I cleaned out the BIS
    messages area. Last night I kept getting beeps (about one an
    hour). It took a while to isolate it, but finally looked at the
    hardware error report in the setup screens. Seems I've been
    getting a corrected single-bit error in the same location on
    average about once an hour. I guess ECC actually does something.
    ;-)

    now to figure out which stick is faulty (it's all Crucial, with
    a lifetime warranty ;-). If everything is right in the world the
    error report _should_ point me to it. All I need is the Rosetta
    stone.

    Keith,
    If you only have 1 bank ,It will depend on your controller,how
    hard this is to diagnose.If you have 2 banks (non-interleaved)
    the address should tell you which bank it's in.Swapping one
    module from the bad bank into a good one should tell which
    module it is by seeing the address move higher or lower.
    Not knowing the architecture specifics this is the best
    advice I can give .

    Thanks, but I was more or less just telling the regular .chipsters
    that there is some evidence that ECC actually does something. ;-)
    I've always had that question (even when qualifying boards in a
    former life); how do you know ECC is really doing something
    usefull?

    My board is a Tyan S2875S ( Socket 940) with 2x2 banks,
    unfortunately with mixed size DIMMs (2x256MB and 2x512MB) so I
    can't just swap one DIMM high to low. Any crossing I do is
    symmetrical, though I suppose swapping one set of DIMMs will at
    least change the error signature, or not. That would at least
    isolate the problem to the pair.

    I've sent an email off to Tyan support, which hit an auto-
    responder. I have to jump through their FAQ hoops (like; Why does
    my system no longer automatically launch into the S once I have
    updated my bios from v3.01 to v3.02?) to get to a real person.
  • No.3 | | 502 bytes | |

    Yes, ECC really does something. I've been going thru servers and
    upgrading the memory. In the process, I find servers that have memory
    ECC fails that have been on line and running for two years. But they
    have been perforing perfectly.

    I don't leave the failing memory installed but replace it as a matter
    of practice. But it's nice to know your insurance package really works.

    (Hard drives would never work were it not for ECC).

    Tom S.

  • No.4 | | 1085 bytes | |

    In article <1140759300.161807.202300@i40g2000cwc.googlegroups. com>,
    gtstephenson@gmail.com says
    Yes, ECC really does something. I've been going thru servers and
    upgrading the memory. In the process, I find servers that have memory
    ECC fails that have been on line and running for two years. But they
    have been perforing perfectly.

    I know it "does something", if it's actually implemented. ;-) My
    problem was finding out *if* it's implemented and if, how well.
    I'd never had an example of failing memory to test it with before.
    Now that I can test it, I can't find it (Tyan has *tried* to be
    helpful, but so far).

    I don't leave the failing memory installed but replace it as a matter
    of practice. But it's nice to know your insurance package really works.

    Indeed.

    (Hard drives would never work were it not for ECC).

    I know the theory behind ECC and certainly know it works. I also
    know motherboard makers have a way of skirting around the truth
    (remember the fake caches?).
  • No.5 | | 513 bytes | |

    Keith <krw@att.bizzzzwrote in part:
    I know it "does something", if it's actually implemented. ;-)
    My problem was finding out *if* it's implemented and if,
    how well. I'd never had an example of failing memory to
    test it with before. Now that I can test it, I can't find it
    (Tyan has *tried* to be helpful, but so far).

    See if memtest-86 will provoke errors faster. I suspect one
    cell is "leaking" and show up on slower refresh cycles.
    -- Robert

  • No.6 | | 905 bytes | |

    Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:

    Keith <krw@att.bizzzzwrote in part:
    >I know it "does something", if it's actually implemented. ;-)
    >My problem was finding out *if* it's implemented and if,
    >how well. I'd never had an example of failing memory to
    >test it with before. Now that I can test it, I can't find it
    >(Tyan has *tried* to be helpful, but so far).
    >

    See if memtest-86 will provoke errors faster. I suspect one
    cell is "leaking" and show up on slower refresh cycles.

    I think that's what Tyan was trying to get me to run, but couldn't answer
    the obvious questions; where and Linux? Sheesh the BIS setup report
    gives me a location and a data (always the same). ould think
    *they* could decode this down to the DIMM! I guess some things are too
    hard
  • No.7 | | 934 bytes | |

    Fri, 24 Feb 2006 21:07:25 -0500, Keith <krw@att.bizzzzwrote:

    Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:
    >
    >Keith <krw@att.bizzzzwrote in part:

    I know it "does something", if it's actually implemented. ;-)
    My problem was finding out *if* it's implemented and if,
    how well. I'd never had an example of failing memory to
    test it with before. Now that I can test it, I can't find it
    (Tyan has *tried* to be helpful, but so far).
    >>

    >See if memtest-86 will provoke errors faster. I suspect one
    >cell is "leaking" and show up on slower refresh cycles.
    >
    >I think that's what Tyan was trying to get me to run, but couldn't answer
    >the obvious questions; where and Linux?


    http://www.memtest86.com/#download0

  • No.8 | | 1060 bytes | |

    Sat, 25 Feb 2006 00:56:53 GMT, Robert Redelmeier <redelm@ev1.net.invalid>
    wrote:

    >Keith <krw@att.bizzzzwrote in part:
    >I know it "does something", if it's actually implemented. ;-)
    >My problem was finding out *if* it's implemented and if,
    >how well. I'd never had an example of failing memory to
    >test it with before. Now that I can test it, I can't find it
    >(Tyan has *tried* to be helpful, but so far).
    >
    >See if memtest-86 will provoke errors faster. I suspect one
    >cell is "leaking" and show up on slower refresh cycles.
    >

    Robert

    Control over refresh rates was dropped from memtest-86 some time ago, if that
    is what you're referring to.

    In any case, any conventional memory test technique will fail miserably to
    detect cells with nascent leakage issues, as simply reading the row that the
    cell resides in will refresh that cell, and you know how memory tests like to
    hammer on memory
  • No.9 | | 574 bytes | |

    daytripper <day_trippr@removeyahoo.comwrote in part:
    In any case, any conventional memory test technique will fail
    miserably to detect cells with nascent leakage issues, as simply
    reading the row that the cell resides in will refresh that cell,
    and you know how memory tests like to hammer on memory

    Sure but I think the timescales are different. Memory
    loses refresh fairly quickly, 500 ms IIRC. memtest-86 is
    slow and takes multiple seconds for each pass. I think
    one of the tests is write all, then read all.
    -- Robert

  • No.10 | | 1187 bytes | |

    I used to write firmware for some bus-powered USB device with 2MB of SDRAM
    memory. We've had a problem that we could not reliably detect firmware
    reload over a live copy vs unplug/replug (power cycle). The data was
    retained in SDRAM for a few seconds. about 10-15 sec guaranteed that
    data was gone. I suspect it's because the chip was not powered. In a powered
    module, the data would be gone sooner without referesh.

    "Robert Redelmeier" <redelm@ev1.net.invalidwrote in message
    news:IsQLf.16326$@newssvr12.news.prodigy.com
    daytripper <day_trippr@removeyahoo.comwrote in part:
    >In any case, any conventional memory test technique will fail
    >miserably to detect cells with nascent leakage issues, as simply
    >reading the row that the cell resides in will refresh that cell,
    >and you know how memory tests like to hammer on memory
    >

    Sure but I think the timescales are different. Memory
    loses refresh fairly quickly, 500 ms IIRC. memtest-86 is
    slow and takes multiple seconds for each pass. I think
    one of the tests is write all, then read all.

    -- Robert

  • No.11 | | 1430 bytes | |

    Sat, 25 Feb 2006 03:48:56 GMT, Robert Redelmeier <redelm@ev1.net.invalid>
    wrote:

    >daytripper <day_trippr@removeyahoo.comwrote in part:
    >In any case, any conventional memory test technique will fail
    >miserably to detect cells with nascent leakage issues, as simply
    >reading the row that the cell resides in will refresh that cell,
    >and you know how memory tests like to hammer on memory
    >
    >Sure but I think the timescales are different. Memory
    >loses refresh fairly quickly, 500 ms IIRC. memtest-86 is
    >slow and takes multiple seconds for each pass. I think
    >one of the tests is write all, then read all.
    >

    Robert

    A specification will never tell you how long an sdram might retain data sans
    refresh (in all its forms). What it will tell you is the *minimum* amount of
    time the sdram is guaranteed to retain data, under the listed test conditions
    - which usually includes high temperature which aggravates cell leakage.

    So unless your system is soaking in temps that drive the die temps to 85C,
    that value won't be any where near "real life". It is a value that guarantees
    the vendor won't get late night phone calls from irate customers, so it (like
    virtually every spec you'll ever see in a commodity market) is spec'd
    conservatively to begin with
  • No.12 | | 1445 bytes | |

    Fri, 24 Feb 2006 21:07:25 -0500, Keith <krw@att.bizzzzput finger
    to keyboard and composed:

    Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:
    >
    >Keith <krw@att.bizzzzwrote in part:

    I know it "does something", if it's actually implemented. ;-)
    My problem was finding out *if* it's implemented and if,
    how well. I'd never had an example of failing memory to
    test it with before. Now that I can test it, I can't find it
    (Tyan has *tried* to be helpful, but so far).
    >>

    >See if memtest-86 will provoke errors faster. I suspect one
    >cell is "leaking" and show up on slower refresh cycles.
    >
    >I think that's what Tyan was trying to get me to run, but couldn't answer
    >the obvious questions; where and Linux? Sheesh the BIS setup report
    >gives me a location and a data (always the same).


    What is the BIS telling you? Does it identify the failing data bit or
    check bit?

    ould think
    >*they* could decode this down to the DIMM! I guess some things are too
    >hard


    If you could find some way to non-destructively create a double bit
    error, then that should locate your faulty module. For example, you
    could place tape over a "good" data bit on the edge connector.
    - Franc Zabkar
  • No.13 | | 2080 bytes | |

    Sat, 25 Feb 2006 19:30:41 +1100, Franc Zabkar wrote:

    Fri, 24 Feb 2006 21:07:25 -0500, Keith <krw@att.bizzzzput finger
    to keyboard and composed:
    >
    >Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:
    >>

    Keith <krw@att.bizzzzwrote in part:
    I know it "does something", if it's actually implemented. ;-)
    My problem was finding out *if* it's implemented and if,
    how well. I'd never had an example of failing memory to
    test it with before. Now that I can test it, I can't find it
    (Tyan has *tried* to be helpful, but so far).

    See if memtest-86 will provoke errors faster. I suspect one
    cell is "leaking" and show up on slower refresh cycles.
    >>
    >>I think that's what Tyan was trying to get me to run, but couldn't answer
    >>the obvious questions; where and Linux? Sheesh the BIS setup report
    >>gives me a location and a data (always the same).

    >

    What is the BIS telling you? Does it identify the failing data bit or
    check bit?

    I don't have the exact message (hard to cut-n-paste from BIS
    ;)but it's an ECC correction; A7DA0-00D1
    >
    >ould think

    they* could decode this down to the DIMM! I guess some things are too
    >>hard

    >

    If you could find some way to non-destructively create a double bit
    error, then that should locate your faulty module. For example, you
    could place tape over a "good" data bit on the edge connector.

    I guess I could do something like that. Ugly. It's happening often enough
    now that maybe I'll run with a single bank/DIMM for a while. Hmm, I
    wonder if it'll run with three DIMMs. pull out the high DIMMs to
    isolate it to high/low, then one of them to get to the DIMM. I really
    hate pulling DIMMs with the board in the case though.
  • No.14 | | 739 bytes | |

    Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:

    Keith <krw@att.bizzzzwrote in part:
    >I know it "does something", if it's actually implemented. ;-)
    >My problem was finding out *if* it's implemented and if,
    >how well. I'd never had an example of failing memory to
    >test it with before. Now that I can test it, I can't find it
    >(Tyan has *tried* to be helpful, but so far).
    >

    See if memtest-86 will provoke errors faster. I suspect one
    cell is "leaking" and show up on slower refresh cycles.

    Tyan tech support suggests memtest8 (same thing?). I'm not sure what
    that's going to do though. The errors are corrected.
  • No.15 | | 379 bytes | |

    Keith <krw@att.bizzzzwrote in part:
    Tyan tech support suggests memtest8 (same thing?).

    I think there's a memtest-86+ that is the successor project.

    I'm not sure what that's going to do though.
    The errors are corrected.

    Sure, but they'll show in the logs! Run for one hour,
    reboot and check BIS.
    -- Robert

  • No.16 | | 753 bytes | |

    Keith <krw@att.bizzzzwrote in part:
    I don't have the exact message (hard to cut-n-paste from BIS ;)
    but it's an ECC correction; A7DA0-00D1

    That might be Address 7DA000D1 -- just shy of 2 Gig.
    more likely address 00D17DA0 -- a bit over 13 megs

    I really hate pulling DIMMs with the board in the case though.

    What's the grief with pulling'em? Pushing those eject levers
    isn't that tough, and they're usually locate close to the
    mobo mounts. I can't imagine you not putting them in!

    Inserting DIMMS takes a bit more force and the mobo might
    flex more than I'd like, so I support it with one hand from
    the edge which is often nearby.
    -- Robert

  • No.17 | | 2028 bytes | |

    Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:

    Keith <krw@att.bizzzzwrote in part:
    >I don't have the exact message (hard to cut-n-paste from BIS ;)
    >but it's an ECC correction; A7DA0-00D1
    >

    That might be Address 7DA000D1 -- just shy of 2 Gig.

    Nah, only 1.5GB installed. ;-)

    more likely address 00D17DA0 -- a bit over 13 megs

    Here is the complete message:

    02/25/06 10:19:54
    Single Bit ECC Memory Error
    Address and Syndrome
    000A7DA0 - 00D1

    If I assume this is a 32bit address (why would it be 32b) I could infer
    that it's a little over 1/2 MB. If I add three zeros on the right (8
    bytes per DIMM) I get about 5MB. It could be a cache line (add another
    five bits)

    >I really hate pulling DIMMs with the board in the case though.
    >

    What's the grief with pulling'em? Pushing those eject levers isn't that
    tough, and they're usually locate close to the mobo mounts. I can't
    imagine you not putting them in!

    It's not pulling them that's the problem. I'm not looking forward to too
    many insertion cycles though. ;-)

    Inserting DIMMS takes a bit more force and the mobo might flex more than
    I'd like, so I support it with one hand from the edge which is often
    nearby.

    There is a screw sorta in the middle of the DIMMs, so maybe I'll give it
    a try.

    BTW, there is something else going on too. It's beeping more often than
    there are memory errors. The last one was at 10:00 this morning and
    it's now 1:30PM. It beeps about every ten minutes. I just did a cruise
    through th eBIS settings and didn't see anything strange.

    other thing strange since I've gone to SuSE 10.0. THe computer no
    longer shuts down by itself. It halts, but doesn't power off. and
    dual head no longer works. I'm beginning to hate "upgrades"
  • No.18 | | 772 bytes | |

    Sat, 25 Feb 2006 16:55:57 +0000, Robert Redelmeier wrote:

    Keith <krw@att.bizzzzwrote in part:
    >Tyan tech support suggests memtest8 (same thing?).
    >

    I think there's a memtest-86+ that is the successor project.
    >
    >I'm not sure what that's going to do though.
    >The errors are corrected.
    >

    Sure, but they'll show in the logs! Run for one hour,
    reboot and check BIS.

    , but I *already* have memory errors in the logs. I don't have to
    stress the system to get them. I'm just trying to find the culprit so
    I can get it replaced. I can see the utility in a stress test after, but
    don't see what information I can gain now.
  • No.19 | | 1604 bytes | |

    Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:

    Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
    >
    >Keith <krw@att.bizzzzwrote in part:

    I don't have the exact message (hard to cut-n-paste from BIS ;)
    but it's an ECC correction; A7DA0-00D1
    >>

    >That might be Address 7DA000D1 -- just shy of 2 Gig.
    >
    >Nah, only 1.5GB installed. ;-)
    >
    >more likely address 00D17DA0 -- a bit over 13 megs
    >
    >Here is the complete message:
    >

    02/25/06 10:19:54
    Single Bit ECC Memory Error
    Address and Syndrome
    000A7DA0 - 00D1
    >
    >If I assume this is a 32bit address (why would it be 32b) I could infer
    >that it's a little over 1/2 MB. If I add three zeros on the right (8
    >bytes per DIMM) I get about 5MB. It could be a cache line (add another
    >five bits)
    >


    [snipped]

    If you look at that "syndrome", half of it is zero, the other has an 8-bit
    value which probably would tell you which bit within a 72-bit codeword is the
    offender if you had the magic decoder ring for the memory layout. I bet the
    syndrome actually covers ecc codewords from the two dimms within a pair.

    It would be interesting (if not particularly revealing, I suppose) to see if
    after swapping the two dimms within the pair (iirc, you have two, unequal size
    pairs of dimms) if the zeroes and "D1" swap position

    /daytripper
  • No.20 | | 2055 bytes | |

    Sat, 25 Feb 2006 15:10:54 -0500, daytripper wrote:

    Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:
    >
    >Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
    >>

    Keith <krw@att.bizzzzwrote in part:
    I don't have the exact message (hard to cut-n-paste from BIS ;)
    but it's an ECC correction; A7DA0-00D1

    That might be Address 7DA000D1 -- just shy of 2 Gig.
    >>
    >>Nah, only 1.5GB installed. ;-)
    >>

    more likely address 00D17DA0 -- a bit over 13 megs
    >>
    >>Here is the complete message:
    >>

    >02/25/06 10:19:54
    >Single Bit ECC Memory Error
    >Address and Syndrome
    >000A7DA0 - 00D1
    >>
    >>If I assume this is a 32bit address (why would it be 32b) I could infer
    >>that it's a little over 1/2 MB. If I add three zeros on the right (8
    >>bytes per DIMM) I get about 5MB. It could be a cache line (add another
    >>five bits)
    >>

    >

    [snipped]

    If you look at that "syndrome", half of it is zero, the other has an 8-bit
    value which probably would tell you which bit within a 72-bit codeword is the
    offender if you had the magic decoder ring for the memory layout. I bet the
    syndrome actually covers ecc codewords from the two dimms within a pair.

    , but why four bits set on one bank?

    It would be interesting (if not particularly revealing, I suppose) to see if
    after swapping the two dimms within the pair (iirc, you have two, unequal size
    pairs of dimms) if the zeroes and "D1" swap position

    Yes, I have four DIMMs, 2x256MB in the first two sockets and 2x512MB in
    the second pair. If I have some more time I'll pull the 512MB pair and
    see what happens. I wonder if BIS is smart enough to run with three
    DIMMs?
  • No.21 | | 1679 bytes | |

    Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzput finger
    to keyboard and composed:

    Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
    >
    >Keith <krw@att.bizzzzwrote in part:

    I don't have the exact message (hard to cut-n-paste from BIS ;)
    but it's an ECC correction; A7DA0-00D1
    >>

    >That might be Address 7DA000D1 -- just shy of 2 Gig.
    >
    >Nah, only 1.5GB installed. ;-)
    >
    >more likely address 00D17DA0 -- a bit over 13 megs
    >
    >Here is the complete message:
    >

    02/25/06 10:19:54
    Single Bit ECC Memory Error
    Address and Syndrome
    000A7DA0 - 00D1
    >
    >If I assume this is a 32bit address (why would it be 32b) I could infer
    >that it's a little over 1/2 MB. If I add three zeros on the right (8
    >bytes per DIMM) I get about 5MB. It could be a cache line (add another
    >five bits)


    My understanding is that each ECC DIMM is 72 bits wide, ie 64 data
    bits + 8 check bits. Your "syndrome" consists of 16 bits which could
    represent a pair of modules. I suspect that if you switch the modules
    the syndrome pattern would become D100. Also, my understanding is
    that accesses memory as 128 bits, so the memory address might
    be 000A7DA00 bytes (ie reported address x 16). This works out at
    approximately 10.5MB, which is still within the first two modules. If
    I'm right, then switching the modules should not change the address of
    the error. If you're right, then it will.
    - Franc Zabkar
  • No.22 | | 1713 bytes | |

    Sat, 25 Feb 2006 15:18:34 -0500, Keith <krw@att.bizzzzput finger
    to keyboard and composed:

    Sat, 25 Feb 2006 15:10:54 -0500, daytripper wrote:
    >
    >Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:
    >>

    Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:

    Keith <krw@att.bizzzzwrote in part:
    I don't have the exact message (hard to cut-n-paste from BIS ;)
    but it's an ECC correction; A7DA0-00D1

    That might be Address 7DA000D1 -- just shy of 2 Gig.

    Nah, only 1.5GB installed. ;-)

    more likely address 00D17DA0 -- a bit over 13 megs

    Here is the complete message:

    02/25/06 10:19:54
    Single Bit ECC Memory Error
    Address and Syndrome
    000A7DA0 - 00D1

    If I assume this is a 32bit address (why would it be 32b) I could infer
    that it's a little over 1/2 MB. If I add three zeros on the right (8
    bytes per DIMM) I get about 5MB. It could be a cache line (add another
    five bits)

    >>

    >[snipped]
    >>

    >If you look at that "syndrome", half of it is zero, the other has an 8-bit
    >value which probably would tell you which bit within a 72-bit codeword is the
    >offender if you had the magic decoder ring for the memory layout. I bet the
    >syndrome actually covers ecc codewords from the two dimms within a pair.
    >
    >, but why four bits set on one bank?


    Because they are syndrome bits, not data bits. A syndrome pattern of
    D1 codes for one particular data bit.
    - Franc Zabkar
  • No.23 | | 1820 bytes | |

    Sun, 26 Feb 2006 07:43:51 +1100, Franc Zabkar
    <fzabkar@iinternode.on.netput finger to keyboard and composed:

    Sat, 25 Feb 2006 15:18:34 -0500, Keith <krw@att.bizzzzput finger
    >to keyboard and composed:
    >
    >Sat, 25 Feb 2006 15:10:54 -0500, daytripper wrote:
    >>

    Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:

    Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:

    Keith <krw@att.bizzzzwrote in part:
    I don't have the exact message (hard to cut-n-paste from BIS ;)
    but it's an ECC correction; A7DA0-00D1

    That might be Address 7DA000D1 -- just shy of 2 Gig.

    Nah, only 1.5GB installed. ;-)

    more likely address 00D17DA0 -- a bit over 13 megs

    Here is the complete message:

    02/25/06 10:19:54
    Single Bit ECC Memory Error
    Address and Syndrome
    000A7DA0 - 00D1

    If I assume this is a 32bit address (why would it be 32b) I could infer
    that it's a little over 1/2 MB. If I add three zeros on the right (8
    bytes per DIMM) I get about 5MB. It could be a cache line (add another
    five bits)

    [snipped]

    If you look at that "syndrome", half of it is zero, the other has an 8-bit
    value which probably would tell you which bit within a 72-bit codeword is the
    offender if you had the magic decoder ring for the memory layout. I bet the
    syndrome actually covers ecc codewords from the two dimms within a pair.
    >>
    >>, but why four bits set on one bank?

    >
    >Because they are syndrome bits, not data bits. A syndrome pattern of
    >D1 codes for one particular data bit.


    or check bit.
    - Franc Zabkar
  • No.24 | | 2195 bytes | |

    Sun, 26 Feb 2006 07:40:10 +1100, Franc Zabkar wrote:

    Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzput finger
    to keyboard and composed:
    >
    >Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:
    >>

    Keith <krw@att.bizzzzwrote in part:
    I don't have the exact message (hard to cut-n-paste from BIS ;)
    but it's an ECC correction; A7DA0-00D1

    That might be Address 7DA000D1 -- just shy of 2 Gig.
    >>
    >>Nah, only 1.5GB installed. ;-)
    >>

    more likely address 00D17DA0 -- a bit over 13 megs
    >>
    >>Here is the complete message:
    >>

    >02/25/06 10:19:54
    >Single Bit ECC Memory Error
    >Address and Syndrome
    >000A7DA0 - 00D1
    >>
    >>If I assume this is a 32bit address (why would it be 32b) I could infer
    >>that it's a little over 1/2 MB. If I add three zeros on the right (8
    >>bytes per DIMM) I get about 5MB. It could be a cache line (add another
    >>five bits)

    >

    My understanding is that each ECC DIMM is 72 bits wide, ie 64 data
    bits + 8 check bits. Your "syndrome" consists of 16 bits which could
    represent a pair of modules. I suspect that if you switch the modules
    the syndrome pattern would become D100. Also, my understanding is
    that accesses memory as 128 bits, so the memory address might
    be 000A7DA00 bytes (ie reported address x 16). This works out at
    approximately 10.5MB, which is still within the first two modules. If
    I'm right, then switching the modules should not change the address of
    the error. If you're right, then it will.

    I thought of that possibility as well. There are a lot of ways to encode
    such messages. I'm not too surprised that the error codes aren't
    published (PC documentation sucks), but am quite surprised that Tyan
    doesn't have a clue.

    I uess I'll swap the bottom pair (256MB) and see what happens.
  • No.25 | | 1973 bytes | |

    Sun, 26 Feb 2006 07:46:37 +1100, Franc Zabkar wrote:

    Sun, 26 Feb 2006 07:43:51 +1100, Franc Zabkar
    <fzabkar@iinternode.on.netput finger to keyboard and composed:
    >
    >Sat, 25 Feb 2006 15:18:34 -0500, Keith <krw@att.bizzzzput finger
    >>to keyboard and composed:
    >>

    Sat, 25 Feb 2006 15:10:54 -0500, daytripper wrote:

    Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:

    Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:

    Keith <krw@att.bizzzzwrote in part:
    I don't have the exact message (hard to cut-n-paste from BIS ;)
    but it's an ECC correction; A7DA0-00D1

    That might be Address 7DA000D1 -- just shy of 2 Gig.

    Nah, only 1.5GB installed. ;-)

    more likely address 00D17DA0 -- a bit over 13 megs

    Here is the complete message:

    02/25/06 10:19:54
    Single Bit ECC Memory Error
    Address and Syndrome
    000A7DA0 - 00D1

    If I assume this is a 32bit address (why would it be 32b) I could infer
    that it's a little over 1/2 MB. If I add three zeros on the right (8
    bytes per DIMM) I get about 5MB. It could be a cache line (add another
    five bits)

    [snipped]

    If you look at that "syndrome", half of it is zero, the other has an 8-bit
    value which probably would tell you which bit within a 72-bit codeword is the
    offender if you had the magic decoder ring for the memory layout. I bet the
    syndrome actually covers ecc codewords from the two dimms within a pair.

    , but why four bits set on one bank?
    >>
    >>Because they are syndrome bits, not data bits. A syndrome pattern of
    >>D1 codes for one particular data bit.

    >

    or check bit.

    Makes sense. I was thinking about a "sybdrome" bit per byte (perhaps the
    check bits themselves).
  • No.26 | | 1174 bytes | |

    Fri, 24 Feb 2006 21:39:54 -0500, daytripper <day_trippr@REMVEyahoo.com>
    wrote:

    Fri, 24 Feb 2006 21:07:25 -0500, Keith <krw@att.bizzzzwrote:
    >
    >Sat, 25 Feb 2006 00:56:53 +0000, Robert Redelmeier wrote:
    >>

    Keith <krw@att.bizzzzwrote in part:
    I know it "does something", if it's actually implemented. ;-)
    My problem was finding out *if* it's implemented and if,
    how well. I'd never had an example of failing memory to
    test it with before. Now that I can test it, I can't find it
    (Tyan has *tried* to be helpful, but so far).

    See if memtest-86 will provoke errors faster. I suspect one
    cell is "leaking" and show up on slower refresh cycles.
    >>
    >>I think that's what Tyan was trying to get me to run, but couldn't answer
    >>the obvious questions; where and Linux?

    >
    >http://www.memtest86.com/#download0


    I think Chris Brady has pretty much passed the baton to the guys at
    www.memtest.org now - more recent versions.
  • No.27 | | 3967 bytes | |

    Sat, 25 Feb 2006 16:33:42 -0500, Keith <krw@att.bizzzzwrote:

    Sun, 26 Feb 2006 07:40:10 +1100, Franc Zabkar wrote:
    >
    >Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzput finger
    >to keyboard and composed:
    >>

    Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:

    Keith <krw@att.bizzzzwrote in part:
    I don't have the exact message (hard to cut-n-paste from BIS ;)
    but it's an ECC correction; A7DA0-00D1

    That might be Address 7DA000D1 -- just shy of 2 Gig.

    Nah, only 1.5GB installed. ;-)

    more likely address 00D17DA0 -- a bit over 13 megs

    Here is the complete message:

    02/25/06 10:19:54
    Single Bit ECC Memory Error
    Address and Syndrome
    000A7DA0 - 00D1

    If I assume this is a 32bit address (why would it be 32b) I could infer
    that it's a little over 1/2 MB. If I add three zeros on the right (8
    bytes per DIMM) I get about 5MB. It could be a cache line (add another
    five bits)
    >>

    >My understanding is that each ECC DIMM is 72 bits wide, ie 64 data
    >bits + 8 check bits. Your "syndrome" consists of 16 bits which could
    >represent a pair of modules. I suspect that if you switch the modules
    >the syndrome pattern would become D100. Also, my understanding is
    >that accesses memory as 128 bits, so the memory address might
    >be 000A7DA00 bytes (ie reported address x 16). This works out at
    >approximately 10.5MB, which is still within the first two modules. If
    >I'm right, then switching the modules should not change the address of
    >the error. If you're right, then it will.
    >
    >I thought of that possibility as well. There are a lot of ways to encode
    >such messages. I'm not too surprised that the error codes aren't
    >published (PC documentation sucks), but am quite surprised that Tyan
    >doesn't have a clue.


    I dunno - with the complications in DIMM addressing of AMD64 CPUs, and some
    Intel chipsets these days, you'd probably have to talk to the guy who wrote
    the BIS to find the final answer *if* he could even remember.

    >I uess I'll swap the bottom pair (256MB) and see what happens.


    If the AMD docs are correct (for your rev. CPU) and I'm reading them
    right:-P, since you have different sized chip selects (or non-power of two
    number of chip selects), you should be running in non-interleaved mode
    (that's non-interleaved ranks, which AMD calls chip-select banks). Also,
    on pg.85 of the BIS & Kernel Developer's Guide we have:

    "Non-interleaving mode can always be used. The BIS must assign the largest
    DIMM chip-select range to the lowest address. As addresses increase, the
    chip select size must remain constant or decrease. This is necessary to
    keep DIMM chip select banks on aligned address boundaries as chipselect
    banks with different depths are added. The masking does not work
    otherwise."

    So the order of the DIMMs in the sockets is not always relevant to the
    address assignment and I'd guess that one of your 512MB DIMMs is bad
    though I'm not clear on what happens if you have an odd number of equal
    sized chip selects, i.e. if your 256MB DIMMs are single sided and the
    512MBs are double sided?

    Have you checked if the BIS can run the memory in 64-bit "mode", i.e.
    instead of 128-bit mode - might help pin things down closer. Also note
    that memtest86+ (www.memtest.org) claims to support ECC polling and ECC
    status - I'd give it a go, to see what it makes of things.

    Final thought: in light of your observation of "there is something else
    going on too" in another post, have you checked caps for bulging leaking?
  • No.28 | | 973 bytes | |


    "Keith" <krw@att.bizzzzwrote in message
    @att.bizzzz
    Sat, 25 Feb 2006 16:55:57 +0000, Robert Redelmeier wrote:
    >
    >Keith <krw@att.bizzzzwrote in part:

    Tyan tech support suggests memtest8 (same thing?).
    >>

    >I think there's a memtest-86+ that is the successor project.
    >>

    I'm not sure what that's going to do though.
    The errors are corrected.
    >>

    >Sure, but they'll show in the logs! Run for one hour,
    >reboot and check BIS.
    >

    , but I *already* have memory errors in the logs. I don't have to
    stress the system to get them. I'm just trying to find the culprit so
    I can get it replaced. I can see the utility in a stress test after,
    but
    don't see what information I can gain now.
  • No.29 | | 2090 bytes | |

    Sat, 25 Feb 2006 16:35:40 -0500, Keith <krw@att.bizzzzwrote:

    Sun, 26 Feb 2006 07:46:37 +1100, Franc Zabkar wrote:
    >
    >Sun, 26 Feb 2006 07:43:51 +1100, Franc Zabkar
    ><fzabkar@iinternode.on.netput finger to keyboard and composed:
    >>

    Sat, 25 Feb 2006 15:18:34 -0500, Keith <krw@att.bizzzzput finger
    to keyboard and composed:

    Sat, 25 Feb 2006 15:10:54 -0500, daytripper wrote:

    Sat, 25 Feb 2006 13:57:02 -0500, Keith <krw@att.bizzzzwrote:

    Sat, 25 Feb 2006 17:08:16 +0000, Robert Redelmeier wrote:

    Keith <krw@att.bizzzzwrote in part:
    I don't have the exact message (hard to cut-n-paste from BIS ;)
    but it's an ECC correction; A7DA0-00D1

    That might be Address 7DA000D1 -- just shy of 2 Gig.

    Nah, only 1.5GB installed. ;-)

    more likely address 00D17DA0 -- a bit over 13 megs

    Here is the complete message:

    02/25/06 10:19:54
    Single Bit ECC Memory Error
    Address and Syndrome
    000A7DA0 - 00D1

    If I assume this is a 32bit address (why would it be 32b) I could infer
    that it's a little over 1/2 MB. If I add three zeros on the right (8
    bytes per DIMM) I get about 5MB. It could be a cache line (add another
    five bits)

    [snipped]

    If you look at that "syndrome", half of it is zero, the other has an 8-bit
    value which probably would tell you which bit within a 72-bit codeword is the
    offender if you had the magic decoder ring for the memory layout. I bet the
    syndrome actually covers ecc codewords from the two dimms within a pair.

    , but why four bits set on one bank?

    Because they are syndrome bits, not data bits. A syndrome pattern of
    D1 codes for one particular data bit.
    >>

    >or check bit.
    >
    >Makes sense. I was thinking about a "sybdrome" bit per byte (perhaps the
    >check bits themselves).


    Google "modified Hamming code"

Re: Single-bit corrected errors


max 4000 letters.
Your nickname that display:
In order to stop the spam: 4 + 3 =
QUESTION ON "Hardware"

EMSDN.COM