Networking

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Bug in long delay networks

    5 answers - 4481 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Vincent,
    I am sorry for the long delay in responding to this e-mail. Somehow I
    missed the response on the list. I appended the original message at the
    bottom.
    Message
    From: Vincent Jardin [mailto:vincent.jardin (AT) 6wind (DOT) com]
    Sent: Tuesday, April 11, 2006 3:17 PM
    To: Spagnolo, Phillip A
    Cc: quagga-dev (AT) lists (DOT) quagga.net; Kushi, David M; Henderson, Thomas R
    Subject: Re: [quagga-dev 4082] Bug in long delay networks
    Hi,

    >The solution we found is to simply increase 2 in SPF_TIMERN
    >(ospf->t_maxage, ospf_maxage_lsa_remover, 2) to a reasonable value.
    >Maybe 60 or 600?

    There is no recommendation into the RFC for having 2, 60 or
    something
    else. My concern is that higher the timer will be, more
    entries will be
    need to be kept until the remover is run.
    Since I don't think it would be possible to guess
    automatically the best
    value, maybe this value should be configurable from the CLI and the
    default one could remain 2, isn't it ?
    As for the best default value, I don't know the answer. However, the
    same problem does not occur with a Cisco router because it keeps the
    LSAs around for at least a couple hundred seconds.
    A CLI addition would be fine. We could also just put a comment in the
    code and let people change it if needed.

    >Attached is a patch with this fix and a couple of minor

    related changes
    >with explanations within the code.

    Please can you elaborate more about this comment:
    "+ /* This does not seem to be necessary. This LSA was already
    flooded
    + when it entered the maxage list. This flood is redundant //
    */
    " ?
    For instance, can you describe a case when it occurs ?
    This is the sequence of function calls
    -ospf_lsa_flush_area()
    -MAXAGE LSA set LSA to maxage
    -ospf_flood_through_area() LSA is flooded throughout area
    -ospf_lsa_maxage()
    SPF_TIMERN (ospf->t_maxage, ospf_maxage_lsa_remover, 2);
    add to maxage list and schedule remover
    -ospf_maxage_lsa_remover()
    -check if the LSA can be removed?
    -ospf_flood_through()
    -ospf_lsa_flush_area() already flooded above, so there is no
    need to do it again
    Does this make sense? I don't see why an LSA that has already been
    maxaged and flooded needs to be reflooded after it has been checked for
    neighbor state and retransmission count.
    Sincerely,
    Phil
    Regards,
    Vincent
    Message:
    All,
    We have found a bug in ospfd for quagga 0.98.5 when it is used in high
    delay networks. I think the problem exists in 0.99.3 because the same
    code is found there.
    The bug exists in ospf_lsa.c. It is found in ospf_lsa_maxage() when
    SPF_TIMERN (ospf->t_maxage, ospf_maxage_lsa_remover, 2) is called to
    schedule removal of the LSA from the database.
    Here is an example:
    5|
    | |
    2--| |
    / | |
    / 6|
    / | | |
    1| |
    \ | |
    \ 7|
    \| |
    4--|
    Nodes 2,3,4 are connected by a broadcast network.
    Nodes 5,6,7 are connected by a PTMP network.
    Let the delay of the PTMP network be 8 secs.
    If the broadcast network of 2,3,4 is brought down then nodes 2,3,4 and
    will generate a Network LSA and then maxage the Network LSA as all
    neighbors are removed from the link. This is correct (RFC 2328 12.4.2
    para 4). The problem is that these LSAs will reach all nodes in the
    network and purge the databases while they are still in transit in the
    PTMP network (5,6,7). When these LSAs come out of the PTMP then they
    will be reinstalled and flooded again because they are already purged
    from the databases. The flooding repeats this process again.
    Short story, flooding is maintained for 3600 secs.
    The solution we found is to simply increase 2 in SPF_TIMERN
    (ospf->t_maxage, ospf_maxage_lsa_remover, 2) to a reasonable value.
    Maybe 60 or 600?
    Attached is a patch with this fix and a couple of minor related changes
    with explanations within the code.
    Is this the correct fix? Are there reason not to increase this value?
    Thanks,
    Phil
    Phil Spagnolo
    Network Technology Engineer
    The Boeing Company
    Phone: (425) 865-6723
    Quagga-dev mailing list
    Quagga-dev (AT) lists (DOT) quagga.net
  • No.1 | | 1534 bytes | |

    Thu, 11 May 2006, Spagnolo, Phillip A wrote:

    As for the best default value, I don't know the answer. However,
    the same problem does not occur with a Cisco router because it
    keeps the LSAs around for at least a couple hundred seconds.

    This is the sequence of function calls
    -ospf_lsa_flush_area()
    -MAXAGE LSA set LSA to maxage
    -ospf_flood_through_area() LSA is flooded throughout area
    -ospf_lsa_maxage()
    SPF_TIMERN (ospf->t_maxage, ospf_maxage_lsa_remover, 2);
    add to maxage list and schedule remover
    -ospf_maxage_lsa_remover()
    -check if the LSA can be removed?
    -ospf_flood_through()
    -ospf_lsa_flush_area() already flooded above, so there is no
    need to do it again

    Does this make sense?

    Not completely ;)

    I don't see why an LSA that has already been maxaged and flooded
    needs to be reflooded after it has been checked for neighbor state
    and retransmission count.

    It's a bit confused alright, however there is at least one place that
    expects ospf_lsa_maxage() will result in flooding - premature aging
    of sequence-number wrapped LSAs. However, that's something that
    doesn't generally work at the moment anyway. seems to have
    same expectation.

    So the problem is differing expectations of what ospf_lsa_maxage()
    ought to be doing, really.

    Something like the attached and completely-untested patch (expanding
    slightly on your proposal) might do the trick - what do you think?

    regards,
  • No.2 | | 315 bytes | |

    Tue, 30 May 2006, Paul Jakma wrote:

    Something like the attached and completely-untested patch
    (expanding slightly on your proposal) might do the trick - what do
    you think?

    Err, add the call to ospf_lsa_maxage() to the bottom of the
    added ospf_lsa_flush() function, obviously.

    regards,
  • No.3 | | 241 bytes | |

    This:
    ;a=commitdiff;;
    Seems to work, so far.
    Though, the ospfd running this patch does not originate an
    network-LSA - all the types it does it refreshes properly (external,
    opaque, summary, router).
    regards,
  • No.4 | | 395 bytes | |

    Agree, I do prefer this option.

    Paul Jakma wrote:

    This:

    ;a=commitdiff;;
    --
    Seems to work, so far.

    Though, the ospfd running this patch does not originate an network-LSA
    - all the types it does it refreshes properly (external, opaque,
    summary, router).

    regards,

    Quagga-dev mailing list
    Quagga-dev (AT) lists (DOT) quagga.net
  • No.5 | | 2073 bytes | |

    Wed, 31 May 2006, Vincent Jardin wrote:

    Agree, I do prefer this option.

    Well, its the logical conclusion of Philips proposal to remove the
    flood call in the maxage_remover. ;)

    The only problem is, this morning it had crashed. And it crashes
    whenever a non-self-originated LSA changes (it seems). There appears
    to be an ospf_lsa_lock() missing somewhere:

    2006/05/31 06:31:11 SPF: RXmtL(1)--, NBR(212.17.55.54), LSA[Type1,id(212.17.55.50),ar(212.17.55.50)]
    2006/05/31 06:31:11 SPF: LSA: freed 0x5c39d0
    2006/05/31 06:31:11 SPF: LSA[Type1:212.17.55.50]: data freed 0x5c5680
    2006/05/31 06:31:11 SPF: SPF: calculation timer scheduled
    2006/05/31 06:31:11 SPF: SPF: calculation timer delay = 200
    2006/05/31 06:31:11 SPF: LSA[Type1,id(212.17.55.50),ar(212.17.55.50)]: Install router-LSA to Area 0.0.0.0
    2006/05/31 06:31:11 SPF: LSA[Type1:212.17.55.50]: Install LSA 0x0x5bf370, MaxAge
    SPF: Received signal 11 at 1149053471 (si_addr 0x0); aborting
    Backtrace for 14 stack frames:
    /usr/lib64/quagga/libzebra.so.0(zlog_backtrace_sigsafe+0x2b)[0x318b12a849]
    /usr/lib64/quagga/libzebra.so.0(zlog_signal+0x23a)[0x318b12ad63]
    /usr/lib64/quagga/libzebra.so.0[0x318b134723]
    /lib64/libc.so.6[0x343fb2f300]
    /usr/lib64/quagga/libospf.so.0(ospf_flood_through_area+0x71)[0x318b339928]
    /usr/lib64/quagga/libospf.so.0(ospf_lsa_flush_area+0x21)[0x318b339fdb]
    /usr/lib64/quagga/libospf.so.0(ospf_lsa_install+0x4c9)[0x318b331684]
    /usr/lib64/quagga/libospf.so.0(ospf_flood+0x1b9)[0x318b339d92]
    /usr/lib64/quagga/libospf.so.0[0x318b32bf18]
    /usr/lib64/quagga/libospf.so.0(ospf_read+0xcb8)[0x318b32d3a0]
    /usr/lib64/quagga/libzebra.so.0(thread_call+0x86)[0x318b120bd3]

    The trace is very consistent.

    And interestingly the reporter on Bug #269 has just added more
    information that strongly suggests his crash must be the same
    problem (So it's an existing and apparently mostly latent bug, i.e.
    *not* specific to this patch).

    Where is the missing lsa_lock() though?

    regards,

Re: Bug in long delay networks


max 4000 letters.
Your nickname that display:
In order to stop the spam: 8 + 8 =
QUESTION ON "Networking"

EMSDN.COM