Input is absolutely welcome, late or not. Thanks for offering your thoughts.
I don't have administrative control over the "naughty" routers, but
their behavior does seem to correlate with what little I know about
proxy ARP. They don't specifically support a proxy ARP feature,
however -- they are consumer grade router/firewall devices. I have
one in a lab now, however, and I am able to reproduce the behavior, so
I should be able to answer questions regarding how they function.
Static mapping of MAC addresses is a possibility and would certainly
mitigate unicast flooding, but the task of implementing this approach
in all appropriate cases would be so arduous so as to be prohibitive.
I've experimented extensively with the adjustment of CAM table
timeouts to correlate with ARP expiration but I have only been
partially successful thus far.
-FC
5/19/06, Harry Reynolds <harry (AT) juniper (DOT) netwrote:
Butting in late and too lazy to completely digest this thread now.
After a quick glance:
I wonder if you have proxy arp enabled on the "naughty" routers, and if
so, whether turning it off might help mitigate? You mention they send
icmp to the target node, and I think proxy ARP would generate ARP so
perhaps not
Also, any chance of putting a manual/static mac entry in the switches
that are flooding?
Regards
--
Message
From: juniper-nsp-bounces (AT) puck (DOT) nether.net
[mailto:juniper-nsp-bounces (AT) puck (DOT) nether.net] Behalf
Frances Albemuth
Sent: Friday, May 19, 2006 9:23 AM
To: Hannes Gredler
Cc: juniper-nsp (AT) puck (DOT) nether.net
Subject: Re: [j-nsp] Strange behavior on directly connected
interfaces?
Hi Hannes,
Some of these questions are easier to answer than others.
5/17/06, Hannes Gredler <hannes (AT) juniper (DOT) netwrote:
frances,
question 1: what is the MAC adress of the device that
generates the 10MBit/s worth of traffic.
--
I'm not sure there's a single device responsible for this;
it sort of looks as if there are at least two culprits. The
two being examined at the moment are both consumer grade
routing appliances. I've since determined more about what is
occurring with these culprit devices:
* These routers transmit bogus traffic to destinations
located in the same subnet with spoofed source IPs but real
source MACs.
* Unicast is flooded because the ARP timeout exceeds the
CAM table timeout. The CAM table never learns the MAC of the
"target" device because that device is discarding all of this
traffic and not generating any traffic of their own (at the
time this occurs -- the behavior is not constant).
* Some of the destination IP's generate no traffic during
certain periods of the day.
* The traffic the culprit devices transmit to other devices
in the broadcast domain will never meet the requirements of a
typical iptables or equivalent implementation so the traffic
is quietly dropped.
Net result? Bogus traffic is broadcast all over the place
because the switching infrastructure never has a cause to
learn the MAC(s) the culprit routers are trying to reach.
The culprit routers don't ARP for it, they just remember the
destination MAC, and the switches dutifully flood the unicast
frames in hopes of identifying the legitimate destination MAC
from a hypothetical return stream of traffic. This never
happens, so these bursts of illegitimate traffic occur until
someone generates traffic from behind a target device.
Then the switching infrastructure learns the MAC and voila,
the unicast traffic stops getting flooded all over the place.
question 2: is your juniper router the only exit for your traffic
--
Indeed it is.
question 3: could it be that there are hidden backdoor(s)
--
As in loop-ish cross-connections "behind" our infrastructure?
Possible, but unlikely.
question 4: what traffic is being looped / unicast / broadcast
--
What's known about that traffic is largely articulated in
the answer to question 1, though if you've got more questions
about that traffic specifically I can probably find more answers
question 5: what is the destination MAC adress of the looped traffic
(broadcast address / unicast address of the router)
--
Also covered largely in the answer to question 1, but to
expand on this a bit, there are two distinct behaviors. I'll
call one "weirdness" and the other "high weirdness". In the
case of high weirdness, here's what happens to the best of my
ability to tell:
- Legitimate ICMP is transmitted from outside source and
arrives at router.
- Router figures packet should egress to directly connected
network via specific logical interface (makes certain filter
criterion are good, et al).
- Router finds the destination address in the ARP table and
fires off a frame into the "Ethernet cloud" with the
destination MAC culled from the ARP table.
- The switches haven't heard a frame from the device
corresponding with the destination MAC for a while and have
forgotten the destination MAC, so they flood the frames.
- Naughty routers (two of them) hear the frames and get in
on the action. They spoof the source IP of the router (!!)
and transmit massive amounts of ICMP to the node which the
router is also trying to transmit to.
- None of this traffic warrants a response from the target
node or the equipment behind it -- it's a firewall silently
discarding unwanted traffic. So we still don't know how to
get to this MAC without flooding.
- Since these naughty routers are spoofing the IP of the
real gateway but never ARP'ing for it, lots of routers are
receiving flooded unicast frames which they believe they
shouldn't be receiving and which they believe came from the
real gateway. They send the gateway ICMP redirect host
messages (redirecting it to itself).
- For each ICMP echo that goes in, dozens of ICMP messages
with different purposes come out.
- Some of these packets are getting their TTL decremented
(the only thing that slows the situation down) but others are
not. Give it a good thirty seconds and you have a storm.
if you stop the introduction of ICMP to the network,
the TTL will decrement on enough of these packets to calm the
situation.
In the case of weirdness, we have a much less severe
version of the situation outlined above, wherein lots of
routers are getting frames that don't belong to them because
of the ARP/CAM synchronization issue, but it doesn't get out
of control because the two very naughty nodes don't get
involved and the TTL decrements as it should.
The other issue with the TTL exceeded messages coming back
on a different logical interface is a little bit of a red
herring - still interesting, but the situation above seems to
be the elephant in the living room here.
Let me know if you have thoughts, and thank you for your
time and consideration.
-FC
/hannes
Frances Albemuth wrote:
The issue can thus far be mitigated (believe it or not) by
filtering ICMP to and from the "mystery node", or by
filtering ICMP
to and from every network on interface "A". I'm in possession of
the MAC of the "mystery node" and I know exactly where it
lives on
the network, but it doesn't seem to correspond oddly with
anything
and I haven't identified anything quirky about the network
configuration. What else should I be keeping an eye out for?
-FC
5/17/06, Hannes Gredler <hannes (AT) juniper (DOT) netwrote:
>
>frances,
>>
>to mitigate the problem while diagnosing you could configure a
>firewall that discards traffic from non-local-subnet sources.
>>
>but lets focus on the loop:
>what is the mac-adress of the mystery node ?
>>
>/hannes
>>
>Frances Albemuth wrote:
>Hi Hannes,
>>
>Thanks for your response. When I'm sniffing on the segment I
>see a massive stream of ICMP TTL exceeded messages
being returned
>by the "mystery node". The topology is definitely
loop-free and
>the "loop-ish" behavior that we see only seems to
occur when data
>is transmitted to unreachable destinations.
>>
>I assume by forwarding loop you mean an Ethernet loop? I would
>agree that it behaves this way in some respects.
course, if I
>had a genuine loop the problems would be more serious
and would
>occur regardless of routed traffic (the Ethernet
topology with a
>handful of hosts would cripple itself).
>>
>Also interesting: the node returning the TTL exceeded "storm"
>lives behind a link with a maximum synchronous
capacity of 10M. The "storm"
>itself results in 10M of traffic pushing consistently over all
>ports where the VLAN lives. It thusly only cripples other
>devices with a 10M maximum synchronous bandwidth.
>>
>Thanks!
>>
>-FC
>>
>5/16/06, Hannes Gredler <hannes (AT) juniper (DOT) netwrote:
>>
>>frances,
>>>
>>looks like you have a forwarding loop in your setup;
>>>
>>for further troubleshooting attach a packet-sniffer to the
>>subnet in question and spot for the source MAC-adress that is
>>bouncing back your traffic.
>>>
>>/hannes
>>>
>>>
>>Frances Albemuth wrote:
>>Hi,
>>>
>>This is my first post to the list and I would like
to preface
>this by
>>stating that I doubt this problem is actually related
>specifically to
>>Juniper equipment (perhaps a configuration error involving
>>Juniper equipment, however). I'm hoping the issue
I'm working
>>on right now might ring bells in the heads of
others, and in
>>any case I figure
>this
>>is as good a place as any to find yourself beaten
by the clue stick.
>>>
>>I have a directly connected interface facing a large, flat
>Ethernet
>>infrastructure. There are dozens of IP's mapped to the
>>interface in question (this is a legacy aspect of
the design,
>>but migration to a more hierarchical infrastructure
is a long
>>process). Periodically, when packets are
transmitted with an
>>unreachable destination IP residing on the directly
connected
>>interface, a massive series of ICMP TTL exceeded
packets is
>>returned by a different host
>residing on
>>a different logical interface. Traceroutes to the
unreachable
>>IP similarly show a one-node loop (the same IP
responds until
>>the
>TTL=0).
>>The node is always the same, but if unmitigated
ICMP traffic
>>is permitted to and from addresses on the logical interface,
>sniffing the
>>wire shows this behavior occurring to and from a number of
>>nodes. I haven't managed to duplicate the
multi-node behavior
>>in a semi-controlled environment.
>>>
>>When sniffing the segment in question, the ICMP is clearly
>visible,
>>so for whatever reason it is universally broadcast, even
>>though both nodes involved in the ICMP communication are
>>legitimate unicast destinations. If a ping is left
running,
>>these TTL exceeded
>messages
>>will continue an accelerate ad nauseum until a de
facto pseudo
>>broadcast storm occurs, crippling access on every switching
>>node
>where
>>the VLAN in question is mapped. Usually (but not
always) the
>>anomalies halt when the ping is killed. The issue
is largely
>>mitigated by denying all ICMP to and from addresses
mapped to
>>the logical interface.
>>>
>>That's all I'm comfortable asserting about the
issue at this time.
>>What I'm really digging for here is an explanation
as to why
>>when
>the
>>Juniper tries to transmit to an unreachable node, it doesn't
>discover
>>the node is unreachable due to a lack of response
from an ARP
>request
>>and return ICMP unreachables on it's own. I may have missed
>something
>>obvious here (I'm sort of hoping so) and would
appreciate any
>>suggestions or experience from others. If I've sent this
>>message
>to a
>>woefully inappropriate list I would greatly appreciate a
>suggestion as
>>to a better place to bring my question(s).
>>>
>>Thanks,
>>>
>>-FC
>>>
>>
>>juniper-nsp mailing list juniper-nsp (AT) puck (DOT) nether.net
>>
>>>
>>
>
>
juniper-nsp mailing list juniper-nsp (AT) puck (DOT) nether.net
--
juniper-nsp mailing list juniper-nsp (AT) puck (DOT) nether.net