BSD

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • PF State Expiration Model for Huge Amount of States

    7 answers - 3073 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    I've been using redundant BSD/PF (pfsync,CARP) configuration with
    large traffic telecommunication system. It has performed rather well, but
    the way PF handles state expiration has lately become an issue.
    PF expires states in periodic sweeps in the state structure. This causes
    jitter in traffic, and with our typical loads (>5000 new states/s) also
    congestion especially in the synchronization network interface. This
    causes backup gateway to lose some state updates and as a result
    accumulate long living TCP ESTABLISHED:ESTABLISHED states which are
    already purged in the active PF machine.
    I propose the following solution:
    PF state entries should be modified to include one more RB-tree entry,
    used for state expiration. This tree would be primarily keyed with
    pf_state_expire() return values (modified to return 0 instead of
    time_second for instant expire). The tree would be secondarily keyed by
    state ids to prevent duplicate keys. All places of PF code modifying
    timeout or expire values should be changed to call update function for
    this new tree.
    So the comparison function for the new tree would be:
    static __inline int
    pf_state_compare_expire(struct pf_state *a, struct pf_state *b)
    {
    if (a->expire_key b->expire_key)
    return (1);
    if (a->expire_key < b->expire_key)
    return (-1);
    return pf_state_compare_id(a, b);
    }
    And the expire/timeout update function:
    void
    pf_update_state_expire(struct pf_state *cur,
    u_int32_t expire,
    u_int8_t timeout) {
    u_int32_t new_expire_key;
    (cur);
    cur->timeout=timeout;
    cur->expire=expire;
    if(new_expire_key != cur->expire_key) {
    RB_REMVE(pf_state_tree_expires, &tree_expires, cur);
    cur->expire_key = new_expire_key;
    RB_INSERT(pf_state_tree_expires, &tree_expires, cur);
    }
    }
    After this modification, the state purging function can be changed so that
    it takes RB_MIN from this expiration tree and expire only that state if it
    is old enough. Since RB_MIN is always the next to expire, no sweep is
    needed. Purging should purge only one state to prevent long lock-ups.
    However, the purge check is now so cheap it can be called also from both
    pf_test and pf_insert_state. Now the load of purging would be divided
    evenly.
    course not all operations currently done in the purging function can be
    done this way, because pool_* functions shouldn't be called without a
    process context (as far as I understand the issue). The purge can leave a
    list of states to be freed and the actual freeing of memory could be done
    in the periodic cleanup quickly.
    Arguably this method causes some additional CPU load. However, at least in
    my measurements I have found it next to impossible to get much more than
    50% CPU utilization in a firewall with BSD 3.8 without significant
    packet or state lossage due to state expiration thread.
    Comments or better solution ideas?
  • No.1 | | 2456 bytes | |

    Fri, 19 May 2006, joerg (AT) britannica (DOT) bec.de wrote:

    >PF state entries should be modified to include one more RB-tree entry,
    >used for state expiration.


    The problem with this approach is that any longer living connection has
    far more updates than actual expiring states. It should also be kept in
    mind that this adds a lot of cache trashing due to the constant tree
    updates.

    This is a valid concern. However at least I find it practically impossible
    to get PF firewall machine under any practical CPU/memory load before
    packet loss starts to happen because of state expiration sweeps.
    So yes, this might be expensive, but it still improves practical
    performance in my opinnion.

    What exactly is your concern?

    The sawchain effect causes jitter and lost forwarded packets when
    connection initiation rate through a PF firewall is high. This causes
    serious problems with voice applications where lost packets can be heard
    easily. The only real solution is to smooth the expiration load.

    However, a bigger problem is that at least the BSD 3.8 code can't
    keep the backup firewall state table synchronized with the master firewall
    because of this effect, even though CPU is 80% idle. The biggest problems
    are the TCP connections for which the closing sync updates are lost. They
    end up living long time in the backup firewall, and should failover
    happen, they cause a lot of state-mismatch blocks for new connections
    which are reusing the same TCP ports.

    Replace this with a number of linked list and one of them
    every second / half second whatever. This adds very low additional
    overhead (one or two pointers per state, a few more instructions for
    creation and removal of a state), but the runtime costs are not changed
    at all.

    I though of keeping one LRU TAILQ for each timeout value. The
    heads of these few lists could be checked for expiration on every
    processed packet. This structure would be rather cheap and cache
    friendly to keep. However I forgot that every rule can have separate
    timeout values which makes the number of lists too huge to check at every
    frame.

    The 3.9 code seems to split the sweep into smaller pieces much like you
    propose, but even that does not really solve the actual problem, just
    moves it to a little bit higher state creation frequency.
  • No.2 | | 1612 bytes | |

    Fri, May 19, 2006 at 10:30:28AM +0300, Teemu Takanen wrote:
    PF expires states in periodic sweeps in the state structure.

    I propose the following solution:

    PF state entries should be modified to include one more RB-tree entry,
    used for state expiration. This tree would be primarily keyed with
    pf_state_expire() return values (modified to return 0 instead of
    time_second for instant expire). The tree would be secondarily keyed by
    state ids to prevent duplicate keys. All places of PF code modifying
    timeout or expire values should be changed to call update function for
    this new tree.

    The problem with this approach is that any longer living connection has
    far more updates than actual expiring states. It should also be kept in
    mind that this adds a lot of cache trashing due to the constant tree
    updates.

    What exactly is your concern? As long as memory usage is not the
    limiting factor and you want to reduce only the sawchain affect, the
    following might be a better approach.

    Currently after a fixed timeout the whole table of all states, fragments
    and src nodes is processed. For this the corresponding full table is
    processed. Replace this with a number of linked list and one of them
    every second / half second whatever. This adds very low additional
    overhead (one or two pointers per state, a few more instructions for
    creation and removal of a state), but the runtime costs are not changed
    at all.

    This reduces the saws, but doesn't eliminate them. It might be enough
    though.

    Joerg
  • No.3 | | 1186 bytes | |

    * Teemu Takanen <Teemu.Takanen (AT) tecnomen (DOT) com[2006-05-19 11:55]:
    Fri, 19 May 2006, joerg (AT) britannica (DOT) bec.de wrote:
    >>PF state entries should be modified to include one more RB-tree entry,
    >>used for state expiration.

    >The problem with this approach is that any longer living connection has
    >far more updates than actual expiring states. It should also be kept in
    >mind that this adds a lot of cache trashing due to the constant tree
    >updates.

    This is a valid concern. However at least I find it practically impossible
    to get PF firewall machine under any practical CPU/memory load before
    packet loss starts to happen because of state expiration sweeps.

    I have had pf firewalls at this point more than once.
    we must be careful to not waste CPU power, you need as much headroom as
    possible for DoS style attacks.

    So yes, this might be expensive, but it still improves practical
    performance in my opinnion.

    not acceptable. we'll need a better solution.
    and yes, it is not easy to solve at all. we talked about that before.
  • No.4 | | 939 bytes | |

    * C?dric Berger <cedric (AT) berger (DOT) to[2006-05-19 12:18]:
    >However, a bigger problem is that at least the BSD 3.8 code can't
    >keep the backup firewall state table synchronized with the master
    >firewall because of this effect, even though CPU is 80% idle. The
    >biggest problems are the TCP connections for which the closing sync
    >updates are lost. They end up living long time in the backup firewall,
    >and should failover happen, they cause a lot of state-mismatch blocks
    >for new connections which are reusing the same TCP ports.


    I'm mostly thinking out loud here, but wouldn't it make sense to at
    least have an option to make new connections replace old ones in case
    of mismatches like that?

    and then you've created the perfect DoS. just send a forged packet that
    gets IPs & ports right, and, kaboom, legit state gone.
  • No.5 | | 3080 bytes | |

    Fri, May 19, 2006 at 12:49:19PM +0300, Teemu Takanen wrote:
    This is a valid concern. However at least I find it practically impossible
    to get PF firewall machine under any practical CPU/memory load before
    packet loss starts to happen because of state expiration sweeps.
    So yes, this might be expensive, but it still improves practical
    performance in my opinnion.

    For any setup which has a high throughput with an average number of states
    this would add a noticable overhead. For bigger state tables, but still
    long running connections, it can even trash the caches a lot. That makes
    it IM less useful for the default code.

    The sawchain effect causes jitter and lost forwarded packets when
    connection initiation rate through a PF firewall is high. This causes
    serious problems with voice applications where lost packets can be heard
    easily. The only real solution is to smooth the expiration load.

    Exactly. Two important parts here: the expiration itself is not the
    problem, but the large time the state table is locked. When the locking
    is made more fine-grained, the peak latency is reduced and with that the
    likeliness of package drops.

    However, a bigger problem is that at least the BSD 3.8 code can't
    keep the backup firewall state table synchronized with the master firewall
    because of this effect, even though CPU is 80% idle.

    This is not really surprising, since the current behaviour results in
    large flushes of updates, which can in themselve content the connection
    to the backup firewall.

    >Replace this with a number of linked list and one of them
    >every second / half second whatever. This adds very low additional
    >overhead (one or two pointers per state, a few more instructions for
    >creation and removal of a state), but the runtime costs are not changed
    >at all.


    I though of keeping one LRU TAILQ for each timeout value. The
    heads of these few lists could be checked for expiration on every
    processed packet. This structure would be rather cheap and cache
    friendly to keep. However I forgot that every rule can have separate
    timeout values which makes the number of lists too huge to check at every
    frame.

    Don't try to reinvent the wheel. The code doesn't use per-state timeouts
    for a reason. The assumption is that states normally don't timeout, e.g.
    at least one packet modifies the state before the timeout. This means
    that scanning all states is cheaper than individual timers.

    The 3.9 code seems to split the sweep into smaller pieces much like you
    propose, but even that does not really solve the actual problem, just
    moves it to a little bit higher state creation frequency.

    The code I have doesn't indicate that. It has a limit to the number of
    state entries processed in a single run, but it doesn't split the table
    into individual pieces to distribute the load over multiple timer calls.

    Joerg
  • No.6 | | 1591 bytes | |

    Fri, 19 May 2006, joerg (AT) britannica (DOT) bec.de wrote:

    For any setup which has a high throughput with an average number of states
    this would add a noticable overhead. For bigger state tables, but still
    long running connections, it can even trash the caches a lot. That makes
    it IM less useful for the default code.

    K, you are probably right about this. Now when I took a second look into
    the 3.9 code I came up with the following idea which keeps the performance
    of the avare case practically identical to current one, but still removes
    the big sweeps from high connection rate setups (This compeletely ignores
    locks and process context vs. interrup handler caller issues, but just to
    illustrate the basic idea):

    The 3.9 (CVS 1.512) version of pf.c pf_purge_expired_states()
    already takes the maxiumum number of operations to perform as an
    argument. Currently it is called with
    (pf_status.states /pf_default_rule.timeout[PFTM_INTERVAL]).

    Now if we add call to this function with a small, probably configurable,
    argument (typically 0-20) to pf_insert_state we keep up purging old state
    entries while generating new ones. In cases with stable states never
    expiring (and practically never coming up with new ones either), nothing
    changes from the current way of doing things. At sites where the
    connection creation rate is huge compared to the packet rate, the
    PFTM_INTERVAL can be set in pf.conf to a large number and still the
    purging done at insert_state can prevent the kernel running out of
    memory.
  • No.7 | | 671 bytes | |

    Don't try to reinvent the wheel. The code doesn't use per-state timeouts
    for a reason. The assumption is that states normally don't timeout, e.g.
    at least one packet modifies the state before the timeout. This means
    that scanning all states is cheaper than individual timers.

    Actually it doesn't use per-state timeouts because the kernel timeout
    system was expensive at the time. About 3 years ago, when the timeout
    wheels got added, we were gonna try switching to per-state timeouts.
    Adaptive timeouts threw a bit of a wrinkle into that idea (and I started
    my plans for world domination by slacking)

    mike

Re: PF State Expiration Model for Huge Amount of States


max 4000 letters.
Your nickname that display:
In order to stop the spam: 3 + 2 =
QUESTION ON "BSD"

EMSDN.COM