Development

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • How to get the "longest possible" match with Python's RE module?

    29 answers - 594 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Basically, the problem is this:
    p = re.compile("do|dolittle")
    p.match("dolittle").group()
    'do'
    Python's NFA regexp engine trys only the first option, and happily
    rests on that. There's another example:
    p = re.compile("one(self)?(selfsufficient)?")
    p.match("oneselfsufficient").group()
    'oneself'
    The Python regular expression engine doesn't exaust all the
    possibilities, but in my application I hope to get the longest possible
    match, starting from a given point.
    Is there a way to do this in Python?
  • No.1 | | 1210 bytes | |

    Licheng Fang wrote:
    Basically, the problem is this:

    p = re.compile("do|dolittle")
    p.match("dolittle").group()
    'do'

    >From what I understand, this isn't python specific, it is the expected

    behavior of that pattern in any implementation. You are using
    alternation, which means "either, or", and you have the shorter
    subexpression first, so the condition is satisfied by just 'do' and the
    matching terminates.

    There's another example:

    p = re.compile("one(self)?(selfsufficient)?")
    p.match("oneselfsufficient").group()
    'oneself'

    Again, I don't think this has anything to do with python. You pattern
    basically means "match 'one' whether it is followed by 'self' or not,
    and whether it is followed by 'selfsufficient' or not". For this
    particular example, you'd want something like
    "one(self)?(sufficient)?".

    I think you could construct a pattern that would do what you want in
    python without any problem. If you post a (short) example of your data,
    I'm sure someone could help you with it.

    Regards,
    Jordan
  • No.2 | | 975 bytes | |

    Licheng Fang wrote:
    Basically, the problem is this:

    p = re.compile("do|dolittle")
    p.match("dolittle").group()
    'do'

    Python's NFA regexp engine trys only the first option, and happily
    rests on that. There's another example:

    p = re.compile("one(self)?(selfsufficient)?")
    p.match("oneselfsufficient").group()
    'oneself'

    The Python regular expression engine doesn't exaust all the
    possibilities, but in my application I hope to get the longest possible
    match, starting from a given point.

    Is there a way to do this in Python?

    This is the way the regexp works python doesn't has anything to do with
    it. It starts parsing the data with the pattern given. It returns the
    matched string acording the pattern and doesn't go back to find the
    other combinations.

    So to get all the combinations you would probably require to give
    different patterns each time.
  • No.3 | | 2048 bytes | |

    MonkeeSage wrote:
    Licheng Fang wrote:
    Basically, the problem is this:

    p = re.compile("do|dolittle")
    p.match("dolittle").group()
    'do'
    >
    >From what I understand, this isn't python specific, it is the expected

    behavior of that pattern in any implementation. You are using
    alternation, which means "either, or", and you have the shorter
    subexpression first, so the condition is satisfied by just 'do' and the
    matching terminates.

    There's another example:

    p = re.compile("one(self)?(selfsufficient)?")
    p.match("oneselfsufficient").group()
    'oneself'

    Again, I don't think this has anything to do with python. You pattern
    basically means "match 'one' whether it is followed by 'self' or not,
    and whether it is followed by 'selfsufficient' or not". For this
    particular example, you'd want something like
    "one(self)?(sufficient)?".

    I think you could construct a pattern that would do what you want in
    python without any problem. If you post a (short) example of your data,
    I'm sure someone could help you with it.

    Regards,
    Jordan

    Hi, according to these regexp engine discussions, it's NT a behavior
    true to any implementation.

    Python's NFA engine reads along the input string, matching it to the
    pattern, and backtracking when needed. By contrast a DFA engine, to my
    understanding, constructs a DFA and uses it to munch as many characters
    as possible. Maybe it's like this:

    Pattern: one(self)?(selfsufficient)?

    PYTHN'S NFA ENGINE:

    one self, none selfsufficient, none
    (start)>((1))>((2))>((3))

    DFA ENGINE:

    one self
    (start)>((123))>((23))
    |
    |
    | selfsufficient
    >((3))


    I want to know if there is some way to make Python RE behave like grep
    does, or do I have to change to another engine?
  • No.4 | | 287 bytes | |

    >
    I want to know if there is some way to make Python RE behave like grep
    does, or do I have to change to another engine?
    Maybe if you posted a (tested) grep example and data, that does as you
    want, the group could better understand what you are asking for?
    - Paddy.
  • No.5 | | 623 bytes | |

    [Licheng Fang]

    I want to know if there is some way to make Python RE behave like grep
    does,

    Not in general, no. The matching strategies couldn't be more
    different, and that's both deep and intentional. See Friedl's book
    for details:

    http://regex.info/

    or do I have to change to another engine?

    Yes, if PSIX regexp semantics are what you require. Several years
    ago there was an extension module for Python supplying PSIX
    semantics, but I couldn't find anything current just now in a minute
    of searching. You may be more motivated to search longer ;-)
  • No.6 | | 933 bytes | |

    Licheng Fang wrote:
    Hi, according to these regexp engine discussions, it's NT a behavior
    true to any implementation.
    [snip]

    Well, I just double-checked in ruby (oniguruma regexp engine):

    r = Regexp.new("do|dolittle")
    puts r.match("dolittle")[0]
    # do

    r = Regexp.new("one(self)?(sufficient)?")
    puts r.match("oneselfsufficient")[0]
    # oneself

    And perl:

    if ("doolittle" =~
    /(do|dolittle)/) {
    print "$1\n";
    # do
    }

    if ("oneselfsufficient" =~
    /(one(self)?(selfsufficient)?)/) {
    print "$1\n";
    # oneself
    }

    And Javascript (whatever regexp engine Spidermonkey uses):

    var r = new RegExp(/do|dolittle/);
    alert("dolittle".match(r)[0]);

    var r = new RegExp(/one(self)?(selfsufficient)?/);
    alert("oneselfsufficient".match(r)[0]);

    So, it seems they are all broken, or python is correct as well.

    Regards,
    Jordan
  • No.7 | | 467 bytes | |

    MonkeeSage wrote:
    So, it seems they are all broken, or python is correct as well.

    Aha, sorry about that Licheng (re: Tim's post). I guess "broken"
    depends on if you are expecting perl-compatible behavior or otherwise.
    I have my own scripts I use to do (f)grep and sed-like operations, so I
    almost never use those programs and forgot about the different pattern
    semantics (part of the reason I made my own scripts).

    Regards,
    Jordan
  • No.8 | | 1144 bytes | |

    , please do have a look at the second link I've posted. There's a
    table comparing the regexp engines. The engines you've tested probably
    all use an NFA implementation.

    MonkeeSage wrote:
    Licheng Fang wrote:
    Hi, according to these regexp engine discussions, it's NT a behavior
    true to any implementation.
    [snip]

    Well, I just double-checked in ruby (oniguruma regexp engine):

    r = Regexp.new("do|dolittle")
    puts r.match("dolittle")[0]
    # do

    r = Regexp.new("one(self)?(sufficient)?")
    puts r.match("oneselfsufficient")[0]
    # oneself

    And perl:

    if ("doolittle" =~
    /(do|dolittle)/) {
    print "$1\n";
    # do
    }

    if ("oneselfsufficient" =~
    /(one(self)?(selfsufficient)?)/) {
    print "$1\n";
    # oneself
    }

    And Javascript (whatever regexp engine Spidermonkey uses):

    var r = new RegExp(/do|dolittle/);
    alert("dolittle".match(r)[0]);

    var r = new RegExp(/one(self)?(selfsufficient)?/);
    alert("oneselfsufficient".match(r)[0]);

    So, it seems they are all broken, or python is correct as well.

    Regards,
    Jordan
  • No.9 | | 951 bytes | |

    Licheng Fang wrote:
    , please do have a look at the second link I've posted. There's a
    table comparing the regexp engines. The engines you've tested probably
    all use an NFA implementation.

    Sorry! *blush* I admit I skipped over your links. I'll have a look now.

    BTW, just an idea that may or may not work. What about finding all
    matches that meet the absolute baseline pattern, then taking the
    longest of themsomething like this mabye:

    def matcher(strings, pattern):
    out = ''
    reg = re.compile(pattern)
    for string in strings:
    match = reg.match(string).group()
    if (len(match) >= len(out)): # should this use or >= ?
    out = match
    return out # empty is no matches, else longest match

    p = ['dodad', 'dolittle', 'dodaday']
    print matcher(p, r'do.*')
    # dolittle

    Just a thought

    Regards,
    Jordan
  • No.10 | | 879 bytes | |

    Thank you very much, Tim and Monkee.

    In fact, what I'm doing is handle a lot of regular expressions. I
    wanted to build VERY LNG regexps part by part and put them all into a
    file for easy modification and maintenance. The idea is like this:

    (*INT) = \d+
    (*DECIMAL) = (*INT)\.(*INT)
    (*FACTIN) = (*DECIMAL)/(*DECIMAL)
    (*NUMERALS) = (*FACTIN)|(*DECIMAL)|(*INT)

    What's inside the sytactically wrong (* and ) is something to be
    replaced, and then I wrote a little script to do the string
    substitution, to get very long regexps to be compiled. I thought that
    might be a way to handle long and complex regexps, and in this course I
    encountered the problem with the semantics of '|'.

    My approach may sound desperate and funny, but please, do you have any
    good idea as to how to handle long and complex regexps?
  • No.11 | | 481 bytes | |

    mabye something like this is better:

    def matcher(string, pattern):
    out = ''
    for match in re.findall(r'\S*%s\S*' % pattern, string):
    if (len(match) >= len(out)):
    out = match
    return out

    p1 = 'dodad donkeykong dolittle dodaday'
    p2 = 'oneself self-serving selfsufficient oneselfsufficient'
    print matcher(p1, 'do')
    # donkeykong
    print matcher(p2, 'self')
    # oneselfsufficient
  • No.12 | | 1865 bytes | |

    [Licheng Fang[

    In fact, what I'm doing is handle a lot of regular expressions. I
    wanted to build VERY LNG regexps part by part and put them all into a
    file for easy modification and maintenance. The idea is like this:

    (*INT) = \d+
    (*DECIMAL) = (*INT)\.(*INT)
    (*FACTIN) = (*DECIMAL)/(*DECIMAL)
    (*NUMERALS) = (*FACTIN)|(*DECIMAL)|(*INT)

    What's inside the sytactically wrong (* and ) is something to be
    replaced, and then I wrote a little script to do the string
    substitution, to get very long regexps to be compiled. I thought that
    might be a way to handle long and complex regexps, and in this course I
    encountered the problem with the semantics of '|'.

    My approach may sound desperate and funny, but please, do you have any
    good idea as to how to handle long and complex regexps?

    My best advice is to find a different approach entirely. For example,
    build a parser using parser technology, and use regexps in that /only/
    to do gross tokenization ("string of digits", "decimal point", ):
    build the rest with a grammar.

    Regexps are a brittle tool, best tolerated in small doses. For an
    "NFA" implementation like Python's, you're likely to experience poor
    speed when combining many complex regexps, and /especially/ when
    alternatives are ambiguous wrt prefixes (and yours are, else you
    wouldn't have a problem with "longest match" versus "some shorter
    match" to begin with. TH, under a "DFA" implementation such as
    PSIX grep's, you're likely to experience exponential memory
    requirements (DFA implementations can need to build enormous state
    machines, tracing out in advance all possible paths through all the
    regexps applied to all possible input strings).

    Just sounds to me like the wrong tool for the job.
  • No.13 | | 798 bytes | |

    Licheng Fang wrote:
    , please do have a look at the second link I've posted. There's a
    table comparing the regexp engines. The engines you've tested probably
    all use an NFA implementation.

    Unfortunately, the stuff about NFA's is wrong. Friedl's awful
    book was the first time I saw this confusion about what NFA is;
    I don't know if he originated the mess or just propagated it.

    "Nondeterministic finite automata" is well defined in theory
    of computation. The set of languages accepted by NFA's is
    exactly the same as the set accepted by DFA's.

    What Python uses is search-and-backtrack. Unfortunately such
    engines don't have much theory behind them, and it's hard to
    reason generally about what they do.
  • No.14 | | 1859 bytes | |

    [Licheng Fang]
    >, please do have a look at the second link I've posted. There's a
    >table comparing the regexp engines. The engines you've tested probably
    >all use an NFA implementation.


    [Bryan ]
    Unfortunately, the stuff about NFA's is wrong. Friedl's awful
    book

    Strongly disagree: it's an excellent book about the /pragmatics/ of
    using "regular expressions" as most widely implemented. It's not at
    all about "regular expressions" in the CompSci sense of the term,
    which appears to be your complaint.

    was the first time I saw this confusion about what NFA is;
    I don't know if he originated the mess or just propagated it.

    As far as I could tell at the time, he originated it. I'm not happy
    about that either.

    "Nondeterministic finite automata" is well defined in theory
    of computation. The set of languages accepted by NFA's is
    exactly the same as the set accepted by DFA's.

    And which has very little to do with "regular expressions" as most
    widely implemented -- gimmicks like backreferences are wholly outside
    the DFA formalism.

    What Python uses is search-and-backtrack. Unfortunately such
    engines don't have much theory behind them, and it's hard to
    reason generally about what they do.

    Yup X 3, and the last is precisely why Friedl's book is valuable for
    people using "NFA" implementations: Friedl does a good job of
    explaining when and why you're likely to trigger atrocious runtime
    performance, and gives practical general techniques for avoiding those
    problems. None of that has anything to do with CompSci regexps
    either, but is vital knowledge for people hoping to make happy
    non-trivial use of Python/Perl/etc regexps.
  • No.15 | | 939 bytes | |

    kondal wrote:

    This is the way the regexp works python doesn't has anything to do with
    it. It starts parsing the data with the pattern given. It returns the
    matched string acording the pattern and doesn't go back to find the
    other combinations.

    I've recently had the same problem in Java, using automatically
    generated regular expressions to find the longest match; I failed on
    cases like matching the whole of "Abcdefg", but also the whole of
    "AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? .
    No systematic way to deal with these corner cases was available, and
    unsystematic ways (with greedy and reluctant quantifiers) were too
    complex.
    I ended up eliminating regular expressions completely and building a
    dynamic programming parser that returns the set of all match lengths;
    it wasn't hard and it should be even easier in Python.

    Lorenzo Gatti
  • No.16 | | 820 bytes | |

    Bryan wrote:
    Licheng Fang wrote:
    , please do have a look at the second link I've posted. There's a
    table comparing the regexp engines. The engines you've tested probably
    all use an NFA implementation.

    Unfortunately, the stuff about NFA's is wrong. Friedl's awful
    book was the first time I saw this confusion about what NFA is;
    I don't know if he originated the mess or just propagated it.

    "Nondeterministic finite automata" is well defined in theory
    of computation. The set of languages accepted by NFA's is
    exactly the same as the set accepted by DFA's.

    What Python uses is search-and-backtrack. Unfortunately such
    engines don't have much theory behind them, and it's hard to
    reason generally about what they do.
    --
  • No.17 | | 1480 bytes | |

    gatti (AT) dsdata (DOT) it wrote:
    kondal wrote:

    This is the way the regexp works python doesn't has anything to do with
    it. It starts parsing the data with the pattern given. It returns the
    matched string acording the pattern and doesn't go back to find the
    other combinations.

    I've recently had the same problem in Java, using automatically
    generated regular expressions to find the longest match; I failed on
    cases like matching the whole of "Abcdefg", but also the whole of
    "AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? .
    No systematic way to deal with these corner cases was available, and
    unsystematic ways (with greedy and reluctant quantifiers) were too
    complex.
    I ended up eliminating regular expressions completely and building a
    dynamic programming parser that returns the set of all match lengths;
    it wasn't hard and it should be even easier in Python.

    Lorenzo Gatti

    Thanks. I think make use of the expresiveness of CFG may be better
    idea.

    Another question: my task is to find in a given string the substrings
    that satisfies a particular pattern. That's why the first tool that
    came to my mind is regular expression. Parsers, however, only give a
    yes/no answer to a given string. To find all substrings with a
    particular pattern I may have to try every substring, which may be an
    impossible task.

    How can I solve this problem?
  • No.18 | | 746 bytes | |

    "Licheng Fang" <fanglicheng (AT) gmail (DOT) comwrites:
    I think if the backtrack is carried out in an exaustive way, we may say
    the engine trys every possibility on the NFA, though it's not an NFA
    itself.

    The backtracking engine really can recognize languages that are not
    describable by classical regexps, by using backreferences, negation,
    etc. But exactly what it can do is nowhere near as well-understood as
    what classical regexps can do.

    I seem to remember that the fully general problem of recognizing
    regexps with negation is very hard, so the backtracking matcher either
    has to reject some strings it should really match, or else it has to
    bog down horribly with certain kinds of patterns.
  • No.19 | | 870 bytes | |

    Licheng Fang wrote:

    Another question: my task is to find in a given string the substrings
    that satisfies a particular pattern. That's why the first tool that
    came to my mind is regular expression. Parsers, however, only give a
    yes/no answer to a given string. To find all substrings with a
    particular pattern I may have to try every substring, which may be an
    impossible task.

    You can collect all successful parser results beginning from each index
    in the string; this gives you all matches with that first index.
    You could extend to multiple results general bottom-up context-free
    language parsing like Earley or Tomita's algorithms; for reasonable
    languages most locations can be excluded for most rules at the
    beginning, with great performance improvements over trying over and
    over again.

    Lorenzo Gatti
  • No.20 | | 464 bytes | |

    Licheng Fang wrote:

    The idea is like this:

    (*INT) = \d+
    (*DECIMAL) = (*INT)\.(*INT)
    (*FACTIN) = (*DECIMAL)/(*DECIMAL)
    (*NUMERALS) = (*FACTIN)|(*DECIMAL)|(*INT)

    What's inside the sytactically wrong (* and ) is something to be
    replaced, and then I wrote a little script to do the string
    substitution

    what's hiding under those ellipses? things like "TERM" and "FACTR" and
    "EXPRESSIN", perhaps?

    </F>
  • No.21 | | 2622 bytes | |

    Licheng Fang wrote:
    Basically, the problem is this:

    p = re.compile("do|dolittle")
    p.match("dolittle").group()

    'do'

    Python's NFA regexp engine trys only the first option, and happily
    rests on that. There's another example:

    p = re.compile("one(self)?(selfsufficient)?")
    p.match("oneselfsufficient").group()

    'oneself'

    The Python regular expression engine doesn't exaust all the
    possibilities, but in my application I hope to get the longest possible
    match, starting from a given point.

    Is there a way to do this in Python?

    Licheng,

    If you need regexes, why not just reverse-sort your expressions? This
    seems a lot easier and faster than writing another regex compiler.
    Reverse-sorting places the longer ones ahead of the shorter ones.

    targets = ['be', 'bee', 'been', 'being']
    targets.sort ()
    targets.reverse ()
    regex = '|'.join (targets)
    re.findall (regex, 'Having been a bee in a former life, I don\'t
    mind being what I am and wouldn\'t want to be a bee ever again.')
    ['been', 'bee', 'being', 'be', 'bee']

    You might also take a look at a stream editor I recently came out with:
    %20beta

    It has been well received, especially by newbies, I believe because it
    is so simple to use and allows very compact coding.

    import SE
    Bee_Editor = SE.SE ('be=BE bee=BEE been=BEEN being=BEING')
    Bee_Editor ('Having been a bee in a former life, I don\'t mind
    being what I am and wouldn\'t want to be a bee ever again.')

    "Having BEEN a BEE in a former life, I don't mind BEING what I am and wouldn't want to BE a BEE ever again."

    Because SE works by precedence on length, the targets can be defined in any order and modular theme sets can be spliced freely to form supersets.

    SE.SE ('<EATbe==, bee==, been==, being==,')(above_sting)
    'been,bee,being,be,bee,'

    You can do extraction filters, deletion filters, substitutitons in any combination. It does multiple passes and can takes files as input, instead of strings and can output files.

    Key_Word_Translator = SE.SE ('''
    "*INT=int substitute"
    "*DECIMAL=decimal substitute"
    "*FACTIN=faction substitute"
    "*NUMERALS=numerals substitute"
    # etc.
    ''')

    I don't know if that could serve.

    Regards

    Frederic
  • No.22 | | 1243 bytes | |

    Tim Peters wrote:

    [Bryan ]
    >Unfortunately, the stuff about NFA's is wrong. Friedl's awful
    >book


    Strongly disagree: []

    I know I'm disagreeing with a lot of smart people in panning
    the book.

    >What Python uses is search-and-backtrack. Unfortunately such
    >engines don't have much theory behind them, and it's hard to
    >reason generally about what they do.


    Yup X 3, and the last is precisely why Friedl's book is valuable for
    people using "NFA" implementations: Friedl does a good job of
    explaining when and why you're likely to trigger atrocious runtime
    performance, and gives practical general techniques for avoiding those
    problems. None of that has anything to do with CompSci regexps
    either, but is vital knowledge for people hoping to make happy
    non-trivial use of Python/Perl/etc regexps.

    I read the first edition, minus some engine-specific stuff.
    General techniques are what I want, and I didn't see them in the
    book. He had things to do and not do, but it didn't add up to
    building (ir-)regular expressions with reliably tractable
    behavior.
  • No.23 | | 887 bytes | |

    Licheng Fang wrote:
    Basically, the problem is this:

    p = re.compile("do|dolittle")
    p.match("dolittle").group()
    'do'

    Python's NFA regexp engine trys only the first option, and happily
    rests on that. There's another example:

    p = re.compile("one(self)?(selfsufficient)?")
    p.match("oneselfsufficient").group()
    'oneself'

    The Python regular expression engine doesn't exaust all the
    possibilities, but in my application I hope to get the longest possible
    match, starting from a given point.

    Is there a way to do this in Python?

    Yes. Here's a way, but it sucks real bad:

    def longest_match(re_string, text):
    regexp = re.compile('(?:' + re_string + ')$')
    while text:
    m = regexp.match(text)
    if m:
    return m
    text = text[:-1]
    return None
  • No.24 | | 1012 bytes | |

    [Licheng Fang]
    >Basically, the problem is this:
    >>

    >p = re.compile("do|dolittle")
    >p.match("dolittle").group()
    >'do'



    >The Python regular expression engine doesn't exaust all the
    >possibilities, but in my application I hope to get the longest possible
    >match, starting from a given point.
    >>

    >Is there a way to do this in Python?


    [Bryan ]
    Yes. Here's a way, but it sucks real bad:
    --
    def longest_match(re_string, text):
    regexp = re.compile('(?:' + re_string + ')$')
    while text:
    m = regexp.match(text)
    if m:
    return m
    text = text[:-1]
    return None

    If you want to try something like that, note that the match() method
    accepts optional slice arguments, so the "text = text[:-1]" business
    can be replaced with much quicker little-integer arithmetic.
  • No.25 | | 4075 bytes | |

    [Bryan ]
    Unfortunately, the stuff about NFA's is wrong. Friedl's awful
    book

    [Tim Peters]
    >Strongly disagree: []


    [Bryan]
    I know I'm disagreeing with a lot of smart people in panning
    the book.

    That's allowed :-)

    What Python uses is search-and-backtrack. Unfortunately such
    engines don't have much theory behind them, and it's hard to
    reason generally about what they do.

    >Yup X 3, and the last is precisely why Friedl's book is valuable for
    >people using "NFA" implementations: Friedl does a good job of
    >explaining when and why you're likely to trigger atrocious runtime
    >performance, and gives practical general techniques for avoiding those
    >problems. None of that has anything to do with CompSci regexps
    >either, but is vital knowledge for people hoping to make happy
    >non-trivial use of Python/Perl/etc regexps.


    I read the first edition, minus some engine-specific stuff.
    General techniques are what I want, and I didn't see them in the
    book. He had things to do and not do, but it didn't add up to
    building (ir-)regular expressions with reliably tractable
    behavior.

    My guess then is that the book moved at too slow a pace for you, and
    you got bored. The most valuable general technique he (eventually ;-)
    explained he called "unrolling", and consists of writing a regexp in
    the form:

    normal* (?: special normal* )*

    where the sets of characters with which `normal` and `special` can
    start are disjoint. Under most "NFA" implementation, such a regexp is
    immune to most bad behaviors, whether finding very long matches, or in
    searching a very long string that happens not to contain any matches.

    Without knowing this, under many implementations it's very easy to end
    up writing a regexp that matches quickly when a short match exists,
    but "blows up" (stack fault, or some checked internal limit hit) if
    only a very long match exists; and/or takes even exponential time to
    fail to match when a match doesn't exist.

    For example, a direct application is writing a robust regular
    expression to match C string literals (also one form of Python string
    literal: double-quote character at each end, and backslash escapes,
    including backslash-newline to span lines):

    " [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* "

    The "normal case" is that it's just a regular old character: not a
    double quote, backslash, or newline. The "special case" is a
    backslash followed by any character whatsoever. The "normal" and
    "special" cases obviously don't intersect, so the interior of this
    regexp (i.e., ignoring the double quotes at each end) fits the
    "unrolling" pattern to a tee.

    As a result, you can use it with reasonably high confidence. For
    /why/ that's so, read the rest of his book ;-) In effect, it gives
    the search engine only one possible course of action at each character
    in the target string. So long as that's "normal", it has to stay in
    the "normal" part of the regexp, and as soon as it's "special" it has
    to leave the "normal" part. The intent is to eliminate any
    possibility for backtracking, thus ensuring predictable runtime and
    ensuring that no form of internal backtracking stack needs to be used
    (let alone hit its limit).

    The unrolling idea was new to me, and as soon as I learned it I
    replaced lots of regexps in the Python library with rewritten versions
    that demonstrably performed very much better -- even allowed closing
    some bug reports concerning extreme cases (extremely long matches, and
    unbearable runtime on long strings without any matches).

    A different lesson to take from this is that nobody is likely to
    stumble into effective "NFA" techniques based on luck or "common
    sense". It's a set of obscure & specialized learned skills.
  • No.26 | | 1355 bytes | |

    Tim Peters wrote:
    [] The most valuable general technique he (eventually ;-)
    explained he called "unrolling", and consists of writing a regexp in
    the form:

    normal* (?: special normal* )*

    where the sets of characters with which `normal` and `special` can
    start are disjoint. Under most "NFA" implementation, such a regexp is
    immune to most bad behaviors, whether finding very long matches, or in
    searching a very long string that happens not to contain any matches.

    []
    As a result, you can use it with reasonably high confidence. For
    /why/ that's so, read the rest of his book ;-) In effect, it gives
    the search engine only one possible course of action at each character
    in the target string. So long as that's "normal", it has to stay in
    the "normal" part of the regexp, and as soon as it's "special" it has
    to leave the "normal" part. The intent is to eliminate any
    possibility for backtracking, thus ensuring predictable runtime and
    ensuring that no form of internal backtracking stack needs to be used
    (let alone hit its limit).

    So how general is that? The books just says "certain common
    expressions". I think the real answer is that one can unroll
    exactly the true regular expressions. The unrolling is
    essentially resolving the states of a DFA by hand.
  • No.27 | | 1100 bytes | |

    Licheng Fang wrote:
    Basically, the problem is this:

    p = re.compile("do|dolittle")
    p.match("dolittle").group()
    'do'

    Python's NFA regexp engine trys only the first option, and happily
    rests on that. There's another example:

    p = re.compile("one(self)?(selfsufficient)?")
    p.match("oneselfsufficient").group()
    'oneself'

    The Python regular expression engine doesn't exaust all the
    possibilities, but in my application I hope to get the longest possible
    match, starting from a given point.

    Is there a way to do this in Python?

    A reply from Tim Peters reminded me that I once programmed a
    simple special case of this problem. In particular, given a
    lot of strings, my function returns a regexp that efficiently
    matches the longest of prefix equal to any of the strings.
    If the application is like the examples, where the target RE
    is just the or of strings, the method should work well.

    Now I'm thinking of a more general version, that handles all
    true regular expressions.
  • No.28 | | 431 bytes | |

    Frederic Rentsch wrote:

    If you need regexes, why not just reverse-sort your expressions? This
    seems a lot easier and faster than writing another regex compiler.
    Reverse-sorting places the longer ones ahead of the shorter ones.

    Unfortunately, not all regular expressions have a fixed match length.
    Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
    depends on the input.

    Lorenzo Gatti
  • No.29 | | 1017 bytes | |

    gatti (AT) dsdata (DOT) it wrote:
    Frederic Rentsch wrote:
    --

    >If you need regexes, why not just reverse-sort your expressions? This
    >seems a lot easier and faster than writing another regex compiler.
    >Reverse-sorting places the longer ones ahead of the shorter ones.
    >
    >

    Unfortunately, not all regular expressions have a fixed match length.
    Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
    depends on the input.

    Lorenzo Gatti

    Very true! Funny you should remind me, considering that I spent quite
    some time upgrading SE to allow regular expressions. Version 1 didn't
    and could resolve precedence at compile time. Version 2 resolves
    precedence at runtime by length of the matches and should function
    correctly in this respect, although it might not function fast enough
    for speed-critical applications. But then there is in general a
    trade-off between convenience and speed.

    Frederic

Re: How to get the "longest possible" match with Python's RE module?


max 4000 letters.
Your nickname that display:
In order to stop the spam: 6 + 5 =
QUESTION ON "Development"

EMSDN.COM