Development

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Match 2 words in a line of file

    11 answers - 496 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hi
    Am pretty new to python and hence this question
    I have file with an output of a process. I need to search this file one
    line at a time and my pattern is that I am looking for the lines that
    has the word 'event' and the word 'new'.
    Note that i need lines that has both the words only and not either one
    of them
    how do i write a regexp for this or better yet shd i even be using
    regexp or is there a better way to do this
    thanks
  • No.1 | | 1139 bytes | |

    elrondrules (AT) gmail (DOT) com wrote:
    Hi

    Am pretty new to python and hence this question

    I have file with an output of a process. I need to search this file one
    line at a time and my pattern is that I am looking for the lines that
    has the word 'event' and the word 'new'.

    Note that i need lines that has both the words only and not either one
    of them

    how do i write a regexp for this or better yet shd i even be using
    regexp or is there a better way to do this

    thanks

    Maybe something like this would do:

    import re

    def lines_with_words(file, word1, word2):
    """Print all lines in file that have both words in it."""
    for line in file:
    if re.search(r"\b" + word1 + r"\b", line) and \
    re.search(r"\b" + word2 + r"\b", line):
    print line

    Just call the function with a file object and two strings that
    represent the words that you want to find in each line.

    To match a word in regex you write "\bWRD\b".

    I don't know if there is a better way of doing this, but I believe that
    this should at least work.
  • No.2 | | 1452 bytes | |

    Without using re, this may work (untested ;-):

    def lines_with_words(file, word1, word2):
    """Print all lines in file that have both words in it."""
    for line in file:
    words = line.split()
    if word1 in words and word2 in words:
    print line

    /Jean Brouwers

    Rickard Lindberg wrote:
    elrondrules (AT) gmail (DOT) com wrote:
    Hi

    Am pretty new to python and hence this question

    I have file with an output of a process. I need to search this file one
    line at a time and my pattern is that I am looking for the lines that
    has the word 'event' and the word 'new'.

    Note that i need lines that has both the words only and not either one
    of them

    how do i write a regexp for this or better yet shd i even be using
    regexp or is there a better way to do this

    thanks

    Maybe something like this would do:

    import re

    def lines_with_words(file, word1, word2):
    """Print all lines in file that have both words in it."""
    for line in file:
    if re.search(r"\b" + word1 + r"\b", line) and \
    re.search(r"\b" + word2 + r"\b", line):
    print line

    Just call the function with a file object and two strings that
    represent the words that you want to find in each line.

    To match a word in regex you write "\bWRD\b".

    I don't know if there is a better way of doing this, but I believe that
    this should at least work.
  • No.3 | | 632 bytes | |

    MrJean1 wrote:
    def lines_with_words(file, word1, word2):
    """Print all lines in file that have both words in it."""
    for line in file:
    words = line.split()
    if word1 in words and word2 in words:
    print line

    This sounds better, it's probably faster than the RE version, Python
    2.5 has a really fast strcontains__ method, done by effbot:

    def lines_with_words(file, word1, word2):
    """Print all lines in file that have both words in it.
    (word1 may be the less frequent word of the two)."""
    for line in file:
    if word1 in line and word2 in line:
    print line

    Bye,
    bearophile
  • No.4 | | 491 bytes | |

    I see two potential problems with the non regex solutions.

    1) Consider a line: "foo (bar)". When you split it you will only get
    two strings, as split by default only splits the string on white space
    characters. Thus "'bar' in words" will return false, even though bar is
    a word in that line.

    2) If you have a line something like this: "foobar hello" then "'foo'
    in line" will return true, even though foo is not a word (it is part of
    a word).
  • No.5 | | 1284 bytes | |

    Rickard Lindberg wrote:

    I see two potential problems with the non regex solutions.

    1) Consider a line: "foo (bar)". When you split it you will only get
    two strings, as split by default only splits the string on white space
    characters. Thus "'bar' in words" will return false, even though bar is
    a word in that line.

    2) If you have a line something like this: "foobar hello" then "'foo'
    in line" will return true, even though foo is not a word (it is part of
    a word).

    Here's a solution using re.split:

    import re
    import StringI

    wordsplit = re.compile('\W+').split
    def matchlines(fh, w1, w2):
    w1 = w1.lower()
    w2 = w2.lower()
    for line in fh:
    words = [x.lower() for x in wordsplit(line)]
    if w1 in words and w2 in words:
    print line.rstrip()

    test = """1st line of text (not matched)
    2nd line of words (not matched)
    3rd line (Word test) should match (case insensitivity)
    4th line simple test of word's (matches)
    5th line simple test of words not found (plural words)
    6th line tests produce strange words (no match - plural)
    7th line "word test" should find this
    """
    matchlines(StringIStringI(test), 'test', 'word')
  • No.6 | | 1284 bytes | |

    Rickard Lindberg wrote:

    I see two potential problems with the non regex solutions.

    1) Consider a line: "foo (bar)". When you split it you will only get
    two strings, as split by default only splits the string on white space
    characters. Thus "'bar' in words" will return false, even though bar is
    a word in that line.

    2) If you have a line something like this: "foobar hello" then "'foo'
    in line" will return true, even though foo is not a word (it is part of
    a word).

    Here's a solution using re.split:

    import re
    import StringI

    wordsplit = re.compile('\W+').split
    def matchlines(fh, w1, w2):
    w1 = w1.lower()
    w2 = w2.lower()
    for line in fh:
    words = [x.lower() for x in wordsplit(line)]
    if w1 in words and w2 in words:
    print line.rstrip()

    test = """1st line of text (not matched)
    2nd line of words (not matched)
    3rd line (Word test) should match (case insensitivity)
    4th line simple test of word's (matches)
    5th line simple test of words not found (plural words)
    6th line tests produce strange words (no match - plural)
    7th line "word test" should find this
    """
    matchlines(StringIStringI(test), 'test', 'word')
  • No.7 | | 1036 bytes | |

    Rickard Lindberg, yesterday I was sleepy and my solution was wrong.

    2) If you have a line something like this: "foobar hello" then "'foo'
    in line" will return true, even though foo is not a word (it is part of
    a word).

    Right. Now I think the best solution is to use __contains__ (in) to
    quickly find the lines that surely contains both substrings, then on
    such possibly rare cases you can use a correctly done RE. If the words
    are uncommon enough, such solution may be fast and reliable.
    Using raw tests followed by slow and reliable ones on the rare positive
    results of the first test is a solution commonly used in Computer
    Science, that often is both fast and reliable. (It breaks when the
    first test is passed too much often, or when it has some false
    negatives).

    Probably there are even faster solutions, scanning the whole text at
    once instead of inside its lines, but the code becomes too much hairy
    and probably it's not worth it.

    Bye,
    bearophile
  • No.8 | | 926 bytes | |

    18 Jan 2007 18:54:59 -0800, "Rickard Lindberg"
    <ricli576 (AT) student (DOT) liu.sewrote:

    >I see two potential problems with the non regex solutions.
    >
    >1) Consider a line: "foo (bar)". When you split it you will only get
    >two strings, as split by default only splits the string on white space
    >characters. Thus "'bar' in words" will return false, even though bar is
    >a word in that line.
    >
    >2) If you have a line something like this: "foobar hello" then "'foo'
    >in line" will return true, even though foo is not a word (it is part of
    >a word).


    1) Depends how you define a 'word'.

    2) This can be resolved with

    templine = ' ' + line + ' '
    if ' ' + word1 + ' ' in templine and ' ' + word2 + ' ' in templine:

    Dan
  • No.9 | | 324 bytes | |

    Daniel Klein wrote:

    2) This can be resolved with

    templine = ' ' + line + ' '
    if ' ' + word1 + ' ' in templine and ' ' + word2 + ' ' in templine:

    But then you will still have a problem to match the word "foo" in a
    string like "bar (foo)".
  • No.10 | | 1731 bytes | |

    Fri, 19 Jan 2007 22:57:37 -0800, Rickard Lindberg wrote:

    Daniel Klein wrote:

    >2) This can be resolved with
    >>

    >templine = ' ' + line + ' '
    >if ' ' + word1 + ' ' in templine and ' ' + word2 + ' ' in templine:


    But then you will still have a problem to match the word "foo" in a
    string like "bar (foo)".

    That's a good point for a general word-finder application, but in the case
    of the Poster's problem, it depends on the data he is dealing
    with and the consequences of errors.

    If the consequences are serious, then he may need to take extra
    precautions. But if the consequences are insignificant, then the fastest,
    most reliable solution is probably a simple generator:

    def find_new_events(text):
    for line in text.splitlines():
    line = line.lower() # remove this for case-sensitive matches
    if "event" in line and "new" in line:
    yield line

    To get all the matching lines at once, use list(find_new_events(test)).

    This is probably going to be significantly faster than a regex.

    So that's three possible solutions:

    (1) Use a quick non-regex matcher, and deal with false positives later;

    (2) Use a slow potentially complicated regex; or

    (3) Use a quick non-regex matcher to eliminate obvious non-matches, then
    pass the results to a slow regex to eliminate any remaining false
    positives.

    Which is best will depend on the P.'s expected data. As always, resist
    the temptation to guess which is faster, and instead use the timeit module
    to measure it.
  • No.11 | | 502 bytes | |

    egchow do i write a regexp for this or better yet shd i even be using
    egcregexp or is there a better way to do this
    "A team of engineers were faced with a problem; they decided to handle
    it with regular expressions. Now they had two problems"

    Regular expressions are not always the best solution. have
    pointed out how it can work with "if word in line", this approach will
    do if you're dealing with simple cases.

    Take a look at this discussion, it is very informative:

Re: Match 2 words in a line of file


max 4000 letters.
Your nickname that display:
In order to stop the spam: 4 + 4 =
QUESTION ON "Development"

EMSDN.COM