Databases

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • fts, compond words?

    15 answers - 744 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hi,
    I use the tsearch full text search with pg 8.0.3. It works great, but I
    wonder if it's possible to search for compound words?
    Ie if I search for "New York" i want to get a match on
    New York has traffic problems.
    but not on
    New axe murderer incident in brittish York.
    Is this possible?
    I don't use any wrapper, just
    select
    from
    where
    idxfti @@ to_tsquery('default', 'searchstring')
    Thanks,
    Marcus
    (end of broadcast)
    TIP 1: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your
    message can get through to the mailing list cleanly
  • No.1 | | 1176 bytes | |

    Mon, 5 Dec 2005, Marcus Engene wrote:

    Hi,

    I use the tsearch full text search with pg 8.0.3. It works great, but I
    wonder if it's possible to search for compound words?
    Ie if I search for "New York" i want to get a match on
    New York has traffic problems.
    but not on
    New axe murderer incident in brittish York.

    Is this possible?

    I don't use any wrapper, just
    select

    from

    where
    idxfti @@ to_tsquery('default', 'searchstring')

    ranking function is what you need. Read documentation.

    Thanks,
    Marcus
    --
    (end of broadcast)
    TIP 1: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your
    message can get through to the mailing list cleanly

    Regards,

    Bartunov, sci.researcher, hostmaster of AstroNet,
    Sternberg Astronomical Institute, Moscow University (Russia)
    Internet: oleg (AT) sai (DOT) msu.su, http://www.sai.msu.su/~megera/
    phone: +007(095)939-16-83, +007(095)939-23-83

    (end of broadcast)
    TIP 6: explain analyze is your friend
  • No.2 | | 1290 bytes | |

    Bartunov wrote:
    Mon, 5 Dec 2005, Marcus Engene wrote:

    >Hi,
    >>

    >I use the tsearch full text search with pg 8.0.3. It works great, but
    >I wonder if it's possible to search for compound words?
    >Ie if I search for "New York" i want to get a match on
    >New York has traffic problems.
    >but not on
    >New axe murderer incident in brittish York.
    >>

    >Is this possible?
    >>

    >I don't use any wrapper, just
    >select
    >
    >from
    >
    >where
    >idxfti @@ to_tsquery('default', 'searchstring')


    ranking function is what you need. Read documentation.

    Hi,

    I realized from the documentation that I'm not looking for
    compound words after all, I meant "exact phrase".

    I can't see how to make rank tell me which results has an
    exact phrase? Like "there must be a occurence of 'new' before
    'york'" (stemmed not really exact phrase)?

    Is there something new in rank for pg 8.1?

    Thanks!
    Marcus

    (end of broadcast)
    TIP 3: Have you checked our extensive FAQ?

  • No.3 | | 1707 bytes | |

    12/5/05, Marcus Engene <mengpg (AT) engene (DOT) sewrote:
    Bartunov wrote:
    Mon, 5 Dec 2005, Marcus Engene wrote:
    >
    >Hi,
    >>

    >I use the tsearch full text search with pg 8.0.3. It works great, but
    >I wonder if it's possible to search for compound words?
    >Ie if I search for "New York" i want to get a match on
    >New York has traffic problems.
    >but not on
    >New axe murderer incident in brittish York.
    >>

    >Is this possible?
    >>

    >I don't use any wrapper, just
    >select
    >
    >from
    >
    >where
    >idxfti @@ to_tsquery('default', 'searchstring')
    >
    >
    >

    ranking function is what you need. Read documentation.
    --
    Hi,

    I realized from the documentation that I'm not looking for
    compound words after all, I meant "exact phrase".

    I can't see how to make rank tell me which results has an
    exact phrase? Like "there must be a occurence of 'new' before
    'york'" (stemmed not really exact phrase)?

    What you'll want to do is check the original text for the exact phrase
    after the tsearch2 index has given you some targets.

    Given table foo:

    CREATE TABLE foo (
    id serial primary key,
    txt text,
    ts2 tsvector
    );

    use query:

    SELECT id FRM foo WHERE ts2 @@ to_tsquery('new&york') AND txt ILIKE
    '%new york%';

    You can get rid of the '%'s if you want the entire txt column to match
    the search phrase.
  • No.4 | | 1386 bytes | |

    Mon, 5 Dec 2005, Marcus Engene wrote:

    I realized from the documentation that I'm not looking for
    compound words after all, I meant "exact phrase".

    I can't see how to make rank tell me which results has an
    exact phrase? Like "there must be a occurence of 'new' before
    'york'" (stemmed not really exact phrase)?

    http://www.sai.msu.su/~

    Phrase search
    This tip is by Mike Rylander

    To do phrase searching just add an additional WHERE clause to your query:

    SELECT id FRM tab WHERE ts_idx_col @@ to_tsquery('history&lesson')
    AND text_col ~* '.*history\\s+lesson.*';

    The full-text index will still be used, and the regex will be used to
    prune the results afterwards.

    Is there something new in rank for pg 8.1?

    it has some improving, but not for your case.

    Regards,

    Bartunov, sci.researcher, hostmaster of AstroNet,
    Sternberg Astronomical Institute, Moscow University (Russia)
    Internet: oleg (AT) sai (DOT) msu.su, http://www.sai.msu.su/~megera/
    phone: +007(095)939-16-83, +007(095)939-23-83

    (end of broadcast)
    TIP 1: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your
    message can get through to the mailing list cleanly
  • No.5 | | 1666 bytes | |

    Bartunov wrote:
    Mon, 5 Dec 2005, Marcus Engene wrote:
    >I realized from the documentation that I'm not looking for
    >compound words after all, I meant "exact phrase".
    >>

    >I can't see how to make rank tell me which results has an
    >exact phrase? Like "there must be a occurence of 'new' before
    >'york'" (stemmed not really exact phrase)?


    http://www.sai.msu.su/~

    Phrase search
    This tip is by Mike Rylander

    To do phrase searching just add an additional WHERE clause to your query:

    SELECT id FRM tab WHERE ts_idx_col @@ to_tsquery('history&lesson')
    AND text_col ~* '.*history\\s+lesson.*';

    The full-text index will still be used, and the regex will be used to
    prune the results afterwards.

    Hi,

    Thanks for the answer, and Mike.

    This, I guess, will be problematic in a query like
    A & (B | C)
    or a more complex expression.

    say C is "New York" and that tsearch receives

    A & (B | (new & york))

    I cannot just add the regexp afterwards. What if B is true?
    What would be nice to have, given ofcourse the index isn't stripped
    is something like

    A & (B | (New TheNextWordMustFollow York))

    Would something like that be doable? Right now, intuitively, it would be
    two trees in the where clause:
    tsearch(A & B) R
    (tsearch (A & C) AND regexpmatch(C))
    and a nightmare in complex queries.

    Best regards,
    Marcus

    (end of broadcast)
    TIP 3: Have you checked our extensive FAQ?

  • No.6 | | 510 bytes | |

    12/6/05, Marcus Engene <mengpg (AT) engene (DOT) sewrote:

    [snip]

    A & (B | (New TheNextWordMustFollow York))

    Actually, I love that idea. , would it be possible to create a
    tsquery operator that understands proximity? , how allowing a
    predicate to the current '&' op, as in '&[dist<=1]' meaning "next
    token follows with a max distance of 1". I imagine that it would
    only be useful on unstripped tsvectors, but if the lexem position is
    already stored
  • No.7 | | 883 bytes | |

    A & (B | (New TheNextWordMustFollow York))

    I had thought about this before myself. Alas I have never had the time to
    properly investigate implementing such a feature.

    :(

    A & (B | (New + York))

    Something like that?

    Actually, I love that idea. , would it be possible to create a
    tsquery operator that understands proximity? , how allowing a
    predicate to the current '&' op, as in '&[dist<=1]' meaning "next
    token follows with a max distance of 1". I imagine that it would
    only be useful on unstripped tsvectors, but if the lexem position is
    already stored

    Would the proximity go in both directions? just forward? What about tokens
    that come before? Just a thought.

    Andy

    (end of broadcast)
    TIP 4: Have you searched our list archives?

    http://archives.postgresql.org
  • No.8 | | 1434 bytes | |

    That is a long discussed thing. We can't formulate unconflicting rules For
    example:
    1) a &[dist<=2] ( b &[dist<=3] c )
    2) a &[dist<=2] ( b |[dist<=3] c )
    3) a &[dist<=2] !c
    4) a &[dist<=2] ( b |[dist<=3] !c )
    5) a &[dist<=2] ( b & c )
    What does exact they mean? What is tsvectors which should be matched by those
    queries?

    The simple solution is : under operation 'phrase search' (ok, it will be '+'
    below) it must be only 'phrase search operations. I.e.:
    a | b ( c + ( d + e ) ) - good
    a | ( c + ( d & g ) ) - bad.

    For example, we have word 'foonish' and after lexize we got two lexemes: 'foo1'
    and 'foo2'. So a good query 'a + foonish' becomes 'a + ( foo1 | foo2 )'

    Mike Rylander wrote:
    12/6/05, Marcus Engene <mengpg (AT) engene (DOT) sewrote:

    [snip]


    >A & (B | (New TheNextWordMustFollow York))
    >>


    Actually, I love that idea. , would it be possible to create a
    tsquery operator that understands proximity? , how allowing a
    predicate to the current '&' op, as in '&[dist<=1]' meaning "next
    token follows with a max distance of 1". I imagine that it would
    only be useful on unstripped tsvectors, but if the lexem position is
    already stored
  • No.9 | | 2513 bytes | |

    12/7/05, Teodor Sigaev <teodor (AT) sigaev (DOT) ruwrote:
    That is a long discussed thing. We can't formulate unconflicting rules For
    example:
    1) a &[dist<=2] ( b &[dist<=3] c )
    2) a &[dist<=2] ( b |[dist<=3] c )
    3) a &[dist<=2] !c
    4) a &[dist<=2] ( b |[dist<=3] !c )
    5) a &[dist<=2] ( b & c )
    What does exact they mean? What is tsvectors which should be matched by those
    queries?

    1,2,4, and 5 are obviously ambiguous, but 3 seems straightforward to
    me, if not more difficult to implement. Would it not be acceptable to
    say that proximity modifiers are only valid between two simple lexemes
    and can not be placed next to any compound expression?

    The simple solution is : under operation 'phrase search' (ok, it will be '+'
    below) it must be only 'phrase search operations. I.e.:
    a | b ( c + ( d + e ) ) - good
    a | ( c + ( d & g ) ) - bad.

    Same as above. And, while '+' would be a very good shortcut for
    "&[follows;dist=1]" (or some such), I think the user should be able to
    specify the proximity more explicitly as well.

    For example, we have word 'foonish' and after lexize we got two lexemes: 'foo1'
    and 'foo2'. So a good query 'a + foonish' becomes 'a + ( foo1 | foo2 )'

    hrm that is a problem. Though, I think that's a case of how the
    compiled expression is built from user input. Unless I'm mistaken

    a + ( foo1 | foo2 )

    is exactly equal to

    (a + foo1) | (a + foo2)

    Ahhh but then there is the more complex example of

    a + foonish + bar

    becoming

    a + (foo1 | foo2) + bar

    but I guess that could be

    (a + foo1 + bar) | (a + foo2 + bar)


    >
    >
    >
    >

    Mike Rylander wrote:
    12/6/05, Marcus Engene <mengpg (AT) engene (DOT) sewrote:

    [snip]
    >
    >
    >A & (B | (New TheNextWordMustFollow York))
    >>

    >
    >

    Actually, I love that idea. , would it be possible to create a
    tsquery operator that understands proximity? , how allowing a
    predicate to the current '&' op, as in '&[dist<=1]' meaning "next
    token follows with a max distance of 1". I imagine that it would
    only be useful on unstripped tsvectors, but if the lexem position is
    already stored
  • No.10 | | 1642 bytes | |

    As Teodor already pointed there is no non-ambiguous solution, or
    at least, we don't know it.

    Wed, 7 Dec 2005, Andrew J. Kopciuch wrote:

    A & (B | (New TheNextWordMustFollow York))
    >>

    >

    I had thought about this before myself. Alas I have never had the time to
    properly investigate implementing such a feature.

    :(

    A & (B | (New + York))

    Something like that?
    >
    >Actually, I love that idea. , would it be possible to create a
    >tsquery operator that understands proximity? , how allowing a
    >predicate to the current '&' op, as in '&[dist<=1]' meaning "next
    >token follows with a max distance of 1". I imagine that it would
    >only be useful on unstripped tsvectors, but if the lexem position is
    >already stored
    >>

    >

    Would the proximity go in both directions? just forward? What about tokens
    that come before? Just a thought.
    >
    >
    >

    Andy

    (end of broadcast)
    TIP 4: Have you searched our list archives?

    http://archives.postgresql.org

    Regards,

    Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
    Sternberg Astronomical Institute, Moscow University, Russia
    Internet: oleg (AT) sai (DOT) msu.su, http://www.sai.msu.su/~megera/
    phone: +007(495)939-16-83, +007(495)939-23-83

    (end of broadcast)
    TIP 4: Have you searched our list archives?

    http://archives.postgresql.org
  • No.11 | | 1068 bytes | |

    Mike Rylander wrote:

    >>
    >>Mike Rylander wrote:

    >
    >>

    12/6/05, Marcus Engene <mengpg (AT) engene (DOT) sewrote:

    [snip]

    A & (B | (New TheNextWordMustFollow York))

    Actually, I love that idea. , would it be possible to create a
    tsquery operator that understands proximity? , how allowing a
    predicate to the current '&' op, as in '&[dist<=1]' meaning "next
    token follows with a max distance of 1". I imagine that it would
    only be useful on unstripped tsvectors, but if the lexem position is
    already stored

    This might not be a solution in the longer term, but what I do for that
    type of thing is

    idxfti @@ '(a&b)' and message ~* 'a b'

    Postgres is smart enough to use the results of the GIST index and go
    from there with the message scanning.

    Jeff

    (end of broadcast)
    TIP 3: Have you checked our extensive FAQ?

  • No.12 | | 1249 bytes | |

    hrm that is a problem. Though, I think that's a case of how the
    compiled expression is built from user input. Unless I'm mistaken

    a + ( foo1 | foo2 )

    is exactly equal to

    (a + foo1) | (a + foo2)

    Ahhh but then there is the more complex example of

    a + foonish + bar

    becoming

    a + (foo1 | foo2) + bar

    but I guess that could be

    (a + foo1 + bar) | (a + foo2 + bar)

    That a simple case, what about languages as norwegian or german? They has
    compound words and ispell dictionary can split them to lexemes. But, usialy
    there is more than one variant of separation:

    forbruksvaremerkelov
    forbrukvare merke lov
    forbrukvare merkelov
    forbruk varemerke lov
    forbruk varemerkelov
    forbruksvare merke lov
    forbruksvare merkelov
    (notice: I don't know translation, just an example. When we working on compound
    word support we found word which has 24 variant of separation!!)

    So, query 'a + forbruksvaremerkelov' will be awful:

    a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | )

    course, that is examle just from mind, but solution of phrase search should
    work reasonably with such corner cases.
  • No.13 | | 4617 bytes | |

    12/8/05, Teodor Sigaev <teodor (AT) sigaev (DOT) ruwrote:
    (a + foo1 + bar) | (a + foo2 + bar)

    That a simple case, what about languages as norwegian or german? They has
    compound words and ispell dictionary can split them to lexemes. But, usialy
    there is more than one variant of separation:

    forbruksvaremerkelov
    forbruk vare merke lov
    forbruk vare merkelov
    forbruk varemerke lov
    forbruk varemerkelov
    forbruksvare merke lov
    forbruksvare merkelov
    (notice: I don't know translation, just an example. When we working on compound
    word support we found word which has 24 variant of separation!!)

    So, query 'a + forbruksvaremerkelov' will be awful:

    a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | )

    course, that is examle just from mind, but solution of phrase search should
    work reasonably with such corner cases.

    WARNING: What follows is wild, hand waving speculation as I don't
    fully understand the implications of compound words! ;-)

    My naive impression is that it would be both possible and a good idea
    to stem any compound words to their versions containing the most
    individual lexemes. As an analogy, this would be similar to
    transforming composed (Normalization Form C) UTF-8 characters into
    their decomposed (Normalization Form D) versions.

    From your example above, the stemmed version of 'forbrukvaremerkelov'
    would always decompose to 'forbruk vare merke lov', both for indexing
    and in to_tsquery(). For the purposes of phrase searching, or more
    generally proximity searching, the compiled query

    a + forbrukvaremerkelov

    might look something like

    a + forbruk + vare + merke + lov

    and that's it all parts of the compound word are required, and
    required to be in that order, for the "phrase" search to be valid. A
    compiled query like

    a + (forbruk & vare & merke & lov)

    wouldn't be valid anyway, because the user wants the entire compound
    word to be adjacent to 'a', and the bare '&' op would allow any of the
    parts to exist anywhere in the document or am I missing something?
    (I probably am.)

    The point is, once you go into an order-and-distance mode for two user
    supplied words (pre-stemming) you have to apply that mode to the
    entire set of stemmed lexemes that are involved in the "phrase". If
    that assumption, that "user requested order and distance" uses a
    different set of operators than free-form full text searching, then I
    think it's doable. Each sub-statement that comprises a phrase search
    is an atomic unit, and can be applied anywhere within the global
    compiled query.

    [Thinking ]

    Starting from that assumption, take the example of

    a + foonish & bar

    The implication of the above assumption is that the '+' (or
    '&[follows;dist=1]') operator has higher precedence than a bare '&'
    operator. So, the next version of the query, before compilation is
    complete, might look like:

    (a + foonish) & bar

    Then we go through these steps:

    (a + (foo1 | foo2)) & bar #decompose compound and multi-stem words
    ( (a + foo1) | (a + foo2) ) & bar # create multiple atoms for
    multi-stem words

    The end result is both non-ambiguous and reflects the most likely user
    intended query. Let's try it with a compound word /and/ a multi-stem
    word, remembering that "phrase operators" are only allowed between
    simple query terms, not compound terms (grouped terms):

    1) a & (foonish + forbrukvaremerkelov) & ! bar # user supplied query

    2) a & ( (foo1 | foo2) + forbrukvaremerkelov) & ! bar # decompose
    multi-stem words

    3) a & ( (foo1 + forbrukvaremerkelov) | (foo2 + forbrukvaremerkelov)
    ) & ! bar # make multiple atoms from multi-stemmed words involved in
    phrases (this creates 1 atom per stem per multi-stem word, and yes,
    that could get very big but, IMH, slow but working corner cases
    are K)

    4) a & ( (foo1 + forbruk + vare + merke + lov) | (foo2 + forbruk +
    vare + merke + lov) ) & ! bar # explode the compound words to their
    "decomposed" form, because that's what ought to be in the indexed data

    That meets the same criteria as the simpler example above, and I've
    not said anything about compound and multi-stem word outside the
    "phrase mode" portion of the query because the current behaviour is
    what we want in those cases.

    --
  • No.14 | | 1973 bytes | |

    That a simple case, what about languages as norwegian or german? They
    has compound words and ispell dictionary can split them to lexemes.
    But, usialy there is more than one variant of separation:
    >

    forbruksvaremerkelov
    forbruk vare merke lov
    forbruk vare merkelov
    forbruk varemerke lov
    forbruk varemerkelov
    forbruksvare merke lov
    forbruksvare merkelov
    (notice: I don't know translation, just an example. When we working
    on compound word support we found word which has 24 variant of
    separation!!)
    >

    So, query 'a + forbruksvaremerkelov' will be awful:
    >

    a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) |
    )
    >

    course, that is examle just from mind, but solution of phrase
    search should work reasonably with such corner cases.

    (Sorry for replying in the wrong place in the thread, I was away for a
    trip and unsubscribed meanwhile)

    I'm a swede and swedish is similair to norweigan and german. Take this
    example:

    kvinna
    kvinna

    Words are put together to make a new word with different meaning. The
    first example means "tall hairy woman" and the second is "woman with
    long hair". If I would be on f.ex a date site, I'd want the distinction.
    ;-) If not, i should enter both strings
    (" " | ) & kvinna
    which is perfectly acceptable.

    IMH I don't see any point in splitting these words.

    Let's go back to the subject, what about a syntax like this:

    idxfti @@ to_tsquery('default', 'pizza & (Chicago | [New York]')

    Ie the exact match string is always atomic. Wouldn't that be doable
    without any logical implications?

    Best regards,
    Marcus

    (end of broadcast)
    TIP 6: explain analyze is your friend
  • No.15 | | 4017 bytes | |

    12/12/05, Marcus Engene <mengpg (AT) engene (DOT) sewrote:
    That a simple case, what about languages as norwegian or german? They
    has compound words and ispell dictionary can split them to lexemes.
    But, usialy there is more than one variant of separation:

    forbruksvaremerkelov
    forbruk vare merke lov
    forbruk vare merkelov
    forbruk varemerke lov
    forbruk varemerkelov
    forbruksvare merke lov
    forbruksvare merkelov
    (notice: I don't know translation, just an example. When we working
    on compound word support we found word which has 24 variant of
    separation!!)

    So, query 'a + forbruksvaremerkelov' will be awful:

    a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) |
    )

    course, that is examle just from mind, but solution of phrase
    search should work reasonably with such corner cases.

    (Sorry for replying in the wrong place in the thread, I was away for a
    trip and unsubscribed meanwhile)

    I'm a swede and swedish is similair to norweigan and german. Take this
    example:

    kvinna
    kvinna

    Words are put together to make a new word with different meaning. The
    first example means "tall hairy woman" and the second is "woman with
    long hair". If I would be on f.ex a date site, I'd want the distinction.
    ;-) If not, i should enter both strings
    (" " | ) & kvinna
    which is perfectly acceptable.

    Well, that certainly kills my initial naive implementation plan! :-)
    Thank you for the explanation.

    [thinking]

    Well, if compound words should always be treated as the user has
    inserted them then it seems that the current implementation may be
    doing the wrong thing with regard to stemming compound words. If the
    compound words are being decomposed to constituent stems then you'd be
    getting semantically, or at least contextually, incorrect results,
    right? (Again, not an expert here. :-) )

    [thinking more]

    So, assuming that compound words should not be fully stemmed, due to
    the way they are used to create new words with different meanings, if
    step (4) were removed from my earlier plan then everything would
    continue to work as proposed.

    IMH I don't see any point in splitting these words.
    --
    Let's go back to the subject, what about a syntax like this:

    idxfti @@ to_tsquery('default', 'pizza & (Chicago | [New York]')

    Ie the exact match string is always atomic. Wouldn't that be doable
    without any logical implications?

    I think there are several ways that phrase matching can be done in a
    logically consistent way. That is certainly one of them, and takes
    the focus off a single infix operator. TS2 already recognises
    grouping operations via parens, and restricting brackets ([,]) to
    surrounding only simple expressions (no '&', '|', '!' or '()')
    shouldn't be too hard. However, I'd still prefer that proximity
    searches could be specified more explicitly by the user. Using the
    above example:

    pizza & (Chicago | [New York])

    becomes

    pizza & (Chicago | New + Your)

    which is implicitly

    pizza & (Chicago | New +{follows;dist=1} York)

    and that is read as: "Pizza, and chicago, or new followed by york at
    a distance of 1."

    where the modifier to the '+' operator could be specified by the user
    initially if desired.

    While I understand and agree that "phrase searching" would be the most
    common use for proximity+direction operator modifiers, I see things
    like the '+' operator and '[]' groupings as special cases of the more
    generalized restriction operation (or set thereof) based on the
    positional information recorded in (unstripped) indexes.

    Thoughts?

    Best regards,
    Marcus

    (end of broadcast)
    TIP 6: explain analyze is your friend

Re: fts, compond words?


max 4000 letters.
Your nickname that display:
In order to stop the spam: 3 + 2 =
QUESTION ON "Databases"

EMSDN.COM