Java

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Question for Wildcard Search:

    7 answers - 172 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    There is a possibility for searching with the "*" and "?" wildcard at the
    end and in the middle of a search string, but not at the beginning, is there
    way to do this?
  • No.1 | | 887 bytes | |

    Markus Atteneder writes:
    There is a possibility for searching with the "*" and "?" wildcard at the
    end and in the middle of a search string, but not at the beginning, is there
    way to do this?

    Sure. Simply index reversed words.

    The reason why QP prohibits wildcards at the beginning is performance.
    If there is some prefix, only terms using this prefix need to be examined,
    if they match the wildcard.
    IIRC you can use wildcards in the beginning if you create the query using
    the api but it will be slow.

    So the performant solution is to have an additional field containing the
    tokens in reversed character order.
    Won't help for *foo* though.

    HTH
    Morus

    To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
  • No.2 | | 141 bytes | |


    Sure. Simply index reversed words.
    Since I do not have much experience with lucene can you explain it more
    exactly for me? THX!
  • No.3 | | 1803 bytes | |

    Jun 22, 2005, at 4:01 AM, Morus Walter wrote:

    Markus Atteneder writes:
    >
    >There is a possibility for searching with the "*" and "?" wildcard
    >at the
    >end and in the middle of a search string, but not at the
    >beginning, is there
    >way to do this?
    >>
    >>

    Sure. Simply index reversed words.

    The reason why QP prohibits wildcards at the beginning is performance.
    If there is some prefix, only terms using this prefix need to be
    examined,
    if they match the wildcard.
    IIRC you can use wildcards in the beginning if you create the query
    using
    the api but it will be slow.

    So the performant solution is to have an additional field
    containing the
    tokens in reversed character order.
    Won't help for *foo* though.

    There is a technique from the book Managing Gigabytes that I've
    mentioned here before (in February). Here's a snippet from it:

    technique I found in the book Managing Gigabytes, making
    "*string*" queries drastically more efficient for searching (though
    also impacting index size). Take the term "cat". It would be
    indexed with all rotated variations with an end of word marker added:

    cat$
    at$c
    t$ca
    $cat

    The query for "*at*" would be preprocessed and rotated such that the
    wildcards are collapsed at the end to search for "at*" as a
    PrefixQuery. A wildcard in the middle of a string like "c*t" would
    become a prefix query for "t$c*".

    Anyone tried this technique with Lucene?

    Erik

    To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
  • No.4 | | 1091 bytes | |

    >Markus Atteneder writes:
    >There is a possibility for searching with the "*" and "?" wildcard at the
    >end and in the middle of a search string, but not at the beginning, is there
    >way to do this?
    >>

    >Sure. Simply index reversed words.
    >
    >The reason why QP prohibits wildcards at the beginning is performance.
    >If there is some prefix, only terms using this prefix need to be examined,
    >if they match the wildcard.
    >IIRC you can use wildcards in the beginning if you create the query using
    >the api but it will be slow.
    >
    >So the performant solution is to have an additional field containing the
    >tokens in reversed character order.
    >Won't help for *foo* though.


    You can also index ngrams - say 3-grams. Every word gets tokenized &
    indexed as a sequence of three letter sub-strings. E.g. "tokenized"
    would be indexed as "tok" "oke" "ken" "eni" "niz" "ize" "zed".

    That would help you find *foo*, but not *ha*.
    -- Ken
  • No.5 | | 621 bytes | |

    Quoting Erik Hatcher <erik (AT) ehatchersolutions (DOT) com>:

    Anyone tried this technique with Lucene?

    Actually, the problem is that the wildcard code has to search over a large
    subset of terms because the list of terms is, well, a linear structure.

    If, for example, all terms in the index is arranged as a suffix tree, the sort
    of wildcard search that currently is cpu intensive will no longer be cpu
    intensive.

    To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
  • No.6 | | 993 bytes | |

    Quoting Dave Kor <s0454888 (AT) sms (DOT) ed.ac.uk>:

    Quoting Erik Hatcher <erik (AT) ehatchersolutions (DOT) com>:

    Anyone tried this technique with Lucene?

    Actually, the problem is that the wildcard code has to search over a large
    subset of terms because the list of terms is, well, a linear structure.

    If, for example, all terms in the index is arranged as a suffix tree, the
    sort
    of wildcard search that currently is cpu intensive will no longer be cpu
    intensive.

    Hmm I realized I should add a qualifier to the above statement. Searching for
    matching terms would no longer be cpu intensive, especially for wildcards like
    *foo* or *foo. The other wildcard search problem of having too many matching
    terms to lookup in the index still remains unsolved.

    To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
  • No.7 | | 2321 bytes | |

    Hello
    about 3 months ago I posted some idea about wildcard searching.

    main idea was to index every character of input as separate term. and
    then search using PhraseQuery.
    for example word "12345" would be indexed as "1" "2" "3" "4" "5". to
    find "*23*" you can use PhraseQuery with this two terms ("2" "3"). But
    this approach is limited only to queries with wildcards in the begin or end.

    Later I did some research and wrote Extension to PhraseQuery that allows
    to set term relative position to range of values (to insert gaps for "*"
    and "?") this approach is good because it does not rewrite queries and
    never run into Memory or TooManyClauses Exceptions

    regards,
    Volodymyr Bychkoviak

    14.03.2005 13:54

    Dave Kor wrote:

    >Quoting Dave Kor <s0454888 (AT) sms (DOT) ed.ac.uk>:
    >


    >
    >>Quoting Erik Hatcher <erik (AT) ehatchersolutions (DOT) com>:
    >>

    >
    >>

    Anyone tried this technique with Lucene?


    >>Actually, the problem is that the wildcard code has to search over a large
    >>subset of terms because the list of terms is, well, a linear structure.
    >>
    >>If, for example, all terms in the index is arranged as a suffix tree, the
    >>sort
    >>of wildcard search that currently is cpu intensive will no longer be cpu
    >>intensive.

    >
    >>

    >
    >Hmm I realized I should add a qualifier to the above statement. Searching for
    >matching terms would no longer be cpu intensive, especially for wildcards like
    >*foo* or *foo. The other wildcard search problem of having too many matching
    >terms to lookup in the index still remains unsolved.
    >
    >
    >To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
    >For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
    >
    >


    To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org

Re: Question for Wildcard Search:


max 4000 letters.
Your nickname that display:
In order to stop the spam: 7 + 6 =
QUESTION ON "Java"

EMSDN.COM