Apache

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Lucene and database

    7 answers - 955 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hi folks,
    I'm looking for a solution/best practices concerning Lucene and SQL database
    integration.
    The database (MySQL) is already developed and contains data. I've tried
    MySQL full-text
    search, but it's quite slow and doesn't have the possibility to intergate
    custom analyzers.
    Phrase search is perfomed only in boolean mode and doesn't return relevance
    factor :(
    The idea is to manage full-text indexes (titles, keywords, summaries) and
    perform search
    using Lucene. The resultset will include ID's that will be appended to SQL
    query in order
    to apply additional filters based on foreign keys (categories mappings,
    etc).
    But, the database is going to be big enough, and the list of IDs returned by
    Lucene too. This
    may cause high memory usage and slow sql query speed (for instance 1000 IDs
    in "IN (id1, id2 )"
    sql filter)
    Any ideas, suggestions ?
  • No.1 | | 1421 bytes | |

    Hi Alexander,

    if the set of foreign keys you are using to filter the lucene results
    with some SQL query, I suggest you to include the foreign keys values
    as additional lucene fields, and let lucene return you the exact set
    of document IDs.

    I had excellent results with this technique.

    Hope this helps.

    Best regards,

    Giulio Cesare Solaroli

    7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
    Hi folks,

    I'm looking for a solution/best practices concerning Lucene and SQL database
    integration.
    The database (MySQL) is already developed and contains data. I've tried
    MySQL full-text
    search, but it's quite slow and doesn't have the possibility to intergate
    custom analyzers.
    Phrase search is perfomed only in boolean mode and doesn't return relevance
    factor :(

    The idea is to manage full-text indexes (titles, keywords, summaries) and
    perform search
    using Lucene. The resultset will include ID's that will be appended to SQL
    query in order
    to apply additional filters based on foreign keys (categories mappings,
    etc).

    But, the database is going to be big enough, and the list of IDs returned by
    Lucene too. This
    may cause high memory usage and slow sql query speed (for instance 1000 IDs
    in "IN (id1, id2 )"
    sql filter)
    --
    Any ideas, suggestions ?
  • No.2 | | 1546 bytes | |

    If I'm understanding you correctly, you're using Lucene to store IDs and
    index the columns that would normally be full-text indices in MySQL, then
    use the IDs to retrieve the information from the database. This is more of a
    MySQL issue than a Lucene issue, but it suggests a flaw in your approach:
    why not simply store all the data you want to retrieve in Lucene? If the
    data in your database changes, you would have to rebuild your Lucene index
    anyway. My $0.02.

    7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:

    Hi folks,

    I'm looking for a solution/best practices concerning Lucene and SQL
    database
    integration.
    The database (MySQL) is already developed and contains data. I've tried
    MySQL full-text
    search, but it's quite slow and doesn't have the possibility to intergate
    custom analyzers.
    Phrase search is perfomed only in boolean mode and doesn't return
    relevance
    factor :(

    The idea is to manage full-text indexes (titles, keywords, summaries) and
    perform search
    using Lucene. The resultset will include ID's that will be appended to SQL
    query in order
    to apply additional filters based on foreign keys (categories mappings,
    etc).

    But, the database is going to be big enough, and the list of IDs returned
    by
    Lucene too. This
    may cause high memory usage and slow sql query speed (for instance 1000
    IDs
    in "IN (id1, id2 )"
    sql filter)
    --
    Any ideas, suggestions ?
  • No.3 | | 2329 bytes | |

    Thank you *Giulio,*

    I read some information about this approach. Is it possible to store in
    Lucene field
    several FKs ? For example, a document has serveral mappings to category
    table (N:M).
    The other question is about dates - is it possible to query Lucene for date
    interval (from YYYY-MM-DD to YYYY-MM-DD) ?
    And the last (but no least ;) - in database, there are mappings (N:M) which
    contains additional attributes, that have to be searchable - for
    example one document may reference other documents (M:N) with attribute
    (isP, IsV). Is it possible
    to include together with N:M map ?

    Sorry, may be questions are stupid but I've just started to investigate
    Lucene ;)

    Thank you,
    /Alexander

    7/4/06, Giulio Cesare Solaroli <giulio.cesare (AT) gmail (DOT) comwrote:

    Hi Alexander,

    if the set of foreign keys you are using to filter the lucene results
    with some SQL query, I suggest you to include the foreign keys values
    as additional lucene fields, and let lucene return you the exact set
    of document IDs.

    I had excellent results with this technique.

    Hope this helps.

    Best regards,

    Giulio Cesare Solaroli
    >
    >
    >

    7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
    Hi folks,

    I'm looking for a solution/best practices concerning Lucene and SQL
    database
    integration.
    The database (MySQL) is already developed and contains data. I've tried
    MySQL full-text
    search, but it's quite slow and doesn't have the possibility to
    intergate
    custom analyzers.
    Phrase search is perfomed only in boolean mode and doesn't return
    relevance
    factor :(

    The idea is to manage full-text indexes (titles, keywords, summaries)
    and
    perform search
    using Lucene. The resultset will include ID's that will be appended to
    SQL
    query in order
    to apply additional filters based on foreign keys (categories mappings,
    etc).

    But, the database is going to be big enough, and the list of IDs
    returned by
    Lucene too. This
    may cause high memory usage and slow sql query speed (for instance 1000
    IDs
    in "IN (id1, id2 )"
    sql filter)
    --
    Any ideas, suggestions ?
  • No.4 | | 798 bytes | |

    Alexander Mashtakov wrote:

    But, the database is going to be big enough, and the list of IDs
    returned by
    Lucene too. This
    may cause high memory usage and slow sql query speed (for instance 1000 IDs
    in "IN (id1, id2 )"
    sql filter)

    For this part, I recommend using a working table to hold the IDs that result
    from your Lucene search, then use a join instead of the IN operator in your
    SQL query. /Sybase/SQLServer all have array capabilities in their APIs
    that make an insertion of a 1000 integers pretty darn fast, so I'm guessing
    that MySQL has such a thing too.

    Good luck!

    To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
  • No.5 | | 2879 bytes | |

    For your date interval search requirement, Lucene can do range search.
    You need to use ConstantScoreRangeQuery for better performance.

    You'd better select all your additional attributes into the lucene
    document. This will lead to some data duplications, especially look up
    values. But it's a tradeoff for faster response and not really a
    problem.

    Chris Lu

    Lucene Search on Any Databases/Applications
    http://www.dbsight.net

    7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
    Thank you *Giulio,*

    I read some information about this approach. Is it possible to store in
    Lucene field
    several FKs ? For example, a document has serveral mappings to category
    table (N:M).
    The other question is about dates - is it possible to query Lucene for date
    interval (from YYYY-MM-DD to YYYY-MM-DD) ?
    And the last (but no least ;) - in database, there are mappings (N:M) which
    contains additional attributes, that have to be searchable - for
    example one document may reference other documents (M:N) with attribute
    (isP, IsV). Is it possible
    to include together with N:M map ?

    Sorry, may be questions are stupid but I've just started to investigate
    Lucene ;)

    Thank you,
    /Alexander
    --
    7/4/06, Giulio Cesare Solaroli <giulio.cesare (AT) gmail (DOT) comwrote:

    Hi Alexander,

    if the set of foreign keys you are using to filter the lucene results
    with some SQL query, I suggest you to include the foreign keys values
    as additional lucene fields, and let lucene return you the exact set
    of document IDs.

    I had excellent results with this technique.

    Hope this helps.

    Best regards,

    Giulio Cesare Solaroli
    >
    >
    >

    7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
    Hi folks,

    I'm looking for a solution/best practices concerning Lucene and SQL
    database
    integration.
    The database (MySQL) is already developed and contains data. I've tried
    MySQL full-text
    search, but it's quite slow and doesn't have the possibility to
    intergate
    custom analyzers.
    Phrase search is perfomed only in boolean mode and doesn't return
    relevance
    factor :(

    The idea is to manage full-text indexes (titles, keywords, summaries)
    and
    perform search
    using Lucene. The resultset will include ID's that will be appended to
    SQL
    query in order
    to apply additional filters based on foreign keys (categories mappings,
    etc).

    But, the database is going to be big enough, and the list of IDs
    returned by
    Lucene too. This
    may cause high memory usage and slow sql query speed (for instance 1000
    IDs
    in "IN (id1, id2 )"
    sql filter)
    --
    Any ideas, suggestions ?
  • No.6 | | 4269 bytes | |

    Alexander,

    to solve your problem, you have to carefully understand how lucene
    handles fields.

    Lucene can handle the value of the document fields in different ways;
    the different aspects to care of are:
    - indexing: whether the value is indexed or not. It may look strange
    to store data on a Lucene index without indexing it, but your records
    IDs probably feet this scenario, as usually you never search record by
    ID.
    - stored: whether you want Lucene to store the full value of the
    field. In your scenario, you probably need to store only the record ID
    on the Lucene index, as all the data are probably better kept on the
    MySQL database.
    - tokenized: in order to provide useful text search, you need to
    tokenize the "text" field(s), but you will better avoid to tokenize
    date fields. For "foreign keys" fields, if you need to store only a
    single value, you probably better avoid tokenize it. If instead you
    need to store a "list" of foreign key values, your probably set the
    value of the field with a (spare separated) values and let Lucene
    tokenize it.

    Lucene manages also other way to work with the field values, but you
    should not immediately need to understand them for your current need.

    Now, what you really need to do is realize that Lucene basically
    handles only strings, but you have quite a few "knobs" to tune. After
    understanding the option that Lucene offers you, go back to your
    problem and try to find out how to let Lucene encode your information
    in order to achieve your expected result.

    If the content of the DB is updated, and you can not manage to reindex
    the whole DB every time you need to synchronize its content, you will
    need also to define an update strategy for the Lucene index.

    Regards,

    Giulio Cesare Solaroli

    7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
    Thank you *Giulio,*

    I read some information about this approach. Is it possible to store in
    Lucene field
    several FKs ? For example, a document has serveral mappings to category
    table (N:M).
    The other question is about dates - is it possible to query Lucene for date
    interval (from YYYY-MM-DD to YYYY-MM-DD) ?
    And the last (but no least ;) - in database, there are mappings (N:M) which
    contains additional attributes, that have to be searchable - for
    example one document may reference other documents (M:N) with attribute
    (isP, IsV). Is it possible
    to include together with N:M map ?

    Sorry, may be questions are stupid but I've just started to investigate
    Lucene ;)

    Thank you,
    /Alexander
    --
    7/4/06, Giulio Cesare Solaroli <giulio.cesare (AT) gmail (DOT) comwrote:

    Hi Alexander,

    if the set of foreign keys you are using to filter the lucene results
    with some SQL query, I suggest you to include the foreign keys values
    as additional lucene fields, and let lucene return you the exact set
    of document IDs.

    I had excellent results with this technique.

    Hope this helps.

    Best regards,

    Giulio Cesare Solaroli
    >
    >
    >

    7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
    Hi folks,

    I'm looking for a solution/best practices concerning Lucene and SQL
    database
    integration.
    The database (MySQL) is already developed and contains data. I've tried
    MySQL full-text
    search, but it's quite slow and doesn't have the possibility to
    intergate
    custom analyzers.
    Phrase search is perfomed only in boolean mode and doesn't return
    relevance
    factor :(

    The idea is to manage full-text indexes (titles, keywords, summaries)
    and
    perform search
    using Lucene. The resultset will include ID's that will be appended to
    SQL
    query in order
    to apply additional filters based on foreign keys (categories mappings,
    etc).

    But, the database is going to be big enough, and the list of IDs
    returned by
    Lucene too. This
    may cause high memory usage and slow sql query speed (for instance 1000
    IDs
    in "IN (id1, id2 )"
    sql filter)
    --
    Any ideas, suggestions ?
  • No.7 | | 250 bytes | |

    Alexander:
    You might also look through the mail archive for "database". There have been
    several dicsussions on that topic with input from people far more
    knowledgable than me that you might find relevant <G>
    Erick

Re: Lucene and database


max 4000 letters.
Your nickname that display:
In order to stop the spam: 2 + 1 =
QUESTION ON "Apache"

EMSDN.COM