Lucene and database
7 answers - 955 bytes -

Hi folks,
I'm looking for a solution/best practices concerning Lucene and SQL database
integration.
The database (MySQL) is already developed and contains data. I've tried
MySQL full-text
search, but it's quite slow and doesn't have the possibility to intergate
custom analyzers.
Phrase search is perfomed only in boolean mode and doesn't return relevance
factor :(
The idea is to manage full-text indexes (titles, keywords, summaries) and
perform search
using Lucene. The resultset will include ID's that will be appended to SQL
query in order
to apply additional filters based on foreign keys (categories mappings,
etc).
But, the database is going to be big enough, and the list of IDs returned by
Lucene too. This
may cause high memory usage and slow sql query speed (for instance 1000 IDs
in "IN (id1, id2 )"
sql filter)
Any ideas, suggestions ?
No.1 | | 1421 bytes |
| 
Hi Alexander,
if the set of foreign keys you are using to filter the lucene results
with some SQL query, I suggest you to include the foreign keys values
as additional lucene fields, and let lucene return you the exact set
of document IDs.
I had excellent results with this technique.
Hope this helps.
Best regards,
Giulio Cesare Solaroli
7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
Hi folks,
I'm looking for a solution/best practices concerning Lucene and SQL database
integration.
The database (MySQL) is already developed and contains data. I've tried
MySQL full-text
search, but it's quite slow and doesn't have the possibility to intergate
custom analyzers.
Phrase search is perfomed only in boolean mode and doesn't return relevance
factor :(
The idea is to manage full-text indexes (titles, keywords, summaries) and
perform search
using Lucene. The resultset will include ID's that will be appended to SQL
query in order
to apply additional filters based on foreign keys (categories mappings,
etc).
But, the database is going to be big enough, and the list of IDs returned by
Lucene too. This
may cause high memory usage and slow sql query speed (for instance 1000 IDs
in "IN (id1, id2 )"
sql filter)
--
Any ideas, suggestions ?
No.2 | | 1546 bytes |
| 
If I'm understanding you correctly, you're using Lucene to store IDs and
index the columns that would normally be full-text indices in MySQL, then
use the IDs to retrieve the information from the database. This is more of a
MySQL issue than a Lucene issue, but it suggests a flaw in your approach:
why not simply store all the data you want to retrieve in Lucene? If the
data in your database changes, you would have to rebuild your Lucene index
anyway. My $0.02.
7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
Hi folks,
I'm looking for a solution/best practices concerning Lucene and SQL
database
integration.
The database (MySQL) is already developed and contains data. I've tried
MySQL full-text
search, but it's quite slow and doesn't have the possibility to intergate
custom analyzers.
Phrase search is perfomed only in boolean mode and doesn't return
relevance
factor :(
The idea is to manage full-text indexes (titles, keywords, summaries) and
perform search
using Lucene. The resultset will include ID's that will be appended to SQL
query in order
to apply additional filters based on foreign keys (categories mappings,
etc).
But, the database is going to be big enough, and the list of IDs returned
by
Lucene too. This
may cause high memory usage and slow sql query speed (for instance 1000
IDs
in "IN (id1, id2 )"
sql filter)
--
Any ideas, suggestions ?
No.3 | | 2329 bytes |
| 
Thank you *Giulio,*
I read some information about this approach. Is it possible to store in
Lucene field
several FKs ? For example, a document has serveral mappings to category
table (N:M).
The other question is about dates - is it possible to query Lucene for date
interval (from YYYY-MM-DD to YYYY-MM-DD) ?
And the last (but no least ;) - in database, there are mappings (N:M) which
contains additional attributes, that have to be searchable - for
example one document may reference other documents (M:N) with attribute
(isP, IsV). Is it possible
to include together with N:M map ?
Sorry, may be questions are stupid but I've just started to investigate
Lucene ;)
Thank you,
/Alexander
7/4/06, Giulio Cesare Solaroli <giulio.cesare (AT) gmail (DOT) comwrote:
Hi Alexander,
if the set of foreign keys you are using to filter the lucene results
with some SQL query, I suggest you to include the foreign keys values
as additional lucene fields, and let lucene return you the exact set
of document IDs.
I had excellent results with this technique.
Hope this helps.
Best regards,
Giulio Cesare Solaroli
>
>
>
7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
Hi folks,
I'm looking for a solution/best practices concerning Lucene and SQL
database
integration.
The database (MySQL) is already developed and contains data. I've tried
MySQL full-text
search, but it's quite slow and doesn't have the possibility to
intergate
custom analyzers.
Phrase search is perfomed only in boolean mode and doesn't return
relevance
factor :(
The idea is to manage full-text indexes (titles, keywords, summaries)
and
perform search
using Lucene. The resultset will include ID's that will be appended to
SQL
query in order
to apply additional filters based on foreign keys (categories mappings,
etc).
But, the database is going to be big enough, and the list of IDs
returned by
Lucene too. This
may cause high memory usage and slow sql query speed (for instance 1000
IDs
in "IN (id1, id2 )"
sql filter)
--
Any ideas, suggestions ?
No.4 | | 798 bytes |
| 
Alexander Mashtakov wrote:
But, the database is going to be big enough, and the list of IDs
returned by
Lucene too. This
may cause high memory usage and slow sql query speed (for instance 1000 IDs
in "IN (id1, id2 )"
sql filter)
For this part, I recommend using a working table to hold the IDs that result
from your Lucene search, then use a join instead of the IN operator in your
SQL query. /Sybase/SQLServer all have array capabilities in their APIs
that make an insertion of a 1000 integers pretty darn fast, so I'm guessing
that MySQL has such a thing too.
Good luck!
To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
No.5 | | 2879 bytes |
| 
For your date interval search requirement, Lucene can do range search.
You need to use ConstantScoreRangeQuery for better performance.
You'd better select all your additional attributes into the lucene
document. This will lead to some data duplications, especially look up
values. But it's a tradeoff for faster response and not really a
problem.
Chris Lu
Lucene Search on Any Databases/Applications
http://www.dbsight.net
7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
Thank you *Giulio,*
I read some information about this approach. Is it possible to store in
Lucene field
several FKs ? For example, a document has serveral mappings to category
table (N:M).
The other question is about dates - is it possible to query Lucene for date
interval (from YYYY-MM-DD to YYYY-MM-DD) ?
And the last (but no least ;) - in database, there are mappings (N:M) which
contains additional attributes, that have to be searchable - for
example one document may reference other documents (M:N) with attribute
(isP, IsV). Is it possible
to include together with N:M map ?
Sorry, may be questions are stupid but I've just started to investigate
Lucene ;)
Thank you,
/Alexander
--
7/4/06, Giulio Cesare Solaroli <giulio.cesare (AT) gmail (DOT) comwrote:
Hi Alexander,
if the set of foreign keys you are using to filter the lucene results
with some SQL query, I suggest you to include the foreign keys values
as additional lucene fields, and let lucene return you the exact set
of document IDs.
I had excellent results with this technique.
Hope this helps.
Best regards,
Giulio Cesare Solaroli
>
>
>
7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
Hi folks,
I'm looking for a solution/best practices concerning Lucene and SQL
database
integration.
The database (MySQL) is already developed and contains data. I've tried
MySQL full-text
search, but it's quite slow and doesn't have the possibility to
intergate
custom analyzers.
Phrase search is perfomed only in boolean mode and doesn't return
relevance
factor :(
The idea is to manage full-text indexes (titles, keywords, summaries)
and
perform search
using Lucene. The resultset will include ID's that will be appended to
SQL
query in order
to apply additional filters based on foreign keys (categories mappings,
etc).
But, the database is going to be big enough, and the list of IDs
returned by
Lucene too. This
may cause high memory usage and slow sql query speed (for instance 1000
IDs
in "IN (id1, id2 )"
sql filter)
--
Any ideas, suggestions ?
No.6 | | 4269 bytes |
| 
Alexander,
to solve your problem, you have to carefully understand how lucene
handles fields.
Lucene can handle the value of the document fields in different ways;
the different aspects to care of are:
- indexing: whether the value is indexed or not. It may look strange
to store data on a Lucene index without indexing it, but your records
IDs probably feet this scenario, as usually you never search record by
ID.
- stored: whether you want Lucene to store the full value of the
field. In your scenario, you probably need to store only the record ID
on the Lucene index, as all the data are probably better kept on the
MySQL database.
- tokenized: in order to provide useful text search, you need to
tokenize the "text" field(s), but you will better avoid to tokenize
date fields. For "foreign keys" fields, if you need to store only a
single value, you probably better avoid tokenize it. If instead you
need to store a "list" of foreign key values, your probably set the
value of the field with a (spare separated) values and let Lucene
tokenize it.
Lucene manages also other way to work with the field values, but you
should not immediately need to understand them for your current need.
Now, what you really need to do is realize that Lucene basically
handles only strings, but you have quite a few "knobs" to tune. After
understanding the option that Lucene offers you, go back to your
problem and try to find out how to let Lucene encode your information
in order to achieve your expected result.
If the content of the DB is updated, and you can not manage to reindex
the whole DB every time you need to synchronize its content, you will
need also to define an update strategy for the Lucene index.
Regards,
Giulio Cesare Solaroli
7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
Thank you *Giulio,*
I read some information about this approach. Is it possible to store in
Lucene field
several FKs ? For example, a document has serveral mappings to category
table (N:M).
The other question is about dates - is it possible to query Lucene for date
interval (from YYYY-MM-DD to YYYY-MM-DD) ?
And the last (but no least ;) - in database, there are mappings (N:M) which
contains additional attributes, that have to be searchable - for
example one document may reference other documents (M:N) with attribute
(isP, IsV). Is it possible
to include together with N:M map ?
Sorry, may be questions are stupid but I've just started to investigate
Lucene ;)
Thank you,
/Alexander
--
7/4/06, Giulio Cesare Solaroli <giulio.cesare (AT) gmail (DOT) comwrote:
Hi Alexander,
if the set of foreign keys you are using to filter the lucene results
with some SQL query, I suggest you to include the foreign keys values
as additional lucene fields, and let lucene return you the exact set
of document IDs.
I had excellent results with this technique.
Hope this helps.
Best regards,
Giulio Cesare Solaroli
>
>
>
7/4/06, Alexander Mashtakov <amashtakov (AT) gmail (DOT) comwrote:
Hi folks,
I'm looking for a solution/best practices concerning Lucene and SQL
database
integration.
The database (MySQL) is already developed and contains data. I've tried
MySQL full-text
search, but it's quite slow and doesn't have the possibility to
intergate
custom analyzers.
Phrase search is perfomed only in boolean mode and doesn't return
relevance
factor :(
The idea is to manage full-text indexes (titles, keywords, summaries)
and
perform search
using Lucene. The resultset will include ID's that will be appended to
SQL
query in order
to apply additional filters based on foreign keys (categories mappings,
etc).
But, the database is going to be big enough, and the list of IDs
returned by
Lucene too. This
may cause high memory usage and slow sql query speed (for instance 1000
IDs
in "IN (id1, id2 )"
sql filter)
--
Any ideas, suggestions ?
No.7 | | 250 bytes |
| 
Alexander:
You might also look through the mail archive for "database". There have been
several dicsussions on that topic with input from people far more
knowledgable than me that you might find relevant <G>
Erick