On 8/12/2011 3:32 PM, Eric Myers wrote:
Recently started looking into solr to solve a problem created before my
time. We have a dataset consisting of 390,000,000+ records that had a
search written for it using a simple query. The problem is that the
dataset needs additional indices to keep operating. The DBA says no go,
too large a dataset.
I came to the very quick conclusion that they needed a search engine,
preferably one that can return some data.
My problem lies in the initial index creation. Using the
DataImportHandler with JDBC to import 390m records will, I am guessing
take far longer than I would like, and use up quite a few resources.
Is there any way to chunk this data, with the DataImportHandler? If not
I will just write some code to handle the initial import.
Eric,
You can pass variables into the DIH via the request URL, which you can
then use in your DIH SQL. For example, "minDid=7000" on the URL can be
accessed as ${dataimporter.request.minDid} in the dih-config.xml file
(or whatever you called your dih config). I know this works as far back
as 1.4.0, but I've never used anything older than that.
Once you have variables passing in via the DIH url, use whatever SQL
constraints you need on each DIH call to do it in chunks. You can
either issue delta-import commands or full-import with clean=false, to
tell it not to delete the index before starting.
To further speed up both indexing and searching, you should limit the
amount of data *stored* (contrast with indexed) in Solr to the smallest
subset that is required to display a search results grid without
consulting the original datastore. If someone opens an individual item,
the original datastore is likely to be fast enough to retrieve full details.
Thanks,
Shawn