On 8/12/2011 3:32 PM, Eric Myers wrote:
Recently started looking into solr to solve a problem created before my
time.  We have a dataset consisting of 390,000,000+ records that had a
search written for it using a simple query.  The problem is that the
dataset needs additional indices to keep operating.  The DBA says no go,
too large a dataset.

I came to the very quick conclusion that they needed a search engine,
preferably one that can return some data.

My problem lies in the initial index creation.  Using the
DataImportHandler with JDBC to import 390m records will, I am guessing
take far longer than I would like, and use up quite a few resources.

Is there any way to chunk this data, with the DataImportHandler?  If not
I will just write some code to handle the initial import.

Eric,

You can pass variables into the DIH via the request URL, which you can then use in your DIH SQL. For example, "minDid=7000" on the URL can be accessed as ${dataimporter.request.minDid} in the dih-config.xml file (or whatever you called your dih config). I know this works as far back as 1.4.0, but I've never used anything older than that.

Once you have variables passing in via the DIH url, use whatever SQL constraints you need on each DIH call to do it in chunks. You can either issue delta-import commands or full-import with clean=false, to tell it not to delete the index before starting.

To further speed up both indexing and searching, you should limit the amount of data *stored* (contrast with indexed) in Solr to the smallest subset that is required to display a search results grid without consulting the original datastore. If someone opens an individual item, the original datastore is likely to be fast enough to retrieve full details.

Thanks,
Shawn

Reply via email to