Re: dataimporthandler large dataset

Shawn Heisey Fri, 12 Aug 2011 15:00:49 -0700

On 8/12/2011 3:32 PM, Eric Myers wrote:

Recently started looking into solr to solve a problem created before my
time.  We have a dataset consisting of 390,000,000+ records that had a
search written for it using a simple query.  The problem is that the
dataset needs additional indices to keep operating.  The DBA says no go,
too large a dataset.


I came to the very quick conclusion that they needed a search engine,
preferably one that can return some data.

My problem lies in the initial index creation.  Using the
DataImportHandler with JDBC to import 390m records will, I am guessing
take far longer than I would like, and use up quite a few resources.

Is there any way to chunk this data, with the DataImportHandler?  If not
I will just write some code to handle the initial import.


Eric,

You can pass variables into the DIH via the request URL, which you canthen use in your DIH SQL. For example, "minDid=7000" on the URL can beaccessed as ${dataimporter.request.minDid} in the dih-config.xml file(or whatever you called your dih config). I know this works as far backas 1.4.0, but I've never used anything older than that.

Once you have variables passing in via the DIH url, use whatever SQLconstraints you need on each DIH call to do it in chunks. You caneither issue delta-import commands or full-import with clean=false, totell it not to delete the index before starting.

To further speed up both indexing and searching, you should limit theamount of data *stored* (contrast with indexed) in Solr to the smallestsubset that is required to display a search results grid withoutconsulting the original datastore. If someone opens an individual item,the original datastore is likely to be fast enough to retrieve full details.


Thanks,
Shawn

Re: dataimporthandler large dataset

Reply via email to