We have a 200,000,000 record index with 14 fields, and we can re-index
the entire data set in about five hours. One thing to note is that the
DataImportHandler uses one thread per entity by default. If you have a
multcore box, you can drastically speed indexing by specifying a
threadcount of n+1, where n is the number of cores at your disposal.
See the DataImportHandler wiki page for more information.

If that's still too slow, you may wish to consider setting up multiple
Sorl instances on different machines. If you go this route, then each
Solr instance can house a portion of the index rather than the whole
and may build these indexes concurrently. This is called sharding and
has both benefits and drawbacks. The wiki page on Distributed Search
has a more thorough explanation.

You can use whatever scheme you like to partition the data, but one of
the simplest approaches using the DataImportHandler is to simply mod
the record ID by the number of shards you're intending to create.

For example:

SELECT <your columns>
FROM <your table>
WHERE <primary key> % <numShards> == 0

On Fri, Aug 12, 2011 at 4:32 PM, Eric Myers <emy...@nabancard.com> wrote:
> Recently started looking into solr to solve a problem created before my
> time.  We have a dataset consisting of 390,000,000+ records that had a
> search written for it using a simple query.  The problem is that the
> dataset needs additional indices to keep operating.  The DBA says no go,
> too large a dataset.
>
> I came to the very quick conclusion that they needed a search engine,
> preferably one that can return some data.
>
> My problem lies in the initial index creation.  Using the
> DataImportHandler with JDBC to import 390m records will, I am guessing
> take far longer than I would like, and use up quite a few resources.
>
> Is there any way to chunk this data, with the DataImportHandler?  If not
> I will just write some code to handle the initial import.
>
> Thanks
>
> --
> Eric Myers
>
>
>

Reply via email to