Re: Fastest way to import big amount of documents in SolrCloud

Costi Muraru Thu, 01 May 2014 15:21:23 -0700

Thanks for the reply, Anshum. Please see my answers to your questions below.


* Why do you want to do a full index everyday?
    Not sure I understand what you mean by full index. Every day we want to
import additional documents to the existing ones. Of course, we want to
remove older ones as well, so the total amount remains roughly the same.
* How much of data are we talking about?
    The number of new documents is around 500k each day.
* What's your SolrCloud setup like?
    We're currently using Solr 3.6 with 16 shards and planning to switch to
SolrCloud, hence the inquiry.
* Do you already have some benchmarks which you're not happy with?
    Not yet. Planning to do some tests quite soon. I was looking for some
guidance before jumping in.

"Also, it helps to set the commit intervals reasonable."
What do you mean by *reasonable*? Also, do you recommend using autoCommit?
We are currently doing an optimize after each import (in Solr 3), in order
to speed up future queries. This is proving to take very long though
(several hours). Doing a commit instead of optimize is usually bringing the
master and slave nodes down. We reverted to calling optimize on every
ingest.



On Thu, May 1, 2014 at 11:57 PM, Anshum Gupta <ans...@anshumgupta.net>wrote:

> Hi Costi,
>
> I'd recommend SolrJ, parallelize the inserts. Also, it helps to set the
> commit intervals reasonable.
>
> Just to get a better perspective
> * Why do you want to do a full index everyday?
> * How much of data are we talking about?
> * What's your SolrCloud setup like?
> * Do you already have some benchmarks which you're not happy with?
>
>
>
> On Thu, May 1, 2014 at 1:47 PM, Costi Muraru <costimur...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > What would you say it's the fastest way to import data in SolrCloud?
> > Our use case: each day do a single import of a big number of documents.
> >
> > Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
> > import feature in SOLR? I came upon this promising link:
> > http://wiki.apache.org/solr/UpdateCSV
> > Any idea on how UpdateCSV is performance-wise compared with
> > SolrJ/DataImportHandler?
> >
> > If SolrJ, should we split the data in chunks and start multiple clients
> at
> > once? In this way we could perhaps take advantage of the multitude number
> > of servers in the SolrCloud configuration?
> >
> > Either way, after the import is finished, should we do an optimize or a
> > commit or none (
> > http://wiki.solarium-project.org/index.php/V1:Optimize_command)?
> >
> > Any tips and tricks to perform this process the right way are gladly
> > appreciated.
> >
> > Thanks,
> > Costi
> >
>
>
>
> --
>
> Anshum Gupta
> http://www.anshumgupta.net
>

Re: Fastest way to import big amount of documents in SolrCloud

Reply via email to