This.  And so much this.  As much this as you can muster.

On Apr 7, 2014, at 1:49 PM, Michael Della Bitta 
<michael.della.bi...@appinions.com> wrote:

> The speed of ingest via HTTP improves greatly once you do two things:
> 
> 1. Batch multiple documents into a single request.
> 2. Index with multiple threads at once.
> 
> Michael Della Bitta
> 
> Applications Developer
> 
> o: +1 646 532 3062
> 
> appinions inc.
> 
> "The Science of Influence Marketing"
> 
> 18 East 41st Street
> 
> New York, NY 10017
> 
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
> 
> 
> On Mon, Apr 7, 2014 at 12:40 PM, Daniel Collins <danwcoll...@gmail.com>wrote:
> 
>> I have to agree with Shawn.  We have a SolrCloud setup with 256 shards,
>> ~400M documents in total, with 4-way replication (so its quite a big
>> setup!)  I had thought that HTTP would slow things down, so we recently
>> trialed a JNI approach (clients are C++) so we could call SolrJ and get the
>> benefits of JavaBin encoding for our indexing....
>> 
>> Once we had done benchmarks with both solutions, I think we saved about 1ms
>> per document (on average) with JNI, so it wasn't as big a gain as we were
>> expecting.  There are other benefits of SolrJ (zookeeper integration,
>> better routing, etc) and we were doing local HTTP (so it was literally just
>> a TCP port to localhost, no actual net traffic) but that just goes to prove
>> what other posters have said here.  Check whether HTTP really *is* the
>> bottleneck before you try to replace it!
>> 
>> 
>> On 7 April 2014 17:05, Shawn Heisey <s...@elyograg.org> wrote:
>> 
>>> On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
>>> 
>>>> Do you mean to tell me that the people on this list that are indexing
>>>> 100s of millions of documents are doing this over http?  I have been
>> using
>>>> custom Lucene code to index files, as I thought this would be faster for
>>>> many documents and I wanted some non-standard OCR and index fields.  Is
>>>> there a better way?
>>>> 
>>>> To the OP: You can also use Lucene to locally index files for Solr.
>>>> 
>>> 
>>> My sharded index has 94 million docs in it.  All normal indexing and
>>> maintenance is done with SolrJ, over http.Currently full rebuilds are
>> done
>>> with the dataimport handler loading from MySQL, but that is legacy.  This
>>> is NOT a SolrCloud installation.  It is also not a replicated setup -- my
>>> indexing program keeps both copies up to date independently, similar to
>>> what happens behind the scenes with SolrCloud.
>>> 
>>> The single-thread DIH is very well optimized, and is faster than what I
>>> have written myself -- also single-threaded.
>>> 
>>> The real reason that we still use DIH for rebuilds is that I can run the
>>> DIH simultaenously on all shards.  A full rebuild that way takes about 5
>>> hours.  A SolrJ process feeding all shards with a single thread would
>> take
>>> a lot longer.  Once I have time to work on it, I can make the SolrJ
>> rebuild
>>> multi-threaded, and I expect it will be similar to DIH in rebuild speed.
>>> Hopefully I can make it faster.
>>> 
>>> There is always overhead with HTTP.  On a gigabit LAN, I don't think it's
>>> high enough to matter.
>>> 
>>> Using Lucene to index files for Solr is an option -- but that requires
>>> writing a custom Lucene application, and knowledge about how to turn the
>>> Solr schema into Lucene code.  A lot of users on this list (me included)
>> do
>>> not have the skills required.  I know SolrJ reasonably well, but Lucene
>> is
>>> a nut that I haven't cracked.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 
>> 

Reply via email to