This. And so much this. As much this as you can muster. On Apr 7, 2014, at 1:49 PM, Michael Della Bitta <michael.della.bi...@appinions.com> wrote:
> The speed of ingest via HTTP improves greatly once you do two things: > > 1. Batch multiple documents into a single request. > 2. Index with multiple threads at once. > > Michael Della Bitta > > Applications Developer > > o: +1 646 532 3062 > > appinions inc. > > "The Science of Influence Marketing" > > 18 East 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts> > w: appinions.com <http://www.appinions.com/> > > > On Mon, Apr 7, 2014 at 12:40 PM, Daniel Collins <danwcoll...@gmail.com>wrote: > >> I have to agree with Shawn. We have a SolrCloud setup with 256 shards, >> ~400M documents in total, with 4-way replication (so its quite a big >> setup!) I had thought that HTTP would slow things down, so we recently >> trialed a JNI approach (clients are C++) so we could call SolrJ and get the >> benefits of JavaBin encoding for our indexing.... >> >> Once we had done benchmarks with both solutions, I think we saved about 1ms >> per document (on average) with JNI, so it wasn't as big a gain as we were >> expecting. There are other benefits of SolrJ (zookeeper integration, >> better routing, etc) and we were doing local HTTP (so it was literally just >> a TCP port to localhost, no actual net traffic) but that just goes to prove >> what other posters have said here. Check whether HTTP really *is* the >> bottleneck before you try to replace it! >> >> >> On 7 April 2014 17:05, Shawn Heisey <s...@elyograg.org> wrote: >> >>> On 4/7/2014 5:52 AM, Jonathan Varsanik wrote: >>> >>>> Do you mean to tell me that the people on this list that are indexing >>>> 100s of millions of documents are doing this over http? I have been >> using >>>> custom Lucene code to index files, as I thought this would be faster for >>>> many documents and I wanted some non-standard OCR and index fields. Is >>>> there a better way? >>>> >>>> To the OP: You can also use Lucene to locally index files for Solr. >>>> >>> >>> My sharded index has 94 million docs in it. All normal indexing and >>> maintenance is done with SolrJ, over http.Currently full rebuilds are >> done >>> with the dataimport handler loading from MySQL, but that is legacy. This >>> is NOT a SolrCloud installation. It is also not a replicated setup -- my >>> indexing program keeps both copies up to date independently, similar to >>> what happens behind the scenes with SolrCloud. >>> >>> The single-thread DIH is very well optimized, and is faster than what I >>> have written myself -- also single-threaded. >>> >>> The real reason that we still use DIH for rebuilds is that I can run the >>> DIH simultaenously on all shards. A full rebuild that way takes about 5 >>> hours. A SolrJ process feeding all shards with a single thread would >> take >>> a lot longer. Once I have time to work on it, I can make the SolrJ >> rebuild >>> multi-threaded, and I expect it will be similar to DIH in rebuild speed. >>> Hopefully I can make it faster. >>> >>> There is always overhead with HTTP. On a gigabit LAN, I don't think it's >>> high enough to matter. >>> >>> Using Lucene to index files for Solr is an option -- but that requires >>> writing a custom Lucene application, and knowledge about how to turn the >>> Solr schema into Lucene code. A lot of users on this list (me included) >> do >>> not have the skills required. I know SolrJ reasonably well, but Lucene >> is >>> a nut that I haven't cracked. >>> >>> Thanks, >>> Shawn >>> >>> >>