Earlier I used to index with HtttpPost Mechanism only, making each post size specific to 2Mb to 20Mb that was going fine, but we had a suspect that instead of indexing through network call(which ofcourse results in latency due to network delays and http protocol) if we can index Offline by just writing the index and dumping it to Shards it would be much better.
Although I am doing commit with a batch of 25K docs which I will try to replace with CommitWithin(seems it works faster) or probably have a look at this Binary Prot. Thanks! On Fri, Jun 6, 2014 at 5:55 PM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > On Fri, 2014-06-06 at 14:05 +0200, Vineet Mishra wrote: > > > Could you state what indexing mechanism are you using, as I started > > with EmbeddedSolrServer but it was pretty slow after a few GB(~30+) of > > indexing. > > I suspect that is due to too-frequent commits, too small heap or > something third, unrelated to EmbeddedSolrServer itself. Underneath the > surface it is just the same as a standalone Solr. > > We're building our ~1TB indexes individually, using standalone workers > for the heavy part of the analysis (Tika). The delivery from the workers > to the Solr server is over the network, using the Solr binary protocol. > My colleague Thomas Egense just created a small write-up at > https://github.com/netarchivesuite/netsearch > > > I started indexing 1 week back and still its 37GB, although I assume > > HttpPost mechanism will perform lethargic slow due to network latency > > and for the response await. > > Maybe if you send the documents one at a time, but if you bundle them in > larger updates, the post-method should be fine. > > - Toke Eskildsen, State and University Library, Denmark > > >