On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
Do you mean to tell me that the people on this list that are indexing 100s of 
millions of documents are doing this over http?  I have been using custom 
Lucene code to index files, as I thought this would be faster for many 
documents and I wanted some non-standard OCR and index fields.  Is there a 
better way?

To the OP: You can also use Lucene to locally index files for Solr.

My sharded index has 94 million docs in it. All normal indexing and maintenance is done with SolrJ, over http.Currently full rebuilds are done with the dataimport handler loading from MySQL, but that is legacy. This is NOT a SolrCloud installation. It is also not a replicated setup -- my indexing program keeps both copies up to date independently, similar to what happens behind the scenes with SolrCloud.

The single-thread DIH is very well optimized, and is faster than what I have written myself -- also single-threaded.

The real reason that we still use DIH for rebuilds is that I can run the DIH simultaenously on all shards. A full rebuild that way takes about 5 hours. A SolrJ process feeding all shards with a single thread would take a lot longer. Once I have time to work on it, I can make the SolrJ rebuild multi-threaded, and I expect it will be similar to DIH in rebuild speed. Hopefully I can make it faster.

There is always overhead with HTTP. On a gigabit LAN, I don't think it's high enough to matter.

Using Lucene to index files for Solr is an option -- but that requires writing a custom Lucene application, and knowledge about how to turn the Solr schema into Lucene code. A lot of users on this list (me included) do not have the skills required. I know SolrJ reasonably well, but Lucene is a nut that I haven't cracked.

Thanks,
Shawn

Reply via email to