Re: Solr interface

Shawn Heisey Mon, 07 Apr 2014 09:06:01 -0700

On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:

Do you mean to tell me that the people on this list that are indexing 100s of 
millions of documents are doing this over http?  I have been using custom 
Lucene code to index files, as I thought this would be faster for many 
documents and I wanted some non-standard OCR and index fields.  Is there a 
better way?


To the OP: You can also use Lucene to locally index files for Solr.

My sharded index has 94 million docs in it. All normal indexing andmaintenance is done with SolrJ, over http.Currently full rebuilds aredone with the dataimport handler loading from MySQL, but that islegacy. This is NOT a SolrCloud installation. It is also not areplicated setup -- my indexing program keeps both copies up to dateindependently, similar to what happens behind the scenes with SolrCloud.

The single-thread DIH is very well optimized, and is faster than what Ihave written myself -- also single-threaded.

The real reason that we still use DIH for rebuilds is that I can run theDIH simultaenously on all shards. A full rebuild that way takes about 5hours. A SolrJ process feeding all shards with a single thread wouldtake a lot longer. Once I have time to work on it, I can make the SolrJrebuild multi-threaded, and I expect it will be similar to DIH inrebuild speed. Hopefully I can make it faster.

There is always overhead with HTTP. On a gigabit LAN, I don't thinkit's high enough to matter.

Using Lucene to index files for Solr is an option -- but that requireswriting a custom Lucene application, and knowledge about how to turn theSolr schema into Lucene code. A lot of users on this list (me included)do not have the skills required. I know SolrJ reasonably well, butLucene is a nut that I haven't cracked.


Thanks,
Shawn

Re: Solr interface

Reply via email to