On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
Do you mean to tell me that the people on this list that are indexing 100s of
millions of documents are doing this over http? I have been using custom
Lucene code to index files, as I thought this would be faster for many
documents and I wanted some non-standard OCR and index fields. Is there a
better way?
To the OP: You can also use Lucene to locally index files for Solr.
My sharded index has 94 million docs in it. All normal indexing and
maintenance is done with SolrJ, over http.Currently full rebuilds are
done with the dataimport handler loading from MySQL, but that is
legacy. This is NOT a SolrCloud installation. It is also not a
replicated setup -- my indexing program keeps both copies up to date
independently, similar to what happens behind the scenes with SolrCloud.
The single-thread DIH is very well optimized, and is faster than what I
have written myself -- also single-threaded.
The real reason that we still use DIH for rebuilds is that I can run the
DIH simultaenously on all shards. A full rebuild that way takes about 5
hours. A SolrJ process feeding all shards with a single thread would
take a lot longer. Once I have time to work on it, I can make the SolrJ
rebuild multi-threaded, and I expect it will be similar to DIH in
rebuild speed. Hopefully I can make it faster.
There is always overhead with HTTP. On a gigabit LAN, I don't think
it's high enough to matter.
Using Lucene to index files for Solr is an option -- but that requires
writing a custom Lucene application, and knowledge about how to turn the
Solr schema into Lucene code. A lot of users on this list (me included)
do not have the skills required. I know SolrJ reasonably well, but
Lucene is a nut that I haven't cracked.
Thanks,
Shawn