Hi

I am trying to index 12MM docs faster than is currently happening in Solr 
(using solrj). We have identified solr's add method as the bottleneck (and not 
commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm 
ram). 
Adding 1000 docs is taking approximately 25 seconds. We are making sure we add 
and commit in batches. And we've tried both CommonsHttpSolrServer and 
EmbeddedSolrServer (assuming removing http overhead would speed things up with 
embedding) but the differences is marginal. 

The docs being indexed are on average 20 fields long, mostly indexed but none 
stored. The major size contributors are two fields:

        - content, and
        - shingledContent (populated using copyField of content).

The length of the content field is (likely) gaussian distributed (few large 
docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to 
support phrase queries and content for unigram queries (following the advice of 
Solr Enterprise search server advice - p. 305, section "The Solution: 
Shingling"). 

Clearly the size of the docs is a contributor to the slow adds (confirmed by 
removing these 2 fields resulting in halving the indexing time). We've tried 
compressed=true also but that is not working. 

Any guidance on how to support our application logic (without having to change 
the schema too much) and speed the indexing speed (from current 212 days for 
12MM docs) would be much appreciated. 

thank you

Peyman 

Reply via email to