Deduplication in SolrCloud

Daniel Brügge Fri, 27 Jul 2012 08:36:48 -0700

Hi,

in my old Solr Setup I have used the deduplication feature in the update
chain
with couple of fields.


<updateRequestProcessorChain name="dedupe">
 <processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
 <str name="signatureField">signature</str>
<bool name="overwriteDupes">false</bool>
 <str name="fields">uuid,type,url,content_hash</str>
<str
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
 </processor>
<processor class="solr.LogUpdateProcessorFactory" />
 <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

This worked fine. When I now use this in my 2 shards SolrCloud setup when
inserting 150.000 documents,
I am always getting an error:

*INFO: end_commit_flush*
*Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log*
*SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError:
unable to create new native thread*
* at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456)
*
* at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284)
*

I am inserting the documents via CSV import and curl command and split them
also into 50k chunks.

Without the dedupe chain, the import finishes after 40secs.

The curl command writes to one of my shards.


Do you have an idea why this happens? Should I reduce the fields to one? I
have read that not using the id as
dedupe fields could be an issue?


I have searched for deduplication with SolrCloud and I am wondering if it
is already working correctly? see e.g.
http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html

Thanks & regards

Daniel

Deduplication in SolrCloud

Reply via email to