Sorry by sending wrong message, this should go to my own mail box :( 2010/1/30 Wangsheng Mei <hairr...@gmail.com>
> Document Duplication Detection > > [image: <!>] Solr1.4 <http://solr/Solr1.4> > > 目录 > > 1. Document Duplication > Detection<#1267b655a97b48f5_Document_Duplication_Detection> > 2. Overview <#1267b655a97b48f5_Overview> > 1. Goals <#1267b655a97b48f5_Goals> > 2. Design <#1267b655a97b48f5_Design> > 3. Notes <#1267b655a97b48f5_Notes> > 4. Configuration <#1267b655a97b48f5_Configuration> > 1. solrconfig.xml <#1267b655a97b48f5_solrconfig.xml> > 1. Note <#1267b655a97b48f5_Note> > 2. Settings <#1267b655a97b48f5_Settings> > > Overview > > Preventing duplicate or near duplicate documents from entering an index or > tagging documents with a signature/fingerprint for duplicate field > collapsing can be efficiently achieved with a low collision or fuzzy hash > algorithm. Solr should natively support deduplication techniques of this > type and allow for the easy addition of new hash/signature implementations. > > Goals > > - Efficient, hash based exact/near document duplication detection and > blocking. > - Allow for both duplicate collapsing in search results as well as > deduplication on adding a document. > > Design > > Signature > > A class capable of generating a signature String from the concatenation of > a group of specified document fields. > > public abstract class Signature { > public void init(SolrParams nl) { > } > > public abstract String calculate(String content); > } > > Implementations: > > MD5Signature > > 128 bit hash used for exact duplicate detection. > > Lookup3Signature <http://solr/Lookup3Signature> > > 64 bit hash used for exact duplicate detection, much faster than MD5 and > smaller to index > > TextProfileSignature <http://solr/TextProfileSignature> > > Fuzzy hashing implementation from nutch for near duplicate detection. Its > tunable but works best on longer text. > > There are other more sophisticated algorithms for fuzzy/near hashing that > could be added later. > > Notes > > Adding in the dedupe process will change the allowDups setting so that it > applies to an update Term (with field signatureField in this case) rather > than the unique field Term (of course the signatureField could be the unique > field, but generally you want the unique field to be unique) > > When a document is added, a signature will automatically be generated and > attached to the document in the specified signatureField. > > Configuration > > solrconfig.xml > > The > SignatureUpdateProcessorFactory<http://solr/SignatureUpdateProcessorFactory>has > to be registered in the solrconfig.xml as part of the > UpdateRequest <http://solr/UpdateRequest> Chain: > > Accepting all defaults: > > <updateRequestProcessorChain name="dedupe"> > <processor > > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> > </processor> > <processor class="solr.RunUpdateProcessorFactory" /> > > </updateRequestProcessorChain> > > Example settings: > > <!-- An example dedup update processor that creates the "id" field on the > fly > based on the hash code of some other fields. This example has > overwriteDupes > set to false since we are using the id field as the signatureField and > Solr > > will maintain uniqueness based on that anyway. --> > <updateRequestProcessorChain name="dedupe"> > <processor > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> > > <bool name="enabled">true</bool> > <bool name="overwriteDupes">false</bool> > <str name="signatureField">id</str> > <str name="fields">name,features,cat</str> > > <str > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str> > </processor> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory" /> > > </updateRequestProcessorChain> > > Note > > Also be sure to change your update handlers to use the defined chain, i.e. > > <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" > > <lst name="defaults"> > <str name="update.processor">dedupe</str> > > </lst> > </requestHandler> > > The update processor can also be specified per request with a parameter of > update.processor=dedupe > > Settings > > *Setting* > > *Default* > > *Description* > > signatureClass > > org.apache.solr.update.processor.Lookup3Signature<http://solr/Lookup3Signature> > > A Signature implementation for generating a signature hash. > > fields > > all fields > > The fields to use to generate the signature hash in a comma separated list. > By default, all fields on the document will be used. > > signatureField > > signatureField > > The name of the field used to hold the fingerprint/signature. Be sure the > field is defined in schema.xml. > > enabled > > true > > Enable/disable dedupe factory processing > > > -- > 梅旺生 > -- 梅旺生