Sorry by sending wrong message, this should go to my own mail box  :(

2010/1/30 Wangsheng Mei <hairr...@gmail.com>

> Document Duplication Detection
>
> [image: <!>] Solr1.4 <http://solr/Solr1.4>
>
> 目录
>
>    1. Document Duplication 
> Detection<#1267b655a97b48f5_Document_Duplication_Detection>
>    2. Overview <#1267b655a97b48f5_Overview>
>       1. Goals <#1267b655a97b48f5_Goals>
>       2. Design <#1267b655a97b48f5_Design>
>    3. Notes <#1267b655a97b48f5_Notes>
>    4. Configuration <#1267b655a97b48f5_Configuration>
>       1. solrconfig.xml <#1267b655a97b48f5_solrconfig.xml>
>          1. Note <#1267b655a97b48f5_Note>
>       2. Settings <#1267b655a97b48f5_Settings>
>
>  Overview
>
> Preventing duplicate or near duplicate documents from entering an index or
> tagging documents with a signature/fingerprint for duplicate field
> collapsing can be efficiently achieved with a low collision or fuzzy hash
> algorithm. Solr should natively support deduplication techniques of this
> type and allow for the easy addition of new hash/signature implementations.
>
> Goals
>
>    - Efficient, hash based exact/near document duplication detection and
>    blocking.
>    - Allow for both duplicate collapsing in search results as well as
>    deduplication on adding a document.
>
>  Design
>
> Signature
>
> A class capable of generating a signature String from the concatenation of
> a group of specified document fields.
>
> public abstract class Signature {
>   public void init(SolrParams nl) {
>   }
>
>   public abstract String calculate(String content);
> }
>
> Implementations:
>
> MD5Signature
>
> 128 bit hash used for exact duplicate detection.
>
> Lookup3Signature <http://solr/Lookup3Signature>
>
> 64 bit hash used for exact duplicate detection, much faster than MD5 and
> smaller to index
>
> TextProfileSignature <http://solr/TextProfileSignature>
>
> Fuzzy hashing implementation from nutch for near duplicate detection. Its
> tunable but works best on longer text.
>
> There are other more sophisticated algorithms for fuzzy/near hashing that
> could be added later.
>
> Notes
>
> Adding in the dedupe process will change the allowDups setting so that it
> applies to an update Term (with field signatureField in this case) rather
> than the unique field Term (of course the signatureField could be the unique
> field, but generally you want the unique field to be unique)
>
> When a document is added, a signature will automatically be generated and
> attached to the document in the specified signatureField.
>
> Configuration
>
> solrconfig.xml
>
> The 
> SignatureUpdateProcessorFactory<http://solr/SignatureUpdateProcessorFactory>has
>  to be registered in the solrconfig.xml as part of the
> UpdateRequest <http://solr/UpdateRequest> Chain:
>
> Accepting all defaults:
>
>   <updateRequestProcessorChain name="dedupe">
>     <processor
>       
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>     </processor>
>     <processor class="solr.RunUpdateProcessorFactory" />
>
>   </updateRequestProcessorChain>
>
> Example settings:
>
>   <!-- An example dedup update processor that creates the "id" field on the 
> fly
>        based on the hash code of some other fields.  This example has 
> overwriteDupes
>        set to false since we are using the id field as the signatureField and 
> Solr
>
>        will maintain uniqueness based on that anyway. -->
>   <updateRequestProcessorChain name="dedupe">
>     <processor 
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>
>       <bool name="enabled">true</bool>
>       <bool name="overwriteDupes">false</bool>
>       <str name="signatureField">id</str>
>       <str name="fields">name,features,cat</str>
>
>       <str 
> name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
>     </processor>
>     <processor class="solr.LogUpdateProcessorFactory" />
>     <processor class="solr.RunUpdateProcessorFactory" />
>
>   </updateRequestProcessorChain>
>
>  Note
>
> Also be sure to change your update handlers to use the defined chain, i.e.
>
>   <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
>     <lst name="defaults">
>       <str name="update.processor">dedupe</str>
>
>     </lst>
>   </requestHandler>
>
> The update processor can also be specified per request with a parameter of
> update.processor=dedupe
>
> Settings
>
> *Setting*
>
> *Default*
>
> *Description*
>
> signatureClass
>
> org.apache.solr.update.processor.Lookup3Signature<http://solr/Lookup3Signature>
>
> A Signature implementation for generating a signature hash.
>
> fields
>
> all fields
>
> The fields to use to generate the signature hash in a comma separated list.
> By default, all fields on the document will be used.
>
> signatureField
>
> signatureField
>
> The name of the field used to hold the fingerprint/signature. Be sure the
> field is defined in schema.xml.
>
> enabled
>
> true
>
> Enable/disable dedupe factory processing
>
>
> --
> 梅旺生
>



-- 
梅旺生

Reply via email to