Useful tip Erik, this will save a lot of hassle.

Thank you much.

Regards,
Veselin K


On Tue, Apr 07, 2009 at 11:29:38AM -0400, Erik Hatcher wrote:
> Note that Solr (trunk, soon to be 1.4) has a duplicate detection feature 
> that may work for your need. See 
> http://wiki.apache.org/solr/Deduplication (looks like docs need updating 
> to say 1.4 here) and http://issues.apache.org/jira/browse/SOLR-799
>
>       Erik
>
>
> On Apr 7, 2009, at 11:25 AM, Veselin K wrote:
>
>> Thank you much Fergus,
>>
>> I was considering implementing a database which would hold a path name
>> and an MD5 sum of each file.
>>
>> Then as a part of Solr indexing, one could check against the DB if a
>> file path exists, if Yes, then compare MD5 and only index if  
>> different.
>>
>>
>> Regards,
>> Veselin K
>>
>> On Tue, Apr 07, 2009 at 09:01:31AM +0100, Fergus McMenemie wrote:
>>> Veselin,
>>>
>>> Well, as far as solr is concerned, there is two issues here:-
>>>
>>> 1) To stop the same document ending up in the indexes twice, use the 
>>> document
>>>   pathname as the unique ID. Then if you do index it twice, the  
>>> previous index
>>>   information will be discarded. Not very efficient, but it may be  
>>> tolerable.
>>>   IMHO using pathname as the unique ID is often best practice.
>>>
>>> 2) To stop a document even being submitted to solr. You need to  
>>> implement some
>>>   middle ware that either performs a search/lookup using a documents 
>>> pathname
>>>   to see if it is already indexed. Or, after examining timestampts,  
>>> only submits
>>>   documents which have changed since the last folder scan.
>>>
>>> Fergus.

Reply via email to