Useful tip Erik, this will save a lot of hassle. Thank you much.
Regards, Veselin K On Tue, Apr 07, 2009 at 11:29:38AM -0400, Erik Hatcher wrote: > Note that Solr (trunk, soon to be 1.4) has a duplicate detection feature > that may work for your need. See > http://wiki.apache.org/solr/Deduplication (looks like docs need updating > to say 1.4 here) and http://issues.apache.org/jira/browse/SOLR-799 > > Erik > > > On Apr 7, 2009, at 11:25 AM, Veselin K wrote: > >> Thank you much Fergus, >> >> I was considering implementing a database which would hold a path name >> and an MD5 sum of each file. >> >> Then as a part of Solr indexing, one could check against the DB if a >> file path exists, if Yes, then compare MD5 and only index if >> different. >> >> >> Regards, >> Veselin K >> >> On Tue, Apr 07, 2009 at 09:01:31AM +0100, Fergus McMenemie wrote: >>> Veselin, >>> >>> Well, as far as solr is concerned, there is two issues here:- >>> >>> 1) To stop the same document ending up in the indexes twice, use the >>> document >>> pathname as the unique ID. Then if you do index it twice, the >>> previous index >>> information will be discarded. Not very efficient, but it may be >>> tolerable. >>> IMHO using pathname as the unique ID is often best practice. >>> >>> 2) To stop a document even being submitted to solr. You need to >>> implement some >>> middle ware that either performs a search/lookup using a documents >>> pathname >>> to see if it is already indexed. Or, after examining timestampts, >>> only submits >>> documents which have changed since the last folder scan. >>> >>> Fergus.