Hi Lance, sounds interesting. The idea was to use a message digest (e. g. a md5 hash) of a file to be indexed as an unique identifier to avoid duplicates. I wasn't aware of the de-duplication feature you mention. This feature seems to be the exact solution for my problem. In the solr wiki I found some samples how to configure and trigger it when calling a XmlUpdateRequestHandler. I guess I can also use it in a similar way when calling the DataImportHandler, correct?
Many thanks for your suggestion. Joe > The SignatureUpdateProcessor implements a smaller, faster cryptohash. > It > is used by the de-duplication feature. > > What's the purpose? Do you need > the MD5 algorithm, or is any competent > cryptohash good enough? > > On Sat, > Apr 21, 2012 at 5:55 AM, <kuchenbr...@mail.org> wrote: > > Hi Otis, > > > > > thank you very much for the quick response to my question. I'll have a look > at your > suggested solution. Do you know if there's any documentation about > writing such an Update > Request Handler or how to trigger it using the Data > Import/Tika combination? > > > > Thanks. > > Joe