Re: Storing the md5 hash of pdf files as a field in the index

2012-04-22 Thread kuchenbrett
Hi Lance, sounds interesting. The idea was to use a message digest (e. g. a md5 hash) of a file to be indexed as an unique identifier to avoid duplicates. I wasn't aware of the de-duplication feature you mention. This feature seems to be the exact solution for my problem. In the solr wiki I fo

Re: Storing the md5 hash of pdf files as a field in the index

2012-04-21 Thread Lance Norskog
The SignatureUpdateProcessor implements a smaller, faster cryptohash. It is used by the de-duplication feature. What's the purpose? Do you need the MD5 algorithm, or is any competent cryptohash good enough? On Sat, Apr 21, 2012 at 5:55 AM, wrote: > Hi Otis, > >  thank you very much for the quic

Re: Storing the md5 hash of pdf files as a field in the index

2012-04-21 Thread kuchenbrett
Hi Otis, thank you very much for the quick response to my question. I'll have a look at your suggested solution. Do you know if there's any documentation about writing such an Update Request Handler or how to trigger it using the Data Import/Tika combination? Thanks. Joe

Re: Storing the md5 hash of pdf files as a field in the index

2012-04-20 Thread Otis Gospodnetic
ay, April 20, 2012 10:07 AM >Subject: Storing the md5 hash of pdf files as a field in the index > >Hi, > >I want to build an index of quite a number of pdf and msword files using the >Data Import Request Handler and the Tika Entity Processor. It works very well. >Now I would li

Storing the md5 hash of pdf files as a field in the index

2012-04-20 Thread kuchenbrett
Hi, I want to build an index of quite a number of pdf and msword files using the Data Import Request Handler and the Tika Entity Processor. It works very well. Now I would like to use the md5 digest of the binary (pdf/word) file as the unique key in t he index. But I do not know how to implem