Hi Lance,
sounds interesting. The idea was to use a message digest (e. g. a md5 hash) of
a file to be indexed as an unique identifier to avoid duplicates. I wasn't
aware of the de-duplication feature you mention. This feature seems to be the
exact solution for my problem. In the solr wiki I fo
The SignatureUpdateProcessor implements a smaller, faster cryptohash.
It is used by the de-duplication feature.
What's the purpose? Do you need the MD5 algorithm, or is any competent
cryptohash good enough?
On Sat, Apr 21, 2012 at 5:55 AM, wrote:
> Hi Otis,
>
> thank you very much for the quic
Hi Otis,
thank you very much for the quick response to my question. I'll have a look at
your suggested solution. Do you know if there's any documentation about writing
such an Update Request Handler or how to trigger it using the Data Import/Tika
combination?
Thanks.
Joe
ay, April 20, 2012 10:07 AM
>Subject: Storing the md5 hash of pdf files as a field in the index
>
>Hi,
>
>I want to build an index of quite a number of pdf and msword files using the
>Data Import Request Handler and the Tika Entity Processor. It works very well.
>Now I would li
Hi,
I want to build an index of quite a number of pdf and msword files using the
Data Import Request Handler and the Tika Entity Processor. It works very well.
Now I would like to use the md5 digest of the binary (pdf/word) file as the
unique key in t
he index. But I do not know how to implem