Re: Storing the md5 hash of pdf files as a field in the index

Otis Gospodnetic Fri, 20 Apr 2012 20:17:13 -0700

Hi Joe,

You could write a custom URP - Update Request Processor.  This URP would take 
the value from one SolrDocument field (say the one that has the full path to 
your PDF and is thus unique), compute MD5 using Java API for doing that, and 
would stick that MD5 value in some field that you've defined as string to hold 
that value.


Otis
----
Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: "kuchenbr...@mail.org" <kuchenbr...@mail.org>
>To: solr-user@lucene.apache.org 
>Sent: Friday, April 20, 2012 10:07 AM
>Subject: Storing the md5 hash of pdf files as a field in the index
> 
>Hi,
>
>I want to build an index of quite a number of pdf and msword files using the 
>Data Import Request Handler and the Tika Entity Processor. It works very well. 
>Now I would like to use the md5 digest of the binary (pdf/word) file as the 
>unique key in t
>he index. But I do not know how to implement this. In the data-config.xml 
>configuring the FileListEntityProcessor I have access to the absolute file 
>name of a pdf to be indexed. I'm sitting on a Linux box and so there is an 
>easy way to calculate t
>he md5 hash using the operating system command md5sum. But how can I trigger 
>this calculation and store the result as a field in my index?
>
>Any tips or ideas are really appreciated.
>
>Thanks.
>Joe
>
>
>

Re: Storing the md5 hash of pdf files as a field in the index

Reply via email to