I'd consider using a separate Java program that uses Tika directly, or
one of various services. Then you can assemble whatever you please
before sending the doc to Solr. There are multiple reasons to
recommend this, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

There are other reasons why using extractingRequestHandler is
problematic in production, the biggest one being that it can blow up
your server. Tika has to try to cope with every variant of every
document format it processes, and I personally guarantee that the
implementations from company X (which is no longer in business) for a
PDF file  (from a spec current 10 years ago) may "interpret" that
spec...er...freely ;) And Tika has to then try to cope. It does a
brilliant job, but there's going to be case N+1

The inference, of course, is that extractingRequestHandler is largely
a PoC tool IMO, it gets people going without having to write an external
program but not something I'd recommend for production.

Best,
Erick

On Thu, May 24, 2018 at 10:06 PM, Thomas Lustig <tm.lus...@gmail.com> wrote:
> dear community,
>
> I would like to automatically add a sha256 filehash to a Document field
> after a binary file is posted to a ExtractingRequestHandler.
> First i thought, that the ExtractingRequestHandler has such a feature, but
> so far i did not find a configuration.
> It was mentioned that I should implement my own  Update Request Processor
> to calculate the hash and add it to a field.
> The  SignatureUpdateProcessor seemed to be an out-of-the-box option, but it
> only supports md5 and also does not access the raw binary stream.
>
> The important thing is that i do need the binary stream of the uploaded
> file to calculate a correct hashvalue (e.g. md5, sha256,..)
> Is it possible to also arrange this with a ScriptUpdateProcessor and
> javascript?.
>
> thanks in advance for any help
>
> Tom

Reply via email to