Re: Indexing part of Binary Documents and not the entire contents

Erick Erickson Tue, 26 Jun 2018 06:45:01 -0700

Well, if you were using ERH you'd have the same problem as it uses
Tika. At least if you run Tika on some client somewhere, if you do
have a document that blows out memory or has some other problem, your
client can crash without taking Solr with it.


That's one of the reasons, in fact, that we don't recommend running ERH in prod.

And I should point out that this is not a flaw in Tika. Rather the
problem Tika has to cope with is immense.

And even a cursory look at Tika shows a streaming interface, see:
https://tika.apache.org/1.8/examples.html#Streaming_the_plain_text_in_chunks

Best,
Erick

On Tue, Jun 26, 2018 at 6:28 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 6/26/2018 7:13 AM, neotorand wrote:
>>
>> Dont you think the below method is very exepensive
>>
>> autoParser.parse(input, textHandler, metadata, context);
>>
>> If the document size if bigger than it will need enough memory to hold the
>> document(ie ContentHandler).
>> Any other alternative?
>
>
> I did find this:
>
> https://stackoverflow.com/questions/25043720/using-poi-or-tika-to-extract-text-stream-to-stream-without-loading-the-entire-f
>
> But I have no actual experience with Tika.  If you want to get a definitive
> answer, you will need to go to a Tika support resource.  Although Solr does
> incorporate Tika, we are not experts in its use.
>
> Thanks,
> Shawn
>

Re: Indexing part of Binary Documents and not the entire contents

Reply via email to