Well, if you were using ERH you'd have the same problem as it uses Tika. At least if you run Tika on some client somewhere, if you do have a document that blows out memory or has some other problem, your client can crash without taking Solr with it.
That's one of the reasons, in fact, that we don't recommend running ERH in prod. And I should point out that this is not a flaw in Tika. Rather the problem Tika has to cope with is immense. And even a cursory look at Tika shows a streaming interface, see: https://tika.apache.org/1.8/examples.html#Streaming_the_plain_text_in_chunks Best, Erick On Tue, Jun 26, 2018 at 6:28 AM, Shawn Heisey <apa...@elyograg.org> wrote: > On 6/26/2018 7:13 AM, neotorand wrote: >> >> Dont you think the below method is very exepensive >> >> autoParser.parse(input, textHandler, metadata, context); >> >> If the document size if bigger than it will need enough memory to hold the >> document(ie ContentHandler). >> Any other alternative? > > > I did find this: > > https://stackoverflow.com/questions/25043720/using-poi-or-tika-to-extract-text-stream-to-stream-without-loading-the-entire-f > > But I have no actual experience with Tika. If you want to get a definitive > answer, you will need to go to a Tika support resource. Although Solr does > incorporate Tika, we are not experts in its use. > > Thanks, > Shawn >