Thanks Erick
I already have gone through the link from tika example you shared.
Please look at the code in bold.
I believe still the entire contents is pushed to memory with handler object.
sorry i copied lengthy code from tika site.

Regards
Neo

*Streaming the plain text in chunks*
Sometimes, you want to chunk the resulting text up, perhaps to output as you
go minimising memory use, perhaps to output to HDFS files, or any other
reason! With a small custom content handler, you can do that.

public List<String> parseToPlainTextChunks() throws IOException,
SAXException, TikaException {
    final List<String> chunks = new ArrayList<>();
    chunks.add("");
    ContentHandlerDecorator handler = new ContentHandlerDecorator() {
        @Override
        public void characters(char[] ch, int start, int length) {
            String lastChunk = chunks.get(chunks.size() - 1);
            String thisStr = new String(ch, start, length);
 
            if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) {
                chunks.add(thisStr);
            } else {
                chunks.set(chunks.size() - 1, lastChunk + thisStr);
            }
        }
    };
 
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream =
ContentHandlerExample.class.getResourceAsStream("test2.doc")) {
        *parser.parse(stream, handler, metadata);*
        return chunks;
    }
}



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to