You might consider using a free tool like JesterJ (www.jesterj.org) which can possibly also automate the acquisition of the documents and transmission to solr. As well as provide a framework for massaging the contents of the document in between (including Tika processing)
(Disclaimer: I'm the primary author of JesterJ so I'ms slightly biased ;) ) -Gus On Wed, Jun 27, 2018 at 5:08 AM, neotorand <neotor...@gmail.com> wrote: > Thanks Erick > I already have gone through the link from tika example you shared. > Please look at the code in bold. > I believe still the entire contents is pushed to memory with handler > object. > sorry i copied lengthy code from tika site. > > Regards > Neo > > *Streaming the plain text in chunks* > Sometimes, you want to chunk the resulting text up, perhaps to output as > you > go minimising memory use, perhaps to output to HDFS files, or any other > reason! With a small custom content handler, you can do that. > > public List<String> parseToPlainTextChunks() throws IOException, > SAXException, TikaException { > final List<String> chunks = new ArrayList<>(); > chunks.add(""); > ContentHandlerDecorator handler = new ContentHandlerDecorator() { > @Override > public void characters(char[] ch, int start, int length) { > String lastChunk = chunks.get(chunks.size() - 1); > String thisStr = new String(ch, start, length); > > if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) { > chunks.add(thisStr); > } else { > chunks.set(chunks.size() - 1, lastChunk + thisStr); > } > } > }; > > AutoDetectParser parser = new AutoDetectParser(); > Metadata metadata = new Metadata(); > try (InputStream stream = > ContentHandlerExample.class.getResourceAsStream("test2.doc")) { > *parser.parse(stream, handler, metadata);* > return chunks; > } > } > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > -- http://www.the111shift.com