Re: Indexing part of Binary Documents and not the entire contents

Gus Heck Wed, 04 Jul 2018 15:11:19 -0700

You might consider using a free tool like JesterJ (www.jesterj.org) which
can possibly also automate the acquisition of the documents and
transmission to solr. As well as provide a framework for massaging the
contents of the document in between (including Tika processing)


(Disclaimer: I'm the primary author of JesterJ so I'ms slightly biased ;) )

-Gus

On Wed, Jun 27, 2018 at 5:08 AM, neotorand <neotor...@gmail.com> wrote:

> Thanks Erick
> I already have gone through the link from tika example you shared.
> Please look at the code in bold.
> I believe still the entire contents is pushed to memory with handler
> object.
> sorry i copied lengthy code from tika site.
>
> Regards
> Neo
>
> *Streaming the plain text in chunks*
> Sometimes, you want to chunk the resulting text up, perhaps to output as
> you
> go minimising memory use, perhaps to output to HDFS files, or any other
> reason! With a small custom content handler, you can do that.
>
> public List<String> parseToPlainTextChunks() throws IOException,
> SAXException, TikaException {
>     final List<String> chunks = new ArrayList<>();
>     chunks.add("");
>     ContentHandlerDecorator handler = new ContentHandlerDecorator() {
>         @Override
>         public void characters(char[] ch, int start, int length) {
>             String lastChunk = chunks.get(chunks.size() - 1);
>             String thisStr = new String(ch, start, length);
>
>             if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) {
>                 chunks.add(thisStr);
>             } else {
>                 chunks.set(chunks.size() - 1, lastChunk + thisStr);
>             }
>         }
>     };
>
>     AutoDetectParser parser = new AutoDetectParser();
>     Metadata metadata = new Metadata();
>     try (InputStream stream =
> ContentHandlerExample.class.getResourceAsStream("test2.doc")) {
>         *parser.parse(stream, handler, metadata);*
>         return chunks;
>     }
> }
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>



-- 
http://www.the111shift.com

Re: Indexing part of Binary Documents and not the entire contents

Reply via email to