Re: Indexing part of Binary Documents and not the entire contents

2018-07-06 Thread neotorand
Gus You are never biased. I explored a bit about JesterJ. Looks quite promising. I will keep you posted on my experience to you soon. Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Indexing part of Binary Documents and not the entire contents

2018-07-04 Thread Gus Heck
You might consider using a free tool like JesterJ (www.jesterj.org) which can possibly also automate the acquisition of the documents and transmission to solr. As well as provide a framework for massaging the contents of the document in between (including Tika processing) (Disclaimer: I'm the prim

Re: Indexing part of Binary Documents and not the entire contents

2018-06-27 Thread neotorand
Thanks Erick I already have gone through the link from tika example you shared. Please look at the code in bold. I believe still the entire contents is pushed to memory with handler object. sorry i copied lengthy code from tika site. Regards Neo *Streaming the plain text in chunks* Sometimes, you

Re: Indexing part of Binary Documents and not the entire contents

2018-06-26 Thread Erick Erickson
Well, if you were using ERH you'd have the same problem as it uses Tika. At least if you run Tika on some client somewhere, if you do have a document that blows out memory or has some other problem, your client can crash without taking Solr with it. That's one of the reasons, in fact, that we don'

Re: Indexing part of Binary Documents and not the entire contents

2018-06-26 Thread Shawn Heisey
On 6/26/2018 7:13 AM, neotorand wrote: Dont you think the below method is very exepensive autoParser.parse(input, textHandler, metadata, context); If the document size if bigger than it will need enough memory to hold the document(ie ContentHandler). Any other alternative? I did find this: h

Re: Indexing part of Binary Documents and not the entire contents

2018-06-26 Thread neotorand
Thanks Shawn, Yes I agree ERH is never suggested in production. I am writing my custom ones. Any pointer with this? What exactly i am looking is a custom indexing program to compile precisely the information that you need and send that to Solr. On the other hand i see the below method is very ex

Re: Indexing part of Binary Documents and not the entire contents

2018-06-26 Thread neotorand
Thanks Erick, Though i saw this article in several places but never went through it seriously. Dont you think the below method is very exepensive autoParser.parse(input, textHandler, metadata, context); If the document size if bigger than it will need enough memory to hold the document(ie Cont

Re: Indexing part of Binary Documents and not the entire contents

2018-06-21 Thread Erick Erickson
This may help you get started: https://lucidworks.com/2012/02/14/indexing-with-solrj/ Best, Erick On Thu, Jun 21, 2018 at 8:11 AM, Shawn Heisey wrote: > On 6/20/2018 9:05 AM, neotorand wrote: >> >> I have a specific Requirement where i need to index below things >> >> Meta Data of any document

Re: Indexing part of Binary Documents and not the entire contents

2018-06-21 Thread Shawn Heisey
On 6/20/2018 9:05 AM, neotorand wrote: I have a specific Requirement where i need to index below things Meta Data of any document Some parts from the Document that matches some keywords that i configure The first part i am able to achieve through ERH or FilelistEntityProcessor. I am struggling

Indexing part of Binary Documents and not the entire contents

2018-06-20 Thread neotorand
Hi List, I have a specific Requirement where i need to index below things Meta Data of any document Some parts from the Document that matches some keywords that i configure The first part i am able to achieve through ERH or FilelistEntityProcessor. I am struggling on second part.I am looking for