RE: Upgrading tika

Phil Scadden Wed, 20 Mar 2019 16:08:28 -0700

While using the update/extract handler is good for test, tika is a heavyweight 
with the risk that a bad document would compromise the solr instance and tika 
even with ordinary docs is a hog.


I wrote code with solrj to do the indexing and run it on completely different 
machine to the solr instance. It just sends SolrDocuments (created from 
analysis by tika) to the server as Erick says. It becomes even more important 
if you are going to incorporate inline OCR into the tika processing (the 
default). Solr docs gives you the outline for the solrj process. I don’t do 
inline OCR.

My workflow is something like this.
Find document to add.
If image PDF convert to searchable PDF via OCR  as searchable PDF is more 
useful document to deliver as result of search.
Submit document to the solrj-based solr indexer.

The core of my indexer is:
             File f = new File(filename);
             ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
             Metadata metadata = new Metadata();
             Parser parser = new AutoDetectParser();
             ParseContext context = new ParseContext();
             if (filename.toLowerCase().contains("pdf")) {  // this special 
setup of pdf processing is only required to switch OCR off
               PDFParserConfig pdfConfig = new PDFParserConfig();
               pdfConfig.setExtractInlineImages(false);
               pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
               context.set(PDFParserConfig.class,pdfConfig);
               context.set(Parser.class,parser);
             }
             InputStream input = new FileInputStream(f);
             try {
               parser.parse(input, textHandler, metadata, context);
             } catch (Exception e) {
 // exception handling
             }
             SolrInputDocument up = new SolrInputDocument();
             up.addField("id",f.getCanonicalPath());
//             other addField calls for items extracted from metadata etc.
             up.addField("_text_",content);
             UpdateRequest req = new UpdateRequest();
             req.add(up);
             req.setBasicAuthCredentials("solrAdmin", password);
             UpdateResponse ur =  req.process(solr,"myindex");
             req.commit(solr, "myindex");

-----Original Message-----
From: Geoffrey Willis <gwilli...@yahoo.com.INVALID>
Sent: Thursday, 21 March 2019 06:52
To: solr-user@lucene.apache.org
Subject: Re: Upgrading tika

Could you expand on that please? I’m currently building a webApp that submits 
documents to Solr/Tika via the update/extract handler and it’s working fine. 
What do you mean when you say “You do not want to have your Solr instance 
processing via Tika”? If that’s a bad design choice please elaborate.
Thanks,
Geoff


> On Mar 19, 2019, at 5:15 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
> As per Erick advice, I would strongly recommend that you do anything tika in 
> a  separate solrj programme. You do not want to have your solr instance 
> processing via tika.
>
> -----Original Message-----
> From: Tannen, Lev (USAEO) [Contractor] <lev.tan...@usdoj.gov.INVALID>
> Sent: Wednesday, 20 March 2019 08:17
> To: solr-user@lucene.apache.org
> Subject: RE: Upgrading tika
>
> Sorry Erick,
> Please disregard my previous message. Somehow I downloaded the version 
> without those two files. I am going to download the latest version solr 8.0.0 
> and try it.
> Best
> Lev Tannen
>
> -----Original Message-----
> From: Erick Erickson <erickerick...@gmail.com>
> Sent: Tuesday, March 19, 2019 2:48 PM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Upgrading tika
>
> Yes, Solr is distributed with Tika. Look in:
> ./solr/contrib/extraction/lib
>
> Tika is upgraded when new versions come out, so the underlying files are 
> whatever are current at the time.
>
> The integration is a fairly loose coupling, if you're using some external 
> program (say a SolrJ program) to parse the files, there's no requirement to 
> use the jars distributed with Solr, use whatever suits your fancy. An 
> external program just constructs a SolrDocument to send to Solr. What you use 
> to create that document is irrelevant. See:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/ for some background.
>
> If you're using the ExtractingRequestHandler, where you just send the 
> semi-structured docs to Solr (PDFs, Word or whatever), then needing to know 
> anything about individual Tika-related jar files is kind of strange.
>
> If your predecessors wrote some custom code that runs as part of Solr, I 
> don't know what to say...
>
> Best,
> Erick
>
> On Tue, Mar 19, 2019 at 10:47 AM Tannen, Lev (USAEO) [Contractor] 
> <lev.tan...@usdoj.gov.invalid> wrote:
>>
>> Thank you Shawn.
>> I assumed that tika has been integrated with solr. I the project written 
>> before me they used two tika files taken from solr distribution. I am trying 
>> to do the same with solr 7.7.1. However this version contains a different 
>> set of tika related files. So I am confused. Does  solr does not have 
>> integrated tika anymore, or I just cannot recognize them?
>>
>> -----Original Message-----
>> From: Shawn Heisey <apa...@elyograg.org>
>> Sent: Tuesday, March 19, 2019 11:11 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Upgrading tika
>>
>> On 3/19/2019 9:03 AM, levtannen wrote:
>>> Could anybody suggest me what files do I need to use the latest
>>> version of Tika and where to find them?
>>
>> This mailing list is solr-user.  Tika is an entirely separate project from 
>> Solr within the Apache Foundation.  To get help with Tika, you'll need to 
>> ask that project.
>>
>> https://tika.apache.org/mail-lists.html
>>
>> Thanks,
>> Shawn
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Upgrading tika

Reply via email to