While using the update/extract handler is good for test, tika is a heavyweight with the risk that a bad document would compromise the solr instance and tika even with ordinary docs is a hog.
I wrote code with solrj to do the indexing and run it on completely different machine to the solr instance. It just sends SolrDocuments (created from analysis by tika) to the server as Erick says. It becomes even more important if you are going to incorporate inline OCR into the tika processing (the default). Solr docs gives you the outline for the solrj process. I don’t do inline OCR. My workflow is something like this. Find document to add. If image PDF convert to searchable PDF via OCR as searchable PDF is more useful document to deliver as result of search. Submit document to the solrj-based solr indexer. The core of my indexer is: File f = new File(filename); ContentHandler textHandler = new BodyContentHandler(Integer.MAX_VALUE); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); if (filename.toLowerCase().contains("pdf")) { // this special setup of pdf processing is only required to switch OCR off PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(false); pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR); context.set(PDFParserConfig.class,pdfConfig); context.set(Parser.class,parser); } InputStream input = new FileInputStream(f); try { parser.parse(input, textHandler, metadata, context); } catch (Exception e) { // exception handling } SolrInputDocument up = new SolrInputDocument(); up.addField("id",f.getCanonicalPath()); // other addField calls for items extracted from metadata etc. up.addField("_text_",content); UpdateRequest req = new UpdateRequest(); req.add(up); req.setBasicAuthCredentials("solrAdmin", password); UpdateResponse ur = req.process(solr,"myindex"); req.commit(solr, "myindex"); -----Original Message----- From: Geoffrey Willis <gwilli...@yahoo.com.INVALID> Sent: Thursday, 21 March 2019 06:52 To: solr-user@lucene.apache.org Subject: Re: Upgrading tika Could you expand on that please? I’m currently building a webApp that submits documents to Solr/Tika via the update/extract handler and it’s working fine. What do you mean when you say “You do not want to have your Solr instance processing via Tika”? If that’s a bad design choice please elaborate. Thanks, Geoff > On Mar 19, 2019, at 5:15 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote: > > As per Erick advice, I would strongly recommend that you do anything tika in > a separate solrj programme. You do not want to have your solr instance > processing via tika. > > -----Original Message----- > From: Tannen, Lev (USAEO) [Contractor] <lev.tan...@usdoj.gov.INVALID> > Sent: Wednesday, 20 March 2019 08:17 > To: solr-user@lucene.apache.org > Subject: RE: Upgrading tika > > Sorry Erick, > Please disregard my previous message. Somehow I downloaded the version > without those two files. I am going to download the latest version solr 8.0.0 > and try it. > Best > Lev Tannen > > -----Original Message----- > From: Erick Erickson <erickerick...@gmail.com> > Sent: Tuesday, March 19, 2019 2:48 PM > To: solr-user <solr-user@lucene.apache.org> > Subject: Re: Upgrading tika > > Yes, Solr is distributed with Tika. Look in: > ./solr/contrib/extraction/lib > > Tika is upgraded when new versions come out, so the underlying files are > whatever are current at the time. > > The integration is a fairly loose coupling, if you're using some external > program (say a SolrJ program) to parse the files, there's no requirement to > use the jars distributed with Solr, use whatever suits your fancy. An > external program just constructs a SolrDocument to send to Solr. What you use > to create that document is irrelevant. See: > https://lucidworks.com/2012/02/14/indexing-with-solrj/ for some background. > > If you're using the ExtractingRequestHandler, where you just send the > semi-structured docs to Solr (PDFs, Word or whatever), then needing to know > anything about individual Tika-related jar files is kind of strange. > > If your predecessors wrote some custom code that runs as part of Solr, I > don't know what to say... > > Best, > Erick > > On Tue, Mar 19, 2019 at 10:47 AM Tannen, Lev (USAEO) [Contractor] > <lev.tan...@usdoj.gov.invalid> wrote: >> >> Thank you Shawn. >> I assumed that tika has been integrated with solr. I the project written >> before me they used two tika files taken from solr distribution. I am trying >> to do the same with solr 7.7.1. However this version contains a different >> set of tika related files. So I am confused. Does solr does not have >> integrated tika anymore, or I just cannot recognize them? >> >> -----Original Message----- >> From: Shawn Heisey <apa...@elyograg.org> >> Sent: Tuesday, March 19, 2019 11:11 AM >> To: solr-user@lucene.apache.org >> Subject: Re: Upgrading tika >> >> On 3/19/2019 9:03 AM, levtannen wrote: >>> Could anybody suggest me what files do I need to use the latest >>> version of Tika and where to find them? >> >> This mailing list is solr-user. Tika is an entirely separate project from >> Solr within the Apache Foundation. To get help with Tika, you'll need to >> ask that project. >> >> https://tika.apache.org/mail-lists.html >> >> Thanks, >> Shawn > Notice: This email and any attachments are confidential and may not be used, > published or redistributed without the prior written consent of the Institute > of Geological and Nuclear Sciences Limited (GNS Science). If received in > error please destroy and immediately notify GNS Science. Do not copy or > disclose the contents. Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.