Re: [DISCUSS] Future of SolrCell in Solr

Tim Allison Thu, 23 Mar 2023 12:34:13 -0700

Sounds good, Jan.  If you're heading in this direction, I'd recommend
the /tika endpoint with an Accept header set to "application/json".


Please let me know if I can help.

Best,

      Tim


On Thu, Mar 23, 2023 at 2:43 PM Jan Høydahl <jan....@cominvent.com> wrote:
>
> Documentation wise we can re-write the chapter we have on rich text indexing 
> to mention several options, including tika-server, tika-pipes with solr 
> emitter.
>
> Wrt SolrCell successor, I still think a super-thin module forwarding to 
> TikaServer is the best. Users would get same features and API as today, so 
> users who rely on SolrCell have a simple migration path. It may also be a 
> benefit that they get better control over their Tika Server wrt version, 
> scaling, what parsers are included etc. I want to do a quick POC on this to 
> see how it flies.
>
> Jan
>
> > 23. mar. 2023 kl. 17:14 skrev Tim Allison <talli...@apache.org>:
> >
> > Apologies for being late to the show, and thank you Eric for pinging me on 
> > this.
> >
> > I'm 100% for factoring out Tika from the same jvm as Solr.  I see three 
> > options for removing Tika from Solr's jvm, making it easier for users and 
> > keeping Tika's jar hell all to itself.
> >
> > 1) As already proposed, use Tika server and somehow figure out how to 
> > integrate that seamlessly.
> >
> > 2) Use Tika pipes within Solr directly or within a package (as Eric 
> > suggest).  This forks a process for parsing, and all the heavy dependencies 
> > go into the forked process.  Solr would need tika-core, but could specify a 
> > directory with tika-app.jar in it.  The dependency nightmare in 
> > tika-app.jar would not get loaded into Solr's jvm.  We'd probably have to 
> > make some mods to tika-pipes for this to work roughly as Tika is being used 
> > now, but I think something like this is doable...
> >
> > 3) Direct users to tika-pipes directly.  We have a Solr emitter.  Users can 
> > aim tika-pipes at a directory of files, an S3 bucket, a gcs thing, etc, and 
> > Tika will safely parse the files in a forked process and forward the 
> > results to Solr.  This is not as easy as curling bytes to Solr and having 
> > those bytes parsed, but it is possible.
> >
> > Please let me know how I can help.
> >
> > Best,
> >
> >    Tim
> >
> > On 2023/03/10 03:57:45 Gus Heck wrote:
> >> While I totally think that for any heavy-duty use case or any use case
> >> where the document's are not constrained to a known set with polite
> >> characteristics (i.e. known not to be password protected, reasonable
> >> length, etc), Tika should not run inside solr. That said, as I see it the
> >> key downside of not having solr-cell as part of solr would be that we would
> >> likely  remove the docs for it too, and the entire concept of how to get a
> >> "normal" document into solr evaporates from our ref guide. So I like the
> >> sound of it being an official package as Eric suggests, and perhaps even
> >> the canonical example of how to install a package... Along with heavy
> >> documentation caveats of why Tika should run outside of solr for most
> >> production purposes of course.
> >>
> >> -Gus
> >>
> >>
> >> On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh <ep...@opensourceconnections.com>
> >> wrote:
> >>
> >>> I did a series of blog posts about Tika, and while conventional wisdom is
> >>> that running Tika in Solr is bad, I’ve had GREAT luck with it over the
> >>> years.
> >>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> >>> <
> >>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> >>>>
> >>>
> >>> Having said that, my bigger beef with Tika in Solr is about all the
> >>> dependencies that it drags along.   I am constantly looking up a package
> >>> wondering how we use it in Solr just to find it’s a Tika package….  So….
> >>> For that reason I think we need to do something better.
> >>>
> >>> I like SolrCell to a package (
> >>> https://issues.apache.org/jira/browse/SOLR-15951 <
> >>> https://issues.apache.org/jira/browse/SOLR-15951>).   We have this
> >>> powerful packaging feature, and yet we hardly dog food it ourselves….  I’d
> >>> love to see us separate out SolrCell and make it easy to do `bin/solr
> >>> package install solrcell` and have it work!  It would both validate the
> >>> whole Package concept, and minimize the dependencies in Solr’s tarball.
> >>>
> >>> Secondly, for folks who really do want to run a separate Tika server, I’d
> >>> love to make it easier to use.    Tika has introduced a new “pipes” 
> >>> concept
> >>> to reduce the amount of back and forth when working with Tika Server that
> >>> might tie nicely into the Solr update pipeline.  I don’t think any real
> >>> work has been done on this…. Hoping Tim Allison weighs in on this topic 
> >>> ;-)
> >>>
> >>> Eric
> >>>
> >>>
> >>>> On Mar 8, 2023, at 9:50 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> >>>>
> >>>> On 3/7/2023 3:48 PM, Jan Høydahl wrote:
> >>>>> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <
> >>> https://issues.apache.org/jira/browse/SOLR-15951>
> >>>>> * Deprecate SolrCell SOLR-13973 <
> >>> https://issues.apache.org/jira/browse/SOLR-13973>
> >>>>> * Keep in Solr but use Tika-Server <
> >>> https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <
> >>> https://issues.apache.org/jira/browse/SOLR-7632>
> >>>>> * Integrate Tika client-side SOLR-1526 <
> >>> https://issues.apache.org/jira/browse/SOLR-1526>
> >>>>
> >>>> As you likely know, the big problem is that Tika has a habit of crashing
> >>> or misbehaving, particularly with PDFs, and if it's running inside Solr,
> >>> then Solr itself is going to suffer whatever bad effects Tika causes.
> >>>>
> >>>>> My current thinking / proposal is to:
> >>>>> * Build a new, thin Solr module that exposes a compatible
> >>> /update/extract handler, delegating to Tika-Server (user-hosted)
> >>>>> * Deprecate SolrCell in current form
> >>>>> * From 10.0, Solr will not ship with embedded Tika, only the new
> >>> handler delegating to Tika-Server
> >>>>
> >>>> I was thinking something along these lines too.  A separate JVM running
> >>> Tika Server that can crash without taking Solr down, and communication so
> >>> ERH can send commands to it, receive extracted data, and hopefully know
> >>> when the other JVM crashes.  If we design it well, then the framework 
> >>> could
> >>> be used to integrate with other extraction mechanisms besides Tika.  I
> >>> think that would be quite a bit of work.
> >>>>
> >>>> It might be a good idea to make that a separate project as was done for
> >>> DIH, but I have no way of guessing whether there is enough interest in the
> >>> community to keep it maintained.  If it's a separate project, then I think
> >>> it would just incorporate SolrJ and Tika, rather than using a special
> >>> handler.  I have never used ERH in a production setting, and barely have
> >>> experience with it in non-production.
> >>>>
> >>>> Thanks,
> >>>> Shawn
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> >>>> For additional commands, e-mail: dev-h...@solr.apache.org
> >>>>
> >>>
> >>> _______________________
> >>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> >>> http://www.opensourceconnections.com <
> >>> http://www.opensourceconnections.com/> | My Free/Busy <
> >>> http://tinyurl.com/eric-cal>
> >>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> >>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> >>>
> >>> This e-mail and all contents, including attachments, is considered to be
> >>> Company Confidential unless explicitly stated otherwise, regardless of
> >>> whether attachments are marked as such.
> >>>
> >>>
> >>
> >> --
> >> http://www.needhamsoftware.com (work)
> >> http://www.the111shift.com (play)
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > For additional commands, e-mail: dev-h...@solr.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org

Re: [DISCUSS] Future of SolrCell in Solr

Reply via email to