Re: [DISCUSS] Future of SolrCell in Solr

2023-03-23 Thread Tim Allison
Apologies for being late to the show, and thank you Eric for pinging me on this.

I'm 100% for factoring out Tika from the same jvm as Solr.  I see three options 
for removing Tika from Solr's jvm, making it easier for users and keeping 
Tika's jar hell all to itself.

1) As already proposed, use Tika server and somehow figure out how to integrate 
that seamlessly.

2) Use Tika pipes within Solr directly or within a package (as Eric suggest).  
This forks a process for parsing, and all the heavy dependencies go into the 
forked process.  Solr would need tika-core, but could specify a directory with 
tika-app.jar in it.  The dependency nightmare in tika-app.jar would not get 
loaded into Solr's jvm.  We'd probably have to make some mods to tika-pipes for 
this to work roughly as Tika is being used now, but I think something like this 
is doable...

3) Direct users to tika-pipes directly.  We have a Solr emitter.  Users can aim 
tika-pipes at a directory of files, an S3 bucket, a gcs thing, etc, and Tika 
will safely parse the files in a forked process and forward the results to 
Solr.  This is not as easy as curling bytes to Solr and having those bytes 
parsed, but it is possible.

Please let me know how I can help.

Best,

Tim

On 2023/03/10 03:57:45 Gus Heck wrote:
> While I totally think that for any heavy-duty use case or any use case
> where the document's are not constrained to a known set with polite
> characteristics (i.e. known not to be password protected, reasonable
> length, etc), Tika should not run inside solr. That said, as I see it the
> key downside of not having solr-cell as part of solr would be that we would
> likely  remove the docs for it too, and the entire concept of how to get a
> "normal" document into solr evaporates from our ref guide. So I like the
> sound of it being an official package as Eric suggests, and perhaps even
> the canonical example of how to install a package... Along with heavy
> documentation caveats of why Tika should run outside of solr for most
> production purposes of course.
> 
> -Gus
> 
> 
> On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh 
> wrote:
> 
> > I did a series of blog posts about Tika, and while conventional wisdom is
> > that running Tika in Solr is bad, I’ve had GREAT luck with it over the
> > years.
> > https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> > <
> > https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> > >
> >
> > Having said that, my bigger beef with Tika in Solr is about all the
> > dependencies that it drags along.   I am constantly looking up a package
> > wondering how we use it in Solr just to find it’s a Tika package….  So….
> > For that reason I think we need to do something better.
> >
> > I like SolrCell to a package (
> > https://issues.apache.org/jira/browse/SOLR-15951 <
> > https://issues.apache.org/jira/browse/SOLR-15951>).   We have this
> > powerful packaging feature, and yet we hardly dog food it ourselves….  I’d
> > love to see us separate out SolrCell and make it easy to do `bin/solr
> > package install solrcell` and have it work!  It would both validate the
> > whole Package concept, and minimize the dependencies in Solr’s tarball.
> >
> > Secondly, for folks who really do want to run a separate Tika server, I’d
> > love to make it easier to use.Tika has introduced a new “pipes” concept
> > to reduce the amount of back and forth when working with Tika Server that
> > might tie nicely into the Solr update pipeline.  I don’t think any real
> > work has been done on this…. Hoping Tim Allison weighs in on this topic ;-)
> >
> > Eric
> >
> >
> > > On Mar 8, 2023, at 9:50 PM, Shawn Heisey  wrote:
> > >
> > > On 3/7/2023 3:48 PM, Jan Høydahl wrote:
> > >> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <
> > https://issues.apache.org/jira/browse/SOLR-15951>
> > >> * Deprecate SolrCell SOLR-13973 <
> > https://issues.apache.org/jira/browse/SOLR-13973>
> > >> * Keep in Solr but use Tika-Server <
> > https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <
> > https://issues.apache.org/jira/browse/SOLR-7632>
> > >> * Integrate Tika client-side SOLR-1526 <
> > https://issues.apache.org/jira/browse/SOLR-1526>
> > >
> > > As you likely know, the big problem is that Tika has a habit of crashing
> > or misbehaving, particularly with PDFs, and if it's running inside Solr,
> > then Solr itself is going to suffer whatever bad effects

Re: [DISCUSS] Future of SolrCell in Solr

2023-03-23 Thread Tim Allison
Sounds good, Jan.  If you're heading in this direction, I'd recommend
the /tika endpoint with an Accept header set to "application/json".

Please let me know if I can help.

Best,

  Tim


On Thu, Mar 23, 2023 at 2:43 PM Jan Høydahl  wrote:
>
> Documentation wise we can re-write the chapter we have on rich text indexing 
> to mention several options, including tika-server, tika-pipes with solr 
> emitter.
>
> Wrt SolrCell successor, I still think a super-thin module forwarding to 
> TikaServer is the best. Users would get same features and API as today, so 
> users who rely on SolrCell have a simple migration path. It may also be a 
> benefit that they get better control over their Tika Server wrt version, 
> scaling, what parsers are included etc. I want to do a quick POC on this to 
> see how it flies.
>
> Jan
>
> > 23. mar. 2023 kl. 17:14 skrev Tim Allison :
> >
> > Apologies for being late to the show, and thank you Eric for pinging me on 
> > this.
> >
> > I'm 100% for factoring out Tika from the same jvm as Solr.  I see three 
> > options for removing Tika from Solr's jvm, making it easier for users and 
> > keeping Tika's jar hell all to itself.
> >
> > 1) As already proposed, use Tika server and somehow figure out how to 
> > integrate that seamlessly.
> >
> > 2) Use Tika pipes within Solr directly or within a package (as Eric 
> > suggest).  This forks a process for parsing, and all the heavy dependencies 
> > go into the forked process.  Solr would need tika-core, but could specify a 
> > directory with tika-app.jar in it.  The dependency nightmare in 
> > tika-app.jar would not get loaded into Solr's jvm.  We'd probably have to 
> > make some mods to tika-pipes for this to work roughly as Tika is being used 
> > now, but I think something like this is doable...
> >
> > 3) Direct users to tika-pipes directly.  We have a Solr emitter.  Users can 
> > aim tika-pipes at a directory of files, an S3 bucket, a gcs thing, etc, and 
> > Tika will safely parse the files in a forked process and forward the 
> > results to Solr.  This is not as easy as curling bytes to Solr and having 
> > those bytes parsed, but it is possible.
> >
> > Please let me know how I can help.
> >
> > Best,
> >
> >Tim
> >
> > On 2023/03/10 03:57:45 Gus Heck wrote:
> >> While I totally think that for any heavy-duty use case or any use case
> >> where the document's are not constrained to a known set with polite
> >> characteristics (i.e. known not to be password protected, reasonable
> >> length, etc), Tika should not run inside solr. That said, as I see it the
> >> key downside of not having solr-cell as part of solr would be that we would
> >> likely  remove the docs for it too, and the entire concept of how to get a
> >> "normal" document into solr evaporates from our ref guide. So I like the
> >> sound of it being an official package as Eric suggests, and perhaps even
> >> the canonical example of how to install a package... Along with heavy
> >> documentation caveats of why Tika should run outside of solr for most
> >> production purposes of course.
> >>
> >> -Gus
> >>
> >>
> >> On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh 
> >> wrote:
> >>
> >>> I did a series of blog posts about Tika, and while conventional wisdom is
> >>> that running Tika in Solr is bad, I’ve had GREAT luck with it over the
> >>> years.
> >>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> >>> <
> >>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> >>>>
> >>>
> >>> Having said that, my bigger beef with Tika in Solr is about all the
> >>> dependencies that it drags along.   I am constantly looking up a package
> >>> wondering how we use it in Solr just to find it’s a Tika package….  So….
> >>> For that reason I think we need to do something better.
> >>>
> >>> I like SolrCell to a package (
> >>> https://issues.apache.org/jira/browse/SOLR-15951 <
> >>> https://issues.apache.org/jira/browse/SOLR-15951>).   We have this
> >>> powerful packaging feature, and yet we hardly dog food it ourselves….  I’d
> >>> love to see us separate out SolrCell and make it easy to do `bin/solr
> >>> package install solrcell` and have it work!  It would both validate the

Re: Welcome David Smiley as Solr's new PMC chair

2023-03-31 Thread Tim Allison
What great news!  Congratulations, David!

On Fri, Mar 31, 2023 at 1:04 PM Houston Putman  wrote:
>
> Hello,
>
> Solr has had quite a year, with a major release and many cool new
> initiatives!
> It's been an honor to serve as the PMC Chair over that time.
>
> Our PMC has traditionally rotated the position every year, and this year
> the PMC has chosen David Smiley to be the next Solr PMC Chair.
>
> Congrats David, and thanks in advance!
>
> - Houston

-
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org



Re: Updating Dependencies - Apache Tika

2024-08-13 Thread Tim Allison
All,

Let me know how I can help. If there’s any way we can move people to
tika-pipes, that’d be best.

We have a Solr emitter already in Tika, but that might add too much
complexity for people just beginning.

I’m strongly in favor of extricating Tika’s dependencies from Solr’s for
all of the reasons mentioned.

Perhaps a meetup or telecon next week?

Best,
Tim


On Tue, Aug 13, 2024 at 11:02 AM David Smiley  wrote:

> Alternatively, just like we did with the DataImportHandler (DIH)[1],
> we migrate the Tika stuff to an independent project/home on GitHub and
> people install it if they need it.  Like the DIH, Solr's Tika
> integration is quite popular/used so I expect it'll be maintained
> instead of abandoned.  At that point, whether it's migrated to
> TikaServer or whatever is a choice up to whoever the maintainer(s)
> are.  I suppose proceeding in this direction requires volunteers.
>
> [1] https://github.com/SearchScale/dataimporthandler
>
> On Mon, Aug 12, 2024 at 1:15 PM Christos Malliaridis
>  wrote:
> >
> > I tried to find a java client for tika, but with no success so far.
> >
> > The version upgrade would reduce the vulnerabilities from about 21 CVEs
> to
> > 6, so it would definitely be an improvement and probably worth the
> > migration effort  until a client is available.
> >
> > On Mon, 12 Aug 2024, 18:15 Jan Høydahl,  wrote:
> >
> > > Hi
> > >
> > > Wrt Tika, I had been hoping that we could replace extracting handler
> with
> > > a processor that delegates to Tika Server, but is otherwise feature
> parity.
> > > It would remove tons of dependencies and attack surface from Solr.
> > >
> > > I tried a POC once but could not find a suitable Java client for Tika
> > > Server REST API. Perhaps that exists now?
> > >
> > > Jan Høydahl
> > >
> > > > 12. aug. 2024 kl. 16:20 skrev Christos Malliaridis <
> > > c.malliari...@gmail.com>:
> > > >
> > > > Hello everyone,
> > > >
> > > > I've been looking into the dependencies of the project and thought
> that
> > > we
> > > > could update a couple of them, together with their license files
> > > (wherever
> > > > necessary).
> > > >
> > > > I tried to start with Apache Tika and upgrade it from 1.28.5 to
> 2.9.2,
> > > > which is a huge step due to some restructuring of Apache Tika. The
> > > affected
> > > > modules are extraction and langid.
> > > >
> > > > There is a PR from solrbot  >
> > > that
> > > > requires some manual work that I have already picked up for learning
> > > > purposes. I'd like to create a ticket for the upgrade, but also saw
> that
> > > > there is also SOLR-13973
> > > >  that
> > > > is titled "Deprecate Tika". From the age and conversation on the
> ticket,
> > > it
> > > > sounds like Tika will not be deprecated and the ticket can be closed.
> > > But I
> > > > am not sure and would like to ask for your input on this.
> > > >
> > > > In the migration to 2.9.2 it seems that there are some conflicts
> with the
> > > > way the title from documents is extracted. Some metadata tags have
> also
> > > > been removed / replaced, which needs more attention. See Migrating to
> > > Tika
> > > > 2.0.0
> > > > <
> > >
> https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0>
> > > for
> > > > more details.
> > > >
> > > > I'd be happy to create a PR for the upgrade and look into the fixes
> with
> > > > someone that has already worked with Apache Tika 2.X or the affected
> > > > modules (extraction/langid).
> > > >
> > > > Best,
> > > > Christos
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > > For additional commands, e-mail: dev-h...@solr.apache.org
> > >
> > >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
>
>


Re: What I want for Solr 10...

2024-12-20 Thread Tim Allison
Extracting Tika… let’s talk?

On Fri, Dec 20, 2024 at 12:48 PM Eric Pugh 
wrote:

> Sorry I missed the meetup yesterday (was really bummed I didn't have it on
> my calendar), so I wrote up what I would have talked about:
>
> [image: post.png]
>
> All I want for Solr 10 is...
> 
> dep4b.hashnode.dev 
> 
>
>
>
>
>
> ___
> *Eric Pugh **| *Founder | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com | My Free/Busy
> 
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
> 
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>
>