Re: Issue with Solr Cell mixing metadata and content together

Erick Erickson Thu, 21 Dec 2017 16:45:02 -0800

bq: s there any way to get reasonable behavior using the
ExtractingRequestHandler, or should I just dump that approach and plan
to run Tika outside of Solr, and then send Solr the exact content I
want?


Actually, this is recommended for a bunch of reasons, so I'd just
go there straightaway. Tika has all sorts of "interesting" things to
cope with, and since the underlying file formats are more-or-less
followed by this vendor or that, there's always the possibility
that Tika will kill your Solr.

Here's a place to start:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Thu, Dec 21, 2017 at 4:31 PM, Phillip Rhodes
<motley.crue....@gmail.com> wrote:
> Hi all, I have been having an issue with Solr, using the
> ExtractingRequestHandler.  Basically, when indexing a PDF (for
> example) I get all the metadata mixed into the "content" field along
> with the content.  See:
> <https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content>
> for the gory details.
>
> I'm guessing this is the same basic issue as
> <https://issues.apache.org/jira/browse/SOLR-9178> which is still
> unresolved.  But I thought I'd ping the list just to see if anyone had
> a workaround or any more information on this.
>
> Is there any way to get reasonable behavior using the
> ExtractingRequestHandler, or should I just dump that approach and plan
> to run Tika outside of Solr, and then send Solr the exact content I
> want?
>
>
> Thanks,
>
>
>
> This message optimized for indexing by NSA PRISM

Re: Issue with Solr Cell mixing metadata and content together

Reply via email to