bq: s there any way to get reasonable behavior using the ExtractingRequestHandler, or should I just dump that approach and plan to run Tika outside of Solr, and then send Solr the exact content I want?
Actually, this is recommended for a bunch of reasons, so I'd just go there straightaway. Tika has all sorts of "interesting" things to cope with, and since the underlying file formats are more-or-less followed by this vendor or that, there's always the possibility that Tika will kill your Solr. Here's a place to start: https://lucidworks.com/2012/02/14/indexing-with-solrj/ Best, Erick On Thu, Dec 21, 2017 at 4:31 PM, Phillip Rhodes <motley.crue....@gmail.com> wrote: > Hi all, I have been having an issue with Solr, using the > ExtractingRequestHandler. Basically, when indexing a PDF (for > example) I get all the metadata mixed into the "content" field along > with the content. See: > <https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content> > for the gory details. > > I'm guessing this is the same basic issue as > <https://issues.apache.org/jira/browse/SOLR-9178> which is still > unresolved. But I thought I'd ping the list just to see if anyone had > a workaround or any more information on this. > > Is there any way to get reasonable behavior using the > ExtractingRequestHandler, or should I just dump that approach and plan > to run Tika outside of Solr, and then send Solr the exact content I > want? > > > Thanks, > > > > This message optimized for indexing by NSA PRISM