RE: SolrJ/Tika custom indexer not indexing CERTAIN .doc text? | SIDENOTE

Allison, Timothy B. Fri, 10 Jul 2015 05:42:55 -0700

>>Wow, that code looks familiar ;)...

Erick and Paden,
  The following is not the source of your problem, but I thought I'd mention it 
while you reference Erick's fantastic blog post on solrj 
(http://lucidworks.com/blog/indexing-with-solrj/).  I tried to comment on 
Erick's blog post, but something went wrong with the website failed, so I'll 
take this opportunity.


If you want Tika to parse embedded files (attachments within your .doc or any 
other embedded files), you need to send in the autodetectparser in the 
parsecontext:

ParseContext context = new ParseContext();
context.set(Parser.class, autoParser);

Shalin fixed this in DIH in SOLR-7189.

If you don't include the parser in the ParseContext, Tika will only extract 
text from the container/original file that you send in, and it will ignore all 
attachments.  For some applications, this is desired, but I think users would 
generally expect that Tika will extract everything.

Happy extraction!

Cheers,

        Tim

-----Original Message-----
From: Paden [mailto:rumsey...@gmail.com] 
Sent: Thursday, July 09, 2015 1:00 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrJ/Tika custom indexer not indexing CERTAIN .doc text?

Haha no need to reinvent wheels. Especially when you don't know java. Just a
prototype anyway.

RE: SolrJ/Tika custom indexer not indexing CERTAIN .doc text? | SIDENOTE

Reply via email to