Re: SolrJ/Tika custom indexer not indexing CERTAIN .doc text? | SIDENOTE

Erick Erickson Fri, 10 Jul 2015 10:14:58 -0700

Tim:

Thanks! I've prompted the folks at LW to see what 's up with blog
comments and I'll add your suggestion to the blog (with attribution of
course)


Best,
Erick

On Fri, Jul 10, 2015 at 5:41 AM, Allison, Timothy B. <talli...@mitre.org> wrote:
>>>Wow, that code looks familiar ;)...
>
> Erick and Paden,
>   The following is not the source of your problem, but I thought I'd mention 
> it while you reference Erick's fantastic blog post on solrj 
> (http://lucidworks.com/blog/indexing-with-solrj/).  I tried to comment on 
> Erick's blog post, but something went wrong with the website failed, so I'll 
> take this opportunity.
>
> If you want Tika to parse embedded files (attachments within your .doc or any 
> other embedded files), you need to send in the autodetectparser in the 
> parsecontext:
>
> ParseContext context = new ParseContext();
> context.set(Parser.class, autoParser);
>
> Shalin fixed this in DIH in SOLR-7189.
>
> If you don't include the parser in the ParseContext, Tika will only extract 
> text from the container/original file that you send in, and it will ignore 
> all attachments.  For some applications, this is desired, but I think users 
> would generally expect that Tika will extract everything.
>
> Happy extraction!
>
> Cheers,
>
>         Tim
>
> -----Original Message-----
> From: Paden [mailto:rumsey...@gmail.com]
> Sent: Thursday, July 09, 2015 1:00 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrJ/Tika custom indexer not indexing CERTAIN .doc text?
>
> Haha no need to reinvent wheels. Especially when you don't know java. Just a
> prototype anyway.
>

Re: SolrJ/Tika custom indexer not indexing CERTAIN .doc text? | SIDENOTE

Reply via email to