Re: Extract footer/header text out of Word docs

Erick Erickson Fri, 31 Aug 2012 04:25:48 -0700

You can also move the Tika processing off Solr to the client and perhaps have
more control there. I haven't tried this particular thing, so....


see: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

Best
Erick

On Thu, Aug 30, 2012 at 9:35 AM, Markus Jelsma
<[email protected]> wrote:
> Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps 
> it should made configurable which content handler Solr uses and in case of 
> the BoilerpipeContentHandler which extractor implementation to use.
>
> -----Original message-----
>> From:Otis Gospodnetic <[email protected]>
>> Sent: Thu 30-Aug-2012 15:30
>> To: [email protected]
>> Subject: Re: Extract footer/header text out of Word docs
>>
>> Hi Alex,
>>
>> I think you may get better help on the Tika mailing list - Solr uses Tika to 
>> parse rich text docs and extract text from them.  I don't know if Tika can 
>> figure out what's from a header and a footer...
>>
>> Otis
>> ----
>> Performance Monitoring for Solr / ElasticSearch / HBase - 
>> http://sematext.com/spm
>>
>>
>>
>> ----- Original Message -----
>> > From: Alex Cougarman <[email protected]>
>> > To: "[email protected]" <[email protected]>
>> > Cc:
>> > Sent: Thursday, August 30, 2012 9:25 AM
>> > Subject: Extract footer/header text out of Word docs
>> >
>> > Hi. Is it possible to specifically extract footer/header and body text out 
>> > of a
>> > Word document using Solr? In other words, we'd like to index/store those
>> > items in different Solr fields.
>> >
>> > Also, is it possible to search on specific styles within a Word document? 
>> > Can
>> > these attributes be indexed? Thanks.
>> >
>> > Sincerely,
>> > Alex
>> >
>>

Re: Extract footer/header text out of Word docs

Reply via email to