RE: Extract footer/header text out of Word docs

Markus Jelsma Thu, 30 Aug 2012 06:33:53 -0700

Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps it 
should made configurable which content handler Solr uses and in case of the 
BoilerpipeContentHandler which extractor implementation to use.
 
-----Original message-----
> From:Otis Gospodnetic <[email protected]>
> Sent: Thu 30-Aug-2012 15:30
> To: [email protected]
> Subject: Re: Extract footer/header text out of Word docs
> 
> Hi Alex,
> 
> I think you may get better help on the Tika mailing list - Solr uses Tika to 
> parse rich text docs and extract text from them.  I don't know if Tika can 
> figure out what's from a header and a footer...
> 
> Otis 
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase - 
> http://sematext.com/spm 
> 
> 
> 
> ----- Original Message -----
> > From: Alex Cougarman <[email protected]>
> > To: "[email protected]" <[email protected]>
> > Cc: 
> > Sent: Thursday, August 30, 2012 9:25 AM
> > Subject: Extract footer/header text out of Word docs
> > 
> > Hi. Is it possible to specifically extract footer/header and body text out 
> > of a 
> > Word document using Solr? In other words, we'd like to index/store those 
> > items in different Solr fields.
> > 
> > Also, is it possible to search on specific styles within a Word document? 
> > Can 
> > these attributes be indexed? Thanks.
> > 
> > Sincerely,
> > Alex
> > 
>

RE: Extract footer/header text out of Word docs

Reply via email to