Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps it
should made configurable which content handler Solr uses and in case of the
BoilerpipeContentHandler which extractor implementation to use.
-----Original message-----
> From:Otis Gospodnetic <otis_gospodne...@yahoo.com>
> Sent: Thu 30-Aug-2012 15:30
> To: solr-user@lucene.apache.org
> Subject: Re: Extract footer/header text out of Word docs
>
> Hi Alex,
>
> I think you may get better help on the Tika mailing list - Solr uses Tika to
> parse rich text docs and extract text from them. I don't know if Tika can
> figure out what's from a header and a footer...
>
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase -
> http://sematext.com/spm
>
>
>
> ----- Original Message -----
> > From: Alex Cougarman <acoug...@bwc.org>
> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> > Cc:
> > Sent: Thursday, August 30, 2012 9:25 AM
> > Subject: Extract footer/header text out of Word docs
> >
> > Hi. Is it possible to specifically extract footer/header and body text out
> > of a
> > Word document using Solr? In other words, we'd like to index/store those
> > items in different Solr fields.
> >
> > Also, is it possible to search on specific styles within a Word document?
> > Can
> > these attributes be indexed? Thanks.
> >
> > Sincerely,
> > Alex
> >
>