You can also move the Tika processing off Solr to the client and perhaps have more control there. I haven't tried this particular thing, so....
see: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ Best Erick On Thu, Aug 30, 2012 at 9:35 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps > it should made configurable which content handler Solr uses and in case of > the BoilerpipeContentHandler which extractor implementation to use. > > -----Original message----- >> From:Otis Gospodnetic <otis_gospodne...@yahoo.com> >> Sent: Thu 30-Aug-2012 15:30 >> To: solr-user@lucene.apache.org >> Subject: Re: Extract footer/header text out of Word docs >> >> Hi Alex, >> >> I think you may get better help on the Tika mailing list - Solr uses Tika to >> parse rich text docs and extract text from them. I don't know if Tika can >> figure out what's from a header and a footer... >> >> Otis >> ---- >> Performance Monitoring for Solr / ElasticSearch / HBase - >> http://sematext.com/spm >> >> >> >> ----- Original Message ----- >> > From: Alex Cougarman <acoug...@bwc.org> >> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> >> > Cc: >> > Sent: Thursday, August 30, 2012 9:25 AM >> > Subject: Extract footer/header text out of Word docs >> > >> > Hi. Is it possible to specifically extract footer/header and body text out >> > of a >> > Word document using Solr? In other words, we'd like to index/store those >> > items in different Solr fields. >> > >> > Also, is it possible to search on specific styles within a Word document? >> > Can >> > these attributes be indexed? Thanks. >> > >> > Sincerely, >> > Alex >> > >>