Tika generates a block-structured stream of events for the document.
It would be cool to have an alternate Tika processor in the DIH that
generates this stream as XML. You could then use the XPath tools to
grab whatever you want.

On Fri, Aug 31, 2012 at 4:25 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> You can also move the Tika processing off Solr to the client and perhaps have
> more control there. I haven't tried this particular thing, so....
>
> see: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
>
> Best
> Erick
>
> On Thu, Aug 30, 2012 at 9:35 AM, Markus Jelsma
> <markus.jel...@openindex.io> wrote:
>> Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps 
>> it should made configurable which content handler Solr uses and in case of 
>> the BoilerpipeContentHandler which extractor implementation to use.
>>
>> -----Original message-----
>>> From:Otis Gospodnetic <otis_gospodne...@yahoo.com>
>>> Sent: Thu 30-Aug-2012 15:30
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Extract footer/header text out of Word docs
>>>
>>> Hi Alex,
>>>
>>> I think you may get better help on the Tika mailing list - Solr uses Tika 
>>> to parse rich text docs and extract text from them.  I don't know if Tika 
>>> can figure out what's from a header and a footer...
>>>
>>> Otis
>>> ----
>>> Performance Monitoring for Solr / ElasticSearch / HBase - 
>>> http://sematext.com/spm
>>>
>>>
>>>
>>> ----- Original Message -----
>>> > From: Alex Cougarman <acoug...@bwc.org>
>>> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
>>> > Cc:
>>> > Sent: Thursday, August 30, 2012 9:25 AM
>>> > Subject: Extract footer/header text out of Word docs
>>> >
>>> > Hi. Is it possible to specifically extract footer/header and body text 
>>> > out of a
>>> > Word document using Solr? In other words, we'd like to index/store those
>>> > items in different Solr fields.
>>> >
>>> > Also, is it possible to search on specific styles within a Word document? 
>>> > Can
>>> > these attributes be indexed? Thanks.
>>> >
>>> > Sincerely,
>>> > Alex
>>> >
>>>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to