Tod,

You didn't mention Tika, which makes me think you are not aware of it...
You could implement a custom Transformer that uses Tika to perform rich doc 
text extraction, just like ExtractingRequestHandler does it (see 
http://wiki.apache.org/solr/ExtractingRequestHandler ).  Maybe you could even 
just call ERH from your Transformer, though that wouldn't be the most efficient.

 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Tod <listac...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Fri, June 18, 2010 8:51:02 AM
> Subject: Data Import Handler Rich Format Documents
> 
> I have a database containing Metadata from a content management system.  
> Part of that data includes a URL pointing to the actual published document 
> which 
> can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.

I'm already 
> indexing the Metadata and that provides a lot of value.  The customer 
> however would like that the content pointed to by the URL also be indexed for 
> more discrete searching.

This article at Lucid:


> href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS";
>  
> target=_blank 
> >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS

describes 
> the process of coding a custom transformer.  A separate article I've read 
> implies Nutch could be used to provide this functionality too.

What would 
> be the best and most efficient way to accomplish what I'm trying to do?  I 
> have a feeling the Lucid article might be dated and there might ways to do 
> this 
> now without any coding and maybe without even needing to use Nutch.  I'm 
> using the current release version of Solr.

Thanks in 
> advance.


- Tod

Reply via email to