Tod, You didn't mention Tika, which makes me think you are not aware of it... You could implement a custom Transformer that uses Tika to perform rich doc text extraction, just like ExtractingRequestHandler does it (see http://wiki.apache.org/solr/ExtractingRequestHandler ). Maybe you could even just call ERH from your Transformer, though that wouldn't be the most efficient.
Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Tod <listac...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Fri, June 18, 2010 8:51:02 AM > Subject: Data Import Handler Rich Format Documents > > I have a database containing Metadata from a content management system. > Part of that data includes a URL pointing to the actual published document > which > can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc. I'm already > indexing the Metadata and that provides a lot of value. The customer > however would like that the content pointed to by the URL also be indexed for > more discrete searching. This article at Lucid: > href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS" > > target=_blank > >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS describes > the process of coding a custom transformer. A separate article I've read > implies Nutch could be used to provide this functionality too. What would > be the best and most efficient way to accomplish what I'm trying to do? I > have a feeling the Lucid article might be dated and there might ways to do > this > now without any coding and maybe without even needing to use Nutch. I'm > using the current release version of Solr. Thanks in > advance. - Tod