Tod, I don't think DIH can do that, but who knows, let's see what others say. Yes, Nutch uses TIKA, too.
Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Tod <listac...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Fri, June 18, 2010 10:20:34 AM > Subject: Re: Data Import Handler Rich Format Documents > > On 6/18/2010 9:12 AM, Otis Gospodnetic wrote: > Tod, > > You > didn't mention Tika, which makes me think you are not aware of it... > You > could implement a custom Transformer that uses Tika to perform rich doc text > extraction, just like ExtractingRequestHandler does it (see > href="http://wiki.apache.org/solr/ExtractingRequestHandler" target=_blank > >http://wiki.apache.org/solr/ExtractingRequestHandler ). Maybe you > could even just call ERH from your Transformer, though that wouldn't be the > most > efficient. You're right, sorry. I have looked at Tika, which I > believe is used by Nutch too - no? Implementing a transformer is > fine. I guess I'm being lazy and trying to see if a method of doing this > has been incorporated into the latest Solr release so I can avoid coding for > it. > > > ----- Original Message > ---- >> From: Tod < > href="mailto:listac...@gmail.com">listac...@gmail.com> >> To: > ymailto="mailto:solr-user@lucene.apache.org" > href="mailto:solr-user@lucene.apache.org">solr-user@lucene.apache.org >> > Sent: Fri, June 18, 2010 8:51:02 AM >> Subject: Data Import Handler > Rich Format Documents >> >> I have a database containing > Metadata from a content management system. Part of that data includes a > URL pointing to the actual published document which can be an HTML file or a > PDF, MS Word/Excel/Powerpoint, etc. > > I'm already >> > indexing the Metadata and that provides a lot of value. The customer > however would like that the content pointed to by the URL also be indexed for > more discrete searching. > > This article at Lucid: > > > >> href=" > href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS" > > target=_blank > >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS" > > > target=_blank >>> > href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS" > > target=_blank > >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS > > > describes >> the process of coding a custom > transformer. A separate article I've read implies Nutch could be used to > provide this functionality too. > > What would >> be the > best and most efficient way to accomplish what I'm trying to do? I have a > feeling the Lucid article might be dated and there might ways to do this now > without any coding and maybe without even needing to use Nutch. I'm using > the current release version of Solr. > > Thanks in >> > advance. > > > - Tod >