On 6/18/2010 9:12 AM, Otis Gospodnetic wrote:
Tod,

You didn't mention Tika, which makes me think you are not aware of it...
You could implement a custom Transformer that uses Tika to perform rich doc 
text extraction, just like ExtractingRequestHandler does it (see 
http://wiki.apache.org/solr/ExtractingRequestHandler ).  Maybe you could even 
just call ERH from your Transformer, though that wouldn't be the most efficient.


You're right, sorry. I have looked at Tika, which I believe is used by Nutch too - no?

Implementing a transformer is fine. I guess I'm being lazy and trying to see if a method of doing this has been incorporated into the latest Solr release so I can avoid coding for it.






----- Original Message ----
From: Tod <listac...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Fri, June 18, 2010 8:51:02 AM
Subject: Data Import Handler Rich Format Documents

I have a database containing Metadata from a content management system. Part of that data includes a URL pointing to the actual published document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.

I'm already
indexing the Metadata and that provides a lot of value. The customer however would like that the content pointed to by the URL also be indexed for more discrete searching.

This article at Lucid:


href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS"; target=_blank
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS

describes
the process of coding a custom transformer. A separate article I've read implies Nutch could be used to provide this functionality too.

What would
be the best and most efficient way to accomplish what I'm trying to do? I have a feeling the Lucid article might be dated and there might ways to do this now without any coding and maybe without even needing to use Nutch. I'm using the current release version of Solr.

Thanks in
advance.


- Tod


Reply via email to