On 6/18/2010 9:12 AM, Otis Gospodnetic wrote:
Tod,
You didn't mention Tika, which makes me think you are not aware of it...
You could implement a custom Transformer that uses Tika to perform rich doc
text extraction, just like ExtractingRequestHandler does it (see
http://wiki.apache.org/solr/ExtractingRequestHandler ). Maybe you could even
just call ERH from your Transformer, though that wouldn't be the most efficient.
You're right, sorry. I have looked at Tika, which I believe is used by
Nutch too - no?
Implementing a transformer is fine. I guess I'm being lazy and trying
to see if a method of doing this has been incorporated into the latest
Solr release so I can avoid coding for it.
----- Original Message ----
From: Tod <listac...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Fri, June 18, 2010 8:51:02 AM
Subject: Data Import Handler Rich Format Documents
I have a database containing Metadata from a content management system.
Part of that data includes a URL pointing to the actual published document which
can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.
I'm already
indexing the Metadata and that provides a lot of value. The customer
however would like that the content pointed to by the URL also be indexed for
more discrete searching.
This article at Lucid:
href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS"
target=_blank
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS
describes
the process of coding a custom transformer. A separate article I've read
implies Nutch could be used to provide this functionality too.
What would
be the best and most efficient way to accomplish what I'm trying to do? I
have a feeling the Lucid article might be dated and there might ways to do this
now without any coding and maybe without even needing to use Nutch. I'm
using the current release version of Solr.
Thanks in
advance.
- Tod