Re: Data Import Handler Rich Format Documents

Tod Fri, 18 Jun 2010 07:23:38 -0700

On 6/18/2010 9:12 AM, Otis Gospodnetic wrote:

Tod,


You didn't mention Tika, which makes me think you are not aware of it...
You could implement a custom Transformer that uses Tika to perform rich doc 
text extraction, just like ExtractingRequestHandler does it (see 
http://wiki.apache.org/solr/ExtractingRequestHandler ).  Maybe you could even 
just call ERH from your Transformer, though that wouldn't be the most efficient.

You're right, sorry. I have looked at Tika, which I believe is used byNutch too - no?

Implementing a transformer is fine. I guess I'm being lazy and tryingto see if a method of doing this has been incorporated into the latestSolr release so I can avoid coding for it.

----- Original Message ----
From: Tod <listac...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Fri, June 18, 2010 8:51:02 AM
Subject: Data Import Handler Rich Format Documents
I have a database containing Metadata from a content management system.Part of that data includes a URL pointing to the actual published document whichcan be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.
I'm already
indexing the Metadata and that provides a lot of value. The customerhowever would like that the content pointed to by the URL also be indexed formore discrete searching.
This article at Lucid:
href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS";target=_blank
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS
describes
the process of coding a custom transformer. A separate article I've readimplies Nutch could be used to provide this functionality too.
What would
be the best and most efficient way to accomplish what I'm trying to do? Ihave a feeling the Lucid article might be dated and there might ways to do thisnow without any coding and maybe without even needing to use Nutch. I'musing the current release version of Solr.
Thanks in
advance.
- Tod

Re: Data Import Handler Rich Format Documents

Reply via email to