On 6/18/2010 2:42 PM, Chris Hostetter wrote:
: > I don't think DIH can do that, but who knows, let's see what others say.
: Looks like the ExtractingRequestHandler uses Tika as well. I might just use
: this but I'm wondering if there will be a large performance difference between
: using it to batch content in over rolling my own Transformer?
I'm confused ... You're using DIH, and some of your fields are URLs to
documents that you want to parse with Tika?
Why would you need a custom Transformer?
I started this thread after reading a Lucid article suggesting a custom
Transformer might be the way to go when using DIH. My initial question
was if there was an alternative.
My database contains only Metadata and a reference to the actual content
(HTML,Office Documents, PDF...) as a URL - not blobs as the Lucid
article focused on. What I would like to do is use DIH somehow to index
the Metadata but also the actual content pointed to by the URL column.
I might be able to do this instead with Nutch (who uses Tika), haven't
thoroughly researched this yet, or I can write a job to pull all the
URL's out of the database and utilize cURL and the Solr
ExtractingRequestHandler to push everything into the index. I just
wanted to see what everybody else is doing and what my other options
might be.
Thanks - Tod
Ref:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS