On 6/18/2010 2:42 PM, Chris Hostetter wrote:
: > I don't think DIH can do that, but who knows, let's see what others say.

: Looks like the ExtractingRequestHandler uses Tika as well.  I might just use
: this but I'm wondering if there will be a large performance difference between
: using it to batch content in over rolling my own Transformer?

I'm confused ... You're using DIH, and some of your fields are URLs to documents that you want to parse with Tika?

Why would you need a custom Transformer?

I started this thread after reading a Lucid article suggesting a custom Transformer might be the way to go when using DIH. My initial question was if there was an alternative.

My database contains only Metadata and a reference to the actual content (HTML,Office Documents, PDF...) as a URL - not blobs as the Lucid article focused on. What I would like to do is use DIH somehow to index the Metadata but also the actual content pointed to by the URL column.

I might be able to do this instead with Nutch (who uses Tika), haven't thoroughly researched this yet, or I can write a job to pull all the URL's out of the database and utilize cURL and the Solr ExtractingRequestHandler to push everything into the index. I just wanted to see what everybody else is doing and what my other options might be.


Thanks - Tod


Ref:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS

Reply via email to