Re: Data Import Handler Rich Format Documents

Tod Fri, 18 Jun 2010 12:37:27 -0700

On 6/18/2010 2:42 PM, Chris Hostetter wrote:

: > I don't think DIH can do that, but who knows, let's see what others say.


: Looks like the ExtractingRequestHandler uses Tika as well.  I might just use
: this but I'm wondering if there will be a large performance difference between
: using it to batch content in over rolling my own Transformer?

I'm confused ... You're using DIH, and some of your fields are URLs todocuments that you want to parse with Tika?


Why would you need a custom Transformer?

I started this thread after reading a Lucid article suggesting a customTransformer might be the way to go when using DIH. My initial questionwas if there was an alternative.

My database contains only Metadata and a reference to the actual content(HTML,Office Documents, PDF...) as a URL - not blobs as the Lucidarticle focused on. What I would like to do is use DIH somehow to indexthe Metadata but also the actual content pointed to by the URL column.

I might be able to do this instead with Nutch (who uses Tika), haven'tthoroughly researched this yet, or I can write a job to pull all theURL's out of the database and utilize cURL and the SolrExtractingRequestHandler to push everything into the index. I justwanted to see what everybody else is doing and what my other optionsmight be.



Thanks - Tod


Ref:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS

Re: Data Import Handler Rich Format Documents

Reply via email to