I have a database containing Metadata from a content management system. Part of that data includes a URL pointing to the actual published document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.

I'm already indexing the Metadata and that provides a lot of value. The customer however would like that the content pointed to by the URL also be indexed for more discrete searching.

This article at Lucid:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS

describes the process of coding a custom transformer. A separate article I've read implies Nutch could be used to provide this functionality too.

What would be the best and most efficient way to accomplish what I'm trying to do? I have a feeling the Lucid article might be dated and there might ways to do this now without any coding and maybe without even needing to use Nutch. I'm using the current release version of Solr.

Thanks in advance.


- Tod

Reply via email to