I have a database containing Metadata from a content management system.
Part of that data includes a URL pointing to the actual published
document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.
I'm already indexing the Metadata and that provides a lot of value. The
customer however would like that the content pointed to by the URL also
be indexed for more discrete searching.
This article at Lucid:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS
describes the process of coding a custom transformer. A separate
article I've read implies Nutch could be used to provide this
functionality too.
What would be the best and most efficient way to accomplish what I'm
trying to do? I have a feeling the Lucid article might be dated and
there might ways to do this now without any coding and maybe without
even needing to use Nutch. I'm using the current release version of Solr.
Thanks in advance.
- Tod