I think you can use existing ExtractingRequestHandler to do the job,
i.e. add child entity to your DIH metadata

<dataSource type="JdbcDataSource" name="db" ... />
<dataSource type="URLDataSource" name="solr" />
<entity name="metadata" query="select id, title, url from metadata"
dataSource="db">
    <entity processor="PlainTextEntityProcessor" name="content"
url="http://localhost:8983/solr/update/extract?extractOnly=true&wt=xml&indent=on&stream.url=${metadata.url}";
dataSource="solr">
        <field column="plainText" name="content"/>
    </entity>
</entity>

That's not working example, just basic idea, you still need to
uri_escape ${metadata.url} reference probably using some transformer
(regexp, javascript?) and extract file content from ERH xml response
using xpath and probably do some html stripping.

HTH,
Alex

On Fri, Jun 18, 2010 at 4:51 PM, Tod <listac...@gmail.com> wrote:
> I have a database containing Metadata from a content management system.
>  Part of that data includes a URL pointing to the actual published document
> which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.
>
> I'm already indexing the Metadata and that provides a lot of value.  The
> customer however would like that the content pointed to by the URL also be
> indexed for more discrete searching.
>
> This article at Lucid:
>
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS
>
> describes the process of coding a custom transformer.  A separate article
> I've read implies Nutch could be used to provide this functionality too.
>
> What would be the best and most efficient way to accomplish what I'm trying
> to do?  I have a feeling the Lucid article might be dated and there might
> ways to do this now without any coding and maybe without even needing to use
> Nutch.  I'm using the current release version of Solr.
>
> Thanks in advance.
>
>
> - Tod
>

Reply via email to