I tend to approach these differently. DIH is a great tool for its purpose,
but I find  SolrJ/Tika to be more understandable. Which may only reflect
that I've never spent enough time with DIH, but there it is....

So, why not use a simple SolrJ program with either Tika or your favorite
HTML parser to extract what you need? I find this an easier setup personally.

Here's a starter set:
https://lucidworks.com/blog/indexing-with-solrj/

It has a fork that indexes from a DB as well, but you can rip that out pretty
easily.

Best,
Erick

On Wed, Aug 12, 2015 at 3:53 PM, Scott Derrick <sc...@tnstaafl.net> wrote:
> I am trying to index a slew of web pages but want to restrict what gets
> indexed
>
> I'm trying to use a dataImportHandler to do this.
>
> my initial config to test this approach isn't doing what I expect
>
>
> <dataConfig>
>  <dataSource name="myfilereader" type="FileDataSource"/>
>  <document>
>     <entity name="jcurrent"
>        processor="FileListEntityProcessor"
>        fileName=".*html"
>        newerThan="${dataimporter.last_index_time}"
>        recursive="true"
>        rootEntity="false"
>        dataSource="null"
>        baseDir="/var/www/web/A10078">
>
>        <entity name="x"
>           dataSource="myfilereader"
>           processor="XPathEntityProcessor"
>           url="${jcurrent.fileAbsolutePath}"
>           stream="false"
>           forEach="/html/body"
>           dataField="text"
>           >
>           <field column="p" xpath="//p"   />
>           </entity>
>        </entity>
>     </document>
>  </dataConfig>
>
> The FileListEntityProccessor is feeding me the files as expected
>
> But the XPathEntityProcessor is only processing one <p> and and its coming
> up empty?
>
> "entity:jcurrent",
>     [
>       null,
>       "----------- row #1-------------",
>       "file",
>       "A10078.html",
>       "fileSize",
>       43635,
>       "fileLastModified",
>       "2015-08-12T22:44:19Z",
>       "fileDir",
>       "/var/www/web/A10078",
>       "fileAbsolutePath",
>       "/var/www/web/A10078/A10078.html",
>       null,
>       "---------------------------------------------",
>       "entity:x",
>       [
>         "document#1",
>         [
>           "query",
>           "/var/www/web/A10078/A10078.html",
>           "time-taken",
>           "0:0:0.0",
>           null,
>           "----------- row #1-------------",
>           "p",
>           "",
>           "$forEach",
>           "/html/body",
>           null,
>           "---------------------------------------------"
>         ],
>         "document#1",
>         []
>       ]
>     ]
>   ],

Reply via email to