I tend to approach these differently. DIH is a great tool for its purpose, but I find SolrJ/Tika to be more understandable. Which may only reflect that I've never spent enough time with DIH, but there it is....
So, why not use a simple SolrJ program with either Tika or your favorite HTML parser to extract what you need? I find this an easier setup personally. Here's a starter set: https://lucidworks.com/blog/indexing-with-solrj/ It has a fork that indexes from a DB as well, but you can rip that out pretty easily. Best, Erick On Wed, Aug 12, 2015 at 3:53 PM, Scott Derrick <sc...@tnstaafl.net> wrote: > I am trying to index a slew of web pages but want to restrict what gets > indexed > > I'm trying to use a dataImportHandler to do this. > > my initial config to test this approach isn't doing what I expect > > > <dataConfig> > <dataSource name="myfilereader" type="FileDataSource"/> > <document> > <entity name="jcurrent" > processor="FileListEntityProcessor" > fileName=".*html" > newerThan="${dataimporter.last_index_time}" > recursive="true" > rootEntity="false" > dataSource="null" > baseDir="/var/www/web/A10078"> > > <entity name="x" > dataSource="myfilereader" > processor="XPathEntityProcessor" > url="${jcurrent.fileAbsolutePath}" > stream="false" > forEach="/html/body" > dataField="text" > > > <field column="p" xpath="//p" /> > </entity> > </entity> > </document> > </dataConfig> > > The FileListEntityProccessor is feeding me the files as expected > > But the XPathEntityProcessor is only processing one <p> and and its coming > up empty? > > "entity:jcurrent", > [ > null, > "----------- row #1-------------", > "file", > "A10078.html", > "fileSize", > 43635, > "fileLastModified", > "2015-08-12T22:44:19Z", > "fileDir", > "/var/www/web/A10078", > "fileAbsolutePath", > "/var/www/web/A10078/A10078.html", > null, > "---------------------------------------------", > "entity:x", > [ > "document#1", > [ > "query", > "/var/www/web/A10078/A10078.html", > "time-taken", > "0:0:0.0", > null, > "----------- row #1-------------", > "p", > "", > "$forEach", > "/html/body", > null, > "---------------------------------------------" > ], > "document#1", > [] > ] > ] > ],