Hi *Chris*,
I've never used the DIH, but maybe the "*fileName*" pattern is wrong?
fileName="*.*xml*"
Should be:
fileName="**.xml*"
Regards,
*Felipe*.
On Mon, Dec 5, 2016 at 9:43 AM, Chris Rogers <[email protected]
> wrote:
> Hi all,
>
> Just bumping my question again, as doesn’t seem to have been picked up by
> anyone. Any help would be much appreciated.
>
> Chris
>
> On 02/12/2016, 16:36, "Chris Rogers" <[email protected]>
> wrote:
>
> Hi all,
>
> A question regarding using the DIH FileListEntityProcessor with
> SolrCloud (solr 6.3.0, zookeeper 3.4.8).
>
> I get that the config in SolrCloud lives on the Zookeeper node (a
> different server from the solr nodes in my setup).
>
> With this in mind, where is the baseDir attribute in the
> FileListEntityProcessor config relative to? I’m seeing the config in the
> Solr GUI, and I’ve tried setting it as an absolute path on my Zookeeper
> server, but this doesn’t seem to work… any ideas how this should be setup?
>
> My DIH config is below:
>
> <dataConfig>
> <dataSource type="FileDataSource"/>
> <document>
> <!-- this outer processor generates a list of files satisfying the
> conditions
> specified in the attributes -->
> <entity name="f" processor="FileListEntityProcessor"
> fileName=".*xml"
> newerThan="'NOW-5YEARS'"
> recursive="true"
> rootEntity="false"
> dataSource="null"
> baseDir="/home/bodl-zoo-svc/files/">
>
> <!-- this processor extracts content using Xpath from each file
> found -->
>
> <entity name="tei" processor="XPathEntityProcessor"
> forEach="/TEI" url="${f.fileAbsolutePath}"
> transformer="RegexTransformer" >
> <field column="manuscript_title" name="manuscript_title"
> xpath="/TEI/teiHeader/fileDesc/titleStmt/title"/>
> <field column="repository" name="repository"
> xpath="/TEI/teiHeader/fileDesc/publicationStmt/publisher"/>
> <field column="id" name="id" xpath="/TEI/teiHeader/
> fileDesc/sourceDesc/msDesc/msIdentifier/altIdentifier/idno"/>
> </entity>
>
> </entity>
>
> </document>
> </dataConfig>
>
>
> This same script worked as expected on a single solr node (i.e. not in
> SolrCloud mode).
>
> Thanks,
> Chris
>
> --
> Chris Rogers
> Digital Projects Manager
> Bodleian Digital Library Systems and Services
> [email protected]
>
>
>