Can anybody help me understand the right way to define a data-config.xml file with nested entities for indexing the contents of an XML file?
I used this data-config.xml file to index a database containing sample patient records: <dataConfig> <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/bioscope" user="db_user" password=""/> <document name="bioscope"> <entity name="docs" pk="doc_id" query="SELECT doc_id, type FROM bioscope.docs"> <field column="doc_id" name="doc_id"/> <field column="type" name="doc_type"/> <entity name="codes" query="SELECT id, origin, type, code FROM bioscope.codes WHERE doc_id='${docs.doc_id}'"> <field column="origin" name="code_origin"/> <field column="type" name="code_type"/> <field column="code" name="code_value"/> </entity> <entity name="notes" query="SELECT id, origin, type, text FROM bioscope.texts WHERE doc_id='${docs.doc_id}'"> <field column="origin" name="note_origin"/> <field column="type" name="note_type"/> <field column="text" name="note_text"/> </entity> </entity> </document> </dataConfig> I would like to do the same thing with an XML file containing the same data as is in the database. That XML file looks like this: <docs> <doc id="97634811" type="RADIOLOGY_REPORT"> <codes> <code origin="CMC_MAJORITY" type="ICD-9-CM">786.2</code> <code origin="COMPANY3" type="ICD-9-CM">786.2</code> <code origin="COMPANY1" type="ICD-9-CM">786.2</code> <code origin="COMPANY2" type="ICD-9-CM">786.2</code> </codes> <texts> <text origin="CCHMC_RADIOLOGY" type="CLINICAL_HISTORY">Seventeen year old with cough.</text> <text origin="CCHMC_RADIOLOGY" type="IMPRESSION">Normal.</text> </texts> </doc> .... </docs> I tried using this data-config.xml file, in order to preserve the nested entity structure used with the database case: <dataConfig> <dataSource type="FileDataSource" encoding="UTF-8"/> <document name="bioscope"> <entity name="doc" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc" url="C:/data/bioscope.xml"> <field column="doc_id" xpath="/docs/doc/@id"/> <field column="doc_type" xpath="/docs/doc/@type"/> <entity name="code" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc[@id='${doc.doc_id}']/codes/code" url="C:/data/bioscope.xml"> <field column="code_origin" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@origin"/> <field column="code_type" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@type"/> <field column="code_value" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code"/> </entity> <entity name="note" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc[@id='${doc.doc_id}']/texts/text" url="C:/data/bioscope.xml"> <field column="note_origin" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@origin"/> <field column="note_type" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@type"/> <field column="note_text" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text"/> </entity> </entity> </document> </dataConfig> This is wrong, and it fails to index any of the <codes> and <texts> blocks in the XML file. I'm sure that part of the problem must be that the xpath expressions such as "/docs/doc[@id='${doc.doc_id}']/texts/text/@origin" fail to match anything in the XML file, because when I try the same import without nested entities, using this data-config.xml file, the <codes> and <texts> blocks are also not indexed: <dataConfig> <dataSource type="FileDataSource" encoding="UTF-8"/> <document name="bioscope"> <entity name="doc" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc" url="C:/data/bioscope.xml"> <field column="doc_id" xpath="/docs/doc/@id"/> <field column="doc_type" xpath="/docs/doc/@type"/> <field column="code_origin" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@origin"/> <field column="code_type" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@type"/> <field column="code_value" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code"/> <field column="note_origin" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@origin"/> <field column="note_type" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@type"/> <field column="note_text" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text"/> </entity> </document> </dataConfig> However, when I use this data-config.xml file, which doesn't use nested entities, all of the fields are included in the index: <dataConfig> <dataSource type="FileDataSource" encoding="UTF-8"/> <document name="bioscope"> <entity name="doc" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc" url="C:/data/bioscope.xml"> <field column="doc_id" xpath="/docs/doc/@id"/> <field column="doc_type" xpath="/docs/doc/@type"/> <field column="code_origin" xpath="/docs/doc/codes/code/@origin"/> <field column="code_type" xpath="/docs/doc/codes/code/@type"/> <field column="code_value" xpath="/docs/doc/codes/code"/> <field column="note_origin" xpath="/docs/doc/texts/text/@origin"/> <field column="note_type" xpath="/docs/doc/texts/text/@type"/> <field column="note_text" xpath="/docs/doc/texts/text"/> </entity> </document> </dataConfig> but I don't think any correspondence is maintained between the code_origin, code_type and code_value field values and the note_origin, note_type and note_text field values that are grouped together in the input XML file. It has taken me a while to get this far, and obviously I don't have it right yet. Can anybody help me define a data-config.xml file with nested entities for indexing an XML file? Thanks, Mike