Can anybody help me understand the right way to define a data-config.xml file 
with nested entities for indexing the contents of an XML file?

I used this data-config.xml file to index a database containing sample patient 
records:

<dataConfig>
  <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" 
url="jdbc:mysql://localhost:3306/bioscope" user="db_user" password=""/>
  <document name="bioscope">
    <entity name="docs" pk="doc_id" query="SELECT doc_id, type FROM 
bioscope.docs">
      <field column="doc_id" name="doc_id"/>
      <field column="type" name="doc_type"/>
      <entity name="codes" query="SELECT id, origin, type, code FROM 
bioscope.codes WHERE doc_id='${docs.doc_id}'">
        <field column="origin" name="code_origin"/>
        <field column="type" name="code_type"/>
        <field column="code" name="code_value"/>
      </entity>
      <entity name="notes" query="SELECT id, origin, type, text FROM 
bioscope.texts WHERE doc_id='${docs.doc_id}'">
        <field column="origin" name="note_origin"/>
        <field column="type" name="note_type"/>
        <field column="text" name="note_text"/>
      </entity>
    </entity>
  </document>
</dataConfig>

I would like to do the same thing with an XML file containing the same data as 
is in the database. That XML file looks like this:

<docs>
  <doc id="97634811" type="RADIOLOGY_REPORT">
    <codes>
      <code origin="CMC_MAJORITY" type="ICD-9-CM">786.2</code>
      <code origin="COMPANY3" type="ICD-9-CM">786.2</code>
      <code origin="COMPANY1" type="ICD-9-CM">786.2</code>
      <code origin="COMPANY2" type="ICD-9-CM">786.2</code>
    </codes>
    <texts>
      <text origin="CCHMC_RADIOLOGY" type="CLINICAL_HISTORY">Seventeen year old 
with cough.</text>
      <text origin="CCHMC_RADIOLOGY" type="IMPRESSION">Normal.</text>
    </texts>
  </doc>
  ....
</docs>

I tried using this data-config.xml file, in order to preserve the nested entity 
structure used with the database case:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8"/>
  <document name="bioscope">
    <entity name="doc" processor="XPathEntityProcessor" stream="true" 
forEach="/docs/doc" url="C:/data/bioscope.xml">
      <field column="doc_id" xpath="/docs/doc/@id"/>
      <field column="doc_type" xpath="/docs/doc/@type"/>
      <entity name="code" processor="XPathEntityProcessor" stream="true" 
forEach="/docs/doc[@id='${doc.doc_id}']/codes/code" url="C:/data/bioscope.xml">
        <field column="code_origin" 
xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@origin"/>
        <field column="code_type" 
xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@type"/>
        <field column="code_value" 
xpath="/docs/doc[@id='${doc.doc_id}']/codes/code"/>
      </entity>
      <entity name="note" processor="XPathEntityProcessor" stream="true" 
forEach="/docs/doc[@id='${doc.doc_id}']/texts/text" url="C:/data/bioscope.xml">
       <field column="note_origin" 
xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@origin"/>
       <field column="note_type" 
xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@type"/>
       <field column="note_text" 
xpath="/docs/doc[@id='${doc.doc_id}']/texts/text"/>
      </entity>
    </entity>
  </document>
</dataConfig>

This is wrong, and it fails to index any of the <codes> and <texts> blocks in 
the XML file. I'm sure that part of the problem  must be that the xpath 
expressions such as "/docs/doc[@id='${doc.doc_id}']/texts/text/@origin" fail to 
match anything in the XML file, because when I try the same import without 
nested entities, using this data-config.xml file, the <codes> and <texts> 
blocks are also not indexed:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8"/>
  <document name="bioscope">
    <entity name="doc" processor="XPathEntityProcessor" stream="true" 
forEach="/docs/doc" url="C:/data/bioscope.xml">
      <field column="doc_id" xpath="/docs/doc/@id"/>
      <field column="doc_type" xpath="/docs/doc/@type"/>
      <field column="code_origin" 
xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@origin"/>
      <field column="code_type" 
xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@type"/>
      <field column="code_value" 
xpath="/docs/doc[@id='${doc.doc_id}']/codes/code"/>
      <field column="note_origin" 
xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@origin"/>
      <field column="note_type" 
xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@type"/>
      <field column="note_text" 
xpath="/docs/doc[@id='${doc.doc_id}']/texts/text"/>
    </entity>
  </document>
</dataConfig>

However, when I use this data-config.xml file, which doesn't use nested 
entities, all of the fields are included in the index:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8"/>
  <document name="bioscope">
    <entity name="doc" processor="XPathEntityProcessor" stream="true" 
forEach="/docs/doc" url="C:/data/bioscope.xml">
      <field column="doc_id" xpath="/docs/doc/@id"/>
      <field column="doc_type" xpath="/docs/doc/@type"/>
      <field column="code_origin" xpath="/docs/doc/codes/code/@origin"/>
      <field column="code_type" xpath="/docs/doc/codes/code/@type"/>
      <field column="code_value" xpath="/docs/doc/codes/code"/>
      <field column="note_origin" xpath="/docs/doc/texts/text/@origin"/>
      <field column="note_type" xpath="/docs/doc/texts/text/@type"/>
      <field column="note_text" xpath="/docs/doc/texts/text"/>
    </entity>
  </document>
</dataConfig>

but I don't think any correspondence is maintained between the code_origin, 
code_type and code_value field values and the note_origin, note_type and 
note_text field values that are grouped together in the input XML file.

It has taken me a while to get this far, and obviously I don't have it right 
yet. Can anybody help me define a data-config.xml file with nested entities for 
indexing an XML file?
Thanks,
Mike

Reply via email to