I want tika to only index the content in <div id="content">...</div> for the field "text". unfortunately it's indexing the hole page. Can't xpath do this?
data-config.xml: <dataConfig> <dataSource type="BinFileDataSource" name="data"/> <dataSource type="BinURLDataSource" name="dataUrl"/> <dataSource type="URLDataSource" name="main"/> <document> <entity name="rec" processor="XPathEntityProcessor" url="http://127.0.0.1/tkb/internet/docImportUrl.xml" forEach="/docs/doc" dataSource="main"> <!--transformer="script:GenerateId"--> <field column="title" xpath="//title" /> <field column="id" xpath="//id" /> <field column="file" xpath="//file" /> <field column="path" xpath="//path" /> <field column="url" xpath="//url" /> <field column="Author" xpath="//author" /> <entity name="tika" processor="TikaEntityProcessor" url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" > <field column="text" xpath="//div[@id='content']" /> </entity> </entity> </document> </dataConfig>