Hi List,

My SOLR instance is setup to index PST files with DIH, TikaEntityProcessor and 
OutlookPSTParser. After running import, I can see that the index contains the 
top level information of the PST file (e.g. unique id of each message, header, 
PST file size) but the messages themselves are missing. I suspect that I need 
to instruct SOLR to recurse to the next level during indexing inside DIH config 
file but I don’t know how. My DIH config file looks like so:

<dataSource name="bin" type="BinFileDataSource" />
<document>
        <entity name="files" dataSource="bin" rootEntity="false" 
processor="FileListEntityProcessor" baseDir=“/PST_Path" fileName=".*" 
onError="abort” recursive=“true”>
                <entity pk="uri" name="file" dataSource="bin" 
processor="TikaEntityProcessor" url="${files.fileAbsolutePath}" format="xml" 
rootEntity="true" onError="skip" recursive="true" 
parser="org.apache.tika.parser.mbox.OutlookPSTParser”>
                        <!—- I think I need to insert another entity here to 
parse/index the actual messages but I don’t know how to craft one —>
                </entity>
        </entity>
</document>

Any ideas?

Thank you,
Anton

Reply via email to