All, I am not sure if this is overly obvious or not (it wasn't to me) but in trying to index some international characters from XML files using the DIH, I found that setting the encoding attribute on the dataSource element to "UTF-8" fixed my problem.
<dataSource type="FileDataSource" encoding="UTF-8"/> My question is why the default isn't UTF-8 or if there is a good reason, can the DIH wiki be made more clear that this encoding attribute can affect the indexing of international characters? If I can get access to edit this wiki page, I can add a section to that effect.. perhaps under a troubleshooting section? Thanks! Amit