Hi,All I use TikaEntityProcessor to extract the text content from binary or text file.
But when I try to extract Japanese Characters from HTML File whose caharacter encoding is SJIS, the content is garbled.In the case of UTF-8,it does work well. The setting of Data Import Handler is as below. --- from here --- <dataConfig> <dataSource name="ds-db" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/bbs" user="root" password="xxxx"/> <dataSource name="ds-file" type="BinFileDataSource"/> <document> <entity name="messages" dataSource="ds-db" pk="id" query="select id,title from messages"> <field column="id" name="id"/> <field column="title" name="title"/> <entity name="contents" dataSource="ds-db" pk="id" query="select id,path from contents where id=${messages.id}"> <entity name="file" dataSource="ds-file" processor="TikaEntityProcessor" url="${contents.path}" format="text"> <field column="text" name="content" /> </entity> </entity> </entity> </document> </dataConfig> --- to here --- How do I solve this? -- View this message in context: http://lucene.472066.n3.nabble.com/Japanese-character-is-garbled-when-using-TikaEntityProcessor-tp4329217.html Sent from the Solr - User mailing list archive at Nabble.com.