Please open an issue on Tika's JIRA and share the triggering file if possible. If we can touch the file, we may be able to recommend alternate ways to configure Tika's encoding detectors. We just added configurability to the encoding detectors and that will be available with Tika 1.15. [1]
We use a fallback set of detectors: html, universalchardet, icu4j. Whichever one has a non-null answer, we go with that. This is perhaps not the best option, but that's what we've been doing for a while. We are in the process of reassessing our current methods[2], but that will take some time. [1] https://issues.apache.org/jira/browse/TIKA-2273 [2] https://issues.apache.org/jira/browse/TIKA-2038 -----Original Message----- From: Noriyuki TAKEI [mailto:nta...@sios.com] Sent: Monday, April 10, 2017 1:46 PM To: solr-user@lucene.apache.org Subject: Japanese character is garbled when using TikaEntityProcessor Hi,All I use TikaEntityProcessor to extract the text content from binary or text file. But when I try to extract Japanese Characters from HTML File whose caharacter encoding is SJIS, the content is garbled.In the case of UTF-8,it does work well. The setting of Data Import Handler is as below. --- from here --- <dataConfig> <dataSource name="ds-db" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/bbs" user="root" password="xxxx"/> <dataSource name="ds-file" type="BinFileDataSource"/> <document> <entity name="messages" dataSource="ds-db" pk="id" query="select id,title from messages"> <field column="id" name="id"/> <field column="title" name="title"/> <entity name="contents" dataSource="ds-db" pk="id" query="select id,path from contents where id=${messages.id}"> <entity name="file" dataSource="ds-file" processor="TikaEntityProcessor" url="${contents.path}" format="text"> <field column="text" name="content" /> </entity> </entity> </entity> </document> </dataConfig> --- to here --- How do I solve this? -- View this message in context: http://lucene.472066.n3.nabble.com/Japanese-character-is-garbled-when-using-TikaEntityProcessor-tp4329217.html Sent from the Solr - User mailing list archive at Nabble.com.