Please open an issue on Tika's JIRA and share the triggering file if possible.  
If we can touch the file, we may be able to recommend alternate ways to 
configure Tika's encoding detectors.  We just added configurability to the 
encoding detectors and that will be available with Tika 1.15. [1]

We use a fallback set of detectors: html, universalchardet, icu4j.  Whichever 
one has a non-null answer, we go with that.  This is perhaps not the best 
option, but that's what we've been doing for a while. We are in the process of 
reassessing our current methods[2], but that will take some time.

[1] https://issues.apache.org/jira/browse/TIKA-2273
[2] https://issues.apache.org/jira/browse/TIKA-2038

-----Original Message-----
From: Noriyuki TAKEI [mailto:nta...@sios.com] 
Sent: Monday, April 10, 2017 1:46 PM
To: solr-user@lucene.apache.org
Subject: Japanese character is garbled when using TikaEntityProcessor

Hi,All

I use TikaEntityProcessor to extract the text content from binary or text file.

But when I try to extract Japanese Characters from HTML File whose caharacter 
encoding is SJIS, the content is garbled.In the case of UTF-8,it does work well.

The setting of Data Import Handler is as below.

--- from here ---
<dataConfig>
  <dataSource name="ds-db"
              type="JdbcDataSource" 
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/bbs" 
              user="root" 
              password="xxxx"/>
  <dataSource name="ds-file" type="BinFileDataSource"/>

  <document>
    <entity name="messages"
            dataSource="ds-db"
            pk="id"
            query="select id,title from messages">
      <field column="id" name="id"/>
      <field column="title" name="title"/>

      <entity name="contents"
              dataSource="ds-db"
              pk="id"
              query="select id,path from contents where id=${messages.id}">

        <entity name="file" dataSource="ds-file"
processor="TikaEntityProcessor" url="${contents.path}" format="text">
          <field column="text" name="content" />
        </entity>
      </entity>
    </entity>
  </document>
</dataConfig>
--- to here ---

How do I solve this?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Japanese-character-is-garbled-when-using-TikaEntityProcessor-tp4329217.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to