Hi,All

I use TikaEntityProcessor to extract the text content from binary or text
file.

But when I try to extract Japanese Characters from HTML File whose
caharacter encoding is SJIS, the content is garbled.In the case of UTF-8,it
does work 
well.

The setting of Data Import Handler is as below.

--- from here ---
<dataConfig>
  <dataSource name="ds-db"
              type="JdbcDataSource" 
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/bbs" 
              user="root" 
              password="xxxx"/>
  <dataSource name="ds-file" type="BinFileDataSource"/>

  <document>
    <entity name="messages"
            dataSource="ds-db"
            pk="id"
            query="select id,title from messages">
      <field column="id" name="id"/>
      <field column="title" name="title"/>

      <entity name="contents"
              dataSource="ds-db"
              pk="id"
              query="select id,path from contents where id=${messages.id}">

        <entity name="file" dataSource="ds-file"
processor="TikaEntityProcessor" url="${contents.path}" format="text">
          <field column="text" name="content" />
        </entity>
      </entity>
    </entity>
  </document>
</dataConfig>
--- to here ---

How do I solve this?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Japanese-character-is-garbled-when-using-TikaEntityProcessor-tp4329217.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to