Re: Indexing fails for docs with high Latin1 chars

Jack Krupansky Mon, 08 Jul 2013 16:23:51 -0700

Maybe you need to add "; charset=UTF-8" to your Content-type:

curl"http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml;charset=UTF-8”


-- Jack Krupansky

-----Original Message-----From: John Randall

Sent: Monday, July 08, 2013 6:43 PM
To: solr-user@lucene.apache.org
Subject: Indexing fails for docs with high Latin1 chars

I'm new to Solr, so I'm probably missing something. So far I've successfullyindexed .xml docs with low Ascii chars. However when I try to add a doc thathas Latin1 chars with diacritics, it fails. I've tried using the Jettyexampledocs post.jar, as well as using curl and directly from a browser. Allthree of the following methods work fine when the docs contain Ascii 32-126:

From a browser:

http://localhost:8080/solr/update/?stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml


Using cURL:

curl"http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml”


Using post.jar from exampledocs directory
java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486

java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486.xml

I've tried other things: e.g., I've added the following line to the Tomcatserver.xml file, <Connector .../> section.

URIEncoding="UTF-8"

I've also copied some characters out of the utf8-example.xml file that camewith the Jetty app. It still fails. I also changed the offending charactersto their unicode equivalent: e.g., N with tilde to Ñ and Ñwithout success. For N with tilde and e with acute I get the followingmessage:


HTTP Status 400 - Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)

________________________________

type Status report
message Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
description The request sent by the client was syntactically incorrect.

________________________________

Apache Tomcat/7.0.40
The file I am trying to add is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
  <field name="id">57917486</field>
  <field name="descrip_fw">NIÑO VOLANTE YOUNG FLYER</field>
 </doc>
</add>



My schema.xml file contains following fieldtypes:
   <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

  <!--For descrip_fw field (and trailing wildcard searches):-->

<fieldType name="search_fw" class="solr.TextField"positionIncrementGap="100">

   <analyzer type="index">

<charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt"/>

   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"maxGramSize="20" side="front"/>

   </analyzer>
   <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
 </fieldType>

<!-- For leading wildcard searches, I've added the following copy fieldtype using a copy field:

-->

<fieldType name="search_rev" class="solr.TextField"positionIncrementGap="100">

   <analyzer type="index">

<charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt"/>

   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"maxGramSize="20" side="back"/>

   </analyzer>
   <analyzer type="query">
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
 </fieldType>



My schema.xml file contains following pertinent fields:

<field name="id" type="string" indexed="true" stored="true"required="true"/><field name="descrip_fw" type="search_fw" indexed="true" stored="false"required="false"/>

  <copyField source="descrip_fw" dest="descrip_rev"/>


Also, I am using Tomcat as container on a Windows XP SP3 machine.

As I said this all works as long as the docs contain no high Latin1characters.

I'd appreciate any ideas you many have.

Re: Indexing fails for docs with high Latin1 chars

Reply via email to