Maybe you need to add "; charset=UTF-8" to your Content-type:

curl "http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml; charset=UTF-8”

-- Jack Krupansky

-----Original Message----- From: John Randall
Sent: Monday, July 08, 2013 6:43 PM
To: solr-user@lucene.apache.org
Subject: Indexing fails for docs with high Latin1 chars

I'm new to Solr, so I'm probably missing something. So far I've successfully indexed .xml docs with low Ascii chars. However when I try to add a doc that has Latin1 chars with diacritics, it fails. I've tried using the Jetty exampledocs post.jar, as well as using curl and directly from a browser. All three of the following methods work fine when the docs contain Ascii 32-126:

From a browser:
http://localhost:8080/solr/update/?stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml


Using cURL:
curl "http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml”

Using post.jar from exampledocs directory
java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486

java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486.xml


I've tried other things: e.g., I've added the following line to the Tomcat server.xml file, <Connector .../> section.
URIEncoding="UTF-8"

I've also copied some characters out of the utf8-example.xml file that came with the Jetty app. It still fails. I also changed the offending characters to their unicode equivalent: e.g., N with tilde to &#209; and &Ntilde; without success. For N with tilde and e with acute I get the following message:

HTTP Status 400 - Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)

________________________________

type Status report
message Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
description The request sent by the client was syntactically incorrect.

________________________________

Apache Tomcat/7.0.40
The file I am trying to add is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
  <field name="id">57917486</field>
  <field name="descrip_fw">NIÑO VOLANTE YOUNG FLYER</field>
 </doc>
</add>



My schema.xml file contains following fieldtypes:
   <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

  <!--For descrip_fw field (and trailing wildcard searches):-->
<fieldType name="search_fw" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front"/>
   </analyzer>
   <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
 </fieldType>

<!-- For leading wildcard searches, I've added the following copy field type using a copy field:
  -->
<fieldType name="search_rev" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="back"/>
   </analyzer>
   <analyzer type="query">
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
 </fieldType>



My schema.xml file contains following pertinent fields:
<field name="id" type="string" indexed="true" stored="true" required="true"/> <field name="descrip_fw" type="search_fw" indexed="true" stored="false" required="false"/>
  <copyField source="descrip_fw" dest="descrip_rev"/>


Also, I am using Tomcat as container on a Windows XP SP3 machine.
As I said this all works as long as the docs contain no high Latin1 characters.

I'd appreciate any ideas you many have.

Reply via email to