Maybe you need to add "; charset=UTF-8" to your Content-type:
curl
"http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml;
charset=UTF-8”
-- Jack Krupansky
-----Original Message-----
From: John Randall
Sent: Monday, July 08, 2013 6:43 PM
To: solr-user@lucene.apache.org
Subject: Indexing fails for docs with high Latin1 chars
I'm new to Solr, so I'm probably missing something. So far I've successfully
indexed .xml docs with low Ascii chars. However when I try to add a doc that
has Latin1 chars with diacritics, it fails. I've tried using the Jetty
exampledocs post.jar, as well as using curl and directly from a browser. All
three of the following methods work fine when the docs contain Ascii 32-126:
From a browser:
http://localhost:8080/solr/update/?stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml
Using cURL:
curl
"http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml”
Using post.jar from exampledocs directory
java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486
java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486.xml
I've tried other things: e.g., I've added the following line to the Tomcat
server.xml file, <Connector .../> section.
URIEncoding="UTF-8"
I've also copied some characters out of the utf8-example.xml file that came
with the Jetty app. It still fails. I also changed the offending characters
to their unicode equivalent: e.g., N with tilde to Ñ and Ñ
without success. For N with tilde and e with acute I get the following
message:
HTTP Status 400 - Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
________________________________
type Status report
message Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
description The request sent by the client was syntactically incorrect.
________________________________
Apache Tomcat/7.0.40
The file I am trying to add is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
<field name="id">57917486</field>
<field name="descrip_fw">NIÑO VOLANTE YOUNG FLYER</field>
</doc>
</add>
My schema.xml file contains following fieldtypes:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<!--For descrip_fw field (and trailing wildcard searches):-->
<fieldType name="search_fw" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="20" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- For leading wildcard searches, I've added the following copy field
type using a copy field:
-->
<fieldType name="search_rev" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="20" side="back"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
My schema.xml file contains following pertinent fields:
<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="descrip_fw" type="search_fw" indexed="true" stored="false"
required="false"/>
<copyField source="descrip_fw" dest="descrip_rev"/>
Also, I am using Tomcat as container on a Windows XP SP3 machine.
As I said this all works as long as the docs contain no high Latin1
characters.
I'd appreciate any ideas you many have.