Re: Indexing fails for docs with high Latin1 chars

John Randall Mon, 08 Jul 2013 17:24:58 -0700

I tried that. It didn't work. I forgot to mention in my first email that I'm 
using Solr 3.6. Would that make a difference?




________________________________
From: Jack Krupansky <j...@basetechnology.com>
To: solr-user@lucene.apache.org; John Randall <jmr...@yahoo.com> 
Sent: Monday, July 8, 2013 7:22 PM
Subject: Re: Indexing fails for docs with high Latin1 chars


Maybe you need to add "; charset=UTF-8" to your Content-type:

curl 
"http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml;
charset=UTF-8”

-- Jack Krupansky

-----Original Message----- 
From: John Randall
Sent: Monday, July 08, 2013 6:43 PM
To: solr-user@lucene.apache.org
Subject: Indexing fails for docs with high Latin1 chars

I'm new to Solr, so I'm probably missing something. So far I've successfully 
indexed .xml docs with low Ascii chars. However when I try to add a doc that 
has Latin1 chars with diacritics, it fails. I've tried using the Jetty 
exampledocs post.jar, as well as using curl and directly from a browser. All 
three of the following methods work fine when the docs contain Ascii 32-126:

From a browser:
http://localhost:8080/solr/update/?stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml


Using cURL:
curl 
"http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml”

Using post.jar from exampledocs directory
java -jar -Durl=http://localhost:8080/solr/updatepost.jar 57917486

java -jar -Durl=http://localhost:8080/solr/updatepost.jar 57917486.xml


I've tried other things: e.g., I've added the following line to the Tomcat 
server.xml file, <Connector .../> section.
URIEncoding="UTF-8"

I've also copied some characters out of the utf8-example.xml file that came 
with the Jetty app. It still fails. I also changed the offending characters 
to their unicode equivalent: e.g., N with tilde to Ñ and &Ntilde; 
without success. For N with tilde and e with acute I get the following 
message:

HTTP Status 400 - Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)

________________________________

type Status report
message Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
description The request sent by the client was syntactically incorrect.

________________________________

Apache Tomcat/7.0.40
The file I am trying to add is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
  <field name="id">57917486</field>
  <field name="descrip_fw">NIÑO VOLANTE YOUNG FLYER</field>
  </doc>
</add>



My schema.xml file contains following fieldtypes:
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

  <!--For descrip_fw field (and trailing wildcard searches):-->
  <fieldType name="search_fw" class="solr.TextField" 
positionIncrementGap="100">
    <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="20" side="front"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

  <!-- For leading wildcard searches, I've added the following copy field 
type using a copy field:
  -->
  <fieldType name="search_rev" class="solr.TextField" 
positionIncrementGap="100">
    <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="20" side="back"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>



My schema.xml file contains following pertinent fields:
  <field name="id" type="string" indexed="true" stored="true" 
required="true"/>
  <field name="descrip_fw" type="search_fw" indexed="true" stored="false" 
required="false"/>
  <copyField source="descrip_fw" dest="descrip_rev"/>


Also, I am using Tomcat as container on a Windows XP SP3 machine.
As I said this all works as long as the docs contain no high Latin1 
characters.

I'd appreciate any ideas you many have.

Re: Indexing fails for docs with high Latin1 chars

Reply via email to