I tried that. It didn't work. I forgot to mention in my first email that I'm using Solr 3.6. Would that make a difference?
________________________________ From: Jack Krupansky <j...@basetechnology.com> To: solr-user@lucene.apache.org; John Randall <jmr...@yahoo.com> Sent: Monday, July 8, 2013 7:22 PM Subject: Re: Indexing fails for docs with high Latin1 chars Maybe you need to add "; charset=UTF-8" to your Content-type: curl "http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml; charset=UTF-8” -- Jack Krupansky -----Original Message----- From: John Randall Sent: Monday, July 08, 2013 6:43 PM To: solr-user@lucene.apache.org Subject: Indexing fails for docs with high Latin1 chars I'm new to Solr, so I'm probably missing something. So far I've successfully indexed .xml docs with low Ascii chars. However when I try to add a doc that has Latin1 chars with diacritics, it fails. I've tried using the Jetty exampledocs post.jar, as well as using curl and directly from a browser. All three of the following methods work fine when the docs contain Ascii 32-126: From a browser: http://localhost:8080/solr/update/?stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml Using cURL: curl "http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml” Using post.jar from exampledocs directory java -jar -Durl=http://localhost:8080/solr/updatepost.jar 57917486 java -jar -Durl=http://localhost:8080/solr/updatepost.jar 57917486.xml I've tried other things: e.g., I've added the following line to the Tomcat server.xml file, <Connector .../> section. URIEncoding="UTF-8" I've also copied some characters out of the utf8-example.xml file that came with the Jetty app. It still fails. I also changed the offending characters to their unicode equivalent: e.g., N with tilde to Ñ and Ñ without success. For N with tilde and e with acute I get the following message: HTTP Status 400 - Invalid UTF-8 middle byte 0x4f (at char #159, byte #37) ________________________________ type Status report message Invalid UTF-8 middle byte 0x4f (at char #159, byte #37) description The request sent by the client was syntactically incorrect. ________________________________ Apache Tomcat/7.0.40 The file I am trying to add is as follows: <?xml version="1.0" encoding="UTF-8"?> <add> <doc> <field name="id">57917486</field> <field name="descrip_fw">NIÑO VOLANTE YOUNG FLYER</field> </doc> </add> My schema.xml file contains following fieldtypes: <fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <!--For descrip_fw field (and trailing wildcard searches):--> <fieldType name="search_fw" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <!-- For leading wildcard searches, I've added the following copy field type using a copy field: --> <fieldType name="search_rev" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="back"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> My schema.xml file contains following pertinent fields: <field name="id" type="string" indexed="true" stored="true" required="true"/> <field name="descrip_fw" type="search_fw" indexed="true" stored="false" required="false"/> <copyField source="descrip_fw" dest="descrip_rev"/> Also, I am using Tomcat as container on a Windows XP SP3 machine. As I said this all works as long as the docs contain no high Latin1 characters. I'd appreciate any ideas you many have.