Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-06 Thread Federico Chiacchiaretta
2013/8/6 Raymond Wiker > Ok, let me rephrase that slightly: does your database extraction include > BLOBs or CLOBs that are actually complete documents, that might be UTF-8 > encoded text? > > It definitely does, each entry I have in PostgreSQL has a field of type "text" that include UTF-8 encode

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Raymond Wiker
Ok, let me rephrase that slightly: does your database extraction include BLOBs or CLOBs that are actually complete documents, that might be UTF-8 encoded text? >From the stack trace in your second post, it seems that the error occurs while parsing an XML file uploaded via the UpdateRequestHandler.

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Federico Chiacchiaretta
No, the content has no XML tags included (hope I understood what you were asking here). Federico 2013/8/5 Raymond Wiker > On Aug 5, 2013, at 20:12 , Federico Chiacchiaretta < > federico.c...@gmail.com> wrote: > > Hi Raymond, > > I agree with you, 0xfffe is a special character, that is why I wa

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Raymond Wiker
On Aug 5, 2013, at 20:12 , Federico Chiacchiaretta wrote: > Hi Raymond, > I agree with you, 0xfffe is a special character, that is why I was asking > how it's handled in solr. > In my document, 0xfffe does not appear at the beginning, it's in the > content. > > Just an update about testing I'm d

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Sundararaju, Shankar
The problem is that even though unicode point \u and \uFFFE are valid UTF-8 characters, they will not be parsed by standards conforming XML parsers. There is something called UTF-8 replacement character \uFFFD that can be used to replace such characters. While indexing docs, replace all such ch

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Robert Muir
On Mon, Aug 5, 2013 at 3:03 PM, Chris Hostetter wrote: > > : > 0xfffe is not a special character -- it is explicitly *not* a character in > : > Unicode at all, it is set asside as "not a character." specifically so > : > that the character 0xfeff can be used as a BOM, and if the BOM is read > : >

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Steve Rowe
Unicode noncharacters are perfectly valid for the purpose of interchange (though as Robert points out, XML has its own ideas about this, separately from the Unicode standard). From : Q: Are noncharacters invalid in Unicode strings and UTFs?

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Chris Hostetter
: > 0xfffe is not a special character -- it is explicitly *not* a character in : > Unicode at all, it is set asside as "not a character." specifically so : > that the character 0xfeff can be used as a BOM, and if the BOM is read : > incorrectly, it will cause an error. : : XML doesnt allow contro

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Shawn Heisey
On 8/5/2013 12:12 PM, Federico Chiacchiaretta wrote: Hi Raymond, I agree with you, 0xfffe is a special character, that is why I was asking how it's handled in solr. In my document, 0xfffe does not appear at the beginning, it's in the content. I believe that 0xfffe not a valid UTF-8 character, a

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Robert Muir
On Mon, Aug 5, 2013 at 11:42 AM, Chris Hostetter wrote: > > : I agree with you, 0xfffe is a special character, that is why I was asking > : how it's handled in solr. > : In my document, 0xfffe does not appear at the beginning, it's in the > : content. > > Unless i'm missunderstanding something (an

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Chris Hostetter
: I agree with you, 0xfffe is a special character, that is why I was asking : how it's handled in solr. : In my document, 0xfffe does not appear at the beginning, it's in the : content. Unless i'm missunderstanding something (and it's very likely that i am)... 0xfffe is not a special character -

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Federico Chiacchiaretta
Hi Raymond, I agree with you, 0xfffe is a special character, that is why I was asking how it's handled in solr. In my document, 0xfffe does not appear at the beginning, it's in the content. Just an update about testing I'm doing: in a SolrCloud two shards environment, if I launch dataimport on one

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Raymond Wiker
I think #xfffe is special; it is used as a "byte order mark" to identify the encoding used. In that case, it should only appear at the beginning of the document. Sent from my iPhone On 5 Aug 2013, at 17:19, Federico Chiacchiaretta wrote: > Hi Shawn, > thanks for your answer. > From the docs

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Federico Chiacchiaretta
Hi Shawn, thanks for your answer. >From the docs you linked i found: "This property is only relevent for server versions less than or equal to 7.2". I'm using version 9.1, I gave it a try but unfortunately I had no luck. Besides, I checked encoding settings on DB and it's UTF-8. Please note that

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Shawn Heisey
On 8/1/2013 7:20 AM, Federico Chiacchiaretta wrote: > on data import from a PostgreSQL db, I get the following error in solr.log: > > ERROR - 2013-08-01 09:51:00.217; org.apache.solr.common.SolrException; > shard update error RetryNode: > http://172.16.201.173:8983/solr/archive/:org.apache.solr.cl

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Federico Chiacchiaretta
Hi, I reproduced the bug on solr 4.4.0. The bug is specific to SolrCloud, so the bug occurs only when data has to be forwarded to another node (say I start dataimport on node1 and it forwards data to node2). Here is the log I found on target node: ERROR - 2013-08-05 11:57:48.739; org.apache.solr.c