I don't think this is a BOM - that would be 0xfeff. Anyway the problem we usually see w/processing XML with BOMs is in UTF8 (which really doesn't need a BOM since it's a byte stream anyway), in which if you transform the stream (bytes) into a reader (chars) before the xml parser can see it, the parser treats the BOM as white space. But in that case you typically get a more specific error about invalid characters in the XML prolog, not just a random invalid character error.

-Mike

On 06/27/2011 10:33 AM, lee carroll wrote:
Hi Markus

I've seen similar issue before (but not with solr) when processing files as xml.
In our case the problem was due to processing a utf16 file with a byte
order mark. This presents itself as
0xffff to the xml parser which is not used by utf8 (the bom unicode
would be represented as efbfbf in utf8) This caused the utf8
aware parser to choke.

I don't want to get involved in any unicode / utf war as I'm confused
enough as it stands but
could you check for utf16 files before processing ?

lee c

On 27 June 2011 14:26, Thomas Fischer<fischer...@aon.at>  wrote:
Hello,

Am 27.06.2011 um 12:40 schrieb Markus Jelsma:

Hi,

I came across the indexing error below. It happened in a huge batch update
from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace
the error back to a specific document. So i try my luck here: anyone seen this
before with SolrJ 3.1? Anything else on the Nutch part i should have taken
care off?

Thanks!


Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500 
QTime=423
Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] 
Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
       at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
and loads of other rubbish and

       ... 26 more

I see this as a problem of solr error-reporting. This is not only obnoxiously 
"loud" (white on grey with oversized fonts), but less useful than it should be.
Instead of telling the user where the error occurred (i.e. while reading which 
file, which column at which line) it unravels the stack. This is useless if the 
program just choked on some unexpected input, like a typo in a schema of config 
file or an invalid character in a file to be indexed.
I don't know if this is due to the Tomcat, the logging system of solr itself, 
but it is annoying.

And yes, I've seen something like this before and found the error not by 
inspecting solr but by opening the suspected files with an appropriate browser 
(e.g. Firefox) which tells me exactly where something goes wrong.

All the best
Thomas


Reply via email to