I don't think this is a BOM - that would be 0xfeff. Anyway the problem
we usually see w/processing XML with BOMs is in UTF8 (which really
doesn't need a BOM since it's a byte stream anyway), in which if you
transform the stream (bytes) into a reader (chars) before the xml parser
can see it, the parser treats the BOM as white space. But in that case
you typically get a more specific error about invalid characters in the
XML prolog, not just a random invalid character error.
-Mike
On 06/27/2011 10:33 AM, lee carroll wrote:
Hi Markus
I've seen similar issue before (but not with solr) when processing files as xml.
In our case the problem was due to processing a utf16 file with a byte
order mark. This presents itself as
0xffff to the xml parser which is not used by utf8 (the bom unicode
would be represented as efbfbf in utf8) This caused the utf8
aware parser to choke.
I don't want to get involved in any unicode / utf war as I'm confused
enough as it stands but
could you check for utf16 files before processing ?
lee c
On 27 June 2011 14:26, Thomas Fischer<fischer...@aon.at> wrote:
Hello,
Am 27.06.2011 um 12:40 schrieb Markus Jelsma:
Hi,
I came across the indexing error below. It happened in a huge batch update
from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace
the error back to a specific document. So i try my luck here: anyone seen this
before with SolrJ 3.1? Anything else on the Nutch part i should have taken
care off?
Thanks!
Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500
QTime=423
Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException]
Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
and loads of other rubbish and
... 26 more
I see this as a problem of solr error-reporting. This is not only obnoxiously
"loud" (white on grey with oversized fonts), but less useful than it should be.
Instead of telling the user where the error occurred (i.e. while reading which
file, which column at which line) it unravels the stack. This is useless if the
program just choked on some unexpected input, like a typo in a schema of config
file or an invalid character in a file to be indexed.
I don't know if this is due to the Tomcat, the logging system of solr itself,
but it is annoying.
And yes, I've seen something like this before and found the error not by
inspecting solr but by opening the suspected files with an appropriate browser
(e.g. Firefox) which tells me exactly where something goes wrong.
All the best
Thomas