Hi Markus, the result of my investigation is that Lucene currently can only handle UTF-8 code within BMP [Basic Multilingual Plane] (plane 0) <= 0xFFFF.
Any code above BMP might end in unpredictable results which is bad. If you get invalid UTF-8 from the index and use wt=xml it gives the error page. This is due to encoding=text/xml and charset=utf-8 in the header. If you use wt=json then the encoding is text/plain and charset=utf-8. Because of text/plain you don't get an error page but nevertheless the content is invalid. I guess it replaces all invalid code with UTF-8 BOM. So currently no solution, even not with JSON. This should (hopefully) be fixed with Lucene 3.1. Regards, Bernd Am 11.02.2011 15:50, schrieb Markus Jelsma: > No i haven't located the issue. It might be Solr but it could also be Xerces > having trouble with it. You can possibly work around the problem by using the > JSONResponseWriter. > > On Friday 11 February 2011 15:45:23 Bernd Fehling wrote: >> Hi Markus, >> >> yes it looks like the same issue. There is also a \uffff utf8-code in your >> dump. Till now I followed it into XMLResponseWriter. >> Some steps before the result in a buffer looks good and the utf8-code is >> correct. Really hard to debug this freaky problem. >> >> Have you looked deeper into this and located the bug? >> >> It is definately a bug and has nothing to do with firefox. >> >> Regards, >> Bernd >> >> Am 11.02.2011 13:48, schrieb Markus Jelsma: >>> It looks like you hit the same issue as i did a while ago: >>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg46510.html >>> >>> On Friday 11 February 2011 08:59:27 Bernd Fehling wrote: >>>> Dear list, >>>> >>>> after loading some documents via DIH which also include urls >>>> I get this yellow XML error page as search result from solr admin GUI >>>> after a search. >>>> It says XML processing error "not well-formed". >>>> The code it argues about is: >>>> >>>> <arr name="dcurls"> >>>> <str>http://eprints.soton.ac.uk/43350/</str> >>>> <str>http://dx.doi.org/doi:10.1112/S0024610706023143</str> >>>> <str>Martinez-Perez, Conchita and Nucinkis, Brita E.A. (2006) >>>> Cohomological dimension of Mackey functors for infinite groups. Journal >>>> of the London Mathematical Society, 74, (2), 379-396. >>>> (doi:10.1112/S0024610706023143 >>>> <http://dx.doi.org/10.1112/S002461070602314\uffff>)</str></arr> >>>> >>>> See the \uffff utf8-code in the last line. >>>> >>>> 1. the loaded data is valid, well-formed and checked with xmllint. No >>>> errors. 2. there is no \uffff utf8-code in the source data. >>>> 3. the data is loaded via DIH without any errors. >>>> 4. if opening the source-view of the result page with firefox there is >>>> also no \uffff utf8-code. >>>> >>>> Only idea I have is solr itself or the result page generation. >>>> >>>> How to proceed, what else to check? >>>> >>>> Regards, >>>> Bernd > -- ************************************************************* Bernd Fehling Universitätsbibliothek Bielefeld Dipl.-Inform. (FH) Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de 33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *************************************************************