Hi Markus,

the result of my investigation is that Lucene currently can only handle
UTF-8 code within BMP [Basic Multilingual Plane] (plane 0) <= 0xFFFF.

Any code above BMP might end in unpredictable results which is bad.
If you get invalid UTF-8 from the index and use wt=xml it gives the error
page. This is due to encoding=text/xml and charset=utf-8 in the header.
If you use wt=json then the encoding is text/plain and charset=utf-8.
Because of text/plain you don't get an error page but nevertheless the
content is invalid. I guess it replaces all invalid code with UTF-8 BOM.
So currently no solution, even not with JSON.

This should (hopefully) be fixed with Lucene 3.1.

Regards,
Bernd


Am 11.02.2011 15:50, schrieb Markus Jelsma:
> No i haven't located the issue. It might be Solr but it could also be Xerces 
> having trouble with it. You can possibly work around the problem by using the 
> JSONResponseWriter.
> 
> On Friday 11 February 2011 15:45:23 Bernd Fehling wrote:
>> Hi Markus,
>>
>> yes it looks like the same issue. There is also a \uffff utf8-code in your
>> dump. Till now I followed it into XMLResponseWriter.
>> Some steps before the result in a buffer looks good and the utf8-code is
>> correct. Really hard to debug this freaky problem.
>>
>> Have you looked deeper into this and located the bug?
>>
>> It is definately a bug and has nothing to do with firefox.
>>
>> Regards,
>> Bernd
>>
>> Am 11.02.2011 13:48, schrieb Markus Jelsma:
>>> It looks like you hit the same issue as i did a while ago:
>>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg46510.html
>>>
>>> On Friday 11 February 2011 08:59:27 Bernd Fehling wrote:
>>>> Dear list,
>>>>
>>>> after loading some documents via DIH which also include urls
>>>> I get this yellow XML error page as search result from solr admin GUI
>>>> after a search.
>>>> It says XML processing error "not well-formed".
>>>> The code it argues about is:
>>>>
>>>> <arr name="dcurls">
>>>> <str>http://eprints.soton.ac.uk/43350/</str>
>>>> <str>http://dx.doi.org/doi:10.1112/S0024610706023143</str>
>>>> <str>Martinez-Perez, Conchita and Nucinkis, Brita E.A. (2006)
>>>> Cohomological dimension of Mackey functors for infinite groups. Journal
>>>> of the London Mathematical Society, 74, (2), 379-396.
>>>> (doi:10.1112/S0024610706023143
>>>> &lt;http://dx.doi.org/10.1112/S002461070602314\uffff&gt;)</str></arr>
>>>>
>>>> See the \uffff utf8-code in the last line.
>>>>
>>>> 1. the loaded data is valid, well-formed and checked with xmllint. No
>>>> errors. 2. there is no \uffff utf8-code in the source data.
>>>> 3. the data is loaded via DIH without any errors.
>>>> 4. if opening the source-view of the result page with firefox there is
>>>> also no \uffff utf8-code.
>>>>
>>>> Only idea I have is solr itself or the result page generation.
>>>>
>>>> How to proceed, what else to check?
>>>>
>>>> Regards,
>>>> Bernd
> 

-- 
*************************************************************
Bernd Fehling                Universitätsbibliothek Bielefeld
Dipl.-Inform. (FH)                        Universitätsstr. 25
Tel. +49 521 106-4060                   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de                33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

Reply via email to