Hi all,

I'm new to the list, but I've been struggling with this problem for some time. I'm getting Illegal xml/html character errors and I'm trying to track down the source. The characters in question seem to be in the 128-159 (decimal) range, which is illegal in XML. The characters are mostly diacritics and other types of accents.

The original data is encoded in UTF-8. I have verified that the data doesn't contain any of these characters prior to indexing, and when I get the records in question back in a list of results, they display fine. The problem arises when the characters occur in a facet value and I try to pass it through the URL.

As an example, consider a facet value:
Brasseur de Bourbourg, abb%C3%A9, 1814-1874, former owner

The %C3%A9 is an e with a diacritic, so roughly abbe'.

The following is a snippet of a link to use a facet:
search-faceted.html?q=[* TO *]&facet=true&rows=25&fq=name_facet:"Brasseur de Bourbourg, abb%C3%A9, 1814-1874, former owner""

These characters are correctly specified. When it returns, I get an illegal character error. Examining the XML, I get an fq value of:
name_facet:"Brasseur de Bourbourg, abbé, 1814-1874, former owner"

I'm not sure how that will display in the email, but in short, it's not what I put in. Further, it's not legal html and things break.

Does anyone have any thoughts about this? I apologize if this has been asked somewhere in the past, but I did some digging and couldn't come up with anything. I welcome any input.

Regards,

Peter

----
Peter Cline, Digital Library Applications Programmer
University of Pennsylvania Library
email: pcline at pobox dot upenn dot edu

Reply via email to