Hi Mark,

BOM detection is irrelevant here. The HTTP header states that it should be 
UTF8, but this is not being honoured. 

There is something further down the stack that isn’t recording the HTTP 
headers. 

Chris

> On 4 Jan 2016, at 4:23 PM, Mark Hung <[email protected]> wrote:
> 
> Hi Chris,
> 
> As recently I'm working on SvParser and HTMLParser, 
> 
> There is BOM detection is in SvParser::GetNextChar().
> 
> A quick look at eehtml, EditHTMLParser:: <>EditHTMLParser seems relevant.
> 
> Best regards.
> 
> 
> 2016-01-04 12:02 GMT+08:00 Chris Sherlock <[email protected] 
> <mailto:[email protected]>>:
> Hey guys, 
> 
> Probably nobody saw this because of the time of year (Happy New Year, 
> incidentally!!!).
> 
> Just a quick ping to the list to see if anyone can give me some pointers. 
> 
> Chris
> 
>> On 30 Dec 2015, at 12:15 PM, Chris Sherlock <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi guys,
>> 
>> In bug 95217 - https://bugs.documentfoundation.org/show_bug.cgi?id=95217 
>> <https://bugs.documentfoundation.org/show_bug.cgi?id=95217> - Persian test 
>> in a webpage encoded as UTF-8 is corrupting.
>> 
>> If I take the webpage and save to an HTML file encoded as UTF8, then there 
>> are no problems and the Persian text comes through fine. However, when 
>> connecting to a webserver directly, the HTTP header correctly gives the 
>> content type as utf8.
>> 
>> I did a test using Charles Proxy with its SSL interception feature turned on 
>> and pointed Safari to 
>> https://bugs.documentfoundation.org/attachment.cgi?id=119818 
>> <https://bugs.documentfoundation.org/attachment.cgi?id=119818>
>> 
>> The following headers are gathered:
>> 
>> HTTP/1.1 200 OK
>> Server: nginx/1.2.1
>> Date: Sat, 26 Dec 2015 01:41:30 GMT
>> Content-Type: text/html; name="text.html"; charset=UTF-8
>> Content-Length: 982
>> Connection: keep-alive
>> X-xss-protection: 1; mode=block
>> Content-disposition: inline; filename="text.html"
>> X-content-type-options: nosniff
>> 
>> Some warnings are spat out that it editeng's eehtml can't detect the 
>> encoding. I initially thought it was looking for a BOM, which makes no sense 
>> for a webpage, but that's wrong. Instead, for some reason the headers don't 
>> seem to be processed and the HTML parser is falling back to ISO-8859-1 and 
>> not UTF8 as the character encoding.
>> 
>> We seem to use Neon to make the GET request to the webserver. A few 
>> observations:
>> 
>> 1. We detect a server OK response as an error
>> 2. (Probably more to the point) I believe PROPFIND is being used, but 
>> actually even though the function being used indicates a PROPFIND verb is 
>> used a GET is used as is normal but the headers aren't being stored. This 
>> ,Evans that when the parser looks for the headers to find the encoding it's 
>> not finding anything, resulting in a fallback to ISO-8859-1.
>> 
>> One easy thing (doesn't solve the root issue) is that wouldn't it be a 
>> better idea to fallback to UTF8 and not ISO-8859-1, given ISO-8859-1 is 
>> really just a subset of UTF-8?
>> 
>> Any pointers on how to get to the bottom of this would be appreciated, I'm 
>> honestly not up on webdav or Neon.
>> 
>> Chris Sherlock
> 
> 
> _______________________________________________
> LibreOffice mailing list
> [email protected] <mailto:[email protected]>
> http://lists.freedesktop.org/mailman/listinfo/libreoffice 
> <http://lists.freedesktop.org/mailman/listinfo/libreoffice>
> 
> 
> 
> 
> -- 
> Mark Hung

_______________________________________________
LibreOffice mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Reply via email to