Re: SVN Blame Returns Corrupt Data

Branko Čibej Fri, 11 Oct 2013 09:24:00 -0700

On 11.10.2013 18:12, Bob Archer wrote:
>> On 11.10.2013 17:19, Bob Archer wrote:
>>>> On 11.10.2013 16:55, Bob Archer wrote:
>>>>>> On 11.10.2013 15:58, Bob Archer wrote:
>>>>>>>> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer
>> <[email protected]>
>>>>>> wrote:
>>>>>>>> I assume he was asking how to "fix" the blame. Cause, sure, he
>>>>>>>> could open the file, convert it back to UTF-8 with CRLF line
>>>>>>>> endings... and commit it... of course, now blame is going to show
>>>>>>>> him on every line, since he just changed every line.
>>>>>>>>
>>>>>>>> That's exactly what I meant.  You're correct with how the blame
>>>>>>>> is handled.  I committed the UTF-8 copy to a test branch, diff'd,
>>>>>>>> and it showed every line as being changed.  Unfortunately it
>>>>>>>> looks like this is our
>>>>>> best option.
>>>>>>> Yep, we have done the same thing. As a matter of fact, I just over
>>>>>>> the past
>>>>>> few days rescripted all our database scripts to be UTF-8 since
>>>>>> merging them just doesn't work correctly when they are UTF-16 even
>>>>>> if you remove the binary mime type.
>>>>>>>> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser <[email protected]> wrote:
>>>>>>>> At current blame is not UTF-16 aware.
>>>>>>> It's not just blame that isn't... the diff engine, or whatever
>>>>>>> detects file
>>>>>> types always considers UTF-16 files to be binary. If you "add" a
>>>>>> UTF-16 file you see that svn adds the application/octet-stream mime
>>>>>> type.  There is an issue in the bug database about this from when I
>>>>>> reported/complained about it... however it hasn't been addressed.
>>>>>> I'm surprised still at this time that svn still can't support
>>>>>> UTF-16 text files as
>>>> text wrt adding, diffing, blaming, etc.
>>>>>> It's quite simple: no-one has written the necessary code. While I
>>>>>> can understand it's an interesting feature for Windows users, most
>>>>>> Subversion developers have other things to do. This being a
>>>>>> volunteer project, and most of us do not use Windows, you can
>>>>>> hardly expect anyone to spend several weeks on solving a problem
>>>>>> that has a perfectly simple workaround. Since
>>>>>> UFT-8 and UTF-16 can be interchanged without data loss, there are
>>>>>> other, much more important things to do in Subversion.
>>>>> I appreciate all that you said. I didn't expect that UTF-16 was so
>>>>> uncommon
>>>> in non-Windows OSes. A large number of dev tools that I work with on
>>>> Windows, especially the Microsoft tools default to creating UTF-16 files.
>>>>> I disagree with your "can be converted without data loss". If you
>>>>> need UTF-
>>>> 16 then you need it. Also, if you are working in an international
>>>> team and you have developers with other language Oss which have
>>>> different code pages then what you see when you look at a UTF-8 file
>>>> might be different than what I see.
>>>>
>>>> I don't follow. Both UTF-16 and UTF-8 are complete representations of
>>>> the Unicode character set. Exactly the same code sequences can be
>>>> represented in both encodings. You can convert from UTF-16 to UTF-8
>>>> and back and get exactly the same sequence of bytes.
>>>>
>>> Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode
>> format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday
>> senior moment). What I recall being told by one of the subversion
>> developers was that subversion only supported the ASCII character set and
>> while UTF-8 was compatible with ASCII it didn't truly support Unicode files.
>>> However, this blog entry seems to dispute that:
>>>
>>> http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/
>>>
>>> Would adding that mime-type to this file fix the blame issues this user is
>> seeing?
>>
>> I think the user is just very lucky. Subversion does not actually try to 
>> interpret
>> the svn:mime-type property, other than to determine whether to treat a file
>> as text or binary. (The comment is correct in that the proper parameter is
>> charset=, not encoding=, but that's not important for this discussion).
>>
>> Subversion's merge algorithm depends on being able to detect line endings
>> in the file, and always scans the file as a sequence of bytes.
>> There are several ways to represent line endings in a UTF-16 file (shown here
>> as hex byte sequences):
>>
>>   * 00 0A (Unix newline, UTF16-BE)
>>   * 00 0D 00 0A (Windows newline, UTF16-BE)
>>   * 0A 00 (Unix newline, UTF16-LE)
>>   * 0D 00 0A 00 (Windows newline, UTF16-LE)
>>   * 24 24 (Unicode newline, same in LE and BE)
>>
>> Subversion, however, expects one of the following newline sequences:
>>
>>   * 0A (Unix newline)
>>   * 0D 0A (Windows newline)
>>
>> My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII
>> newline character, are interpreted as the end-of-line markers, and the zero
>> bytes are treated as part of the text. In most cases, the result will be 
>> close to
>> correct, as long as there are no conflicts in the merge -- because Subversion
>> will not emit conflict markers in UTF-16.
>>
>> Of course, if someone used the U+2424 newline code point instead, then in
>> the worst case, the whole file would be interpreted as a single line.
>>
>> -- Brane
> Great information.. thanks for that.
>
> Bottom line is use UTF-8 for your text files and svn will be happy and work 
> correctly. How hard would it be to create a warning on an add that a file 
> looks like UTF-16 and should be converted to UTF-8 otherwise it will be 
> treated as a binary file?


You'd have to extend Subversion's file type detection to detect UTF-16.
See svn_io_detect_mimetype2 in line 3333 in this file:

http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
Subversion currently only looks at the first 1k Bytes of a file. It may
be enough to check that this initial part of the file contains only
valid UTF-16 (BE or LE) codes. Note that the function takes a dictionary
of (file extension, MIME type) pairs and if it finds a matching type, it
doesn't look at the file at all; and this may not be quite correct,
given that there are no special file extensions that would flag that the
file contains UTF-16.

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. [email protected]

Re: SVN Blame Returns Corrupt Data

Reply via email to