On 11.10.2013 18:12, Bob Archer wrote: >> On 11.10.2013 17:19, Bob Archer wrote: >>>> On 11.10.2013 16:55, Bob Archer wrote: >>>>>> On 11.10.2013 15:58, Bob Archer wrote: >>>>>>>> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer >> <bob.arc...@amsi.com> >>>>>> wrote: >>>>>>>> I assume he was asking how to "fix" the blame. Cause, sure, he >>>>>>>> could open the file, convert it back to UTF-8 with CRLF line >>>>>>>> endings... and commit it... of course, now blame is going to show >>>>>>>> him on every line, since he just changed every line. >>>>>>>> >>>>>>>> That's exactly what I meant. You're correct with how the blame >>>>>>>> is handled. I committed the UTF-8 copy to a test branch, diff'd, >>>>>>>> and it showed every line as being changed. Unfortunately it >>>>>>>> looks like this is our >>>>>> best option. >>>>>>> Yep, we have done the same thing. As a matter of fact, I just over >>>>>>> the past >>>>>> few days rescripted all our database scripts to be UTF-8 since >>>>>> merging them just doesn't work correctly when they are UTF-16 even >>>>>> if you remove the binary mime type. >>>>>>>> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser <b...@reser.org> wrote: >>>>>>>> At current blame is not UTF-16 aware. >>>>>>> It's not just blame that isn't... the diff engine, or whatever >>>>>>> detects file >>>>>> types always considers UTF-16 files to be binary. If you "add" a >>>>>> UTF-16 file you see that svn adds the application/octet-stream mime >>>>>> type. There is an issue in the bug database about this from when I >>>>>> reported/complained about it... however it hasn't been addressed. >>>>>> I'm surprised still at this time that svn still can't support >>>>>> UTF-16 text files as >>>> text wrt adding, diffing, blaming, etc. >>>>>> It's quite simple: no-one has written the necessary code. While I >>>>>> can understand it's an interesting feature for Windows users, most >>>>>> Subversion developers have other things to do. This being a >>>>>> volunteer project, and most of us do not use Windows, you can >>>>>> hardly expect anyone to spend several weeks on solving a problem >>>>>> that has a perfectly simple workaround. Since >>>>>> UFT-8 and UTF-16 can be interchanged without data loss, there are >>>>>> other, much more important things to do in Subversion. >>>>> I appreciate all that you said. I didn't expect that UTF-16 was so >>>>> uncommon >>>> in non-Windows OSes. A large number of dev tools that I work with on >>>> Windows, especially the Microsoft tools default to creating UTF-16 files. >>>>> I disagree with your "can be converted without data loss". If you >>>>> need UTF- >>>> 16 then you need it. Also, if you are working in an international >>>> team and you have developers with other language Oss which have >>>> different code pages then what you see when you look at a UTF-8 file >>>> might be different than what I see. >>>> >>>> I don't follow. Both UTF-16 and UTF-8 are complete representations of >>>> the Unicode character set. Exactly the same code sequences can be >>>> represented in both encodings. You can convert from UTF-16 to UTF-8 >>>> and back and get exactly the same sequence of bytes. >>>> >>> Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode >> format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday >> senior moment). What I recall being told by one of the subversion >> developers was that subversion only supported the ASCII character set and >> while UTF-8 was compatible with ASCII it didn't truly support Unicode files. >>> However, this blog entry seems to dispute that: >>> >>> http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ >>> >>> Would adding that mime-type to this file fix the blame issues this user is >> seeing? >> >> I think the user is just very lucky. Subversion does not actually try to >> interpret >> the svn:mime-type property, other than to determine whether to treat a file >> as text or binary. (The comment is correct in that the proper parameter is >> charset=, not encoding=, but that's not important for this discussion). >> >> Subversion's merge algorithm depends on being able to detect line endings >> in the file, and always scans the file as a sequence of bytes. >> There are several ways to represent line endings in a UTF-16 file (shown here >> as hex byte sequences): >> >> * 00 0A (Unix newline, UTF16-BE) >> * 00 0D 00 0A (Windows newline, UTF16-BE) >> * 0A 00 (Unix newline, UTF16-LE) >> * 0D 00 0A 00 (Windows newline, UTF16-LE) >> * 24 24 (Unicode newline, same in LE and BE) >> >> Subversion, however, expects one of the following newline sequences: >> >> * 0A (Unix newline) >> * 0D 0A (Windows newline) >> >> My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII >> newline character, are interpreted as the end-of-line markers, and the zero >> bytes are treated as part of the text. In most cases, the result will be >> close to >> correct, as long as there are no conflicts in the merge -- because Subversion >> will not emit conflict markers in UTF-16. >> >> Of course, if someone used the U+2424 newline code point instead, then in >> the worst case, the whole file would be interpreted as a single line. >> >> -- Brane > Great information.. thanks for that. > > Bottom line is use UTF-8 for your text files and svn will be happy and work > correctly. How hard would it be to create a warning on an add that a file > looks like UTF-16 and should be converted to UTF-8 otherwise it will be > treated as a binary file?
You'd have to extend Subversion's file type detection to detect UTF-16. See svn_io_detect_mimetype2 in line 3333 in this file: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup Subversion currently only looks at the first 1k Bytes of a file. It may be enough to check that this initial part of the file contains only valid UTF-16 (BE or LE) codes. Note that the function takes a dictionary of (file extension, MIME type) pairs and if it finds a matching type, it doesn't look at the file at all; and this may not be quite correct, given that there are no special file extensions that would flag that the file contains UTF-16. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com