On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote: > On 10/11/13 9:22 AM, Branko Čibej wrote: > > You'd have to extend Subversion's file type detection to detect UTF-16. > > See svn_io_detect_mimetype2 in line 3333 in this file: > > > > http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup > > Subversion currently only looks at the first 1k Bytes of a file. It may > > be enough to check that this initial part of the file contains only > > valid UTF-16 (BE or LE) codes. > > Even if all we looked for is the BOM it might be helpful enough. I suspect > the > development tools producing UTF-16 are including BOMs. Windows seems to be > fond of including them, Notepad puts one even on UTF-8.
Couldn't Subversion automatically convert UTF-16 files to UTF-8 before processing them for diff/merge/blame, and convert output written to the original files back to UTF-16? That would require some work because existing streams, strings, and files passed around in the code would need to be wrapped so that translation to/from the internal from/to the external encoding is seamless. But I don't see why such an approach couldn't be made to work in principle. It might even result in some spring cleaning in the code base and pave the way for improved handling of file formats such as XML for diff and merge. What do you think? Is it worth adding this to our project ideas page?