On 11.10.2013 19:25, Stefan Sperling wrote: > On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote: >> On 10/11/13 9:22 AM, Branko Čibej wrote: >>> You'd have to extend Subversion's file type detection to detect UTF-16. >>> See svn_io_detect_mimetype2 in line 3333 in this file: >>> >>> http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup >>> Subversion currently only looks at the first 1k Bytes of a file. It may >>> be enough to check that this initial part of the file contains only >>> valid UTF-16 (BE or LE) codes. >> Even if all we looked for is the BOM it might be helpful enough. I suspect >> the >> development tools producing UTF-16 are including BOMs. Windows seems to be >> fond of including them, Notepad puts one even on UTF-8. > Couldn't Subversion automatically convert UTF-16 files to UTF-8 before > processing them for diff/merge/blame, and convert output written to > the original files back to UTF-16?
That would be less work than supporting whitespace compression, etc. in UTF-16, but we'd still have to detect U+2424 as an end-of-line marker in UTF-8 text. Still, we'd actually have to correctly identify UTF-16 content first, and handle invalid byte sequences. > That would require some work because existing streams, strings, and files > passed around in the code would need to be wrapped so that translation > to/from the internal from/to the external encoding is seamless. > > But I don't see why such an approach couldn't be made to work in principle. > It might even result in some spring cleaning in the code base and pave the > way for improved handling of file formats such as XML for diff and merge. Can't see what XML has to do with it. The diff algorithm already uses a tokenizer; and for XML, that should be good enough most of the time. > What do you think? Is it worth adding this to our project ideas page? It's already here: http://subversion.tigris.org/issues/show_bug.cgi?id=2194 -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com