> On Mon, 9 Jul 2012, jmg wrote: > > > Package: diff > > Version: 1:3.0-1 > > Please note that when reporting bugs, you should try the latest > version if possible. In this case, wheezy has version 3.2.
I updated to 3.2 just now, in case it is usefull for the future. > > Severity: normal > > > > When diff is used to compare two files which are identical line by > > line and a unique difference which is wether BOM is present or not > > to indicates UTF-8 encoding, diff does not indicate the good > > difference. > > Please explain what do you mean by "does not indicate the good > difference". > > I have used this > > #include <stdio.h> > int main() { > printf("%c%c%c", 0xef,0xbb,0xbf); > return 0; > } > > to create a UTF8 BOM and then I've created two text files, one with > the BOM at the beginning and another one without it. This is what I > *see* when I make the diff: > > 1c1 > < Hello. > --- > > Hello. Yes, this is the issue I consider in this bugreport. First, the example here-above shows the difference in not visible with a regular terminal. Secondly, when I copy your diff I received by mail to $(cat | od -c) command, this gives: 0000000 > 1 c 1 \n > < H e l l o . 0000020 \n > - - - \n > > H e l l o 0000040 . \n 0000042 which just means in one way diff output is not copiable lossless. For a regular user, this does mean the output of diff is simply not understandable, in at least two ways (first and second). Only advanced geek can understand this diff output. For me, it is because diff output is not compatible with unicode specification. > but if I redirect the output to a file and use patch, the first file > becomes identical to the second file, so the diff is correct. This just means that diff is a tool campatible with patch tool. Such a compatibility is essential. But one can expect more from a diff command: For instance, one can expect output should be copiable by mail lossless; this is not the case as explained above. Third, you suggest to store diff output to file. I wonder how such a file should be considered. Is this an octet stream (binary file) to be handled with byte oriented tools? It is not what it looks like. Is this a character stream (text file) compliant with UTF-8, to be handled with text tools? I do not think so as explained above, because any byte sequence is not valid UTF8. This is ambiguous, and I hope a future version including UTF-8 support make it less ambiguous. > Are you reporting that BOM is invisible like spaces? > Why should we consider that as a bug in the "diff" program? > If it's invisible, blame the terminal, not diff! Fourthly, if BOM is invisible like spaces, diff -w should take care of it. This is not the case. Five, if we can have a message such as «No newline at end of file», we should also have a message such as «No BOM at start of file»/«BOM at start of file». Terminal is not to blame as terminal is unicode compliant, like each application is expected to be. I just report both: - diff tool is not unicode compliant - diff man page and diffutils-doc does not inform tool is not unicode compliant, and might be ambiguous on how diff toll handle text, ascii and utf-8. Hopefully, this has yet been identified in $(info diff) in section «18.1.1 Handling Multibyte and Varying-Width Characters». But I consider diff man page and diffutils-doc documentation remains incomplete. Finally, from my understanding, what makes other application unicode compliant when diff is not is: - «If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (essentially a null character). In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060.» (Wikipedia®). - My personal interpretation: BOM is not in the first line, but before first line, as most text editor cannot permit to display it or change it editing the first line. In conclusion, I hope: - a better documentation of current diff limitations - a future diff unicode enabled... -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org