Re: Renaming UTF-8 file names is broken

Andreas Mohr Sat, 17 Nov 2012 08:28:21 -0800

Hi,

On Sat, Nov 17, 2012 at 12:59:15AM +0400, Заболотный Андрей wrote:
> Okay, so thanks to Nico Kadel-Garcia I've upgraded subversion to version 
> 1.7.7 on my Centos server.
> 
> Unfortunately, this did not help. The error is still the same:
> 
> svn: E160013: File not found: transaction '18-16', path 
> '/%D0%A1%D0%A3%20%D0%90%D0%9A%D0%91/doc/forth-asm/%D0%9A%D0%B0%D1%80%D1%82%D0%B0%20%D0%BF%D0%B0%D0%BC%D1%8F%D1%82%D0%B8.txt'
> svn: E160013: Your commit message was left in a temporary file:
> svn: E160013:    'svn-commit.2.tmp'
> 
> I'm out of ideas.


Since that discussion has been a bit longer now and you're at a bit of a
loss, I'll try to explain things a bit.


I've recently started having to fight with an overly nonreliable
and non-supported (non-)interop(er)ability product "of a large so-called PC 
software company".

That product did not support international (read: non-ASCII range)
handling yet, thus I had to add support for that (I could send you the patch
to have a look at what I had to do to fix the problems in my case).


Encoding of URIs is usually governed by RFC 2396 originally.
This RFC strictly covered ASCII range content only.
It indicates which set of within-payload-content (i.e., between-delimiters!)
characters to encode (in order to avoid having a payload character end up
misinterpreted as a delimiter).

In your case, we *are* talking about international characters.
As far as RFC 2396 is concerned, it does not say much about non-ASCII
range, thus AFAICS it usually is chosen to simply hex-encode all non-ASCII chars
as well.
Possibly for RFC2396 purposes internationalization is simply not allowed
in its scope at all. RFC3986 (http://www.ietf.org/rfc/rfc3986.txt)
says the following:
"
Percent-encoded
   octets (Section 2.1) may be used within a URI to represent characters
   outside the range of the US-ASCII coded character set if this



Berners-Lee, et al.         Standards Track                     [Page 8]

RFC 3986                   URI Generic Syntax               January 2005


   representation is allowed by the scheme or by the protocol element in
   which the URI is referenced.  Such a definition should specify the
   character encoding used to map those characters to octets prior to
   being percent-encoded for the URI.
"


In your case the problem could be that SVN knows to correctly apply hex encoding
of non-ASCII range chars in user-defined content
(i.e. specific directory names, file names, ...).
What then might be what happens is that the *other* party sends
back a request (a possibly rather unrelated one some time later!!)
with exactly this hex-encoded string data content
(the actual meaning of that content may be unknown to the sender,
i.e. it's just an opaque "token"/"handle"),
which SVN then fails to *decode* symmetrically prior to doing
actual filesystem item lookup.


General note: encoding handling should always be perfectly symmetric
(and thus reversible without any information loss!!),
and *layered* (i.e. URI en/decoding requirements should be handled by a
different transcoding layer than e.g. predefined entity escape
requirements for transport in XML protocol,
and further transport paths might need to add another transcoding layer!).
It's just a "do the required and correct transcoding per each transport 
mechanism" thing.


There might also be issues such as *duplicate* encoding involved
(in which case hex-encoding would be (in most cases improperly)
changed into hex-encoded hex-encoding,
and the receiving party would then of course only decode *once*
and thus end up with one hex-encoding remaining rather than
original data).


Or it might be server vs. client mismatching in their implementation
conformance (e.g. one conforms to RFC2396 vs. the other RFC3986,
as one example).



Please take my statements with a grain of salt since it was a very Q&D
report without much research. Anyway, these pointers should be quite
useful and to the point, especially since I had to go through pretty
much the same thing recently.

HTH,

Andreas Mohr

-- 
GNU/Linux. It's not the software that's free, it's you.

Re: Renaming UTF-8 file names is broken

Reply via email to