[issue13643] 'ascii' is a bad filesystem default encoding

2011-12-20 Thread Martin Pool

Martin Pool  added the comment:

> I'm not sure why having a locale set to C or something invalid should be 
> considered a Python bug. 

Programs like bzr that hit these problems can tell their users, either in the 
docs or an error message, "change your locale to a UTF-8 one".

There are two problems with this: one is just the practical one that it scales 
poorly to have to tell every user to do this and to take them through working 
out how to set this in a way that covers cron jobs, daemons, things run over 
ssh, etc.

The other problem is that the locale variables primarily describe the locale 
for input/output, and that can very reasonably be different from the filesystem 
encoding.  As a specific common example people may have UTF-8 filenames but 
want a C locale terminal.  If there was a separate LC_FILENAMES then Python 
could respect that and insist people set it, but there isn't.

--
nosy: +poolie

___
Python tracker 
<http://bugs.python.org/issue13643>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13643] 'ascii' is a bad filesystem default encoding

2011-12-20 Thread Martin Pool

Martin Pool  added the comment:

On 21 December 2011 11:01, STINNER Victor  wrote:
>
> Again: please read the discussion (in closed issues) explaing why we removed 
> it (and which problems it introduced).

There's a lot of history, so I'm not sure exactly which problems
you're referring to.  The main problem I see being discussed is that
changing the encoding after Python starts would be dangerous, which I
agree with, but we're not proposing to do that.

--

___
Python tracker 
<http://bugs.python.org/issue13643>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13643] 'ascii' is a bad filesystem default encoding

2011-12-20 Thread Martin Pool

Martin Pool  added the comment:

On 21 December 2011 11:26, STINNER Victor  wrote:
> I never checked which locale is used by default for programs called by cron. 
> So I checked: on Fedora 16, programs start with a very few environment 
> variables, and LANG and LC_ALL are not set. You can add "LANG=fr_FR.UTF-8" 
> (for example) to /etc/environment to set the default language for the whole 
> system (for all programs). I checked, it works with cron. Or if you don't 
> want to affect all programs, it is maybe safer to only set the locale for one 
> specific program in your crontab by adding "LANG=fr_FR.UTF-8 " before you 
> command. Example:
>
> * *  *  *  * LANG=fr_FR.UTF-8 /home/haypo/test.sh

That is the correct kind of configuration.  When I say it scales
poorly I mean that every user running a Python program on a unicode
system needs to insert this configuration in every relevant place, and
they need to work this out from what is typically a fairly cryptic
message.  (bzr just added a workaround for this, but for other
programs it still exists.)

Also, my other point, is that people may very well want their cron
scripts to send ascii output but cope with unicode filenames.

--

___
Python tracker 
<http://bugs.python.org/issue13643>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13643] 'ascii' is a bad filesystem default encoding

2011-12-20 Thread Martin Pool

Martin Pool  added the comment:

Thanks for the example.

Like you say, realistically, all data exchanged with other programs
and with the system needs to be in the same encoding.  (User document
content may be in something else.)

On modern systems, this problem is solved by making the standard
encoding UTF-8.  So it is unfortunate that, when no locale is set,
Python3 defaults to ascii for the filesystem.

With no locale set, python3 makes getdefaultencoding() utf-8, so it
seems oddly pessimistic to make the fsencoding only ascii.

If someone really wants to run everything in iso-8859-1 this patch
would not stop them doing so.

--

___
Python tracker 
<http://bugs.python.org/issue13643>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13643] 'ascii' is a bad filesystem default encoding

2011-12-20 Thread Martin Pool

Martin Pool  added the comment:

On 21 December 2011 12:16, Antoine Pitrou  wrote:
>
> Antoine Pitrou  added the comment:
>
> So, you're complaining about something which works, kind of:
>
> $ touch héhé
> $ LANG=C python3 -c "import os; print(os.listdir())"
> ['h\udcc3\udca9h\udcc3\udca9']

It's possible to work around this in some cases, such as listdir, by
coping with the result including some byte strings, and then manually
decoding them.  But there are, iirc, other cases where the call just
fails and there is no easy workaround.

It wasn't impossible to get unicode right in python2, but python3
still thinks it's worth changing things to make it work better.

>> This makes robustly working with non-ascii filenames on different
>> platforms needlessly annoying, given no modern nix should have problems
>> just using UTF-8 in these cases.
>
> So why don't these supposedly "modern" systems at least set the appropriate 
> environment variables for Python to infer the proper character encoding?
> (since these "modern" systems don't have a well-defined encoding...)

The standard encoding is UTF-8.  Python shouldn't need to have a
variable set to tell it this.  Python is making an assumption about
the default but it is a bad assumption.

> The culprit is not Python, it's the Unix crap

Programs need to work with the environments that are available to
them, even though those environments often have flaws.  Windows and
Mac have annoying bugs too, even bugs specifically about Unicode.

--

___
Python tracker 
<http://bugs.python.org/issue13643>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13643] 'ascii' is a bad filesystem default encoding

2011-12-21 Thread Martin Pool

Martin Pool  added the comment:

On 21 December 2011 12:41, Antoine Pitrou  wrote:
>
> Antoine Pitrou  added the comment:
>
>> The standard encoding is UTF-8.
>
> How so? I don't know of any Linux or Unix spec which says so. If you get
> the Linux heads to standardize this then I'll certainly be very happy
> (and countless others will, too). But AFAIK this it not the case and I
> don't see why you are asking Python to make a choice that OS vendors
> refuse to make. You are certainly asking the wrong project to solve this
> problem.

It is a de facto, not de jure standard: UTF-8 is how things are
typically stored.  Other software (eg gnome file handling utilities)
makes this assumption.  See eg
<http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux>.

I would be happy to see an authoritative document saying this is how
things _should_ be stored, but I can't find one yet.  But in Unix
there are no ultimate authorities: even if someone announced filenames
are utf-8 there will obviously continue to be many machines where in
practice they are not.

I started asking about it over here, to see if at least Ubuntu can
have an opinion that this is how things should normally be:
https://lists.ubuntu.com/archives/ubuntu-devel/2011-December/034588.html

I'm not sure what you expect a technical solution at the OS level
would look like.  The api is 8-bit strings and that's not likely to
change.  It's possible to have a situation where no locale is
specified.  Applications unavoidably need to have some opinion about
what to do there.  Other applications assume the filenames are utf-8.
Python assumes that text in general will be UTF-8
(getdefaultencoding).

It is almost like your caricature of OS developers as being
anglocentric, but in fact here it's Python that assumes everything is
probably ascii - or more charitably, it is just assuming that failing
when things aren't ascii is the best tradeoff.  Maybe it is.

One OS-level fix is to try to reduce the number of situations where
people see no locale, or the C locale, and give them C.UTF-8 instead.
That is probably worth doing.  But having no locale can still happen,
and I think Python could handle that better, so the changes are
complimentary.

--

___
Python tracker 
<http://bugs.python.org/issue13643>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13643] 'ascii' is a bad filesystem default encoding

2011-12-21 Thread Martin Pool

Martin Pool  added the comment:

On 22 December 2011 11:21, STINNER Victor  wrote:
> This discussion is becoming very long, I didn't remember the original
> purpose.

The proposal is that in some cases where Python currently assumes
filenames are ascii on Linux, it ought to instead assume they are
utf-8.

> You want to use UTF-8 instead of ASCII, so what? What do you
> want to do with your nicely well decoded filenames? You cannot print it
> to your terminal nor pass it to a subprocess, because your terminal uses
> ASCII, as subprocess. I don't see how it would help you.

When the application has a unicode string, it can always encode itself
in whatever way it thinks most appropriate.  For instance if it is a
network service, the locale in which it was started may be entirely
irrelevant to the encoding it wants to talk to a particular peer.

However, there are or were some Python filesystem APIs where it is
very hard for the application to avoid being limited to the encoding
Python assumes at startup.  Also, for good reasons, the application
cannot change the filesystem encoding once it starts.  So the reason
for proposing a patch to Python is that there is no way for the
application to escape, once Python's assumed all names will be ascii.

It may be that all of those limitations have since been fixed
separately, either through pep383 or separate patches, so the
application at least has a chance to work around it.  It would be nice
to not burden the application or user with working around this when
the filenames really are valid in what should be the user's locale,
but perhaps this is the OS's fault for not having the right locale
configured.

--

___
Python tracker 
<http://bugs.python.org/issue13643>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13643] 'ascii' is a bad filesystem default encoding

2011-12-21 Thread Martin Pool

Martin Pool  added the comment:

On 22 December 2011 12:32, STINNER Victor  wrote:
>
> STINNER Victor  added the comment:
>
> On 22/12/2011 02:16, Martin Pool wrote:
>> The proposal is that in some cases where Python currently assumes
>> filenames are ascii on Linux, it ought to instead assume they are
>> utf-8.
>
> Oh, I expected a use case describing the problem, not the proposed
> solution :-)

The problem as I see it is this:

On Linux, filenames are generally (but not always) in UTF-8; people
fairly commonly end up with no locale configured, which causes Python
to decode filenames as ascii.  It is easy for this to end up with them
hitting UnicodeErrors.

>>> You want to use UTF-8 instead of ASCII, so what? What do you
>>> want to do with your nicely well decoded filenames? You cannot print it
>>> to your terminal nor pass it to a subprocess, because your terminal uses
>>> ASCII, as subprocess. I don't see how it would help you.
>>
>> When the application has a unicode string,
>
> Where does this string come from? (It is an important question).

It comes, for example, from the name of a file, or a directory, or the
contents of a symlink.  Or the problem applies equally when the
program has got a unicode string (for example off the network in a
defined encoding) and it is trying to use it to access the filesystem.

> If your locale encoding is ASCII, you cannot write such non-ASCII
> filenames using the keyboard for example.

Sure you can.  The user could enter a backslash-escaped name, which
the program knows to decode to unicode.  The point is the program has
a choice of how it deals with user input, whereas it does not have as
much control in Python of how filenames are encoded.

>  > with working around this when the filenames really are
>  > valid in what should be the user's locale,
>
> On your computer, UTF-8 is maybe a good candidate for "what should be
> the user's locale", but you cannot generalize for all computers.
>
> I also wanted to force UTF-8 everywhere, but you cannot do that or your
> program will just not work in some configurations.

Just to be clear, I'm not proposing to force UTF-8 everywhere.  I am
only proposing to 'break' the case where the user has non-ascii
filenames but, intentionally or not, a locale that specifies only
ascii is used.  With this change, Python will try to decode them as
utf-8, and fail if they're not utf-8.

I am coming to think the best step here is just for the OS to do more
to make sure the application does get the appropriate locale.  (For
example, Ubuntu in recent releases uses a pam hook to set LANG for
cron jobs, to avoid the example described above.)

--

___
Python tracker 
<http://bugs.python.org/issue13643>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13643] 'ascii' is a bad filesystem default encoding

2011-12-21 Thread Martin Pool

Martin Pool  added the comment:

On 22 December 2011 13:15, STINNER Victor  wrote:
> You cannot pass directly "h\xe9.txt", but if you know the "correct" file 
> system encoding, you can encode it explicitly using str.encode("utf-8").

My recollection was that there were some cases where you couldn't do
this, but perhaps I was wrong or perhaps they're all fixed in
python3.x, or at least perhaps they are better fixed as individual
bugs.  gz may know more.

> You are trying to do something complex (add hacks for filenames, for a 
> specific configuration) for a simple problem: configure correctly locales.

I think you may be right.

> If you know and you are sure that your are using UTF-8, why not
> simply setting your locale to a UTF-8 locale?

_My_ locale is set properly.  The problem is all the other people in
the world who do not have their locale set to match their files on
disk; telling them each to fix it is tedious.  But perhaps the OS is
the best place to address that, when the incorrect locale is just
accidental not unavoidable.

--

___
Python tracker 
<http://bugs.python.org/issue13643>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13643] 'ascii' is a bad filesystem default encoding

2011-12-23 Thread Martin Pool

Martin Pool  added the comment:

Terry, that's fine.  Thanks to everyone who contributed to the discussion.

--

___
Python tracker 
<http://bugs.python.org/issue13643>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10951] gcc 4.6 warnings

2011-08-21 Thread Martin Pool

Martin Pool  added the comment:

this fixes the pickle warnings, and cleans up some (I'm pretty sure) dead code 
in there.  the pickle tests all pass.

--
keywords: +patch
nosy: +poolie
Added file: http://bugs.python.org/file22980/20110822-1150-python-warnings.diff

___
Python tracker 
<http://bugs.python.org/issue10951>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10951] gcc 4.6 warnings

2011-08-21 Thread Martin Pool

Martin Pool  added the comment:

This fixes every compiler warning so that Python build with -Werror on Ubuntu 
Oneiric alpha (gcc 4.6.1-7ubuntu1).  

 * PyMem_Resize is a macro that mutates its first argument; the return value 
shouldn't be used.
 * Some variables in sre are (apparently harmlessly) not used when validating 
the opcodes.
 * gethostbyname_r return codes weren't used and should be; it is not 
guaranteed to set h_errno correctly (though it does seem to do so here)
 * a few vairables in pthread and tkappintr are not used in some preprocessor 
productions.
 * setup.py should detect linux kernel 3.0 as linux, and therefore able to 
build ossaudiodev

I don't get all the errors mentioned by haypo.  At least some seem already 
fixed.

--
Added file: http://bugs.python.org/file22985/20110822-1352-gcc-warnings.diff

___
Python tracker 
<http://bugs.python.org/issue10951>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1215] Python hang when catching a segfault

2011-08-21 Thread Martin Pool

Martin Pool  added the comment:

The documentation for this can now point to the faulthandler module (in Python 
3).

--
nosy: +poolie

___
Python tracker 
<http://bugs.python.org/issue1215>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1215] documentation doesn't say that you can't handle C segfaults from python

2011-08-21 Thread Martin Pool

Changes by Martin Pool :


--
title: Python hang when catching a segfault -> documentation doesn't say that 
you can't handle C segfaults from python

___
Python tracker 
<http://bugs.python.org/issue1215>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1215] documentation doesn't say that you can't handle C segfaults from python

2011-08-21 Thread Martin Pool

Martin Pool  added the comment:

This patch tries to improve the documentation a bit more to address the issue 
that confused tebeka and to advertise faulthandler.  Could someone review or 
apply it?

--
keywords: +patch
Added file: http://bugs.python.org/file22989/20110822-1525-signal-doc.diff

___
Python tracker 
<http://bugs.python.org/issue1215>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10713] re module doesn't describe string boundaries for \b

2011-08-21 Thread Martin Pool

Martin Pool  added the comment:

> Note, 366 above confirms it's never true for an empty string.  The
documentation states that \B "is just the opposite of \b" yet
re.match(r'\b', '') returns None and so does \B so \B isn't the opposite
of \b in all cases.

This is also a bit strange if you follow the Perl line of reasoning of 
imagining there are non-word characters outside the string.  And, indeed, in 
Perl, 

  "" =~ /\B/

is true.

So this patch adds some tests for \b behaviour and some docs.  I think possible 
\B should actually change, but that would be a bigger (perhaps impossible?) 
change.

--
keywords: +patch
nosy: +poolie
Added file: http://bugs.python.org/file22991/20110822-1604-re-docs.diff

___
Python tracker 
<http://bugs.python.org/issue10713>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10951] gcc 4.6 warnings

2011-08-21 Thread Martin Pool

Martin Pool  added the comment:

My patch above fixes all the messages so that you get a clean build with the 
current makefile.

-Wuninitialized and 'offset outside constant string' would be worth fixing but 
I can't reproduce them in Python.

I'm personally not so keen on fixing all the signed/unsigned comparisons unless 
they're checked by the default build, because in my experience that has a 
pretty low payoff and some chance of introducing errors.

--

___
Python tracker 
<http://bugs.python.org/issue10951>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7584] datetime.rfcformat() for Date and Time on the Internet

2011-08-21 Thread Martin Pool

Martin Pool  added the comment:

Z is well established as meaning "UTC time" 
<http://en.wikipedia.org/wiki/Coordinated_Universal_Time#Time_zones> so 
shouldn't be used for "zone not known."  rfc 3393 is clear that it's equivalent 
to +00:00.  

So the questions seem to be:
 * should there be an included battery to do this format at all?
 * should it represent utc as '+00:00' or as 'Z' by default - applications 
should have the choice.

It's probably reasonable to assume correct Python application code using 
datetime objects will know whether they have a local, utc, or unknown time.

The current patch does not seem to have any way to format an object with a 
declared UTC tzinfo as having a 'Z' prefix, which would be useful.

--
nosy: +poolie

___
Python tracker 
<http://bugs.python.org/issue7584>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1215] documentation doesn't say that you can't handle C segfaults from python

2011-08-30 Thread Martin Pool

Martin Pool  added the comment:

On 31 August 2011 07:56, STINNER Victor  wrote:
>
> STINNER Victor  added the comment:
>
>> def handler(signal, stackframe):
>>     print "OUCH"
>>     stdout.flush()
>>     _exit(1)
>
> What do you want to do on a SIGSEGV? On a real fault, you cannot rely on  
> Python internal state, you cannot use any Python object. To handle a real 
> SIGSEGV fault, you have to implement a signal handler using only *signal 
> safe* functions in C.

Well, strictly speaking, it is very hard or impossible to write C code
that's guaranteed to be safe after an unexpected segv too; who knows
what might have caused it.  The odds are probably better that it will work in
in C than in Python.  At any rate I think it's agreed that the
original code is not supported and it's just the docs that need to
change.

So what do you think of
<http://bugs.python.org/file22989/20110822-1525-signal-doc.diff> ?

--

___
Python tracker 
<http://bugs.python.org/issue1215>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10951] gcc 4.6 warnings

2012-09-30 Thread Martin Pool

Martin Pool added the comment:

Hi, Martin,

On 20 August 2012 05:25, Martin v. Löwis  wrote:
> Martin v. Löwis added the comment:
>
> (As usual), I'm quite skeptical about this "bulk bug report"; it violates the 
> "one bug at a time" principle, where "one bug" can roughly be defined as 
> "cannot be split into smaller independent issues".

I heartily agree with you in general: as well as being non-atomic,
it's also hard to have a clear test whether such bugs are fixed.  But,
I hope this patch has some value even if the bug is not a great
example.

> For the cases at hand, I think it would be best if somebody with gcc 4.6 
> available just fixed the "easy" ones, i.e. where the code clearly improves 
> when silenciing the warning. In these cases, I wouldn't mind if they get 
> checked in without code review; I know some favor review for all changes, in 
> which case a separate issue should be opened for a patch fixing a bunch of 
> these.

I've fixed what I believe to be all the safe/easy warnings in my patch
above.  I would appreciate if someone would review it and if possible
commit it.

--

___
Python tracker 
<http://bugs.python.org/issue10951>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com