Re: [Python-Dev] Unicode byte order mark decoding
Martin v. Löwis sagte: > Walter Dörwald wrote: >> There are situations where the byte stream might be temporarily >> exhausted, e.g. an XML parser that tries to support the >> IncrementalParser interface, or when you want to decode >> encoded data piecewise, because you want to give a progress >> report. > > Yes, but these are not file-like objects. True, on the outside there are no file-like objects. But the IncrementalParser gets passed the XML bytes in chunks, so it has to use a stateful decoder for decoding. Unfortunately this means that is has to use a stream API. (See http://www.python.org/sf/1101097 for a patch that somewhat fixes that.) (Another option would be to completely ignore the stateful API and handcraft stateful decoding (or only support stateless decoding), like most XML parsers for Python do now.) > In the IncrementalParser, > it is *not* the case that a read operation returns an empty > string. Instead, the application repeatedly feeds data explicitly. That's true, but the parser has to wrap this data into an object that can be passed to the StreamReader constructor. (See the Queue class in Lib/test/test_codecs.py for an example.) > For a file-like object, returning "" indicates EOF. Not neccassarily. In the example above the IncrementalParser gets fed a chunk of data, it stuffs this data into the Queue, so that the StreamReader can decode it. Once the data from the Queue is exhausted, there won't any further data until the user calls feed() on the IncrementalParser again. Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes: Martin> I can't put these two paragraphs together. If you think Martin> that explicit is better than implicit, why do you not want Martin> to make different calls for the first chunk of a stream, Martin> and the subsequent chunks? Because the signature/BOM is not a chunk, it's a header. Handling the signature/BOM is part of stream initialization, not translation, to my mind. The point is that explicitly using a stream shows that initialization (and finalization) matter. The default can be BOM or not, as a pragmatic matter. But then the stream data itself can be treated homogeneously, as implied by the notion of stream. I think it probably also would solve Walter's conundrum about buffering the signature/BOM if responsibility for that were moved out of the codecs and into the objects where signatures make sense. I don't know whether that's really feasible in the short run---I suspect there may be a lot of stream-like modules that would need to be updated---but it would be a saner in the long run. >> Yes! Exactly (except in reverse, we want to _read_ from the >> slurped stream-as-string, not write to one)! ... and there's >> no need for a utf-8-sig codec for strings, since you can >> support the usage in exactly this way. Martin> However, if there is an utf-8-sig codec for streams, there Martin> is currently no way of *preventing* this codec to also be Martin> available for strings. The very same code is used for Martin> streams and for strings, and automatically so. And of course it should be. But if it's not possible to move the -sig facility out of the codecs into the streams, that would be a shame. I think we should encourage people to use streams where initialization or finalization semantics are non-trivial, as they are with signatures. But as long as both utf-8-we-dont-need-no-steenkin-sigs-in-strings and utf-8-sig are available, I can program as I want to (and refer those whose strings get cratered by stray BOMs to you). -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Stephen J. Turnbull wrote: "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes: Martin> I can't put these two paragraphs together. If you think Martin> that explicit is better than implicit, why do you not want Martin> to make different calls for the first chunk of a stream, Martin> and the subsequent chunks? Because the signature/BOM is not a chunk, it's a header. Handling the signature/BOM is part of stream initialization, not translation, to my mind. The point is that explicitly using a stream shows that initialization (and finalization) matter. The default can be BOM or not, as a pragmatic matter. But then the stream data itself can be treated homogeneously, as implied by the notion of stream. I think it probably also would solve Walter's conundrum about buffering the signature/BOM if responsibility for that were moved out of the codecs and into the objects where signatures make sense. Not really. In every encoding where a sequence of more than one byte maps to one Unicode character, you will always need some kind of buffering. If we remove the handling of initial BOMs from the codecs (except for UTF-16 where it is required), this wouldn't change any buffering requirements. I don't know whether that's really feasible in the short run---I suspect there may be a lot of stream-like modules that would need to be updated---but it would be a saner in the long run. I'm not exactly sure, what you're proposing here. That all codecs (even UTF-16) pass the BOM through and some other infrastructure is responsible for dropping it? [...] Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] longobject.c & ob_size
Tim Peters <[EMAIL PROTECTED]> writes: > [Michael Hudson] >> Asking mostly for curiousity, how hard would it be to have longs store >> their sign bit somewhere less aggravating? > > Depends on where that is. > >> It seems to me that the top bit of ob_digit[0] is always 0, for example, > > Yes, the top bit of ob_digit[i], for all relevant i, is 0 on all > platforms now. > >> and I'm sure this would result no less convolution in longobject.c it'd be >> considerably more localized convolution. > > I'd much rather give struct _longobject a distinct sign member (say, 0 > == zero, -1 = non-zero negative, 1 == non-zero positive). Well, that would indeed be simpler. > That would simplify code. It would cost no extra bytes for some > longs, and 8 extra bytes for others (since obmalloc rounds up to a > multiple of 8); I don't care about that (e.g., I never use millions > of longs simultaneously, but often use a few dozen very big longs > simultaneously; the memory difference is in the noise then). > > Note that longintrepr.h isn't included by Python.h. Only longobject.h > is, and longobject.h doesn't reveal the internal structure of longs. > IOW, changing the internal layout of longs shouldn't even hurt binary > compatibility. Bonus. > The ob_size member of PyObject_VAR_HEAD would also be redeclared as > size_t in an ideal world. As nature intended. I might do a patch, at some point... Cheers, mwh -- Indeed, when I design my killer language, the identifiers "foo" and "bar" will be reserved words, never used, and not even mentioned in the reference manual. Any program using one will simply dump core without comment. Multitudes will rejoice. -- Tim Peters, 29 Apr 1998 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Re: [Python-checkins] python/dist/src/Modules mathmodule.c, 2.74, 2.75
[EMAIL PROTECTED] > Modified Files: >mathmodule.c > Log Message: > Add a comment explaining the import of longintrepr.h. > > Index: mathmodule.c ... > #include "Python.h" > -#include "longintrepr.h" > +#include "longintrepr.h" // just for SHIFT The intent is fine, but please use a standard C (not C++) comment. That is, /*...*/, not //. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] longobject.c & ob_size
Michael Hudson <[EMAIL PROTECTED]> writes: > Tim Peters <[EMAIL PROTECTED]> writes: > >> [Michael Hudson] >>> Asking mostly for curiousity, how hard would it be to have longs store >>> their sign bit somewhere less aggravating? >> >> Depends on where that is. [...] >> I'd much rather give struct _longobject a distinct sign member (say, 0 >> == zero, -1 = non-zero negative, 1 == non-zero positive). I ended up doing -1 non-zero negative, 1 zero and positive, but I don't know if this is really clearer than what you suggest overall. I suspect it's a wash. [...] > I might do a patch, at some point... http://python.org/sf/119 Assigned to you, but unassign if you don't have time (testing the patch is probably more worthwhile than reading it!). Cheers, mwh -- Linux: Horse. Like a wild horse, fun to ride. Also prone to throwing you and stamping you into the ground because it doesn't like your socks. -- Jim's pedigree of operating systems, asr ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] inconsistency when swapping obj.__dict__ with a dict-like object...
On Apr 5, 2005 8:46 PM, Brett C. <[EMAIL PROTECTED]> wrote: > Alex A. Naanou wrote: > > Here there are two problems, the first is minor, and it is that > > anything assigned to the __dict__ attribute is checked to be a > > descendant of the dict class (mixing this in does not seem to work)... > > and the second problem is a real annoyance, it is that the mapping > > protocol supported by the Dict object in the example above is not used > > by the attribute access mechanics (the same thing that once happened > > in exec)... > > Actually, overriding __getattribute__() does work; __getattr__() and > __getitem__() doesn't. This was brought up last month at some point without > any resolve (I think Steve Bethard pointed it out). Yeah, here's the link: http://mail.python.org/pipermail/python-dev/2005-March/051837.html I've pointed out three possible "solutions" there, but they all have some significant drawbacks. I took the complete silence on the topic as an indication that none of the options were acceptable. STeVe -- You can wordify anything if you just verb it. --- Bucky Katt, Get Fuzzy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] inconsistency when swapping obj.__dict__ with a dict-like object...
P.S. (IMHO) the type check here is not that necessary (at least in its current state), as what we need to assert is not the relation to the dict class but the support of the mapping protocol The type-check is basically correct - as you have discovered, type & object use the PyDict_* API internally (for speed reasons, as I understand it), so supporting the mapping API is not really sufficient for something assigned to __dict__. Changing this for exec is one thing, as speed of access to the locals dict isn't likely to have a major impact on the overall performance of such code, but I would expect changing class dictionary access code in a similar way would have a major (detrimental) performance impact. Depending on the use case, it is possible to work around the problem by defining __dict__, __getattribute__, __setattr__ and __delattr__ in the class. defining __dict__ sidesteps the type error, defining the other three methods then let's you get around the fact that the standard C-level dict pointer is no longer being updated, as well as making sure the general mapping API is used, rather than the concrete PyDict_* API. This is kinda ugly, but it works as long as any C code using the class __dict__ goes via the attribute access machinery and doesn't try to get the dictionary automatically supplied by Python by digging directly into the type structure. = from UserDict import DictMixin class Dict(DictMixin): def __init__(self, dct=None): if dct is None: dct = {} self._dict = dct def __getitem__(self, name): return self._dict[name] def __setitem__(self, name, value): self._dict[name] = value def __delitem__(self, name): del self._dict[name] def keys(self): return self._dict.keys() class A(object): def __new__(cls, *p, **n): o = object.__new__(cls) super(A, o).__setattr__('__dict__', Dict()) return o __dict__ = None def __getattr__(self, attr): try: return self.__dict__[attr] except KeyError: raise AttributeError("%s" % attr) def __setattr__(self, attr, value): if attr in self.__dict__ or not hasattr(self, attr): self.__dict__[attr] = value else: super(A, self).__setattr__(attr, value) def __delattr__(self, attr): if attr in self.__dict__: del self.__dict__[attr] else: super(A, self).__delattr__(attr) Py> a = A() Py> a.__dict__._dict {} Py> a.xxx = 123 Py> a.__dict__._dict {'xxx': 123} Py> a.__dict__._dict['yyy'] = 321 Py> a.yyy 321 Py> a.__dict__._dict {'xxx': 123, 'yyy': 321} Py> del a.xxx Py> a.__dict__._dict {'yyy': 321} Py> del a.xxx Traceback (most recent call last): File "", line 1, in ? File "", line 21, in __delattr__ AttributeError: xxx Py> a.__dict__ = {} Py> a.yyy Traceback (most recent call last): File "", line 1, in ? File "", line 11, in __getattr__ AttributeError: yyy Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- http://boredomandlaziness.skystorm.net ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Stephen J. Turnbull wrote: Because the signature/BOM is not a chunk, it's a header. Handling the signature/BOM is part of stream initialization, not translation, to my mind. I'm sorry, but I'm losing track as to what precisely you are trying to say. You seem to be using a mental model that is entirely different from mine. The point is that explicitly using a stream shows that initialization (and finalization) matter. The default can be BOM or not, as a pragmatic matter. But then the stream data itself can be treated homogeneously, as implied by the notion of stream. But what follows from that point? So it shows some kind of matter... what does that mean for actual changes to Python API? I think it probably also would solve Walter's conundrum about buffering the signature/BOM if responsibility for that were moved out of the codecs and into the objects where signatures make sense. I don't know whether that's really feasible in the short run---I suspect there may be a lot of stream-like modules that would need to be updated---but it would be a saner in the long run. What is "that" which might be really feasible? To "solve Walter's conundrum"? That "signatures make sense"? So I can't really respond to your message in a meaningful way; I just let it rest... Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Weekly Python Patch/Bug Summary
Patch / Bug Summary ___ Patches : 308 open (+11) / 2819 closed ( +7) / 3127 total (+18) Bugs: 882 open (+11) / 4913 closed (+13) / 5795 total (+24) RFE : 176 open ( +1) / 151 closed ( +1) / 327 total ( +2) New / Reopened Patches __ improvement of the script adaptation for the win32 platform (2005-03-30) http://python.org/sf/1173134 opened by Vivian De Smedt unicodedata docstrings (2005-03-30) CLOSED http://python.org/sf/1173245 opened by Jeremy Yallop __slots__ for subclasses of variable length types (2005-03-30) http://python.org/sf/1173475 opened by Michael Hudson Python crashes in pyexpat.c if malformed XML is parsed (2005-03-31) http://python.org/sf/1173998 opened by pdecat hierarchical regular expression (2005-04-01) CLOSED http://python.org/sf/1174589 opened by Chris Ottrey site enhancements (2005-04-01) http://python.org/sf/1174614 opened by Bob Ippolito Export more libreadline API functions (2005-04-01) http://python.org/sf/1175004 opened by Bruce Edge Export more libreadline API functions (2005-04-01) CLOSED http://python.org/sf/1175048 opened by Bruce Edge Patch for whitespace enforcement (2005-04-01) CLOSED http://python.org/sf/1175070 opened by Guido van Rossum Allow weak referencing of classic classes (2005-04-03) http://python.org/sf/1175850 opened by Greg Chapman threading.Condition.wait() return value indicates timeout (2005-04-03) http://python.org/sf/1175933 opened by Martin Blais Make subprocess.Popen support file-like objects (win) (2005-04-03) http://python.org/sf/1175984 opened by Nicolas Fleury Implemented new 'class foo():pass' syntax (2005-04-03) http://python.org/sf/1176019 opened by logistix locale._build_localename treatment for utf8 (2005-04-05) http://python.org/sf/1176504 opened by Hye-Shik Chang Clarify unicode.(en|de)code.() docstrings (2005-04-04) CLOSED http://python.org/sf/1176578 opened by Brett Cannon UTF-8-Sig codec (2005-04-05) http://python.org/sf/1177307 opened by Walter Dörwald Complex commented (2005-04-06) http://python.org/sf/1177597 opened by engelbert gruber explicit sign variable for longs (2005-04-06) http://python.org/sf/119 opened by Michael Hudson Patches Closed __ unicodedata docstrings (2005-03-30) http://python.org/sf/1173245 closed by perky hierarchical regular expression (2005-04-01) http://python.org/sf/1174589 closed by loewis Export more libreadline API functions (2005-04-01) http://python.org/sf/1175048 closed by loewis Patch for whitespace enforcement (2005-04-01) http://python.org/sf/1175070 closed by gvanrossum ast for decorators (2005-03-21) http://python.org/sf/1167709 closed by nascheme [ast branch] unicode literal fixes (2005-03-25) http://python.org/sf/1170272 closed by nascheme Clarify unicode.(en|de)code.() docstrings (2005-04-04) http://python.org/sf/1176578 closed by bcannon New / Reopened Bugs ___ very minor doc bug in 'listsort.txt' (2005-03-30) CLOSED http://python.org/sf/1173407 opened by gyrof quit should quit (2005-03-30) CLOSED http://python.org/sf/1173637 opened by Matt Chaput multiple broken links in profiler docs (2005-03-30) http://python.org/sf/1173773 opened by Ilya Sandler Reading /dev/zero causes SystemError (2005-04-01) http://python.org/sf/1174606 opened by Adam Olsen subclassing ModuleType and another built-in type (2005-04-01) http://python.org/sf/1174712 opened by Armin Rigo PYTHONPATH is not working (2005-04-01) CLOSED http://python.org/sf/1174795 opened by Alexander Belchenko property example code error (2005-04-01) http://python.org/sf/1175022 opened by John Ridley import statement likely to crash if module launches threads (2005-04-01) http://python.org/sf/1175194 opened by Jeff Stearns python hangs if import statement launches threads (2005-04-01) CLOSED http://python.org/sf/1175202 opened by Jeff Stearns codecs.readline sometimes removes newline chars (2005-04-02) CLOSED http://python.org/sf/1175396 opened by Irmen de Jong poorly named variable in urllib2.py (2005-04-03) http://python.org/sf/1175848 opened by Roy Smith StringIO and cStringIO don't provide 'name' attribute (2005-04-03) http://python.org/sf/1175967 opened by logistix compiler module didn't get updated for "class foo():pass" (2005-04-03) http://python.org/sf/1176012 opened by logistix Python garbage collector isn't detecting deadlocks (2005-04-04) CLOSED http://python.org/sf/1176467 opened by Nathan Marushak Readline segfault (2005-04-05) http://python.org/sf/1176893 opened by Walter Dörwald [PyPI] Password reset problem. (2005-04-05) CLOSED http://python.org/sf/1177077 opened by Darek Suchojad
Re: [Python-Dev] Unicode byte order mark decoding
On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote: Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM. I've actually been confused about this point for quite some time now, but never had a chance to bring it up. I do not understand why UnicodeError should be raised if there is no BOM. I know that PEP-100 says: 'utf-16': 16-bit variable length encoding (little/big endian) and: Note: 'utf-16' should be implemented by using and requiring byte order marks (BOM) for file input/output. But this appears to be in error, at least in the current unicode standard. 'utf-16', as defined by the unicode standard, is big-endian in the absence of a BOM: --- 3.10.D42: UTF-16 encoding scheme: ... * The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. --- The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec. I allow for the possibility that this was ambiguous in the standard when the PEP was written, but it is certainly not ambiguous now. -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
> "Walter" == Walter Dörwald <[EMAIL PROTECTED]> writes: Walter> Not really. In every encoding where a sequence of more Walter> than one byte maps to one Unicode character, you will Walter> always need some kind of buffering. If we remove the Walter> handling of initial BOMs from the codecs (except for Walter> UTF-16 where it is required), this wouldn't change any Walter> buffering requirements. Sure. My point is that codecs should be stateful only to the extent needed to assemble semantically meaningful units (ie, multioctet coded characters). In particular, they should not need to know about location at the beginning, middle, or end of some stream---because in the context of operating on a string they _can't_. >> I don't know whether that's really feasible in the short >> run---I suspect there may be a lot of stream-like modules that >> would need to be updated---but it would be a saner in the long >> run. Walter> I'm not exactly sure, what you're proposing here. That all Walter> codecs (even UTF-16) pass the BOM through and some other Walter> infrastructure is responsible for dropping it? Not exactly. I think that at the lowest level codecs should not implement complex mode-switching internally, but rather explicitly abdicate responsibility to a more appropriate codec. For example, autodetecting UTF-16 on input would be implemented by a Python program that does something like data = stream.read() for detector in [ "utf-16-signature", "utf-16-statistical" ]: # for the UTF-16 detectors, OUT will always be u"" or None out, data, codec = data.decode(detector) if codec: break while codec: more_out, data, codec = data.decode(codec) out = out + more_out if data: # a real program would complain about it pass process(out) where decode("utf-16-signature") would be implemented def utf-16-signature-internal (data): if data[0:2] == "\xfe\xff": return (u"", data[2:], "utf-16-be") else if data[0:2] == "\xff\xfe": return (u"", data[2:], "utf-16-le") else # note: data is undisturbed if the detector fails return (None, data, None) The main point is that the detector is just a codec that stops when it figures out what the next codec should be, touches only data that would be incorrect to pass to the next codec, and leaves the data alone if detection fails. utf-16-signature only handles the BOM (if present), and does not handle arbitrary "chunks" of data. Instead, it passes on the rest of the data (including the first chunk) to be handled by the appropriate utf-16-?e codec. I think that the temptation to encapsulate this logic in a utf-16 codec that "simplifies" things by calling the appropriate utf-16-?e codec itself should be deprecated, but YMMV. What I would really like is for the above style to be easier to achieve than it currently is. BTW, I appreciate your patience in exploring this; after Martin's remark about different mental models I have to suspect this approach is just somehow un-Pythonic, but fleshing it out this way I can see how it will be useful in the context of a different project. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com