M.-A. Lemburg wrote: > Walter Dörwald wrote: >>>>> I'd suggest we keep codecs.lookup() the way it is and >>>>> instead add new functions to the codecs module, e.g. >>>>> codecs.getencoderobject() and codecs.getdecoderobject(). >>>>> >>>>> Changing the codec registration is not much of a problem: >>>>> we could simply allow 6-tuples to be passed into the >>>>> registry. >>>> OK, so codecs.lookup() returns 4-tuples, but the registry stores 6-tuples >>>> and the search functions must return 6-tuples. >>>> And we add codecs.getencoderobject() and codecs.getdecoderobject() as well >>>> as new classes codecs.StatefulEncoder and >>>> codecs.StatefulDecoder. What about old search functions that return >>>> 4-tuples? >>> >>> The registry should then simply set the missing entries to None and the >>> getencoderobject()/getdecoderobject() would then >>> have >>> to raise an error. >> >> Sounds simple enough and we don't loose backwards compatibility. >> >>> Perhaps we should also deprecate codecs.lookup() in Py 2.5 ?! >> >> +1, but I'd like to have a replacement for this, i.e. a function that >> returns all info the registry has about an encoding: >> >> 1. Name >> 2. Encoder function >> 3. Decoder function >> 4. Stateful encoder factory >> 5. Stateful decoder factory >> 6. Stream writer factory >> 7. Stream reader factory >> >> and if this is an object with attributes, we won't have any problems if we >> extend it in the future. > > Shouldn't be a problem: just expose the registry dictionary > via the _codecs module. > > The rest can then be done in a Python function defined in > codecs.py using a CodecInfo class.
This would require the Python code to call codecs.lookup() and then look into the codecs dictionary (normalizing the encoding name again). Maybe we should make a version of __PyCodec_Lookup() that allows 4- and 6-tuples available to Python and use that? The official PyCodec_Lookup() would then have to downgrade the 6-tuples to 4-tuples. >> BTW, if we change the API, can we fix the return value of the stateless >> functions? As the stateless function always >> encodes/decodes the complete string, returning the length of the string >> doesn't make sense. >> codecs.getencoder() and codecs.getdecoder() would have to continue to return >> the old variant of the functions, but >> codecs.getinfo("latin-1").encoder would be the new encoding function. > > No: you can still write stateless encoders or decoders that do > not process the whole input string. Just because we don't have > any of those in Python, doesn't mean that they can't be written > and used. A stateless codec might want to leave the work > of buffering bytes at the end of the input data which cannot > be processed to the caller. But what would the call do with that info? It can't retry encoding/decoding the rejected input, because the state of the codec has been thrown away already. > It is also possible to write > stateful codecs on top of such stateless encoding and decoding > functions. That's what the codec helper functions from Python/_codecs.c are for. Anyway, I've started implementing a patch that just adds codecs.StatefulEncoder/codecs.StatefulDecoder. UTF8, UTF8-Sig, UTF-16, UTF-16-LE and UTF-16-BE are already working. Bye, Walter Dörwald _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com