hi john, On Tue, Jun 15, 2010 at 02:40:33PM -0600, John Wright wrote: > * Accept 'raw' as a Deb822 constructor encoding argument, or add a > raw_strings keyword argument, that turns off the unicode behavior > - Con: old code still breaks with mixed data - you have to change > your code to use the new constructor argument > - Pro: most consistent results (raw strings are only returned if you > explicitly ask for them) > > * Wrap unicode stuff in try/except, and use the raw string if > something goes wrong > - Con: not as consistent results as above option > - Pro: old code works out-of-box with mixed data > > Which one do you think makes more sense?
the problem with the former is that since the input is typically outside the control of the programmer, most/many people would end up always having to pass it along, which kinda defeats the purpose and also complicates the api. And I agree about the issue you raise with consistency in the latter, so I don't think either of these two are that great. fwiw, after having looked at the code i have found a workaround in the meantime, which may point at anohter option for a real solution. since the encoding seems to be stored per instance from what the iterator returns, explicitly setting it after catching the UnicodeError seems to get around the problem: slist = deb822.Sources.iter_paragraphs(fh) for ent in slist: try: outf.write(ent.dump().encode('utf-8')) except UnicodeDecodeError: ent.encoding = 'latin-1' outf.write(ent.dump().encode('utf-8')) outf.write("\n") however trying to do this: outf.write(ent.dump(encoding='latin-1').encode('utf-8')) does not work, as it seems there's still something somewhere using the instance's encoding attribute instead of the function parameter. if *that* could be fixed, i don't think we'd have a bug here. i.e.: slist = deb822.Sources.iter_paragraphs(fh) for ent in slist: try: outf.write(ent.dump().encode('utf-8')) except UnicodeDecodeError: outf.write(ent.dump(encoding='latin-1').encode('utf-8')) outf.write("\n") is pretty much what I would have expected to need in my code knowing that deb822 now uses unicode internally. it feels very python like and doesn't involve any extra API changes. what do you think? sean
signature.asc
Description: Digital signature