Some questions about decode/encode
I use chinese charactors as an example here.
>>>s1='你好吗'
>>>repr(s1)
"'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>>>b1=s1.decode('GBK')
My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?
My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.
I'm not sure if some special byte array or too long text caused this
problem. Or maybe thats a BUG of python 2.5?
--
http://mail.python.org/mailman/listinfo/python-list
Re: Some questions about decode/encode
On 1月24日, 下午1时41分, Ben Finney <[EMAIL PROTECTED]>
wrote:
> Ben Finney <[EMAIL PROTECTED]> writes:
> > glacier <[EMAIL PROTECTED]> writes:
>
> > > I use chinese charactors as an example here.
>
> > > >>>s1='你好吗'
> > > >>>repr(s1)
> > > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
> > > >>>b1=s1.decode('GBK')
>
> > > My first question is : what strategy does 'decode' use to tell the
> > > way to seperate the words. I mean since s1 is an multi-bytes-char
> > > string, how did it determine to seperate the string every 2bytes
> > > or 1byte?
>
> > The codec you specified ("GBK") is, like any character-encoding
> > codec, a precise mapping between characters and bytes. It's almost
> > certainly not aware of "words", only character-to-byte mappings.
>
> To be clear, I should point out that I didn't mean to imply static
> tabular mappings only. The mappings in a character encoding are often
> more complex and algorithmic.
>
> That doesn't make them any less precise, of course; and the core point
> is that a character-mapping codec is *only* about getting between
> characters and bytes, nothing else.
>
> --
> \ "He who laughs last, thinks slowest." -- Anonymous |
> `\ |
> _o__) |
> Ben Finney- 隐藏被引用文字 -
>
> - 显示引用的文字 -
thanks for your respoonse:)
When I mentioned 'word' in the previous post, I mean character.
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
##
a='你好吗'*10
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
##
May the code above produce any bogus characters in s1?
Thanks :)
--
http://mail.python.org/mailman/listinfo/python-list
Re: Some questions about decode/encode
On 1月24日, 下午1时49分, [EMAIL PROTECTED] wrote:
> On Jan 23, 8:49 pm, glacier <[EMAIL PROTECTED]> wrote:
>
> > I use chinese charactors as an example here.
>
> > >>>s1='你好吗'
> > >>>repr(s1)
>
> > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>
> > >>>b1=s1.decode('GBK')
>
> > My first question is : what strategy does 'decode' use to tell the way
> > to seperate the words.
>
> decode() uses the GBK strategy you specified to determine what
> constitutes a character in your string.
>
> > My second question is: is there any one who has tested very long mbcs
> > decode? I tried to decode a long(20+MB) xml yesterday, which turns out
> > to be very strange and cause SAX fail to parse the decoded string.
> > However, I use another text editor to convert the file to utf-8 and
> > SAX will parse the content successfully.
>
> > I'm not sure if some special byte array or too long text caused this
> > problem. Or maybe thats a BUG of python 2.5?
>
> That's probably to vague of a description to determine why SAX isn't
> doing what you expect it to.
You mean to post a copy of the XML document?
--
http://mail.python.org/mailman/listinfo/python-list
Re: Some questions about decode/encode
On 1月24日, 下午5时51分, John Machin <[EMAIL PROTECTED]> wrote:
> On Jan 24, 2:49 pm, glacier <[EMAIL PROTECTED]> wrote:
>
> > I use chinese charactors as an example here.
>
> > >>>s1='你好吗'
> > >>>repr(s1)
>
> > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>
> > >>>b1=s1.decode('GBK')
>
> > My first question is : what strategy does 'decode' use to tell the way
> > to seperate the words. I mean since s1 is an multi-bytes-char string,
> > how did it determine to seperate the string every 2bytes or 1byte?
>
> The usual strategy for encodings like GBK is:
> 1. If the current byte is less than 0x80, then it's a 1-byte
> character.
> 2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte
> make up a two-byte character.
> 3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte
> euro character)
> 4: Current byte 0xFF: undefined
>
> Cheers,
> John
Thanks John, I will try to write a function to test if the strategy
above caused the problem I described in the 1st post:)
--
http://mail.python.org/mailman/listinfo/python-list
Re: Some questions about decode/encode
On 1月24日, 下午3时29分, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote:
> En Thu, 24 Jan 2008 04:52:22 -0200, glacier <[EMAIL PROTECTED]> escribió:
>
> > According to your reply, what will happen if I try to decode a long
> > string seperately.
> > I mean:
> > ##
> > a='你好吗'*10
> > s1 = u''
> > cur = 0
> > while cur < len(a):
> > d = min(len(a)-i,1023)
> > s1 += a[cur:cur+d].decode('mbcs')
> > cur += d
> > ##
>
> > May the code above produce any bogus characters in s1?
>
> Don't do that. You might be splitting the input string at a point that is
> not a character boundary. You won't get bogus output, decode will raise a
> UnicodeDecodeError instead.
> You can control how errors are handled, see
> http://docs.python.org/lib/string-methods.html#l2h-237
>
> --
> Gabriel Genellina
Thanks Gabriel,
I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?
Thanks a lot.
--
http://mail.python.org/mailman/listinfo/python-list
Re: Some questions about decode/encode
On 1月24日, 下午4时44分, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > On Wed, 23 Jan 2008 19:49:01 -0800, glacier wrote: > > My second question is: is there any one who has tested very long mbcs > > decode? I tried to decode a long(20+MB) xml yesterday, which turns out > > to be very strange and cause SAX fail to parse the decoded string. > > That's because SAX wants bytes, not a decoded string. Don't decode it > yourself. > > > However, I use another text editor to convert the file to utf-8 and > > SAX will parse the content successfully. > > Because now you feed SAX with bytes instead of a unicode string. > > Ciao, > Marc 'BlackJack' Rintsch Yepp. I feed SAX with the unicode string since SAX didn't support my encoding system(GBK). Is there any way to solve this better? I mean if I shouldn't convert the GBK string to unicode string, what should I do to make SAX work? Thanks , Marc. :) -- http://mail.python.org/mailman/listinfo/python-list
Re: Some questions about decode/encode
On 1月27日, 下午7时20分, John Machin <[EMAIL PROTECTED]> wrote:
> On Jan 27, 9:17 pm, glacier <[EMAIL PROTECTED]> wrote:
>
>
>
>
>
> > On 1月24日, 下午3时29分, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote:
>
> > > En Thu, 24 Jan 2008 04:52:22 -0200, glacier <[EMAIL PROTECTED]> escribió:
>
> > > > According to your reply, what will happen if I try to decode a long
> > > > string seperately.
> > > > I mean:
> > > > ##
> > > > a='你好吗'*10
> > > > s1 = u''
> > > > cur = 0
> > > > while cur < len(a):
> > > > d = min(len(a)-i,1023)
> > > > s1 += a[cur:cur+d].decode('mbcs')
> > > > cur += d
> > > > ##
>
> > > > May the code above produce any bogus characters in s1?
>
> > > Don't do that. You might be splitting the input string at a point that is
> > > not a character boundary. You won't get bogus output, decode will raise a
> > > UnicodeDecodeError instead.
> > > You can control how errors are handled, see
> > > http://docs.python.org/lib/string-methods.html#l2h-237
>
> > > --
> > > Gabriel Genellina
>
> > Thanks Gabriel,
>
> > I guess I understand what will happen if I didn't split the string at
> > the character's boundry.
> > I'm not sure if the decode method will miss split the boundry.
> > Can you tell me then ?
>
> > Thanks a lot.
>
> *IF* the file is well-formed GBK, then the codec will not mess up when
> decoding it to Unicode. The usual cause of mess is a combination of a
> human and a text editor :-)- 隐藏被引用文字 -
>
> - 显示引用的文字 -
I guess firstly, I should check if the file I used to test is well-
formed GBK..:)
--
http://mail.python.org/mailman/listinfo/python-list
Re: Some questions about decode/encode
On 1月27日, 下午7时04分, John Machin <[EMAIL PROTECTED]> wrote:
> On Jan 27, 9:18 pm, glacier <[EMAIL PROTECTED]> wrote:
>
>
>
>
>
> > On 1月24日, 下午4时44分, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
>
> > > On Wed, 23 Jan 2008 19:49:01 -0800, glacier wrote:
> > > > My second question is: is there any one who has tested very long mbcs
> > > > decode? I tried to decode a long(20+MB) xml yesterday, which turns out
> > > > to be very strange and cause SAX fail to parse the decoded string.
>
> > > That's because SAX wants bytes, not a decoded string. Don't decode it
> > > yourself.
>
> > > > However, I use another text editor to convert the file to utf-8 and
> > > > SAX will parse the content successfully.
>
> > > Because now you feed SAX with bytes instead of a unicode string.
>
> > > Ciao,
> > > Marc 'BlackJack' Rintsch
>
> > Yepp. I feed SAX with the unicode string since SAX didn't support my
> > encoding system(GBK).
>
> Let's go back to the beginning. What is "SAX"? Show us exactly what
> command or code you used.
>
SAX is the package 'xml.sax' distributed with Python 2.5:)
1,I read text from a GBK encoded XML file then I skip the first line
declare the encoding.
2,I converted the string to uncode by call decode('mbcs')
3,I used xml.sax.parseString to parse the string.
f = file('e:/temp/456.xml','rb')
s = f.read()
f.close()
n = 0
for i in xrange(len(s)):
if s[i]=='\n':
n += 1
if n == 1:
s = s[i+1:]
break
s = ''+s+''
s = s.decode('mbcs')
xml.sax.parseString(s,handler,handler)
> How did you let this SAX know that the file was encoded in GBK? An
> argument to SAX? An encoding declaration in the first few lines of the
> file? Some other method? ... precise answer please. Or did you expect
> that this SAX would guess correctly what the encoding was without
> being told?
I didn't tell the SAX the file is encoded in GBK since I used the
'parseString' method.
>
> What does "didn't support my encoding system" mean? Have you actually
> tried pushing raw undecoded GBK at SAX using a suitable documented
> method of telling SAX that the file is in fact encoded in GBK? If so,
> what was the error message that you got?
I mean SAX only support a limited number of encoding such as utf-8
utf-16 etc.,which didn't include GBK.
>
> How do you know that it's GBK, anyway? Have you considered these
> possible scenarios:
> (1) It's GBK but you are telling SAX that it's GB2312
> (2) It's GB18030 but you are telling SAX it's GBK
>
Frankly speaking, I cannot tell if the file contains any GB18030
characters...^__^
> HTH,
> John- 隐藏被引用文字 -
>
> - 显示引用的文字 -
--
http://mail.python.org/mailman/listinfo/python-list
Re: Some questions about decode/encode
On 1月28日, 上午5时50分, John Machin <[EMAIL PROTECTED]> wrote:
> On Jan 28, 7:47 am, "Mark Tolonen" <[EMAIL PROTECTED]>
> wrote:
>
>
>
>
>
> > >"John Machin" <[EMAIL PROTECTED]> wrote in message
> > >news:[EMAIL PROTECTED]
> > >On Jan 27, 9:17 pm, glacier <[EMAIL PROTECTED]> wrote:
> > >> On 1月24日, 下午3时29分, "Gabriel Genellina" <[EMAIL PROTECTED]>
> > >> wrote:
>
> > >*IF* the file is well-formed GBK, then the codec will not mess up when
> > >decoding it to Unicode. The usual cause of mess is a combination of a
> > >human and a text editor :-)
>
> > SAX uses the expat parser. From the pyexpat module docs:
>
> > Expat doesn't support as many encodings as Python does, and its repertoire
> > of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
> > (Latin1), and ASCII. If encoding is given it will override the implicit or
> > explicit encoding of the document.
>
> > --Mark
>
> Thank you for pointing out where that list of encodings had been
> cunningly concealed. However the relevance of dropping it in as an
> apparent response to my answer to the OP's question about decoding
> possibly butchered GBK strings is what?
>
> In any case, it seems to support other 8-bit encodings e.g. iso-8859-2
> and koi8-r ...
>
> C:\junk>type gbksax.py
> import xml.sax, xml.sax.saxutils
> import cStringIO
>
> unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
> range(4))
> print 'unistr=%r' % unistr
> gbkstr = unistr.encode('gbk')
> print 'gbkstr=%r' % gbkstr
> unistr2 = gbkstr.decode('gbk')
> assert unistr2 == unistr
>
> print "latin1 FF -> utf8 = %r" %
> '\xff'.decode('iso-8859-1').encode('utf8')
> print "latin2 FF -> utf8 = %r" %
> '\xff'.decode('iso-8859-2').encode('utf8')
> print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8')
>
> xml_template = """%s data>"""
>
> asciidoc = xml_template % ('ascii', 'The quick brown fox etc')
> utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
> latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati
> carborundum' + '\xff')
> latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff')
> koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff')
> gbkdoc = xml_template % ('gbk', gbkstr)
>
> for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc,
> gbkdoc):
> f = cStringIO.StringIO()
> handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
> xml.sax.parseString(doc, handler)
> result = f.getvalue()
> f.close
> print repr(result[result.find(''):])
>
> C:\junk>gbksax.py
> unistr=u'\u4e00W\u4e01X\u4e02Y\u4e03Z'
> gbkstr='[EMAIL PROTECTED]'
> latin1 FF -> utf8 = '\xc3\xbf'
> latin2 FF -> utf8 = '\xcb\x99'
> koi8r FF -> utf8 = '\xd0\xaa'
> 'The quick brown fox etc'
> '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z'
> 'nil illegitimati carborundum\xc3\xbf'
> 'duo secundus\xcb\x99'
> 'Moskva\xd0\xaa'
> Traceback (most recent call last):
> File "C:\junk\gbksax.py", line 27, in
> xml.sax.parseString(doc, handler)
> File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
> parser.parse(inpsrc)
> File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
> xmlreader.IncrementalParser.parse(self, source)
> File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
> self.feed(buffer)
> File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
> self._err_handler.fatalError(exc)
> File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
> raise exception
> xml.sax._exceptions.SAXParseException: :1:30: unknown
> encoding
>
> C:\junk>- 隐藏被引用文字 -
>
> - 显示引用的文字 -
Thanks,John.
It's no doubt that you proved SAX didn't support GBK encoding.
But can you give some suggestion on how to make SAX parse some GBK
string?
--
http://mail.python.org/mailman/listinfo/python-list
Re: Some questions about decode/encode
On Jan 28, 2:31 pm, John Machin <[EMAIL PROTECTED]> wrote:
> On Jan 28, 2:53 pm, glacier <[EMAIL PROTECTED]> wrote:
>
>
>
> > Thanks,John.
> > It's no doubt that you proved SAX didn't support GBK encoding.
> > But can you give some suggestion on how to make SAX parse some GBK
> > string?
>
> Yes, the same suggestion as was given to you by others very early in
> this thread, the same as I demonstrated in the middle of proving that
> SAX doesn't support a GBK-encoded input file.
>
> Suggestion: Recode your input from GBK to UTF-8. Ensure that the XML
> declaration doesn't have an unsupported encoding. Your handler will
> get data encoded as UTF-8. Recode that to GBK if needed.
>
> Here's a cut down version of the previous script, focussed on
> demonstrating that the recoding strategy works.
>
> C:\junk>type gbksax2.py
> import xml.sax, xml.sax.saxutils
> import cStringIO
> unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
> range(4))
> gbkstr = unistr.encode('gbk')
> print 'This is a GBK-encoded string: %r' % gbkstr
> utf8str = gbkstr.decode('gbk').encode('utf8')
> print 'Now recoded as UTF-8 to be fed to a SAX parser: %r' % utf8str
> xml_template = """%s data>"""
> utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
> f = cStringIO.StringIO()
> handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
> xml.sax.parseString(utf8doc, handler)
> result = f.getvalue()
> f.close()
> start = result.find('') + 6
> end = result.find('')
> mydata = result[start:end]
> print "SAX output (UTF-8): %r" % mydata
> print "SAX output recoded to GBK: %r" %
> mydata.decode('utf8').encode('gbk')
>
> C:\junk>gbksax2.py
> This is a GBK-encoded string: '[EMAIL PROTECTED]'
> Now recoded as UTF-8 to be fed to a SAX parser: '\xe4\xb8\x80W
> \xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z'
> SAX output (UTF-8): '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y
> \xe4\xb8\x83Z'
> SAX output recoded to GBK: '[EMAIL PROTECTED]'
>
> HTH,
> John
Thanks a lot John:)
I'll try it.
--
http://mail.python.org/mailman/listinfo/python-list
Re: Mailing-Lists (pointer)
On 01/10/23 11:33, dn wrote: On 10/01/2023 08.46, Stefan Ram wrote: If anyone is interested: In "comp.misc", there's a discussion about the use of mailing lists in software development. Subject: An objective criteria for deprecating community platforms (I did not create this subject!) (and I don't read comp.misc) There is an increasingly relevant question though: how do we 'reach' as many people as possible, without diluting the (community) value of responses? At one time, if you wanted to talk/hear certain folk you felt compelled to join Twitter (see also AOL, MySpace, Facebook, ...). Recently many more people have realised that a single, centralised, (and corporately-owned) 'service' has its down-sides. If there are too many channels for communication, it increases the difficulty for any one person to 'keep up', eg python-list and python-forum. I remember there was once a hot thread in this python-list discussing about abandoning this mailing list and move all the discussion to the forum. Has anyone known about any status quo about the decision? I personally strongly preferred mailing list. It is open-format, open-archive and easy to download and retrive information using your preferred indexing tools and homemade scripts. -- https://mail.python.org/mailman/listinfo/python-list
