Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

2012-03-11 Thread Steven D'Aprano
On Sat, Mar 10, 2012 at 08:03:18PM -0500, Dave Angel wrote:

> There are just 256 possible characters in cp1252, and 256 in cp932.

CP932 is also known as MS-KANJI or SHIFT-JIS (actually, one of many 
variants of SHIFT-JS). It is a multi-byte encoding, which means it has 
far more than 256 characters.

http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml
http://en.wikipedia.org/wiki/Shift_JIS

The actual problem the OP has got is that the *multi-byte* sequence he 
is trying to print is illegal when interpreted as CP932. Personally I 
think that's a bug in the terminal, or possibly even print, since he's 
not printing bytes but characters, but I haven't given that a lot of 
thought so I might be way out of line.

The quick and dirty fix is to change the encoding of his terminal, so 
that it no longer tries to interpret the characters printed using CP932. 
That will also mean he'll no longer see valid Japanese characters.

But since he appears to be using Windows, I don't know if this is 
possible, or easy.


[...] 
> You can "solve" the problem by pretending the input file is also cp932 
> when you open it. That way you'll get the wrong characters, but no 
> errors.

Not so -- there are multi-byte sequences that can't be read in CP932.

>>> b"\xe9x".decode("cp932")  # this one works
'騙'
>>> b"\xe9!".decode("cp932")  # this one doesn't
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 0-1: 
illegal multibyte sequence

In any case, the error doesn't occur when he reads the data, but when he 
prints it. Once the data is read, it is already Unicode text, so he 
should be able to print any character. At worst, it will print as a 
missing character (a square box or space) rather than the expected 
glyph. He shouldn't get a UnicodeDecodeError when printing. I smell a 
bug since print shouldn't be decoding anything. (At worst, it needs to 
*encode*.)


-- 
Steven

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

2012-03-11 Thread Peter Otten
Robert Sjoblom wrote:

> Okay, so here's a fun one. Since I'm on a japanese locale my native
> encoding is cp932. I was thinking of writing a parser for a bunch of
> text files, but I stumbled on even printing the contents due to ...
> something. I don't know what encoding the text file uses, which isn't
> helping my case either (I have asked, but I've yet to get an answer).
> 
> Okay, so:
> 
> address = "C:/Path/to/file/file.ext"
> with open(address, encoding="cp1252") as alpha:

Superfluous readlines() alert:

> text = alpha.readlines()
> for line in text:
> print(line)

You can iterate over the file directly with

#python3
for line in alpha:
print(line, end="")

or even

sys.stdout.writelines(alpha)

> It starts to print until it hits the wonderful character é or '\xe9',
> where it gives me this happy traceback:
> Traceback (most recent call last):
>   File "C:\Users\Azaz\Desktop\CK2 Map Painter\Parser\test parser.py",
> line 8, in 
> print(line)
> UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in
> position 13: illegal multibyte sequence
> 
> I can open the document and view it in UltraEdit -- and it displays
> correct characters there -- but UE can't give me what encoding it
> uses. Any chance of solving this without having to switch from my
> japanese locale? Also, the cp1252 is just an educated guess, but it
> doesn't really matter because it always comes back to the cp932 error.

# python3 
output_encoding = sys.stdout.encoding or "UTF-8"
error_handling = "replace"
Writer = codecs.getwriter(output_encoding)

outstream = Writer(sys.stdout.buffer, error_handling)
with open(filename, "r", encoding="cp1252") as instream:
for line in instream:
print(line, end="", file=outstream)


error_handling = "replace" prints "?" for characters that cannot be 
displayed in the target encoding.


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

2012-03-11 Thread Peter Otten
Steven D'Aprano wrote:

> glyph. He shouldn't get a UnicodeDecodeError when printing. I smell a
> bug since print shouldn't be decoding anything. (At worst, it needs to
> *encode*.)

You have correctly derived the actual traceback ;)

[Robert]
> It starts to print until it hits the wonderful character é or '\xe9',
> where it gives me this happy traceback:
> Traceback (most recent call last):
>   File "C:\Users\Azaz\Desktop\CK2 Map Painter\Parser\test parser.py",
> line 8, in 
> print(line)
> UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in
> position 13: illegal multibyte sequence
 
In nuce:

$ PYTHONIOENCODING=cp932 python3 -c 'print("\xe9")'
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position 
0: illegal multibyte sequence

(I have to lie about the encoding; my terminal speaks UTF-8)

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Finding a specific line in a body of text

2012-03-11 Thread Robert Sjoblom
I'm sorry if the subject is vague, but I can't really explain it very
well. I've been away from programming for a while now (I got a
daughter and a year after that a son, so I've been busy with family
matters). As such, my skills are definitely rusty.

In the file I'm parsing, I'm looking for specific lines. I don't know
the content of these lines but I do know the content that appears two
lines before. As such I thought that maybe I'd flag for a found line
and then flag the next two lines as well, like so:

if keyword in line:
  flag = 1
  continue
if flag == 1 or flag == 2:
  if flag == 1:
flag = 2
continue
  if flag == 2:
list.append(line)

This, however, turned out to be unacceptably slow; this file is 1.1M
lines, and it takes roughly a minute to go through. I have 450 of
these files; I don't have the luxury to let it run for 8 hours.

So I thought that maybe I could use enumerate() somehow, get the index
when I hit keyword and just append the line at index+2; but I realize
I don't know how to do that. File objects doesn't have an index
function.

For those curious, the data I'm looking for looks like this:
5 72 88 77 90 92
18 80 75 98 84 90
81
12 58 76 77 94 96

There are other parts of the file that contains similar strings of
digits, so I can't just grab any digits I come across either; the only
thing I have to go on is the keyword. It's obvious that my initial
idea was horribly bad (and I knew that as well, but I wanted to first
make sure that I could find what I was after properly). The structure
looks like this (I opted to use \t instead of relying on the tabs to
getting formatted properly in the email):

\t\tkeyword=
\t\t{
5 72 88 77 90 92 \t\t}

-- 
best regards,
Robert S.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] question on self

2012-03-11 Thread Michael Lewis
Why do I have to use "self.example" when calling a method inside a class?

For example:

def Play(self):
'''find scores, reports winners'''
self.scores = []
for player in range(self.players):
print
print 'Player', player + 1
self.scores.append(self.TakeTurns())

I have another method called take turns (not shown for brevity purposes).
When I want to call it, why can't I just call it like a function and use
TakeTurns() instead of self.TakeTurns()?

-- 
Michael J. Lewis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] question on self

2012-03-11 Thread Steven D'Aprano
On Sun, Mar 11, 2012 at 07:02:11PM -0700, Michael Lewis wrote:
> Why do I have to use "self.example" when calling a method inside a class?
> 
> For example:
> 
> def Play(self):
> '''find scores, reports winners'''
> self.scores = []
> for player in range(self.players):
> print
> print 'Player', player + 1
> self.scores.append(self.TakeTurns())
> 
> I have another method called take turns (not shown for brevity purposes).
> When I want to call it, why can't I just call it like a function and use
> TakeTurns() instead of self.TakeTurns()?

When you call range() inside a method, as you do above, do you 
expect to get the global range() function, or the self.range() 
method (which likely doesn't exist)?

Same for len(), or any other built-in or global.

Similarly, how do you expect Python to distinguish between a persistent 
attribute, like self.scores, and a local variable, like player?

Since Python can't read your mind, one way or another you have to 
explicitly tell the compiler which of the two name resolution 
orders to use:

(1) The normal function scope rules:

- local variables have priority over:
- non-locals, which have priority over:
- globals, which have priority over:
- built-ins;

(2) or the attribute search rules, which is quite compilicated but a 
simplified version is:

- instance attributes or methods
- class attributes or methods
- superclass attributes or method
- computed attributes or methods using __getattr__

Python refuses to guess which one you want, since any guess is likely to 
be wrong 50% of the time. Instead, Python's design is to always use 
function scope rules, and if you want attributes or methods, you have to 
explicitly ask for them. This makes MUCH more sense than having to 
explicitly flag local variables!

Other languages made other choices. For instance, you might demand that 
the programmer declare all their variables up-front, and all their 
instance attributes. Then the compiler can tell at compile-time that 
range is a built-in, that player is a local variable, and that TakeTurns 
is an instance attribute. That's a legitimate choice, and some languages 
do it that way.

But having programmed in some of these other languages, give me Python's 
lack of declarations anytime!

Since all names (variables and attributes) in Python are generated at 
runtime, the compiler normally cannot tell what the scope of a name is 
until runtime (with a few exceptions).


-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Finding a specific line in a body of text

2012-03-11 Thread Steven D'Aprano
On Mon, Mar 12, 2012 at 02:56:36AM +0100, Robert Sjoblom wrote:

> In the file I'm parsing, I'm looking for specific lines. I don't know
> the content of these lines but I do know the content that appears two
> lines before. As such I thought that maybe I'd flag for a found line
> and then flag the next two lines as well, like so:
> 
> if keyword in line:
>   flag = 1
>   continue
> if flag == 1 or flag == 2:
>   if flag == 1:
> flag = 2
> continue
>   if flag == 2:
> list.append(line)


You haven't shown us the critical part: how are you getting the lines in 
the first place?

(Also, you shouldn't shadow built-ins like list as you do above, unless 
you know what you are doing. If you have to ask "what's shadowing?", you 
don't :)


> This, however, turned out to be unacceptably slow; this file is 1.1M
> lines, and it takes roughly a minute to go through. I have 450 of
> these files; I don't have the luxury to let it run for 8 hours.

Really? And how many hours have you spent trying to speed this up? Two? 
Three? Seven? And if it takes people two or three hours to answer your 
question, and you another two or three hours to read it, it would have 
been faster to just run the code as given :)

I'm just saying.

Since you don't show the actual critical part of the code, I'm going to 
make some simple suggestions that you may or may not have already tried.

- don't read files off USB or CD or over the network, because it will 
likely be slow; if you can copy the files onto the local hard drive, 
performance may be better;

- but if you include the copying time, it might not make that much 
difference;

- can you use a dedicated tool for this, like Unix grep or even perl, 
which is optimised for high-speed file manipulations?

- if you need to stick with Python, try this:

# untested
results = []
fp = open('filename')
for line in fp:
if key in line:  
# Found key, skip the next line and save the following.
_ = next(fp, '')
results.append(next(fp, ''))

By the way, the above assumes you are running Python 2.6 or better. In 
Python 2.5, you can define this function:

def next(iterator, default):
try:
return iterator.next()
except StopIteration:
return default

but it will likely be a little slower.


Another approach may be to read the whole file into memory in one big 
chunk. 1.1 million lines, by (say) 50 characters per line comes to about 
53 MB per file, which should be small enough to read into memory and 
process it in one chunk. Something like this:

# again untested
text = open('filename').read()
results = []
i = 0
while i < len(text):
offset = text.find(key, i)
if i == -1: break
i += len(key)  # skip the rest of the key
# read ahead to the next newline, twice
i = text.find('\n', i)
i = text.find('\n', i)
# now find the following newline, and save everything up to that
p = text.find('\n', i)
if p == -1:  p = len(text)
results.append(text[i:p])
i = p  # skip ahead


This will likely break if the key is found without two more lines 
following it.



-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] question on self

2012-03-11 Thread Steve Willoughby

On 11-Mar-12 20:03, Steven D'Aprano wrote:

On Sun, Mar 11, 2012 at 07:02:11PM -0700, Michael Lewis wrote:

Why do I have to use "self.example" when calling a method inside a class?

For example:

 def Play(self):
 '''find scores, reports winners'''
 self.scores = []
 for player in range(self.players):
 print
 print 'Player', player + 1
 self.scores.append(self.TakeTurns())

I have another method called take turns (not shown for brevity purposes).
When I want to call it, why can't I just call it like a function and use
TakeTurns() instead of self.TakeTurns()?


Steven's notes about scoping rules are one reason.  Another is the 
matter of object instance binding.  When you call a method, you're not 
just calling a regular function.  You're calling a function bound to a 
particular object, so by saying self.TakeTurns(), Python knows that the 
object "self" is invoking that method, not some other instance of the 
Play class.  That method then can access all of that specific object's 
attributes as necessary.


--
Steve Willoughby / st...@alchemy.com
"A ship in harbor is safe, but that is not what ships are built for."
PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Finding a specific line in a body of text

2012-03-11 Thread Robert Sjoblom
> You haven't shown us the critical part: how are you getting the lines in
> the first place?

Ah, yes --
with open(address, "r", encoding="cp1252") as instream:
for line in instream:

> (Also, you shouldn't shadow built-ins like list as you do above, unless
> you know what you are doing. If you have to ask "what's shadowing?", you
> don't :)
Maybe I should have said list_name.append() instead; sorry for that.

>> This, however, turned out to be unacceptably slow; this file is 1.1M
>> lines, and it takes roughly a minute to go through. I have 450 of
>> these files; I don't have the luxury to let it run for 8 hours.
>
> Really? And how many hours have you spent trying to speed this up? Two?
> Three? Seven? And if it takes people two or three hours to answer your
> question, and you another two or three hours to read it, it would have
> been faster to just run the code as given :)
Yes, for one set of files. Since I don't know how many sets of ~450
files I'll have to run this over, I think that asking for help was a
rather acceptable loss of time. I work on other parts while waiting
anyway, or try and find out on my own as well.

> - if you need to stick with Python, try this:
>
> # untested
> results = []
> fp = open('filename')
> for line in fp:
>    if key in line:
>        # Found key, skip the next line and save the following.
>        _ = next(fp, '')
>        results.append(next(fp, ''))

Well that's certainly faster, but not fast enough.
Oh well, I'll continue looking for a solution -- because even with the
speedup it's unacceptable. I'm hoping against hope that I only have to
run it against the last file of each batch of files, but if it turns
out that I don't, I'm in for some exciting days of finding stuff out.
Thanks for all the help though, it's much appreciated!

How do you approach something like this, when someone tells you "we
need you to parse these files. We can't tell you how they're
structured so you'll have to figure that out yourself."? It's just so
much text that's it's hard to get a grasp on the structure, and
there's so much information contained in there as well; this is just
the first part of what I'm afraid will be many. I'll try not to bother
this list too much though.
-- 
best regards,
Robert S.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Finding a specific line in a body of text

2012-03-11 Thread ian douglas
Erik Rise gave a good talk today at PyCon about a parsing library he's
working on called Parsimonious. You could maybe look into what he's doing
there, and see if that helps you any... Follow him on Twitter at @erikrose
to see when his session's video is up. His session was named "Parsing
Horrible Things in Python"
On Mar 11, 2012 9:48 PM, "Robert Sjoblom"  wrote:

> > You haven't shown us the critical part: how are you getting the lines in
> > the first place?
>
> Ah, yes --
> with open(address, "r", encoding="cp1252") as instream:
>for line in instream:
>
> > (Also, you shouldn't shadow built-ins like list as you do above, unless
> > you know what you are doing. If you have to ask "what's shadowing?", you
> > don't :)
> Maybe I should have said list_name.append() instead; sorry for that.
>
> >> This, however, turned out to be unacceptably slow; this file is 1.1M
> >> lines, and it takes roughly a minute to go through. I have 450 of
> >> these files; I don't have the luxury to let it run for 8 hours.
> >
> > Really? And how many hours have you spent trying to speed this up? Two?
> > Three? Seven? And if it takes people two or three hours to answer your
> > question, and you another two or three hours to read it, it would have
> > been faster to just run the code as given :)
> Yes, for one set of files. Since I don't know how many sets of ~450
> files I'll have to run this over, I think that asking for help was a
> rather acceptable loss of time. I work on other parts while waiting
> anyway, or try and find out on my own as well.
>
> > - if you need to stick with Python, try this:
> >
> > # untested
> > results = []
> > fp = open('filename')
> > for line in fp:
> >if key in line:
> ># Found key, skip the next line and save the following.
> >_ = next(fp, '')
> >results.append(next(fp, ''))
>
> Well that's certainly faster, but not fast enough.
> Oh well, I'll continue looking for a solution -- because even with the
> speedup it's unacceptable. I'm hoping against hope that I only have to
> run it against the last file of each batch of files, but if it turns
> out that I don't, I'm in for some exciting days of finding stuff out.
> Thanks for all the help though, it's much appreciated!
>
> How do you approach something like this, when someone tells you "we
> need you to parse these files. We can't tell you how they're
> structured so you'll have to figure that out yourself."? It's just so
> much text that's it's hard to get a grasp on the structure, and
> there's so much information contained in there as well; this is just
> the first part of what I'm afraid will be many. I'll try not to bother
> this list too much though.
> --
> best regards,
> Robert S.
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Finding a specific line in a body of text

2012-03-11 Thread Steven D'Aprano
On Mon, Mar 12, 2012 at 05:46:39AM +0100, Robert Sjoblom wrote:
> > You haven't shown us the critical part: how are you getting the lines in
> > the first place?
> 
> Ah, yes --
> with open(address, "r", encoding="cp1252") as instream:
> for line in instream:

Seems reasonable.


> > (Also, you shouldn't shadow built-ins like list as you do above, unless
> > you know what you are doing. If you have to ask "what's shadowing?", you
> > don't :)
> Maybe I should have said list_name.append() instead; sorry for that.

No problems :) Shadowing builtins is fine if you know what you're doing, 
but it's the people who do it without realising that end up causing 
themselves trouble.


> >> This, however, turned out to be unacceptably slow; this file is 1.1M
> >> lines, and it takes roughly a minute to go through. I have 450 of
> >> these files; I don't have the luxury to let it run for 8 hours.
> >
> > Really? And how many hours have you spent trying to speed this up? Two?
> > Three? Seven? And if it takes people two or three hours to answer your
> > question, and you another two or three hours to read it, it would have
> > been faster to just run the code as given :)
> Yes, for one set of files. Since I don't know how many sets of ~450
> files I'll have to run this over, I think that asking for help was a
> rather acceptable loss of time. I work on other parts while waiting
> anyway, or try and find out on my own as well.

All very reasonable. So long as you have considered the alternatives.


> > - if you need to stick with Python, try this:
> >
> > # untested
> > results = []
> > fp = open('filename')
> > for line in fp:
> >    if key in line:
> >        # Found key, skip the next line and save the following.
> >        _ = next(fp, '')
> >        results.append(next(fp, ''))
> 
> Well that's certainly faster, but not fast enough.

You may have to consider that your bottleneck is not the speed of your 
Python code, but the speed of getting data off the disk into memory. In 
which case, you may be stuck.

I suggest you time how long it takes to process a file using the above, 
then compare it to how long just reading the file takes:

from time import clock
t = clock()
for line in open('filename', encoding='cp1252'):
pass
print(clock() - t)

Run both timings a couple of times and pick the smallest number, to 
minimise caching effects and other extraneous influences.

Then do the same using a system tool. You're using Windows, right? I 
can't tell you how to do it in Windows, but on Linux I'd say:

time cat 'filename' > /dev/null

which should give me a rough-and-ready estimate of the raw speed of 
reading data off the disk. If this speed is not *significantly* better 
than you are getting in Python, then there simply isn't any feasible way 
to speed the code up appreciably. (Except maybe get faster hard drives 
or smaller files .)

[...]
> How do you approach something like this, when someone tells you "we
> need you to parse these files. We can't tell you how they're
> structured so you'll have to figure that out yourself."? 

Bitch and moan quietly to myself, and then smile when I realise I'm 
being paid by the hour.

Reverse-engineering a file structure without any documentation is rarely 
simple or fast.



-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor