Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position
On Sat, Mar 10, 2012 at 08:03:18PM -0500, Dave Angel wrote: > There are just 256 possible characters in cp1252, and 256 in cp932. CP932 is also known as MS-KANJI or SHIFT-JIS (actually, one of many variants of SHIFT-JS). It is a multi-byte encoding, which means it has far more than 256 characters. http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml http://en.wikipedia.org/wiki/Shift_JIS The actual problem the OP has got is that the *multi-byte* sequence he is trying to print is illegal when interpreted as CP932. Personally I think that's a bug in the terminal, or possibly even print, since he's not printing bytes but characters, but I haven't given that a lot of thought so I might be way out of line. The quick and dirty fix is to change the encoding of his terminal, so that it no longer tries to interpret the characters printed using CP932. That will also mean he'll no longer see valid Japanese characters. But since he appears to be using Windows, I don't know if this is possible, or easy. [...] > You can "solve" the problem by pretending the input file is also cp932 > when you open it. That way you'll get the wrong characters, but no > errors. Not so -- there are multi-byte sequences that can't be read in CP932. >>> b"\xe9x".decode("cp932") # this one works '騙' >>> b"\xe9!".decode("cp932") # this one doesn't Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'cp932' codec can't decode bytes in position 0-1: illegal multibyte sequence In any case, the error doesn't occur when he reads the data, but when he prints it. Once the data is read, it is already Unicode text, so he should be able to print any character. At worst, it will print as a missing character (a square box or space) rather than the expected glyph. He shouldn't get a UnicodeDecodeError when printing. I smell a bug since print shouldn't be decoding anything. (At worst, it needs to *encode*.) -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position
Robert Sjoblom wrote: > Okay, so here's a fun one. Since I'm on a japanese locale my native > encoding is cp932. I was thinking of writing a parser for a bunch of > text files, but I stumbled on even printing the contents due to ... > something. I don't know what encoding the text file uses, which isn't > helping my case either (I have asked, but I've yet to get an answer). > > Okay, so: > > address = "C:/Path/to/file/file.ext" > with open(address, encoding="cp1252") as alpha: Superfluous readlines() alert: > text = alpha.readlines() > for line in text: > print(line) You can iterate over the file directly with #python3 for line in alpha: print(line, end="") or even sys.stdout.writelines(alpha) > It starts to print until it hits the wonderful character é or '\xe9', > where it gives me this happy traceback: > Traceback (most recent call last): > File "C:\Users\Azaz\Desktop\CK2 Map Painter\Parser\test parser.py", > line 8, in > print(line) > UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in > position 13: illegal multibyte sequence > > I can open the document and view it in UltraEdit -- and it displays > correct characters there -- but UE can't give me what encoding it > uses. Any chance of solving this without having to switch from my > japanese locale? Also, the cp1252 is just an educated guess, but it > doesn't really matter because it always comes back to the cp932 error. # python3 output_encoding = sys.stdout.encoding or "UTF-8" error_handling = "replace" Writer = codecs.getwriter(output_encoding) outstream = Writer(sys.stdout.buffer, error_handling) with open(filename, "r", encoding="cp1252") as instream: for line in instream: print(line, end="", file=outstream) error_handling = "replace" prints "?" for characters that cannot be displayed in the target encoding. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position
Steven D'Aprano wrote: > glyph. He shouldn't get a UnicodeDecodeError when printing. I smell a > bug since print shouldn't be decoding anything. (At worst, it needs to > *encode*.) You have correctly derived the actual traceback ;) [Robert] > It starts to print until it hits the wonderful character é or '\xe9', > where it gives me this happy traceback: > Traceback (most recent call last): > File "C:\Users\Azaz\Desktop\CK2 Map Painter\Parser\test parser.py", > line 8, in > print(line) > UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in > position 13: illegal multibyte sequence In nuce: $ PYTHONIOENCODING=cp932 python3 -c 'print("\xe9")' Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position 0: illegal multibyte sequence (I have to lie about the encoding; my terminal speaks UTF-8) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] Finding a specific line in a body of text
I'm sorry if the subject is vague, but I can't really explain it very well. I've been away from programming for a while now (I got a daughter and a year after that a son, so I've been busy with family matters). As such, my skills are definitely rusty. In the file I'm parsing, I'm looking for specific lines. I don't know the content of these lines but I do know the content that appears two lines before. As such I thought that maybe I'd flag for a found line and then flag the next two lines as well, like so: if keyword in line: flag = 1 continue if flag == 1 or flag == 2: if flag == 1: flag = 2 continue if flag == 2: list.append(line) This, however, turned out to be unacceptably slow; this file is 1.1M lines, and it takes roughly a minute to go through. I have 450 of these files; I don't have the luxury to let it run for 8 hours. So I thought that maybe I could use enumerate() somehow, get the index when I hit keyword and just append the line at index+2; but I realize I don't know how to do that. File objects doesn't have an index function. For those curious, the data I'm looking for looks like this: 5 72 88 77 90 92 18 80 75 98 84 90 81 12 58 76 77 94 96 There are other parts of the file that contains similar strings of digits, so I can't just grab any digits I come across either; the only thing I have to go on is the keyword. It's obvious that my initial idea was horribly bad (and I knew that as well, but I wanted to first make sure that I could find what I was after properly). The structure looks like this (I opted to use \t instead of relying on the tabs to getting formatted properly in the email): \t\tkeyword= \t\t{ 5 72 88 77 90 92 \t\t} -- best regards, Robert S. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] question on self
Why do I have to use "self.example" when calling a method inside a class? For example: def Play(self): '''find scores, reports winners''' self.scores = [] for player in range(self.players): print print 'Player', player + 1 self.scores.append(self.TakeTurns()) I have another method called take turns (not shown for brevity purposes). When I want to call it, why can't I just call it like a function and use TakeTurns() instead of self.TakeTurns()? -- Michael J. Lewis ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] question on self
On Sun, Mar 11, 2012 at 07:02:11PM -0700, Michael Lewis wrote: > Why do I have to use "self.example" when calling a method inside a class? > > For example: > > def Play(self): > '''find scores, reports winners''' > self.scores = [] > for player in range(self.players): > print > print 'Player', player + 1 > self.scores.append(self.TakeTurns()) > > I have another method called take turns (not shown for brevity purposes). > When I want to call it, why can't I just call it like a function and use > TakeTurns() instead of self.TakeTurns()? When you call range() inside a method, as you do above, do you expect to get the global range() function, or the self.range() method (which likely doesn't exist)? Same for len(), or any other built-in or global. Similarly, how do you expect Python to distinguish between a persistent attribute, like self.scores, and a local variable, like player? Since Python can't read your mind, one way or another you have to explicitly tell the compiler which of the two name resolution orders to use: (1) The normal function scope rules: - local variables have priority over: - non-locals, which have priority over: - globals, which have priority over: - built-ins; (2) or the attribute search rules, which is quite compilicated but a simplified version is: - instance attributes or methods - class attributes or methods - superclass attributes or method - computed attributes or methods using __getattr__ Python refuses to guess which one you want, since any guess is likely to be wrong 50% of the time. Instead, Python's design is to always use function scope rules, and if you want attributes or methods, you have to explicitly ask for them. This makes MUCH more sense than having to explicitly flag local variables! Other languages made other choices. For instance, you might demand that the programmer declare all their variables up-front, and all their instance attributes. Then the compiler can tell at compile-time that range is a built-in, that player is a local variable, and that TakeTurns is an instance attribute. That's a legitimate choice, and some languages do it that way. But having programmed in some of these other languages, give me Python's lack of declarations anytime! Since all names (variables and attributes) in Python are generated at runtime, the compiler normally cannot tell what the scope of a name is until runtime (with a few exceptions). -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Finding a specific line in a body of text
On Mon, Mar 12, 2012 at 02:56:36AM +0100, Robert Sjoblom wrote: > In the file I'm parsing, I'm looking for specific lines. I don't know > the content of these lines but I do know the content that appears two > lines before. As such I thought that maybe I'd flag for a found line > and then flag the next two lines as well, like so: > > if keyword in line: > flag = 1 > continue > if flag == 1 or flag == 2: > if flag == 1: > flag = 2 > continue > if flag == 2: > list.append(line) You haven't shown us the critical part: how are you getting the lines in the first place? (Also, you shouldn't shadow built-ins like list as you do above, unless you know what you are doing. If you have to ask "what's shadowing?", you don't :) > This, however, turned out to be unacceptably slow; this file is 1.1M > lines, and it takes roughly a minute to go through. I have 450 of > these files; I don't have the luxury to let it run for 8 hours. Really? And how many hours have you spent trying to speed this up? Two? Three? Seven? And if it takes people two or three hours to answer your question, and you another two or three hours to read it, it would have been faster to just run the code as given :) I'm just saying. Since you don't show the actual critical part of the code, I'm going to make some simple suggestions that you may or may not have already tried. - don't read files off USB or CD or over the network, because it will likely be slow; if you can copy the files onto the local hard drive, performance may be better; - but if you include the copying time, it might not make that much difference; - can you use a dedicated tool for this, like Unix grep or even perl, which is optimised for high-speed file manipulations? - if you need to stick with Python, try this: # untested results = [] fp = open('filename') for line in fp: if key in line: # Found key, skip the next line and save the following. _ = next(fp, '') results.append(next(fp, '')) By the way, the above assumes you are running Python 2.6 or better. In Python 2.5, you can define this function: def next(iterator, default): try: return iterator.next() except StopIteration: return default but it will likely be a little slower. Another approach may be to read the whole file into memory in one big chunk. 1.1 million lines, by (say) 50 characters per line comes to about 53 MB per file, which should be small enough to read into memory and process it in one chunk. Something like this: # again untested text = open('filename').read() results = [] i = 0 while i < len(text): offset = text.find(key, i) if i == -1: break i += len(key) # skip the rest of the key # read ahead to the next newline, twice i = text.find('\n', i) i = text.find('\n', i) # now find the following newline, and save everything up to that p = text.find('\n', i) if p == -1: p = len(text) results.append(text[i:p]) i = p # skip ahead This will likely break if the key is found without two more lines following it. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] question on self
On 11-Mar-12 20:03, Steven D'Aprano wrote: On Sun, Mar 11, 2012 at 07:02:11PM -0700, Michael Lewis wrote: Why do I have to use "self.example" when calling a method inside a class? For example: def Play(self): '''find scores, reports winners''' self.scores = [] for player in range(self.players): print print 'Player', player + 1 self.scores.append(self.TakeTurns()) I have another method called take turns (not shown for brevity purposes). When I want to call it, why can't I just call it like a function and use TakeTurns() instead of self.TakeTurns()? Steven's notes about scoping rules are one reason. Another is the matter of object instance binding. When you call a method, you're not just calling a regular function. You're calling a function bound to a particular object, so by saying self.TakeTurns(), Python knows that the object "self" is invoking that method, not some other instance of the Play class. That method then can access all of that specific object's attributes as necessary. -- Steve Willoughby / st...@alchemy.com "A ship in harbor is safe, but that is not what ships are built for." PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Finding a specific line in a body of text
> You haven't shown us the critical part: how are you getting the lines in > the first place? Ah, yes -- with open(address, "r", encoding="cp1252") as instream: for line in instream: > (Also, you shouldn't shadow built-ins like list as you do above, unless > you know what you are doing. If you have to ask "what's shadowing?", you > don't :) Maybe I should have said list_name.append() instead; sorry for that. >> This, however, turned out to be unacceptably slow; this file is 1.1M >> lines, and it takes roughly a minute to go through. I have 450 of >> these files; I don't have the luxury to let it run for 8 hours. > > Really? And how many hours have you spent trying to speed this up? Two? > Three? Seven? And if it takes people two or three hours to answer your > question, and you another two or three hours to read it, it would have > been faster to just run the code as given :) Yes, for one set of files. Since I don't know how many sets of ~450 files I'll have to run this over, I think that asking for help was a rather acceptable loss of time. I work on other parts while waiting anyway, or try and find out on my own as well. > - if you need to stick with Python, try this: > > # untested > results = [] > fp = open('filename') > for line in fp: > if key in line: > # Found key, skip the next line and save the following. > _ = next(fp, '') > results.append(next(fp, '')) Well that's certainly faster, but not fast enough. Oh well, I'll continue looking for a solution -- because even with the speedup it's unacceptable. I'm hoping against hope that I only have to run it against the last file of each batch of files, but if it turns out that I don't, I'm in for some exciting days of finding stuff out. Thanks for all the help though, it's much appreciated! How do you approach something like this, when someone tells you "we need you to parse these files. We can't tell you how they're structured so you'll have to figure that out yourself."? It's just so much text that's it's hard to get a grasp on the structure, and there's so much information contained in there as well; this is just the first part of what I'm afraid will be many. I'll try not to bother this list too much though. -- best regards, Robert S. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Finding a specific line in a body of text
Erik Rise gave a good talk today at PyCon about a parsing library he's working on called Parsimonious. You could maybe look into what he's doing there, and see if that helps you any... Follow him on Twitter at @erikrose to see when his session's video is up. His session was named "Parsing Horrible Things in Python" On Mar 11, 2012 9:48 PM, "Robert Sjoblom" wrote: > > You haven't shown us the critical part: how are you getting the lines in > > the first place? > > Ah, yes -- > with open(address, "r", encoding="cp1252") as instream: >for line in instream: > > > (Also, you shouldn't shadow built-ins like list as you do above, unless > > you know what you are doing. If you have to ask "what's shadowing?", you > > don't :) > Maybe I should have said list_name.append() instead; sorry for that. > > >> This, however, turned out to be unacceptably slow; this file is 1.1M > >> lines, and it takes roughly a minute to go through. I have 450 of > >> these files; I don't have the luxury to let it run for 8 hours. > > > > Really? And how many hours have you spent trying to speed this up? Two? > > Three? Seven? And if it takes people two or three hours to answer your > > question, and you another two or three hours to read it, it would have > > been faster to just run the code as given :) > Yes, for one set of files. Since I don't know how many sets of ~450 > files I'll have to run this over, I think that asking for help was a > rather acceptable loss of time. I work on other parts while waiting > anyway, or try and find out on my own as well. > > > - if you need to stick with Python, try this: > > > > # untested > > results = [] > > fp = open('filename') > > for line in fp: > >if key in line: > ># Found key, skip the next line and save the following. > >_ = next(fp, '') > >results.append(next(fp, '')) > > Well that's certainly faster, but not fast enough. > Oh well, I'll continue looking for a solution -- because even with the > speedup it's unacceptable. I'm hoping against hope that I only have to > run it against the last file of each batch of files, but if it turns > out that I don't, I'm in for some exciting days of finding stuff out. > Thanks for all the help though, it's much appreciated! > > How do you approach something like this, when someone tells you "we > need you to parse these files. We can't tell you how they're > structured so you'll have to figure that out yourself."? It's just so > much text that's it's hard to get a grasp on the structure, and > there's so much information contained in there as well; this is just > the first part of what I'm afraid will be many. I'll try not to bother > this list too much though. > -- > best regards, > Robert S. > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Finding a specific line in a body of text
On Mon, Mar 12, 2012 at 05:46:39AM +0100, Robert Sjoblom wrote: > > You haven't shown us the critical part: how are you getting the lines in > > the first place? > > Ah, yes -- > with open(address, "r", encoding="cp1252") as instream: > for line in instream: Seems reasonable. > > (Also, you shouldn't shadow built-ins like list as you do above, unless > > you know what you are doing. If you have to ask "what's shadowing?", you > > don't :) > Maybe I should have said list_name.append() instead; sorry for that. No problems :) Shadowing builtins is fine if you know what you're doing, but it's the people who do it without realising that end up causing themselves trouble. > >> This, however, turned out to be unacceptably slow; this file is 1.1M > >> lines, and it takes roughly a minute to go through. I have 450 of > >> these files; I don't have the luxury to let it run for 8 hours. > > > > Really? And how many hours have you spent trying to speed this up? Two? > > Three? Seven? And if it takes people two or three hours to answer your > > question, and you another two or three hours to read it, it would have > > been faster to just run the code as given :) > Yes, for one set of files. Since I don't know how many sets of ~450 > files I'll have to run this over, I think that asking for help was a > rather acceptable loss of time. I work on other parts while waiting > anyway, or try and find out on my own as well. All very reasonable. So long as you have considered the alternatives. > > - if you need to stick with Python, try this: > > > > # untested > > results = [] > > fp = open('filename') > > for line in fp: > > if key in line: > > # Found key, skip the next line and save the following. > > _ = next(fp, '') > > results.append(next(fp, '')) > > Well that's certainly faster, but not fast enough. You may have to consider that your bottleneck is not the speed of your Python code, but the speed of getting data off the disk into memory. In which case, you may be stuck. I suggest you time how long it takes to process a file using the above, then compare it to how long just reading the file takes: from time import clock t = clock() for line in open('filename', encoding='cp1252'): pass print(clock() - t) Run both timings a couple of times and pick the smallest number, to minimise caching effects and other extraneous influences. Then do the same using a system tool. You're using Windows, right? I can't tell you how to do it in Windows, but on Linux I'd say: time cat 'filename' > /dev/null which should give me a rough-and-ready estimate of the raw speed of reading data off the disk. If this speed is not *significantly* better than you are getting in Python, then there simply isn't any feasible way to speed the code up appreciably. (Except maybe get faster hard drives or smaller files .) [...] > How do you approach something like this, when someone tells you "we > need you to parse these files. We can't tell you how they're > structured so you'll have to figure that out yourself."? Bitch and moan quietly to myself, and then smile when I realise I'm being paid by the hour. Reverse-engineering a file structure without any documentation is rarely simple or fast. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor