[Tutor] Beautiful Soup / Unicode problem?
Hi, I'm having bang-my-head-against-a-wall moments trying to figure all of this out. A word of warming, this is the first time I've tried using unicode, or Beautiful Soup, so if I'm being stupid, please forgive me. I'm trying to scrape results from google as a test case. with Beautiful Soup. I've seen people recommend it here, so maybe somebody can recognize what I'm doing wrong: >>>from BeautifulSoup import BeautifulSoup >>>file = urllib.urlopen("http://www.google.com/search?q=beautifulsoup";) >>>file = file.read().decode("utf-8") >>>soup = BeautifulSoup(file) >>>results = soup('p','g') >>> x = results[1].a.renderContents() >>> type(x) >>> print x Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping ... So far so good. But what I really want is just the text, so I try something like: >>> y = results[1].a.fetchText(re.compile('.+')) Traceback (most recent call last): File "", line 1, in ? File "BeautifulSoup.py", line 466, in fetchText return self.fetch(recursive=recursive, text=text, limit=limit) File "BeautifulSoup.py", line 492, in fetch return self._fetch(name, attrs, text, limit, generator) File "BeautifulSoup.py", line 194, in _fetch if self._matches(i, text): File "BeautifulSoup.py", line 252, in _matches chunk = str(chunk) UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 26: ordinal not in range(128) Is this a bug? Come to think of it, I'm not even sure how printing x worked, since it printed non-ascii characters. If I convert to a string first: >>> filestr = file.encode("utf-8") >>> soup = BeautifulSoup(filestr) >>> soup('p','g')[1].font.fetchText(re.compile('.+')) ['Mobile Screen Scraping with ', 'BeautifulSoup', ' and Python for Series 60. ', 'BeautifulSoup', ' 2', 'BeautifulSoup', ' 3. I haven\xe2€™t had enough time to work up a proper hack for ', '...', 'www.postneo.com/2005/03/28/', 'mobile-screen-scraping-with-', 'beautifulsoup', '-and-python-for-series-60 - 19k - Aug 24, 2005 - ', ' ', 'Cached', ' - ', 'Similar pages'] The regex works, but things like "I haven\xe2€™t" get a bit mangled :) In filestr, it was represented as haven\xe2\x80\x99t which I guess is the ASCII representation for UTF-8. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Beautiful Soup / Unicode problem?
Hi Danny, > If you have a moment, do you mind doing this on your system? > Here you go: >>> import types >>> print types.StringTypes (, ) >>> import sys >>> print sys.version 2.3.4 (#2, May 29 2004, 03:31:27) [GCC 3.3.3 (Debian 20040417)] >>> print type(u'hello' in types.StringTypes True >>>sys.getdefaultencoding() 'ascii' I have a second machine running XP and Activestate 2.4.1, I get the same results with the exception of: >>> sys.version '2.4.1 (#65, Jun 20 2005, 17:01:55) [MSC v.1310 32 bit (Intel)]' Today I tried changing my default encoding to uft8, and the error went away. I have no idea -why- it would go away, and it feels like a hacky solution. And confusing, because I wasn't trying to print anything to the screen, so why would python care or complain? Almost forgot, I tried the included Beautiful Soup tests on both machines and got an error on both: # python BeautifulSoupTests.py ...E == ERROR: testBasicUnicode (__main__.UnicodeRed) -- Traceback (most recent call last): File "BeautifulSoupTests.py", line 209, in testBasicUnicode self.assertEqual(type(str(self.soup)), sType) UnicodeEncodeError: 'ascii' codec can't encode character u'\xc8' in position 13: ordinal not in range(128) So there may be a bug, but I don't know if it's my problem It's strange that the tests fail on two different computers running two versions of python, however. > > and show us what comes up? > > > > Good luck to you! > > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Beautiful Soup / Unicode problem?
>This is the first question in the BeautifulSoup FAQ at >http://www.crummy.com/software/BeautifulSoup/FAQ.html >Unfortunately the author of BS considers this a problem with your Python installation! So it >seems he doesn't have a good understanding of Python and Unicode. (OK, I can forgive him >that, I think there are only a handful of people who really do understand it completely.) >The first fix given doesn't work. The second fix works but it is not a good idea to change the >default encoding for your Python install. There is a hack you can use to change the default >encoding just for one program; in your program put > reload(sys); sys.setdefaultencoding('utf-8') >This seems to fix the problem you are having. >Kent Hi Kent, I did read the FAQ before posting, honest :) But it does seem to be addressing a different issue. He says to try: >>> latin1word = 'Sacr\xe9 bleu!' >>> unicodeword = unicode(latin1word, 'latin-1') >>> print unicodeword Sacré bleu! Which worked fine for me. And then he gives a solution for fixing -display- problems on the terminal. For instance, his first solution was : "The easy way is to remap standard output to a converter that's not afraid to send ISO-Latin-1 or UTF-8 characters to the terminal." But I avoided displaying anything in my original example, because I didn't want to confuse the issue. It's also why I didn't mention the damning FAQ entry: >>> y = results[1].a.fetchText(re.compile('.+')) Is all I am trying to do. I don't expect non-ASCII characters to display correctly, however I was suprised when I tried "print x" in my original example, and it printed. I would have expected to have to do something like: >>> print x.encode("utf8") Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping ... I've just looked, and I have to do this explicit encoding under python 2.3.4, but not under 2.4.1. So perhaps 2.4 is less afraid/smarter about converting and displaying non-ascii characters to the terminal. Either way, I don't -think- that's my problem with Beautiful Soup. Changing my default encoding does indeed fix it, but it may be a reflection of the author making bad assumptions because his default was set to utf-8. I'm not really experienced enough to tell what is going on in his code, but I've been trying. Does seem to defeat the point of unicode, however. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Any suggestions for optimizing this code?
Hello again!I've been playing with generating 1-D cellular automata, and this is the fastest solution I've come up with so far. In the makeArray() function below, I have a sliding window of three elements starting at row[-1,0,1] and going to row[-2,-1,0], so that it wraps around at either boundary. I use the results of 4*row[i-1] + row[i] + row[i+1] to convert the three bits to an integer, and fetch a result from rule[index]. The inner loop is simple, and executes a million times, so shaving any time off makes a big difference. The biggest speedup from things I tried came from making an empty matrix first, to put results into the next row by index, instead of creating new rows on the fly with append. And binding the result[row] and result[row+1] references before the loop. Those two things sped it up by a consistentt 40%. Thanks to Python in a Nutshell!Does anybody see a faster approach, or a way to optimize the inner loop on makeArray() further? With the current approach, the inner loop does the real work, as it steps over each cell one at a time.For anybody not familiar with cellular automatas, there is a pretty java animation of how cells get calculated here: http://math.hws.edu/xJava/CA/CA.html And this is what CAs they look like when you plot them, with 1s as black and 0s as white, and some additional info: http://mathworld.wolfram.com/Rule30.htmlI've just hardcoded rule 30 into this for testing. I make a blank matrix with makeMatrix(). Then makeArray(matrix) sets the middle element of the first row in the matrix to 1 to seed the CA, and loops through it two rows at a time, calculating the results of the first row, and putting them into the next row. Right now I get:>>> timeArray(5)Total time: 3.95693028I am uneasy with the algorithm getting each element one at a time, throwing them away, and getting two that overlap in the next step of the window, but I couldn't come up with a more elegant solution. Also, it seems like a kludgey way to convert the 3 bits into a binary number, but it was the fastest way I stumbled on to. w = 1000h = 1000rule = [0,1,1,1,1,0,0,0]def makeMatrix(w,h): result = [0]*h for i in range(h): result[i] = [0]*w return resultdef makeArray(matrix): result = matrix result[0][w/2] = 1 for row in range(h-1): last = result[row] next = result[row+1] for i in range(w-1): next[i] = rule[4*last[i-1]+2*last[i]+last[i+1]] next[i+1] = rule[4*last[i]+2*last[i+1]+last[0]] return resultdef timeArray(num): import time matrix = makeMatrix(w,h) t1 = time.clock() for i in range(num): makeArray(matrix) t2 = time.clock() print "Total time:", t2 - t1timeArray(5) ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Any suggestions for optimizing this code?
Hi John,Thanks for that suggestion.I tried replacing my matrix with a numpy array, and it was about 20% slower, but I just substituted one for the other, so that really isn't surprising. I didn't try with the built-in array, but I'm guessing that swapping out another container for lists would only have a chance of speeding it up by a constant amount, I think, since setting and getting list elements is O(1). Having said that, I forgot arrays existed. I will take a look right now, and look at their methods, see if there is anything in there I can dig up to help above and beyond a new container.I think, conceptually, I need to make the sliding window more efficient by not getting overlapping elements repeatedly, or maybe figuring out a way to skip the conversion from bits to integers. I tried storing as strings, and doing a join to get i.e. "110" and using that as a key to look up 1 or 0, but it was miserably slow :)The ~15x speedup from psyco makes me think if there was a built-in function that could handle everything in one fell swoop with tight C code, that would probably speed it up a -lot-. But I haven't found anything like that for a sliding window type situation. On 9/18/05, John Fouhy <[EMAIL PROTECTED]> wrote: On 19/09/05, grouchy <[EMAIL PROTECTED]> wrote:> I've been playing with generating 1-D cellular automata, and this is the> fastest solution I've come up with so far. I haven't dug deep into your code --- but, I wonder if the arraymodule might help?--John.___Tutor maillist - Tutor@python.orghttp://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] ElementTree: finding a tag with specific attribute
Now I can replace my my Kent-and-Danny patched version :)On 9/19/05, Bernard Lebel <[EMAIL PROTECTED]> wrote: Thanks a lot everyone for this! Glad I could help debug BS!Bernard On 9/19/05, Kent Johnson <[EMAIL PROTECTED]> wrote:> Kent Johnson wrote:> > I looked at this again and there is a bug in BS that causes this> > behaviour. It's kind of an interesting bug that is a side-effect of > > the way BS uses introspection to access child tags.>> There is a new release of BS that fixes this problem and one Danny found recently (broken fetch).> http://www.crummy.com/software/BeautifulSoup/index.html>> Kent>> ___> Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor>___Tutor maillist - Tutor@python.orghttp://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Any suggestions for optimizing this code?
Replying to myself, I got some speedups by replacing:def makeArray1(matrix): result = matrix result[0][w/2] = 1 for row in range(h-1): last = result[row] next = result[row+1] for i in range(w-1): next[i] = rule[4*last[i-1]+2*last[i]+last[i+1]] next[i+1] = rule[4*last[i]+2*last[i+1]+last[0]] return resultwith this using Numerical Python:def makeArray2(matrix): result = matrix result[0,w/2] = 1 for n in range(h-1): r = result[n] r2 = result[n+1] r2[1:-1] = choose(4*r[:-2]+2*r[1:-1]+r[2:],rule) r2[0] = rule[4*r[-1]+2*r[0]+r[1]] r2[-1] = rule[4*r[-2]+2*r[-1]+r[0]] return resultIt finally clicked that instead of a sliding window, I could just add 4*row + 2*row + row, each offset by one, and that would give the same results using Numpy as stepping through three elements at a time. It's not pretty looking, but it works. This is about 6x faster overall. The bottleneck is in choose, where each element gets looked up and replaced. I'm not sure how to do it any faster tho. Choose is much faster than a for loop, which is where the 6x speedup is really coming from. Numpy just adding and multiplying the row is more like 20x faster than stepping through an element at time with a three element window :) As a sidenote that may be helpful to someone, makeArray1() is 22x faster using psyco, and 4x faster than the Numpy solution. Psyco doesn't speed up the Numpy calculations at all, which isn't surprsing, since it's mostly written in C. If you only use x86, that might be ok. Numpy is a lot more elegant of a solution it seems. I'm positive I could bring those closer together, if I could somehow not use a lookup table to convert binary numbers to integers and back to binary numbers.Normally I wouldn't care a whit about optimisation, but number crunching through a million items and suddenly Numpy seemed pretty cool. I know this is the first time I understood the power of being able to perform a calculation on an entire array, and stacking slices. Multiplying and adding three rows is a lot faster than stepping through a single row three elements at a time. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] installation programs
Hi Jeff,Most people seem to use a combination of py2exe, and either Inno Setup or NSIS. InstallShield is commercial, and, well, you have to pay for it. py2exe gives you the python interpreter, and all the libraries your program needs in a tidy little package, so unless the computers you are installing to have python and any external libraries you are using already installed, you will need to use that, or something similar. It lets people just install and run. Inno Setup is easy if you don't need anything fancy, ie a simple install made with it's wizard. Packaging things up with py2exe will likely be the tricky part, at least if you start trimming it down manually, or run into any snags. Otherwise it's a breeze. NSIS has a scripting language built in, which does indeed entail learning another language(sorta). I haven't used it, however, so there could be lots you can do without touching the scripting bit. py2exe: http://starship.python.net/crew/theller/py2exe/ NSIS: http://nsis.sourceforge.net/Good luck!On 9/21/05, Jeff Peery < [EMAIL PROTECTED]> wrote: Hello, I want to create an installation program. Can anyone tell me what the best program would be to use... maybe inno setup or install shield? do these work with python programs? do they require programming in another language? thanks. Jeff ___Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Problem with BeautifulSoup
Hi Bernard, Not much of an answer, but I printed out your snippet with prettify() to see how it was being parsed, and either the XML is funny, or Beautiful Soup is :) >>> from BeautifulSoup import BeautifulStoneSoup as BSS >>> soup = BSS(xml)>>> print soup.prettify() 1 3 1.79769313486e+308 False - 1.79769313486e+308 7.64880829803 False 20 ?>>> On 9/23/05, Bernard Lebel <[EMAIL PROTECTED]> wrote: Hello,I have this set of XML tags:fullname=" Model.Camera_anim.kine.local.posx" type="Parameter"sourceclassname="FCurve"> 1 3 1.79769313486e+308 False -1.79769313486e+308 7.64880829803 False 20 This set of tags is nested deep few levels of tags. When I get the fcurve tag, all I get is an empty tag. ie:oXMLFcurve = oXMLParameter.fcurveprint str(oXMLFcurve)Outputs:Normally it would print all the content of the tag, but it's not Any suggestion or pointer would be welcome!ThanksBernard___Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Sum of List Elements
Also, the built in function sum(): total = sum(list) Which is probably the One Obvious Way since 2.3, if I had to guess. On 9/24/05, R. Alan Monroe < [EMAIL PROTECTED]> wrote:> Hi All,> I have spent an embarrassingly large amount of time trying to solve what on > its face seems like a simple problem.> I have a list of intergers, and I want to assign the sum of the intergers in> the list to a variable. There are only intergers in the list.> The best I have been able to do so far is to write a function that adds > list[0] and list[1], then list[1] and list [2], etc. Of course, this isn't> what I want.> I'd like to be able to sum a list of any size without having to type> list[0]+list[1] total=0for x in list:total += x___Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Stupid newbie question
As long as you are using IDLE, why not let it handle indentation for you? This could very well be a dumb question, and if it is, well, excuse me :)On 9/23/05, Valone, Toren W. <[EMAIL PROTECTED]> wrote: I am trying to noodle thru classes with python and I built the followingclassimport timeclass startremail:def __init__(self): remailfile = open('U:\Bounce20.txt', 'r') #future address/file from outlook resendfile = open('resend.txt', 'w') #currently thesefiles are in Python24 EmailReport = open('erprt.txt', 'w') #Report of bademails etc fromaddr='[EMAIL PROTECTED]' #set fromadd to aconstant null_recepient_count = 0 date_received = "" date_email_generated = "" Error_050 = "" Error_501 = "" Current_Date = time.ctime(time.time()) month = Current_Date[4:8] day = Current_Date[8:10] print month def getday(self): return self.day def Read(self,line): line = remailfile.readline() #primer read return lineI fire up IDLE and then do thisfrom startremail import * x = startremail()print x.getday()I get the following returnNameError: name 'getday' is not defined___Tutor maillist - Tutor@python.orghttp://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor