!-----Original Message----- !From: Tutor [mailto:tutor-bounces+crk=godblessthe...@python.org] On !Behalf Of Peter Otten !Sent: Tuesday, October 07, 2014 3:50 AM !To: tutor@python.org !Subject: Re: [Tutor] search/match file position q ! !Clayton Kirkwood wrote: ! !> I was trying to keep it generic. !> Wrapped data file: !> <tr data-row-symbol="SWKS"><td class="col-symbol !> txt"><span class="wrapper " !> data-model="name:DatumModel;id:null;" data- !tmpl=""><a !> data-ylk="cat:portfolio;cpos:1" !> href="http://finance.yahoo.com/q?s=SWKS" !> data-rapid_p="18">SWKS</a></span></td><td !> class="col-fiftytwo_week_low cell- !raw:23.270000"><span !> class="wrapper " !> data-model="name:DatumModel;id:SWKS:qsi:wk52:low;" !> data-tmpl="change:yfin.datum">23.27</span></td><td !> class="col-prev_close cell-raw:58.049999"><span !> class="wrapper " data-model="name:DatumMo ! !Doesn't Yahoo make the data available as CSV? That would be the way to !go then.
Yes, Yahoo has a few columns that are csv, but I have maybe 15 fields that aren't provided. Besides, what fun would that be, I try to find tasks that allow me to expand my knowledge"<))) ! !Anyway, regular expressions are definitely the wrong tool here, and !reading the file one line at a time only makes it worse. Why is it making it only worse? I don't think a char by char would be helpful, the line happens to be very long, and I don't have a way of peeking around the corner to the next line so to speak. If I broke it into shorter strings, it would be much more onerous to jump over the end of the current to potentially many next strings. ! !> import re, os !> line_in = file.readline() ! # read in humongous html line !> stock = re.search('\s*<tr data-row-symbol="([A-Z]+)">', !line_in) !> #scan to SWKS"> in data ! #line, stock !should be SWKS !> low_52 = re.search('.+wk52:low.+([\d\.]+)<', line_in) !#want to !> pick up from ! #SWKS">, !low_52 should be 23.27 !> !> I am trying to figure out if each re.match starts scanning at the !> beginning of the same line over and over or does each scan start at !> the end of the last match. It appears to start over?? !> !> This is stock: !> <_sre.SRE_Match object; span=(0, 47), match=' <tr !> data-row-symbol="SWKS">'> This is low_52: !> <_sre.SRE_Match object; span=(0, 502875), match=' !<tr !> data-row-symbol="SWKS"><t> !> If necessary, how do I pick up and move forward to the point right !> after the previous match? File.tell() and file.__sizeof__(), don't !> seem to play a useful role. ! !You should try BeautifulSoup. Let's play: ! !>>> from bs4 import BeautifulSoup !>>> soup = BeautifulSoup("""<tr data-row-symbol="SWKS"><td !>>> class="col-symbol !txt"><span class="wrapper " data-model="name:DatumModel;id:null;" data- !tmpl=""><a data-ylk="cat:portfolio;cpos:1" !href="http://finance.yahoo.com/q?s=SWKS" data- !rapid_p="18">SWKS</a></span></td><td class="col-fiftytwo_week_low cell- !raw:23.270000"><span class="wrapper " data- !model="name:DatumModel;id:SWKS:qsi:wk52:low;" data- !tmpl="change:yfin.datum">23.27</span></td><td class="col-prev_close !cell- !raw:58.049999">""") !>>> soup.find("tr") !<tr data-row-symbol="SWKS"><td class="col-symbol txt"><span !class="wrapper " !data-model="name:DatumModel;id:null;" data-tmpl=""><a data-rapid_p="18" !data-ylk="cat:portfolio;cpos:1" !href="http://finance.yahoo.com/q?s=SWKS">SWKS</a></span></td><td !class="col- fiftytwo_week_low cell-raw:23.270000"><span class="wrapper " !data- model="name:DatumModel;id:SWKS:qsi:wk52:low;" data- !tmpl="change:yfin.datum">23.27</span></td><td class="col-prev_close !cell- raw:58.049999"></td></tr> !>>> tr = soup.find("tr") !>>> tr["data-row-symbol"] !'SWKS' !>>> tr.find_all("span") ![<span class="wrapper " data-model="name:DatumModel;id:null;" data- !tmpl=""><a data-rapid_p="18" data-ylk="cat:portfolio;cpos:1" !href="http://finance.yahoo.com/q?s=SWKS">SWKS</a></span>, <span !class="wrapper " data-model="name:DatumModel;id:SWKS:qsi:wk52:low;" !data- tmpl="change:yfin.datum">23.27</span>] !>>> span = tr.find_all("span")[1] !>>> span["data-model"] !'name:DatumModel;id:SWKS:qsi:wk52:low;' !>>> span.text !'23.27' So, what makes regex wrong for this job? question still remains: does the search start at the beginning of the line each time or does it step forward from the last search? I will check out beautiful soup as suggested in a subsequent mail; I'd still like to finish this process:<}} Clayton _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor