Storing class objects dynamically in an array

2013-01-21 Thread Brian D
Hi,

I'm trying to instantiate a class object repeated times, dynamically for as 
many times as are required, storing each class object in a container to later 
write out to a database. It kind of looks like what's needed is a 
two-dimensional class object, but I can't quite conceptualize how to do that. 

A simpler approach might be to just store class objects in a dictionary, using 
a reference value (or table row number/ID) as the key. 

In the real-world application, I'm parsing row, column values out of a table in 
a document which will have not more than about 20 rows, but I can't expect the 
document output to leave columns well-ordered. I want to be able to call the 
class objects by their respective row number.

A starter example follows, but it's clear that only the last instance of the 
class is stored. 

I'm not quite finding what I want from online searches, so what recommendations 
might Python users make for the best way to do this? 

Maybe I need to re-think the approach? 


Thanks,
Brian



class Car(object):

def __init__(self, Brand, Color, Condition):
self.Brand = Brand
self.Color = Color
self.Condition = Condition

brandList = ['Ford', 'Toyota', 'Fiat']
colorList = ['Red', 'Green', 'Yellow']
conditionList = ['Excellent', 'Good', 'Needs Help']

usedCarLot = {}

for c in range(0, len(brandList)):
print c, brandList[c]
usedCarLot[c] = Car
usedCarLot[c].Brand = brandList[c]
usedCarLot[c].Color = colorList[c]
usedCarLot[c].Condition = conditionList[c]

for k, v in usedCarLot.items():
print k, v.Brand, v.Color, v.Condition


>>> 
0 Ford
1 Toyota
2 Fiat
0 Fiat Yellow Needs Help
1 Fiat Yellow Needs Help
2 Fiat Yellow Needs Help
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Storing class objects dynamically in an array

2013-01-21 Thread Brian D
On Monday, January 21, 2013 8:29:50 PM UTC-6, MRAB wrote:
> On 2013-01-22 01:56, Brian D wrote:
> 
> > Hi,
> 
> >
> 
> > I'm trying to instantiate a class object repeated times, dynamically for as 
> > many times as are required, storing each class object in a container to 
> > later write out to a database. It kind of looks like what's needed is a 
> > two-dimensional class object, but I can't quite conceptualize how to do 
> > that.
> 
> >
> 
> > A simpler approach might be to just store class objects in a dictionary, 
> > using a reference value (or table row number/ID) as the key.
> 
> >
> 
> > In the real-world application, I'm parsing row, column values out of a 
> > table in a document which will have not more than about 20 rows, but I 
> > can't expect the document output to leave columns well-ordered. I want to 
> > be able to call the class objects by their respective row number.
> 
> >
> 
> > A starter example follows, but it's clear that only the last instance of 
> > the class is stored.
> 
> >
> 
> > I'm not quite finding what I want from online searches, so what 
> > recommendations might Python users make for the best way to do this?
> 
> >
> 
> > Maybe I need to re-think the approach?
> 
> >
> 
> >
> 
> > Thanks,
> 
> > Brian
> 
> >
> 
> >
> 
> >
> 
> > class Car(object):
> 
> >
> 
> >  def __init__(self, Brand, Color, Condition):
> 
> >  self.Brand = Brand
> 
> >  self.Color = Color
> 
> >  self.Condition = Condition
> 
> >
> 
> > brandList = ['Ford', 'Toyota', 'Fiat']
> 
> > colorList = ['Red', 'Green', 'Yellow']
> 
> > conditionList = ['Excellent', 'Good', 'Needs Help']
> 
> >
> 
> > usedCarLot = {}
> 
> >
> 
> > for c in range(0, len(brandList)):
> 
> >  print c, brandList[c]
> 
> >  usedCarLot[c] = Car
> 
> >  usedCarLot[c].Brand = brandList[c]
> 
> >  usedCarLot[c].Color = colorList[c]
> 
> >  usedCarLot[c].Condition = conditionList[c]
> 
> >
> 
> > for k, v in usedCarLot.items():
> 
> >  print k, v.Brand, v.Color, v.Condition
> 
> >
> 
> >
> 
> >>>>
> 
> > 0 Ford
> 
> > 1 Toyota
> 
> > 2 Fiat
> 
> > 0 Fiat Yellow Needs Help
> 
> > 1 Fiat Yellow Needs Help
> 
> > 2 Fiat Yellow Needs Help
> 
> >
> 
> You're repeatedly putting the class itself in the dict and setting its
> 
> (the class's) attributes; you're not even using the __init__ method you
> 
> defined.
> 
> 
> 
> What you should be doing is creating instances of the class:
> 
> 
> 
> for c in range(len(brandList)):
> 
>  print c, brandList[c]
> 
>  usedCarLot[c] = Car(brandList[c], colorList[c], conditionList[c])

Thanks for the quick reply Dave & MRAB. I wasn't even sure it could be done, so 
missing the instantiation just completely slipped. 

The simplest fix is as follows, but Dave, I'll try to tighten it up a little, 
when I turn to the real-world code, following your enumeration example. And 
yes, thanks for the reminder (2.7.3). The output is fine -- I just need a 
record number and the list of values stored in the class object.

This is the quick fix -- instantiate class Car: 

usedCarLot[c] = Car('','','')

It may not, however, be the best, most Pythonic way.

Here's the full implementation. I hope this helps someone else. 

Thanks very much for the help!

class Car(object):

def __init__(self, Brand, Color, Condition):
self.Brand = Brand
self.Color = Color
self.Condition = Condition

brandList = ['Ford', 'Toyota', 'Fiat']
colorList = ['Red', 'Green', 'Yellow']
conditionList = ['Excellent', 'Good', 'Needs Help']

#usedCarLot = {0:Car, 1:Car, 2:Car}
usedCarLot = {}

for c in range(0, len(brandList)):
#print c, brandList[c]
usedCarLot[c] = Car('','','')
usedCarLot[c].Brand = brandList[c]
usedCarLot[c].Color = colorList[c]
usedCarLot[c].Condition = conditionList[c]

for k, v in usedCarLot.items():
print k, v.Brand, v.Color, v.Condition
-- 
http://mail.python.org/mailman/listinfo/python-list


Mechanize/ClientForm - How to select IgnoreControl button and submit form

2009-12-23 Thread Brian D
All,

I'm hoping to implement a project that will be historically
transformational by mapping inequalities in property assessments.

I'm stuck at step one: Scrape data from http://www.opboa.org.

The site uses a bunch of hidden controls. I can't find a way to get
past the initial disclaimer page because the "Accept" button value
reads as None: )>

http://www.opboa.org/Search/Disclaimer2.aspx

I've successfully used Mechanize in two other projects, but I've never
seen this IgnoreControl problem before. I also haven't found any
ClientForm examples that handle this problem.

Would anyone like to help me get this off the ground?

Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list


ORM's are evil. Was: Mechanize/ClientForm - How to select IgnoreControl button and submit form

2009-12-24 Thread Brian D
Just kidding. That was a fascinating discussion.

Now I'd like to see if anyone would rather procrastinate than finish
last-minute shopping.

This problem remains untouched. Anyone want to give it a try? Please?

I'm hoping to implement a project that will be historically
transformational by mapping inequalities in property assessments.

I'm stuck at step one: Scrape data from http://www.opboa.org.

The site uses a bunch of hidden controls.

I can't find a way to get past the initial disclaimer page because the
"Accept" button value reads as None: )>
http://www.opboa.org/Search/Disclaimer2.aspx

I've successfully used Mechanize in two other projects, but I've never
seen this IgnoreControl problem before. I also haven't found any
ClientForm examples that handle this problem.

Would anyone like to help me get this off the ground?

Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ORM's are evil. Was: Mechanize/ClientForm - How to select IgnoreControl button and submit form

2009-12-24 Thread Brian D
On Dec 24, 8:20 am, Brian D  wrote:
> Just kidding. That was a fascinating discussion.
>
> Now I'd like to see if anyone would rather procrastinate than finish
> last-minute shopping.
>
> This problem remains untouched. Anyone want to give it a try? Please?
>
> I'm hoping to implement a project that will be historically
> transformational by mapping inequalities in property assessments.
>
> I'm stuck at step one: Scrape data fromhttp://www.opboa.org.
>
> The site uses a bunch of hidden controls.
>
> I can't find a way to get past the initial disclaimer page because the
> "Accept" button value reads as None: 
> )>http://www.opboa.org/Search/Disclaimer2.aspx
>
> I've successfully used Mechanize in two other projects, but I've never
> seen this IgnoreControl problem before. I also haven't found any
> ClientForm examples that handle this problem.
>
> Would anyone like to help me get this off the ground?
>
> Thanks!

Problem solved.

I used the Fiddler Request Builder to discover that the server wanted
a GET request containing the ASP.NET __EVENTTARGET, __EVENTARGUMENT,
and __VIEWSTATE parameters.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Mechanize/ClientForm - How to select IgnoreControl button and submit form

2009-12-24 Thread Brian D
On Dec 23, 8:33 am, Brian D  wrote:
> All,
>
> I'm hoping to implement a project that will be historically
> transformational by mapping inequalities in property assessments.
>
> I'm stuck at step one: Scrape data fromhttp://www.opboa.org.
>
> The site uses a bunch of hidden controls. I can't find a way to get
> past the initial disclaimer page because the "Accept" button value
> reads as None: )>
>
> http://www.opboa.org/Search/Disclaimer2.aspx
>
> I've successfully used Mechanize in two other projects, but I've never
> seen this IgnoreControl problem before. I also haven't found any
> ClientForm examples that handle this problem.
>
> Would anyone like to help me get this off the ground?
>
> Thanks!

Solution posted in another thread:

http://groups.google.com/group/comp.lang.python/browse_thread/thread/2013c8dcbe6a8573/a39c33475b2d2923#a39c33475b2d2923
-- 
http://mail.python.org/mailman/listinfo/python-list


Mechanize - Click a table row to navigate to detail page

2009-12-24 Thread Brian D
A search form returns a list of records embedded in a table.

The user has to click on a table row to call a Javascript call that
opens up the detail page.

It's the detail page, of course, that really contains the useful
information.

How can I use Mechanize to click a row?

Any ideas?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Mechanize - Click a table row to navigate to detail page

2009-12-26 Thread Brian D
On Dec 25, 4:36 am, "Diez B. Roggisch"  wrote:
> Brian D schrieb:
>
> > A search form returns a list of records embedded in a table.
>
> > The user has to click on a table row to call a Javascript call that
> > opens up the detail page.
>
> > It's the detail page, of course, that really contains the useful
> > information.
>
> > How can I use Mechanize to click a row?
>
> You can't, if there is javascript involved. You can try & use Firebug to
> see what the javascript eventually calls, with which parameters. Then
> construct that url based on the parameters in the table row's javascript
> (which of course you have to parse yourself out of the code)
>
> Diez

Thanks Diez.

You were correct. Fiddler provided the clue I needed. As you
described, the Javascript call constructed a URL containing the index
value of the table row. I was able to request that URL to obtain the
detail page.

Thank you very much for the help.

Brian
-- 
http://mail.python.org/mailman/listinfo/python-list


How to test a URL request in a "while True" loop

2009-12-30 Thread Brian D
I'm actually using mechanize, but that's too complicated for testing
purposes. Instead, I've simulated in a urllib2 sample below an attempt
to test for a valid URL request.

I'm attempting to craft a loop that will trap failed attempts to
request a URL (in cases where the connection intermittently fails),
and repeat the URL request a few times, stopping after the Nth attempt
is tried.

Specifically, in the example below, a bad URL is requested for the
first and second iterations. On the third iteration, a valid URL will
be requested. The valid URL will be requested until the 5th iteration,
when a break statement is reached to stop the loop. The 5th iteration
also restores the values to their original state for ease of repeat
execution.

What I don't understand is how to test for a valid URL request, and
then jump out of the "while True" loop to proceed to another line of
code below the loop. There's probably faulty logic in this approach. I
imagine I should wrap the URL request in a function, and perhaps store
the response as a global variable.

This is really more of a basic Python logic question than it is a
urllib2 question.

Any suggestions?

Thanks,
Brian


import urllib2
user_agent = 'Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/
2009042316 Firefox/3.0.10'
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:
1.9.0.16) ' \
 'Gecko/2009120208 Firefox/3.0.16 (.NET CLR 3.5.30729)'
headers={'User-Agent':user_agent,}
url = 'http://this is a bad url'
count = 0
while True:
count += 1
try:
print 'attempt ' + str(count)
request = urllib2.Request(url, None, headers)
response = urllib2.urlopen(request)
if response:
print 'True response.'
if count == 5:
count = 0
url = 'http://this is a bad url'
print 'How do I get out of this thing?'
break
except:
print 'fail ' + str(count)
if count == 3:
url = 'http://www.google.com'
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to test a URL request in a "while True" loop

2009-12-30 Thread Brian D
On Dec 30, 11:06 am, samwyse  wrote:
> On Dec 30, 10:00 am, Brian D  wrote:
>
> > What I don't understand is how to test for a valid URL request, and
> > then jump out of the "while True" loop to proceed to another line of
> > code below the loop. There's probably faulty logic in this approach. I
> > imagine I should wrap the URL request in a function, and perhaps store
> > the response as a global variable.
>
> > This is really more of a basic Python logic question than it is a
> > urllib2 question.
>
> There, I've condensed your question to what you really meant to say.
> You have several approaches.  First, let's define some useful objects:>>> 
> max_attempts = 5
> >>> def do_something(i):
>
>         assert 2 < i < 5
>
> Getting back to original question, if you want to limit the number of
> attempts, don't use a while, use this:
>
> >>> for count in xrange(max_attempts):
>
>         print 'attempt', count+1
>         do_something(count+1)
>
> attempt 1
> Traceback (most recent call last):
>   File "", line 3, in 
>     do_something(count+1)
>   File "", line 2, in do_something
>     assert 2 < i < 5
> AssertionError
>
> If you want to keep exceptions from ending the loop prematurely, you
> add this:
>
> >>> for count in xrange(max_attempts):
>
>         print 'attempt', count+1
>         try:
>                 do_something(count+1)
>         except StandardError:
>                 pass
>
> Note that bare except clauses are *evil* and should be avoided.  Most
> exceptions derive from StandardError, so trap that if you want to
> catch errors.  Finally, to stop iterating when the errors cease, do
> this:
>
> >>> try:
>
>         for count in xrange(max_attempts):
>                 print 'attempt', count+1
>                 try:
>                         do_something(count+1)
>                         raise StopIteration
>                 except StandardError:
>                         pass
> except StopIteration:
>         pass
>
> attempt 1
> attempt 2
> attempt 3
>
> Note that StopIteration doesn't derive from StandardError, because
> it's not an error, it's a notification.  So, throw it if and when you
> want to stop iterating.
>
> BTW, note that you don't have to wrap your code in a function.
> do_something could be replaced with it's body and everything would
> still work.

I'm totally impressed. I love elegant code. Could you tell I was
trained as a VB programmer? I think I can still be reformed.

I appreciate the admonition not to use bare except clauses. I will
avoid that in the future.

I've never seen StopIteration used -- and certainly not used in
combination with a try/except pair. That was an exceptionally valuable
lesson.

I think I can take it from here, so I'll just say thank you, Sam, for
steering me straight -- very nice.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to test a URL request in a "while True" loop

2009-12-30 Thread Brian D
On Dec 30, 12:31 pm, Philip Semanchuk  wrote:
> On Dec 30, 2009, at 11:00 AM, Brian D wrote:
>
>
>
> > I'm actually using mechanize, but that's too complicated for testing
> > purposes. Instead, I've simulated in a urllib2 sample below an attempt
> > to test for a valid URL request.
>
> > I'm attempting to craft a loop that will trap failed attempts to
> > request a URL (in cases where the connection intermittently fails),
> > and repeat the URL request a few times, stopping after the Nth attempt
> > is tried.
>
> > Specifically, in the example below, a bad URL is requested for the
> > first and second iterations. On the third iteration, a valid URL will
> > be requested. The valid URL will be requested until the 5th iteration,
> > when a break statement is reached to stop the loop. The 5th iteration
> > also restores the values to their original state for ease of repeat
> > execution.
>
> > What I don't understand is how to test for a valid URL request, and
> > then jump out of the "while True" loop to proceed to another line of
> > code below the loop. There's probably faulty logic in this approach. I
> > imagine I should wrap the URL request in a function, and perhaps store
> > the response as a global variable.
>
> > This is really more of a basic Python logic question than it is a
> > urllib2 question.
>
> Hi Brian,
> While I don't fully understand what you're trying to accomplish by  
> changing the URL to google.com after 3 iterations, I suspect that some  
> of your trouble comes from using "while True". Your code would be  
> clearer if the while clause actually stated the exit condition. Here's  
> a suggestion (untested):
>
> MAX_ATTEMPTS = 5
>
> count = 0
> while count <= MAX_ATTEMPTS:
>     count += 1
>     try:
>        print 'attempt ' + str(count)
>        request = urllib2.Request(url, None, headers)
>        response = urllib2.urlopen(request)
>        if response:
>           print 'True response.'
>     except URLError:
>        print 'fail ' + str(count)
>
> You could also save the results  (untested):
>
> MAX_ATTEMPTS = 5
>
> count = 0
> results = [ ]
> while count <= MAX_ATTEMPTS:
>     count += 1
>     try:
>        print 'attempt ' + str(count)
>        request = urllib2.Request(url, None, headers)
>        f = urllib2.urlopen(request)
>        # Note that here I ignore the doc that says "None may be
>        # returned if no handler handles the request". Caveat emptor.
>        results.append(f.info())
>        f.close()
>     except URLError:
>        # Even better, append actual reasons for the failure.
>        results.append(False)
>
> for result in results:
>     print result
>
> I guess if you're going to do the same number of attempts each time, a  
> for loop would be more expressive, but you probably get the idea.
>
> Hope this helps
> Philip

Nice to have options, Philip. Thanks! I'll give your solution a try in
mechanize as well. I really can't thank you enough for contributing to
helping me solve this issue. I love Python.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to test a URL request in a "while True" loop

2009-12-30 Thread Brian D
Thanks MRAB as well. I've printed all of the replies to retain with my
pile of essential documentation.

To follow up with a complete response, I'm ripping out of my mechanize
module the essential components of the solution I got to work.

The main body of the code passes a URL to the scrape_records function.
The function attempts to open the URL five times.

If the URL is opened, a values dictionary is populated and returned to
the calling statement. If the URL cannot be opened, a fatal error is
printed and the module terminates. There's a little sleep call in the
function to leave time for any errant connection problem to resolve
itself.

Thanks to all for your replies. I hope this helps someone else:

import urllib2, time
from mechanize import Browser

def scrape_records(url):
maxattempts = 5
br = Browser()
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:
1.9.0.16) Gecko/2009120208 Firefox/3.0.16 (.NET CLR 3.5.30729)'
br.addheaders = [('User-agent', user_agent)]
for count in xrange(maxattempts):
try:
print url, count
br.open(url)
break
except urllib2.URLError:
print 'URL error', count
# Pretend a failed connection was fixed
if count == 2:
url = 'http://www.google.com'
time.sleep(1)
pass
else:
print 'Fatal URL error. Process terminated.'
return None
# Scrape page and populate valuesDict
valuesDict = {}
return valuesDict

url = 'http://badurl'
valuesDict = scrape_records(url)
if valuesDict == None:
print 'Failed to retrieve valuesDict'
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to test a URL request in a "while True" loop

2009-12-30 Thread Brian D
On Dec 30, 7:08 pm, MRAB  wrote:
> Brian D wrote:
> > Thanks MRAB as well. I've printed all of the replies to retain with my
> > pile of essential documentation.
>
> > To follow up with a complete response, I'm ripping out of my mechanize
> > module the essential components of the solution I got to work.
>
> > The main body of the code passes a URL to the scrape_records function.
> > The function attempts to open the URL five times.
>
> > If the URL is opened, a values dictionary is populated and returned to
> > the calling statement. If the URL cannot be opened, a fatal error is
> > printed and the module terminates. There's a little sleep call in the
> > function to leave time for any errant connection problem to resolve
> > itself.
>
> > Thanks to all for your replies. I hope this helps someone else:
>
> > import urllib2, time
> > from mechanize import Browser
>
> > def scrape_records(url):
> >     maxattempts = 5
> >     br = Browser()
> >     user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:
> > 1.9.0.16) Gecko/2009120208 Firefox/3.0.16 (.NET CLR 3.5.30729)'
> >     br.addheaders = [('User-agent', user_agent)]
> >     for count in xrange(maxattempts):
> >         try:
> >             print url, count
> >             br.open(url)
> >             break
> >         except urllib2.URLError:
> >             print 'URL error', count
> >             # Pretend a failed connection was fixed
> >             if count == 2:
> >                 url = 'http://www.google.com'
> >             time.sleep(1)
> >             pass
>
> 'pass' isn't necessary.
>
> >     else:
> >         print 'Fatal URL error. Process terminated.'
> >         return None
> >     # Scrape page and populate valuesDict
> >     valuesDict = {}
> >     return valuesDict
>
> > url = 'http://badurl'
> > valuesDict = scrape_records(url)
> > if valuesDict == None:
>
> When checking whether or not something is a singleton, such as None, use
> "is" or "is not" instead of "==" or "!=".
>
> >     print 'Failed to retrieve valuesDict'
>
>

I'm definitely acquiring some well-deserved schooling -- and it's
really appreciated. I'd seen the "is/is not" preference before, but it
just didn't stick.

I see now that "pass" is redundant -- thanks for catching that.

Cheers.
-- 
http://mail.python.org/mailman/listinfo/python-list


fsync() doesn't work as advertised?

2010-01-04 Thread Brian D
If I'm running a process in a loop that runs for a long time, I
occasionally would like to look at a log to see how it's going.

I know about the logging module, and may yet decide to use that.

Still, I'm troubled by how fsync() doesn't seem to work as advertised:

http://docs.python.org/library/os.html

"If you’re starting with a Python file object f, first do f.flush(),
and then do os.fsync(f.fileno())"

Has anyone else experienced and overcome this issue?

What I've seen is that flush() alone produces a complete log when the
loop finishes. When I used fsync(), I lost all of the write entries
except the first, along with odd error trap and the last entry.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: fsync() doesn't work as advertised?

2010-01-04 Thread Brian D
On Jan 4, 10:29 am, Antoine Pitrou  wrote:
> Le Mon, 04 Jan 2010 08:09:56 -0800, Brian D a écrit :
>
>
>
> > What I've seen is that flush() alone produces a complete log when the
> > loop finishes. When I used fsync(), I lost all of the write entries
> > except the first, along with odd error trap and the last entry.
>
> Perhaps you are writing to the file from several threads or processes at
> once?
>
> By the way, you shouldn't need fsync() if you merely want to look at the
> log files, because your OS will have an up-to-date view of the file
> contents anyway. fsync() is useful if you want to be sure the data has
> been written to the hard disk drive, rather than just kept in the
> operating system's filesystem cache.

Sure -- I hadn't considered how threads might affect the write
process. That's a good lead to perhaps fixing the problem.

Thanks for your help, Antoine.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: fsync() doesn't work as advertised?

2010-01-06 Thread Brian D
On Jan 5, 1:08 pm, Nobody  wrote:
> On Mon, 04 Jan 2010 08:09:56 -0800, Brian D wrote:
> > If I'm running a process in a loop that runs for a long time, I
> > occasionally would like to look at a log to see how it's going.
>
> > I know about the logging module, and may yet decide to use that.
>
> > Still, I'm troubled by howfsync() doesn't seem to work as advertised:
>
> >http://docs.python.org/library/os.html
>
> > "If you’re starting with a Python file object f, first do f.flush(),
> > and then do os.fsync(f.fileno())"
>
> The .flush() method (and the C fflush() function) causes the
> contents of application buffers to be sent to the OS, which basically
> copies the data into the OS-level buffers.
>
> fsync() causes the OS-level buffers to be written to the physical drive.
>
> File operations normally use the OS-level buffers; e.g. if one process
> write()s to a file and another process read()s it, the latter will see
> what the former has written regardless of whether the data has been
> written to the drive.
>
> The main reason for usingfsync() is to prevent important data from being
> lost in the event of an unexpected reboot or power-cycle (an expected
> reboot via the "shutdown" or "halt" commands will flush all OS-level
> buffers to the drive first). Other than that,fsync() is almost invisible
> (I say "almost", as there are mechanisms to bypass the OS-level buffers,
> e.g. the O_DIRECT open() flag).

An excellent explanation of the process. I've seen this in other
programming environments, so I could visualize something to that
effect, but couldn't have verbalized it. I certainly have a better
idea what's happening. Thanks for the contribution.
-- 
http://mail.python.org/mailman/listinfo/python-list


Iterate over group names in a regex match?

2010-01-19 Thread Brian D
Here's a simple named group matching pattern:

>>> s = "1,2,3"
>>> p = re.compile(r"(?P\d),(?P\d),(?P\d)")
>>> m = re.match(p, s)
>>> m
<_sre.SRE_Match object at 0x011BE610>
>>> print m.groups()
('1', '2', '3')

Is it possible to call the group names, so that I can iterate over
them?

The result I'm looking for would be:

('one', 'two', 'three')



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Iterate over group names in a regex match?

2010-01-19 Thread Brian D
On Jan 19, 11:28 am, Peter Otten <[email protected]> wrote:
> Brian D wrote:
> > Here's a simple named group matching pattern:
>
> >>>> s = "1,2,3"
> >>>> p = re.compile(r"(?P\d),(?P\d),(?P\d)")
> >>>> m = re.match(p, s)
> >>>> m
> > <_sre.SRE_Match object at 0x011BE610>
> >>>> print m.groups()
> > ('1', '2', '3')
>
> > Is it possible to call the group names, so that I can iterate over
> > them?
>
> > The result I'm looking for would be:
>
> > ('one', 'two', 'three')
> >>> s = "1,2,3"
> >>> p = re.compile(r"(?P\d),(?P\d),(?P\d)")
> >>> m = re.match(p, s)
> >>> dir(m)
>
> ['__copy__', '__deepcopy__', 'end', 'expand', 'group', 'groupdict',
> 'groups', 'span', 'start']>>> m.groupdict().keys()
>
> ['one', 'three', 'two']>>> sorted(m.groupdict(), key=m.span)
>
> ['one', 'two', 'three']
>
> Peter

groupdict() does it. I've never seen it used before. Very cool!

Thank you all for taking time to answer the question.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Iterate over group names in a regex match?

2010-01-19 Thread Brian D
On Jan 19, 11:51 am, Brian D  wrote:
> On Jan 19, 11:28 am, Peter Otten <[email protected]> wrote:
>
>
>
> > Brian D wrote:
> > > Here's a simple named group matching pattern:
>
> > >>>> s = "1,2,3"
> > >>>> p = re.compile(r"(?P\d),(?P\d),(?P\d)")
> > >>>> m = re.match(p, s)
> > >>>> m
> > > <_sre.SRE_Match object at 0x011BE610>
> > >>>> print m.groups()
> > > ('1', '2', '3')
>
> > > Is it possible to call the group names, so that I can iterate over
> > > them?
>
> > > The result I'm looking for would be:
>
> > > ('one', 'two', 'three')
> > >>> s = "1,2,3"
> > >>> p = re.compile(r"(?P\d),(?P\d),(?P\d)")
> > >>> m = re.match(p, s)
> > >>> dir(m)
>
> > ['__copy__', '__deepcopy__', 'end', 'expand', 'group', 'groupdict',
> > 'groups', 'span', 'start']>>> m.groupdict().keys()
>
> > ['one', 'three', 'two']>>> sorted(m.groupdict(), key=m.span)
>
> > ['one', 'two', 'three']
>
> > Peter
>
> groupdict() does it. I've never seen it used before. Very cool!
>
> Thank you all for taking time to answer the question.


FYI, here's an example of the working result ...

>>> for k, v in m.groupdict().iteritems():
k, v


('one', '1')
('three', '3')
('two', '2')


The use for this is that I'm pulling data from a flat text file using
regex, and storing values in a dictionary that will be used to update
a database.

-- 
http://mail.python.org/mailman/listinfo/python-list


Stuck on a three word street name regex

2010-01-27 Thread Brian D
I've tackled this kind of problem before by looping through a patterns
dictionary, but there must be a smarter approach.

Two addresses. Note that the first has incorrectly transposed the
direction and street name. The second has an extra space in it before
the street type. Clearly done by someone who didn't know how to
concatenate properly -- or didn't care.

1000 RAMPART S ST

100 JOHN CHURCHILL CHASE  ST

I want to parse the elements into an array of values that can be
inserted into new database fields.

Anyone who loves solving these kinds of puzzles care to relieve my
frazzled brain?

The pattern I'm using doesn't keep the "CHASE" with the "JOHN
CHURCHILL":

>>> p = 
>>> re.compile(r'(?P\d+)\s(?P[A-Z\s]*)\s(?P\w*)\s(?P\w{2})$')
>>> s = '1405 RAMPART S ST'
>>> m = re.search(p, s)
>>> m
<_sre.SRE_Match object at 0x011A4440>
>>> print m.groups()
('1405', 'RAMPART', 'S', 'ST')
>>> s = '45 JOHN CHURCHILL CHASE ST'
>>> m = re.search(p, s)
>>> m
<_sre.SRE_Match object at 0x011A43E8>
>>> print m.groups()
('45', 'JOHN CHURCHILL', 'CHASE', 'ST')
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stuck on a three word street name regex

2010-01-27 Thread Brian D
On Jan 27, 6:35 pm, Paul Rubin  wrote:
> Brian D  writes:
> > I've tackled this kind of problem before by looping through a patterns
> > dictionary, but there must be a smarter approach.>
> > Two addresses. Note that the first has incorrectly transposed the
> > direction and street name. 
>
> If you're really serious about it (e.g. you are the post office trying
> to program automatic mail sorting machines) there is no simple regex
> trick anything like what you want.  A lot of addresses will be
> ambiguous.  You have use all the info you have about your entire address
> corpus (e.g. you need a complete street directory of the whole US) and
> do a bunch of Bayesian inference.  As a very simple example, for an
> address like "1000 RAMPART S ST" you'd use the zip code to identify the
> address's geographic neighborhood, and then use your street directory to
> find candidate correct addresses within that zip code.  The USPS does
> an amazing job of delivering mail to completely mangled addresses
> based on methods like that.

Paul,

That's a sound methodology. I actually have a routine that will
compare an address to a list of all streets in the city using a Short
Distance function. I have used that in circumstances when there are a
lot of problems with addresses. In this case, however, the streets are
actually structured very well -- except for the transposed street
directions. I was really hoping to see if there's a solution that
handles one, two, and three word strings, followed by an occasional
single character, and then a two character suffix. I'm still hoping
for that kind of a solution if it exists. The reason? It's actually a
very small number of addresses that aren't being captured with the
current regex. I don't see the need for overkill, and I'm always
stretching to learn something I haven't already succeeded at
accomplishing. I may just make a second pass at the data with a
different regex.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stuck on a three word street name regex

2010-01-27 Thread Brian D
On Jan 27, 7:27 pm, MRAB  wrote:
> Brian D wrote:
> > I've tackled this kind of problem before by looping through a patterns
> > dictionary, but there must be a smarter approach.
>
> > Two addresses. Note that the first has incorrectly transposed the
> > direction and street name. The second has an extra space in it before
> > the street type. Clearly done by someone who didn't know how to
> > concatenate properly -- or didn't care.
>
> > 1000 RAMPART S ST
>
> > 100 JOHN CHURCHILL CHASE  ST
>
> > I want to parse the elements into an array of values that can be
> > inserted into new database fields.
>
> > Anyone who loves solving these kinds of puzzles care to relieve my
> > frazzled brain?
>
> > The pattern I'm using doesn't keep the "CHASE" with the "JOHN
> > CHURCHILL":
>
> [snip]
> Regex doesn't gain you much. I'd split the string and then fix the parts
> as necessary:
>
>  >>> def parse_address(address):
> ...     parts = address.split()
> ...     if parts[-2] == "S":
> ...         parts[1 : -1] = [parts[-2]] + parts[1 : -2]
> ...     parts[1 : -1] = [" ".join(parts[1 : -1])]
> ...     return parts
> ...
>  >>> print parse_address("1000 RAMPART S ST")
> ['1000', 'S RAMPART', 'ST']
>  >>> print parse_address("100 JOHN CHURCHILL CHASE  ST")
> ['100', 'JOHN CHURCHILL CHASE', 'ST']

This is a nice approach I wouldn't have thought to pursue. I've never
seen this referencing of list elements in reverse order with negative
values, so that certainly expands my knowledge of Python. Of course,
I'd want to check for other directionals -- probably with a list
check, e.g.,

if parts[-2] in ('E', 'W', 'N', 'S'):

Thanks for sharing your approach.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stuck on a three word street name regex

2010-01-28 Thread Brian D
> > [snip]
> > Regex doesn't gain you much. I'd split the string and then fix the parts
> > as necessary:
>
> >  >>> def parse_address(address):
> > ...     parts = address.split()
> > ...     if parts[-2] == "S":
> > ...         parts[1 : -1] = [parts[-2]] + parts[1 : -2]
> > ...     parts[1 : -1] = [" ".join(parts[1 : -1])]
> > ...     return parts
> > ...
> >  >>> print parse_address("1000 RAMPART S ST")
> > ['1000', 'S RAMPART', 'ST']
> >  >>> print parse_address("100 JOHN CHURCHILL CHASE  ST")
> > ['100', 'JOHN CHURCHILL CHASE', 'ST']
>
> This is a nice approach I wouldn't have thought to pursue. I've never
> seen this referencing of list elements in reverse order with negative
> values, so that certainly expands my knowledge of Python. Of course,
> I'd want to check for other directionals -- probably with a list
> check, e.g.,
>
> if parts[-2] in ('E', 'W', 'N', 'S'):
>
> Thanks for sharing your approach.

After studying this again today, I realized the ingeniousness of
reverse slicing the list (or perhaps right slicing) -- that one
doesn't have to worry about the number of words in the string.

To translate for those who may follow, the expression "parts[1 : -1]"
means gather list items from position one in the list (index position
2) to one index position before the end of the list. The value in this
is that we already know the first list element after a split() will be
the street number. The last element will be the street type.
Everything in between, no matter how many words, will be the street
name -- excepting, of course, the instances where there's a street
direction added in, as captured in example above.

A very nice solution. Thanks again!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stuck on a three word street name regex

2010-01-28 Thread Brian D
On Jan 28, 7:40 am, Brian D  wrote:
> > > [snip]
> > > Regex doesn't gain you much. I'd split the string and then fix the parts
> > > as necessary:
>
> > >  >>> def parse_address(address):
> > > ...     parts = address.split()
> > > ...     if parts[-2] == "S":
> > > ...         parts[1 : -1] = [parts[-2]] + parts[1 : -2]
> > > ...     parts[1 : -1] = [" ".join(parts[1 : -1])]
> > > ...     return parts
> > > ...
> > >  >>> print parse_address("1000 RAMPART S ST")
> > > ['1000', 'S RAMPART', 'ST']
> > >  >>> print parse_address("100 JOHN CHURCHILL CHASE  ST")
> > > ['100', 'JOHN CHURCHILL CHASE', 'ST']
>
> > This is a nice approach I wouldn't have thought to pursue. I've never
> > seen this referencing of list elements in reverse order with negative
> > values, so that certainly expands my knowledge of Python. Of course,
> > I'd want to check for other directionals -- probably with a list
> > check, e.g.,
>
> > if parts[-2] in ('E', 'W', 'N', 'S'):
>
> > Thanks for sharing your approach.
>
> After studying this again today, I realized the ingeniousness of
> reverse slicing the list (or perhaps right slicing) -- that one
> doesn't have to worry about the number of words in the string.
>
> To translate for those who may follow, the expression "parts[1 : -1]"
> means gather list items from position one in the list (index position
> 2) to one index position before the end of the list. The value in this
> is that we already know the first list element after a split() will be
> the street number. The last element will be the street type.
> Everything in between, no matter how many words, will be the street
> name -- excepting, of course, the instances where there's a street
> direction added in, as captured in example above.
>
> A very nice solution. Thanks again!

Correction:

[snip] the expression "parts[1 : -1]" means gather list items from the
second element in the list (index value 1) to one index position
before the end of the list. [snip]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stuck on a three word street name regex

2010-01-28 Thread Brian D

> Correction:
>
> [snip] the expression "parts[1 : -1]" means gather list items from the
> second element in the list (index value 1) to one index position
> before the end of the list. [snip]

MRAB's solution was deserving of a more complete solution:

>>> def parse_address(address):
# Handles poorly-formatted addresses:
# 100 RAMPART S ST -- direction in wrong position
# 45 JOHN CHURCHILL CHASE  ST -- two spaces before type
#addresslist = ['num', 'dir', 'name', 'type']
addresslist = ['', '', '', '']
parts = address.split()
if parts[-2] in ('E', 'W', 'N', 'S'):
addresslist[1] = parts[-2]
addresslist[2] = ' '.join(parts[1 : -2])
else:
addresslist[2] = ' '.join(parts[1 : -1])
addresslist[0] = parts[0]
addresslist[3] = parts[-1]
return addresslist

>>> parse_address('45 John Churchill Chase N St')
['45', 'N', 'John Churchill Chase', 'St']
>>> parse_address('45 John Churchill Chase  St')
['45', '', 'John Churchill Chase', 'St']
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stuck on a three word street name regex

2010-01-28 Thread Brian D
On Jan 28, 8:27 am, Lie Ryan  wrote:
> On 01/28/10 11:28, Brian D wrote:
>
>
>
> > I've tackled this kind of problem before by looping through a patterns
> > dictionary, but there must be a smarter approach.
>
> > Two addresses. Note that the first has incorrectly transposed the
> > direction and street name. The second has an extra space in it before
> > the street type. Clearly done by someone who didn't know how to
> > concatenate properly -- or didn't care.
>
> > 1000 RAMPART S ST
>
> > 100 JOHN CHURCHILL CHASE  ST
>
> > I want to parse the elements into an array of values that can be
> > inserted into new database fields.
>
> > Anyone who loves solving these kinds of puzzles care to relieve my
> > frazzled brain?
>
> > The pattern I'm using doesn't keep the "CHASE" with the "JOHN
> > CHURCHILL":
>
> How does the following perform?
>
> pat =
> re.compile(r'(?P\d+)\s+(?P[A-Z\s]+)\s+(?PN|S|W|E|)\s+(?PST|RD|AVE?|)$')
>
> or more legibly:
>
> pat = re.compile(
>     r'''
>       (?P  \d+              )  #M series of digits
>       \s+
>       (?P [A-Z\s]+         )  #M one-or-more word
>       \s+
>       (?P  S?E|SW?|N?W|NE?| )  #O direction or nothing
>       \s+
>       (?P ST|RD|AVE?       )  #M street type
>       $                                   #M END
>     ''', re.VERBOSE)

Is that all? That little empty space after the "|" OR metacharacter?
Wow.

As a test, to create a failure, if I remove that last "|"
metacharacter from the "N|S|W|E|" string (i.e., "N|S|W|E"), the match
fails on addresses that do not have that malformed direction after the
street name (e.g., '45 JOHN CHURCHILL CHASE  ST')

Very clever. I don't think I've ever seen documentation showing that
little trick.

Thanks for enlightening me!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression to structure HTML

2009-10-02 Thread Brian D
Yes, John, that's correct. I'm trying to trap and discard the  row
 elements, re-formatting with pipes so that I can more readily
import the data into a database. The tags are, of course, initially
useful for pattern discovery. But there are other approaches -- I
could just replace the tags and capture the data as an array.

I'm well aware of the problems using regular expressions for html
parsing. This isn't merely a question of knowing when to use the right
tool. It's a question about how to become a better developer using
regular expressions.

I'm trying to figure out where the regular expression fails. The
structure of the page I'm scraping is uniform in the production of
tags -- it's an old ASP page that pulls data from a database.

What's different in the first  row is the appearance of a comma, a
# pound symbol, and a number (", Inc #4"). I'm making the assumption
that's what's throwing off the remainder of the regular expression --
because (despite the snark by others above) the expression is working
for every other data row. But I could be wrong. Of course, if I could
identify the problem, I wouldn't be asking. That's why I posted the
question for other eyes to review.

I discovered that I may actually be properly parsing the data from the
tags when I tried this test in a Python interpreter:

>>> s = "New Horizon Technical Academy, Inc #4"
>>> p = re.compile(r'([\s\S\WA-Za-z0-9]*)()')
>>> m = p.match(s)
>>> m = p.match(s)
>>> m.group(0)
"New Horizon Technical Academy, Inc #4"
>>> m.group(1)
"New Horizon Technical Academy, Inc #4"
>>> m.group(2)
''

I found it curious that I was capturing the groups as sequences, but I
didn't understand how to use this knowledge in named groups -- or
maybe I am merely mis-identifying the source of the regular expression
problem.

It's a puzzle. I'm hoping someone will want to share the wisdom of
their experience, not criticize for the attempt to learn. Maybe one
shouldn't learn how to use a hammer on a screw, but I wouldn't say
that I have never hammered a screw into a piece of wood just because I
only had a hammer.

Thanks,
Brian


On Oct 2, 8:38 am, John  wrote:
> On Oct 2, 1:10 am, "[email protected]" <[email protected]> wrote:
>
>
>
> > I'm kind of new to regular expressions, and I've spent hours trying to
> > finesse a regular expression to build a substitution.
>
> > What I'd like to do is extract data elements from HTML and structure
> > them so that they can more readily be imported into a database.
>
> > No -- sorry -- I don't want to use BeautifulSoup (though I have for
> > other projects). Humor me, please -- I'd really like to see if this
> > can be done with just regular expressions.
>
> > Note that the output is referenced using named groups.
>
> > My challenge is successfully matching the HTML tags in between the
> > first table row, and the second table row.
>
> > I'd appreciate any suggestions to improve the approach.
>
> > rText = "8583 > href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
> > Inc #4Jefferson70114 > tr>9371 > lic_number=9371>Career Learning Center > valign=top>Jefferson70113"
>
> > rText = re.compile(r'()(?P\d+)()( > valign=top>)()(?P[A-
> > Za-z0-9#\s\S\W]+)().+$').sub(r'LICENSE:\g|NAME:
> > \g\n', rText)
>
> > print rText
>
> > LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4 > valign=top>Jefferson70114 > valign=top>9371 > lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113
>
> Some suggestions to start off with:
>
>   * triple-quote your multiline strings
>   * consider using the re.X, re.M, and re.S options for re.compile()
>   * save your re object after you compile it
>   * note that re.sub() returns a new string
>
> Also, it sounds like you want to replace the first 2  elements for
> each  element with their content separated by a pipe (throwing
> away the  tags themselves), correct?
>
> ---John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression to structure HTML

2009-10-02 Thread Brian D
The other thought I had was that I may not be properly trapping the
end of the first  row, and the beginning of the next  row.


On Oct 2, 8:38 am, John  wrote:
> On Oct 2, 1:10 am, "[email protected]" <[email protected]> wrote:
>
>
>
> > I'm kind of new to regular expressions, and I've spent hours trying to
> > finesse a regular expression to build a substitution.
>
> > What I'd like to do is extract data elements from HTML and structure
> > them so that they can more readily be imported into a database.
>
> > No -- sorry -- I don't want to use BeautifulSoup (though I have for
> > other projects). Humor me, please -- I'd really like to see if this
> > can be done with just regular expressions.
>
> > Note that the output is referenced using named groups.
>
> > My challenge is successfully matching the HTML tags in between the
> > first table row, and the second table row.
>
> > I'd appreciate any suggestions to improve the approach.
>
> > rText = "8583 > href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
> > Inc #4Jefferson70114 > tr>9371 > lic_number=9371>Career Learning Center > valign=top>Jefferson70113"
>
> > rText = re.compile(r'()(?P\d+)()( > valign=top>)()(?P[A-
> > Za-z0-9#\s\S\W]+)().+$').sub(r'LICENSE:\g|NAME:
> > \g\n', rText)
>
> > print rText
>
> > LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4 > valign=top>Jefferson70114 > valign=top>9371 > lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113
>
> Some suggestions to start off with:
>
>   * triple-quote your multiline strings
>   * consider using the re.X, re.M, and re.S options for re.compile()
>   * save your re object after you compile it
>   * note that re.sub() returns a new string
>
> Also, it sounds like you want to replace the first 2  elements for
> each  element with their content separated by a pipe (throwing
> away the  tags themselves), correct?
>
> ---John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to insert string in each match using RegEx iterator

2009-06-10 Thread Brian D
On Jun 10, 5:17 am, Paul McGuire  wrote:
> On Jun 9, 11:13 pm, "[email protected]" <[email protected]> wrote:
>
> > By what method would a string be inserted at each instance of a RegEx
> > match?
>
> Some might say that using a parsing library for this problem is
> overkill, but let me just put this out there as another data point for
> you.  Pyparsing (http://pyparsing.wikispaces.com) supports callbacks
> that allow you to embellish the matched tokens, and create a new
> string containing the modified text for each match of a pyparsing
> expression.  Hmm, maybe the code example is easier to follow than the
> explanation...
>
> from pyparsing import Word, nums, Regex
>
> # an integer is a 'word' composed of numeric characters
> integer = Word(nums)
>
> # or use this if you prefer
> integer = Regex(r'\d+')
>
> # attach a parse action to prefix 'INSERT ' before the matched token
> integer.setParseAction(lambda tokens: "INSERT " + tokens[0])
>
> # use transformString to search through the input, applying the
> # parse action to all matches of the given expression
> test = '123 abc 456 def 789 ghi'
> print integer.transformString(test)
>
> # prints
> # INSERT 123 abc INSERT 456 def INSERT 789 ghi
>
> I offer this because often the simple examples that get posted are
> just the barest tip of the iceberg of what the poster eventually plans
> to tackle.
>
> Good luck in your Pythonic adventure!
> -- Paul

Thanks for all of the instant feedback. I have enumerated three
responses below:

First response:

Peter,

I wonder if you (or anyone else) might attempt a different explanation
for the use of the special sequence '\1' in the RegEx syntax.

The Python documentation explains:

\number
Matches the contents of the group of the same number. Groups are
numbered starting from 1. For example, (.+) \1 matches 'the the' or
'55 55', but not 'the end' (note the space after the group). This
special sequence can only be used to match one of the first 99 groups.
If the first digit of number is 0, or number is 3 octal digits long,
it will not be interpreted as a group match, but as the character with
octal value number. Inside the '[' and ']' of a character class, all
numeric escapes are treated as characters.

In practice, this appears to be the key to the key device to your
clever solution:

>>> re.compile(r"(\d+)").sub(r"INSERT \1", string)
'abc INSERT 123 def INSERT 456 ghi INSERT 789'

>>> re.compile(r"(\d+)").sub(r"INSERT ", string)
'abc INSERT  def INSERT  ghi INSERT '

I don't, however, precisely understand what is meant by "the group of
the same number" -- or maybe I do, but it isn't explicit. Is this just
a shorthand reference to match.group(1) -- if that were valid --
implying that the group match result is printed in the compile
execution?


Second response:

I've encountered a problem with my RegEx learning curve which I'll be
posting in a new thread -- how to escape hash characters # in strings
being matched, e.g.:

>>> string = re.escape('123#456')
>>> match = re.match('\d+', string)
>>> print match
<_sre.SRE_Match object at 0x00A6A800>
>>> print match.group()
123


Third response:

Paul,

Thanks for the referring me to the Pyparsing module. I'm thoroughly
enjoying Python, but I'm not prepared right now to say I've mastered
the Pyparsing module. As I continue my work, however, I'll be tackling
the problem of parsing addresses, exactly as the Pyparsing module
example illustrates. I'm sure I'll want to use it then.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to escape # hash character in regex match strings

2009-06-11 Thread Brian D
On Jun 11, 2:01 am, Lie Ryan  wrote:
> [email protected] wrote:
> > I've encountered a problem with my RegEx learning curve -- how to
> > escape hash characters # in strings being matched, e.g.:
>
>  string = re.escape('123#abc456')
>  match = re.match('\d+', string)
>  print match
>
> > <_sre.SRE_Match object at 0x00A6A800>
>  print match.group()
>
> > 123
>
> > The correct result should be:
>
> > 123456
>
> > I've tried to escape the hash symbol in the match string without
> > result.
>
> > Any ideas? Is the answer something I overlooked in my lurching Python
> > schooling?
>
> As you're not being clear on what you wanted, I'm just guessing this is
> what you wanted:
>
> >>> s = '123#abc456'
> >>> re.match('\d+', re.sub('#\D+', '', s)).group()
> '123456'
> >>> s = '123#this is a comment and is ignored456'
> >>> re.match('\d+', re.sub('#\D+', '', s)).group()
>
> '123456'

Sorry I wasn't more clear. I positively appreciate your reply. It
provides half of what I'm hoping to learn. The hash character is
actually a desirable hook to identify a data entity in a scraping
routine I'm developing, but not a character I want in the scrubbed
data.

In my application, the hash makes a string of alphanumeric characters
unique from other alphanumeric strings. The strings I'm looking for
are actually manually-entered identifiers, but a real machine-created
identifier shouldn't contain that hash character. The correct pattern
should be 'A1234509', but is instead often merely entered as '#12345'
when the first character, representing an alphabet sequence for the
month, and the last two characters, representing a two-digit year, can
be assumed. Identifying the hash character in a RegEx match is a way
of trapping the string and transforming it into its correct machine-
generated form.

I'm surprised it's been so difficult to find an example of the hash
character in a RegEx string -- for exactly this type of situation,
since it's so common in the real world that people want to put a pound
symbol in front of a number.

Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to escape # hash character in regex match strings

2009-06-11 Thread Brian D
On Jun 11, 9:22 am, Brian D  wrote:
> On Jun 11, 2:01 am, Lie Ryan  wrote:
>
>
>
> > [email protected] wrote:
> > > I've encountered a problem with my RegEx learning curve -- how to
> > > escape hash characters # in strings being matched, e.g.:
>
> > >>>> string = re.escape('123#abc456')
> > >>>> match = re.match('\d+', string)
> > >>>> print match
>
> > > <_sre.SRE_Match object at 0x00A6A800>
> > >>>> print match.group()
>
> > > 123
>
> > > The correct result should be:
>
> > > 123456
>
> > > I've tried to escape the hash symbol in the match string without
> > > result.
>
> > > Any ideas? Is the answer something I overlooked in my lurching Python
> > > schooling?
>
> > As you're not being clear on what you wanted, I'm just guessing this is
> > what you wanted:
>
> > >>> s = '123#abc456'
> > >>> re.match('\d+', re.sub('#\D+', '', s)).group()
> > '123456'
> > >>> s = '123#this is a comment and is ignored456'
> > >>> re.match('\d+', re.sub('#\D+', '', s)).group()
>
> > '123456'
>
> Sorry I wasn't more clear. I positively appreciate your reply. It
> provides half of what I'm hoping to learn. The hash character is
> actually a desirable hook to identify a data entity in a scraping
> routine I'm developing, but not a character I want in the scrubbed
> data.
>
> In my application, the hash makes a string of alphanumeric characters
> unique from other alphanumeric strings. The strings I'm looking for
> are actually manually-entered identifiers, but a real machine-created
> identifier shouldn't contain that hash character. The correct pattern
> should be 'A1234509', but is instead often merely entered as '#12345'
> when the first character, representing an alphabet sequence for the
> month, and the last two characters, representing a two-digit year, can
> be assumed. Identifying the hash character in a RegEx match is a way
> of trapping the string and transforming it into its correct machine-
> generated form.
>
> I'm surprised it's been so difficult to find an example of the hash
> character in a RegEx string -- for exactly this type of situation,
> since it's so common in the real world that people want to put a pound
> symbol in front of a number.
>
> Thanks!

By the way, other forms the strings can take in their manually created
forms:

A#12345
#1234509

Garbage in, garbage out -- I know. I wish I could tell the people
entering the data how challenging it is to work with what they
provide, but it is, after all, a screen-scraping routine.
-- 
http://mail.python.org/mailman/listinfo/python-list


What the \xc2\xa0 ?!!

2010-09-07 Thread Brian D
In an HTML page that I'm scraping using urllib2, a  \xc2\xa0
bytestring appears.

The page's charset = utf-8, and the Chrome browser I'm using displays
the characters as a space.

The page requires authentication:
https://www.nolaready.info/myalertlog.php

When I try to concatenate strings containing the bytestring, Python
chokes because it refuses to coerce the bytestring into ascii.

wfile.write('|'.join(valueList))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
163: ordinal not in range(128)

In searching for help with this issue, I've learned that the
bytestring *might* represent a non-breaking space.

When I scrape the page using urllib2, however, the characters print
as     in a Windows command prompt (though I wouldn't be surprised if
this is some erroneous attempt by the antiquated command window to
handle something it doesn't understand).

If I use IDLE to attempt to decode the single byte referenced in the
error message, and convert it into UTF-8, another error message is
generated:

>>> weird = unicode('\xc2', 'utf-8')

Traceback (most recent call last):
  File "", line 1, in 
weird = unicode('\xc2', 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0:
unexpected end of data

If I attempt to decode the full bytestring, I don't obtain a human-
readable string (expecting, perhaps, a non-breaking space):

>>> weird = unicode('\xc2\xa0', 'utf-8')
>>> par = ' - '.join(['This is', weird])
>>> par
u'This is - \xa0'

I suspect that the bytestring isn't UTF-8, but what is it? Latin1?

>>> weirder = unicode('\xc2\xa0', 'latin1')
>>> weirder
u'\xc2\xa0'
>>> 'This just gets ' + weirder
u'This just gets \xc2\xa0'

Or is it a Microsoft bytestring?

>>> weirder = unicode('\xc2\xa0', 'mbcs')
>>> 'This just gets ' + weirder
u'This just gets \xc2\xa0'

None of these codecs seem to work.

Back to the original purpose, as I'm scraping the page, I'm storing
the field/value pair in a dictionary with each iteration through table
elements on the page. This is all fine, until a value is found that
contains the offending bytestring. I have attempted to coerce all
value strings into an encoding, but Python doesn't seem to like that
when the string is already Unicode:

valuesDict[fieldString] = unicode(value, 'UTF-8')
TypeError: decoding Unicode is not supported

The solution I've arrived at is to specify the encoding for value
strings both when reading and writing value strings.

for k, v in valuesDict.iteritems():
valuePair = ':'.join([k, v.encode('UTF-8')])
[snip] ...
wfile.write('|'.join(valueList))

I'm not sure I have a question, but does this sound familiar to any
Unicode experts out there?

How should I handle these odd bytestring values? Am I doing it
correctly, or what could I improve?

Thanks!


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python vs perl lines of code

2006-05-17 Thread brian d foy
In article <[EMAIL PROTECTED]>, Edward
Elliott <[EMAIL PROTECTED]> wrote:

> This is just anecdotal, but I still find it interesting.  Take it for what
> it's worth.  I'm interested in hearing others' perspectives, just please
> don't turn this into a pissing contest.
> 
> I'm in the process of converting some old perl programs to python.  These
> programs use some network code and do a lot of list/dict data processing. 
> The old ones work fine but are a pain to extend.  After two conversions,
> the python versions are noticeably shorter.

You've got some hidden assumptions in there somehere, even if you
aren't admitting them to yourself. :)

You have to note that rewriting a program, even in the same language,
tends to make it shorter, too. These things are measures of programmer
skill, not the usefulness or merit of a particular language.

Shorter doesn't really mean anything though, and line count means even
less. The number of statements or the statement density might be
slightly more meaningful. Furthermore, you can't judge a script by just
the lines you see. Count the lines of all the libraries and support
files that come into play. Even then, that's next to meaningless unless
the two things do exactly the same thing and have exactly the same
features and capabilities.

I can write a one line (or very short) program (in any language) that
does the same thing your scripts do just by hiding the good stuff in a
library. One of my friends likes to talk about his program that
implements Tetris in one statement (because he hardwired everything
into a chip). That doesn't lead us to any greater understanding of
anything though.

*** Posted via a free Usenet account from http://www.teranews.com ***
-- 
http://mail.python.org/mailman/listinfo/python-list