Re: [Tutor] how to parse a multiple character words from plaintext

Kent Johnson Sun, 24 Feb 2008 06:27:00 -0800

---- John Gunderman <[EMAIL PROTECTED]> wrote: 
> I am parsing the output of the mork.pl, which is a DORK (the mozilla format) 
> parser. I don't know Perl, so I decided to write a Python script to do what I 
> wanted, which basically is to create a dictionary listing each site and its 
> corresponding values instead of outputting into plaintext. Unfortunately, the 
> output of mork.pl is 5000+ lines so reading the whole document wouldn't be 
> that efficient.


OK, I looked briefly at mork.pl. You should be able to process it line-by-line 
with something like this:

for line in history_file:
  if not line.strip():
    continue # skip blank lines; may not be needed
  time, count, url = line.split()
  # do something with time, count, url

Kent

 Currently it uses:
>         for line in history_file.readlines():
> but I dont know if this has to read all lines before it goes through it. if 
> it does, then would it be more efficient to use
>         while line != '/t':
>             line = history_file.readline()    
> I was thinking of just appending each character to the string until it sees 
> '/t', and then using int() on the string, but is there an easier way?
> 
> John
> 
> ----- Original Message ----
> From: Kent Johnson <[EMAIL PROTECTED]>
> To: John Gunderman <[EMAIL PROTECTED]>
> Cc: tutor@python.org
> Sent: Saturday, February 23, 2008 3:43:44 AM
> Subject: Re: [Tutor] how to parse a multiple character words from plaintext
> 
> John Gunderman wrote:
> > I am looking to parse a plaintext from a document. However, I am 
> > confused about the actual methodology of it. This is because some of the 
> > words will be multiple digits or characters. However, I don't know the 
> > length of the words before the parse. Is there a way to somehow have 
> > open() grab something until it sees a /t or ' '? I was thinking I could 
> > have it count ahead the number of spaces till the stopping point and 
> > then parse till that point using read(), but that seems sort of 
> > inefficient. Is there a better way to pull this off? Thanks in advance.
> 
> How big is the file? Can you just read the whole document and parse the 
> resulting string? Or read by lines?
> 
> Depending on how complex your parsing is, you might want to use 
> pyparsing or one of the other Python parser libraries.
> http://pyparsing.wikispaces.com/
> http://nedbatchelder.com/text/python-parsers.html
> 
> Kent
> 
> 
> 
> 
> 
> 
>       
> ____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search.  
> http://tools.search.yahoo.com/newsearch/category.php?category=shopping

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] how to parse a multiple character words from plaintext

Reply via email to