> From: Alan <[EMAIL PROTECTED]>
> 
> I have about 150 lines of python extracting text from large file, the
> problem I need a few lines to clean first to avoid the problem the
> script is facing
> 
> Overview
> There is large text and I am trying to organize it for the python script
> to process, it is badly organized and I attempted to do it like this
> which the master script understand

I think I would split this into three phases:
- collect the data into groups of HFR
- process each group by rearranging, renumbering, reporting errors
- output the processed groups

One potential problem is to resynchronize to the next group when there is a 
sequence error. If there is always a blank line between groups it is easy. 
Otherwise maybe just assume an H is the start of a group.

hth,
Kent

> 
> Keywords:
> ##### is number like 1 thru 99999
> |H paragraphs
> |F reFerence
> |R Rating
> 
> BEFORE I organized by text global and replace
> Each set of tokens was like this
> 
> #####  paragraph
> F reference
> R rating
> 
> Now (where master script understand)
> 
> |H###### paragraph
> |F reference
> |R rating
> 
> Notice no ##### in |F |R
> 
> PROBLEMS
> Phase 1
> PROBLEM 1
> the |H paragraph (multi lines) has some words between () such as (xyz
> blah words) also maybe in multi lines
> �.( blah blah
> blah blah) �
> 
> We need to move it to the end of |F reference (xyz blah words)
> 
> 
> Example
> BEFORE
> 
> |H 00100 a friend in need is a friend indeed (author means both young \
> and old) so select the best friend as soon as you can blah
> |F Old London book
> |R Cool
> 
> AFTER your process
> |H 00100 "a friend in need is a friend indeed so select the best friend
> as soon as you can blah"
> |F Old London book
> |R Cool
> 
> PROBLEM 2
> I need to find out if the order is broken so I go and fix it by hand
> i.e. |H##### |F |R is any other order so it is outputted in
> ErrorOrderLogFile
> 
> |H##### paragraph
> |H paragraph
> |R rating
> 
> or any order like
> 
> run new cleaning script and cat ErrorOrderLogFile
> |H00299 paragraph
> |F Reference
> |H Rating
> 
> |H00300 paragraph
> |H paragraph
> |H rating
> 
> cat ErrorOrderLogFile:
> bad set orders
> |H00300 paragraph
> 
> 
> Phase II
> PROBLEM 3
> Once I fix by the order hand I need to renumber all from say 00001 to
> 99999
> In this format
> 
> |H00001 paragraph
> |F00001 reference
> |R00001 rating
> 
> |H99999 paragraph
> |F99999 reference
> |R99999 rating
> 
> 
> 
> 
> 
> ---
> Outgoing mail is certified Virus Free.
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.778 / Virus Database: 525 - Release Date: 10/15/2004
> 
> 
> ---
> Outgoing mail is certified Virus Free.
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.778 / Virus Database: 525 - Release Date: 10/15/2004
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Tutor maillist  -  [email protected]
> http://mail.python.org/mailman/listinfo/tutor

-- 
http://www.kentsjohnson.com

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Reply via email to