> From: Alan <[EMAIL PROTECTED]> > > I have about 150 lines of python extracting text from large file, the > problem I need a few lines to clean first to avoid the problem the > script is facing > > Overview > There is large text and I am trying to organize it for the python script > to process, it is badly organized and I attempted to do it like this > which the master script understand
I think I would split this into three phases: - collect the data into groups of HFR - process each group by rearranging, renumbering, reporting errors - output the processed groups One potential problem is to resynchronize to the next group when there is a sequence error. If there is always a blank line between groups it is easy. Otherwise maybe just assume an H is the start of a group. hth, Kent > > Keywords: > ##### is number like 1 thru 99999 > |H paragraphs > |F reFerence > |R Rating > > BEFORE I organized by text global and replace > Each set of tokens was like this > > ##### paragraph > F reference > R rating > > Now (where master script understand) > > |H###### paragraph > |F reference > |R rating > > Notice no ##### in |F |R > > PROBLEMS > Phase 1 > PROBLEM 1 > the |H paragraph (multi lines) has some words between () such as (xyz > blah words) also maybe in multi lines > �.( blah blah > blah blah) � > > We need to move it to the end of |F reference (xyz blah words) > > > Example > BEFORE > > |H 00100 a friend in need is a friend indeed (author means both young \ > and old) so select the best friend as soon as you can blah > |F Old London book > |R Cool > > AFTER your process > |H 00100 "a friend in need is a friend indeed so select the best friend > as soon as you can blah" > |F Old London book > |R Cool > > PROBLEM 2 > I need to find out if the order is broken so I go and fix it by hand > i.e. |H##### |F |R is any other order so it is outputted in > ErrorOrderLogFile > > |H##### paragraph > |H paragraph > |R rating > > or any order like > > run new cleaning script and cat ErrorOrderLogFile > |H00299 paragraph > |F Reference > |H Rating > > |H00300 paragraph > |H paragraph > |H rating > > cat ErrorOrderLogFile: > bad set orders > |H00300 paragraph > > > Phase II > PROBLEM 3 > Once I fix by the order hand I need to renumber all from say 00001 to > 99999 > In this format > > |H00001 paragraph > |F00001 reference > |R00001 rating > > |H99999 paragraph > |F99999 reference > |R99999 rating > > > > > > --- > Outgoing mail is certified Virus Free. > Checked by AVG anti-virus system (http://www.grisoft.com). > Version: 6.0.778 / Virus Database: 525 - Release Date: 10/15/2004 > > > --- > Outgoing mail is certified Virus Free. > Checked by AVG anti-virus system (http://www.grisoft.com). > Version: 6.0.778 / Virus Database: 525 - Release Date: 10/15/2004 > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Tutor maillist - [email protected] > http://mail.python.org/mailman/listinfo/tutor -- http://www.kentsjohnson.com _______________________________________________ Tutor maillist - [email protected] http://mail.python.org/mailman/listinfo/tutor
