Hello William! On Wednesday 22 April 2009, William Witteman wrote: > The file format I am looking at (it is a bibliographic reference > file) looks like this: > > <1> # the references are enumerated > AU - some text > perhaps across lines > AB - some other text > AB - there may be multiples of some fields > UN - any 2-letter combination may exist, other than by exhaustion, > I cannot anticipate what will be found
How do you decide that a word is a keyword (AU, AB, UN) and not a part of the text? There could be a file like this: <567> AU - Bibliographical Theory and Practice - Volume 1 - The AU - Tag and its applications AB - Texts in Library Science <568> AU - Bibliographical Theory and Practice - Volume 2 - The AB - Tag and its applications AB - Texts in Library Science <569> AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU AB - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU ZZ - Somewhat nonsensical case To me it seems that a parsing library is unnecessary. Just look at the first few characters of each line and decide if its the start of a record, a tag or normal text. You might need some additional algorithm for corner cases. Kind regards, Eike. _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor