On Wed, Apr 22, 2009 at 11:23:11PM +0200, Eike Welk wrote: >How do you decide that a word is a keyword (AU, AB, UN) and not a part >of the text? There could be a file like this: > ><567> >AU - Bibliographical Theory and Practice - Volume 1 - The AU - Tag >and its applications >AB - Texts in Library Science ><568> >AU - Bibliographical Theory and Practice - Volume 2 - The >AB - Tag and its applications >AB - Texts in Library Science ><569> >AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - >AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU >AB - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - >AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU >ZZ - Somewhat nonsensical case
This is a good case, and luckily the files are validated on the other end to prevent this kind of collision. >To me it seems that a parsing library is unnecessary. Just look at the >first few characters of each line and decide if its the start of a >record, a tag or normal text. You might need some additional >algorithm for corner cases. If this was the only type of file I'd need to parse, I'd agree with you, but this is one of at least 4 formats I'll need to process, and so a robust methodology will serve me better than a regex-based one-off. -- yours, William _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor