On Wed, Apr 22, 2009 at 09:23:30PM +0200, spir wrote: >> I need to be able to decompose a formatted text file into identifiable, >> possibly named pieces. To tokenize it, in other words. There seem to >> be a vast array of modules to do this with (simpleparse, pyparsing etc) >> but I cannot understand their documentation. > >I would recommand pyparsing, but this is an opinion.
It looked like a good package to me as well, but I cannot see how to define the grammar - it may be that the notation just doesn't make sense to me. >Regular expressions may be enough, depending on your actual needs. Perhaps, but I am cautious, because every text and most websites discourage regexes for parsing. >The question is: what do you need from the data? What do you expect as result? >The best is to provide an example of result matching sample data. E.G. I wish >as result a dictionary looking like >{ >'AU': 'some text\nperhaps across lines' >'AB': ['some other text', 'there may be multiples of some fields'] >'UN': 'any 2-letter combination may exist...' >... >} I think that a dictionary could work, but it would have to use lists as the value, to prevent key collisions. That said, returning a list of dictionaries (one dictionary per bibliographic reference) would work very well in the large context of my program. >From this depends the choice of an appropriate tool and hints on possible >algorithms. I hope this helps. I spent quite some time with pyparsing, but I was never able to express the rules of my grammar based on the examples on the website. -- yours, William _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor