Re: [Tutor] Tokenizing Help

spir Wed, 22 Apr 2009 12:25:22 -0700

Le Wed, 22 Apr 2009 14:35:29 -0400,
William Witteman <y...@nerd.cx> s'exprima ainsi:


> I need to be able to decompose a formatted text file into identifiable,
> possibly named pieces.  To tokenize it, in other words.  There seem to
> be a vast array of modules to do this with (simpleparse, pyparsing etc)
> but I cannot understand their documentation.

I would recommand pyparsing, but this is an opinion.

> The file format I am looking at (it is a bibliographic reference file)
> looks like this:
> 
> <1>                   # the references are enumerated
> AU  - some text
> perhaps across lines
> AB  - some other text
> AB  - there may be multiples of some fields
> UN  - any 2-letter combination may exist, other than by exhaustion, I
> cannot anticipate what will be found

Regular expressions may be enough, depending on your actual needs.
 
> What I am looking for is some help to get started, either with
> explaining the implementation of one of the modules with respect to my
> format, or with an approach that I could use from the base library.

The question is: what do you need from the data? What do you expect as result? 
The best is to provide an example of result matching sample data. E.G. I wish 
as result a dictionary looking like
{
'AU': 'some text\nperhaps across lines'
'AB': ['some other text', 'there may be multiples of some fields']
'UN': 'any 2-letter combination may exist...'
...
}
>From this depends the choice of an appropriate tool and hints on possible 
>algorithms.

> Thanks.

Denis
------
la vita e estrany
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Tokenizing Help

Reply via email to