How to efficiently extract information from structured text file
Hi,
I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file. Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject. My
questions are
1) Is there any efficient method that I can search the whole string
list to find the location of the tokens(such as '= {' or '};'
2) Is there any efficient ways to extract the object information you
may suggest?
Thanks,
- Jeremy
= Structured text file =
Object1 = {
...
SubObject1 = {
SubSubObject1 = {
...
};
};
SubObject2 = {
SubSubObject21 = {
...
};
};
SubObjectN = {
SubSubObjectN = {
...
};
};
};
--
http://mail.python.org/mailman/listinfo/python-list
Re: How to efficiently extract information from structured text file
On Feb 16, 7:14 pm, Gary Herron wrote:
> Imaginationworks wrote:
> > Hi,
>
> > I am trying to read object information from a text file (approx.
> > 30,000 lines) with the following format, each line corresponds to a
> > line in the text file. Currently, the whole file was read into a
> > string list using readlines(), then use for loop to search the "= {"
> > and "};" to determine the Object, SubObject,and SubSubObject. My
> > questions are
>
> > 1) Is there any efficient method that I can search the whole string
> > list to find the location of the tokens(such as '= {' or '};'
>
> Yes. Read the *whole* file into a single string using file.read()
> method, and then search through the string using string methods (for
> simple things) or use re, the regular expression module, (for more
> complex searches).
>
> Note: There is a point where a file becomes large enough that reading
> the whole file into memory at once (either as a single string or as a
> list of strings) is foolish. However, 30,000 lines doesn't push that
> boundary.
>
> > 2) Is there any efficient ways to extract the object information you
> > may suggest?
>
> Again, the re module has nice ways to find a pattern, and return parse
> out pieces of it. Building a good regular expression takes time,
> experience, and a bit of black magic... To do so for this case, we
> might need more knowledge of your format. Also regular expressions have
> their limits. For instance, if the sub objects can nest to any level,
> then in fact, regular expressions alone can't solve the whole problem,
> and you'll need a more robust parser.
>
> > Thanks,
>
> > - Jeremy
>
> > = Structured text file =
> > Object1 = {
>
> > ...
>
> > SubObject1 = {
> >
>
> > SubSubObject1 = {
> > ...
> > };
> > };
>
> > SubObject2 = {
> >
>
> > SubSubObject21 = {
> > ...
> > };
> > };
>
> > SubObjectN = {
> >
>
> > SubSubObjectN = {
> > ...
> > };
> > };
> > };
>
>
Gary and Rhodri, Thank you for the suggestions.
--
http://mail.python.org/mailman/listinfo/python-list
Re: How to efficiently extract information from structured text file
On Feb 17, 1:40 pm, Paul McGuire wrote:
> On Feb 16, 5:48 pm, Imaginationworks wrote:
>
> > Hi,
>
> > I am trying to read object information from a text file (approx.
> > 30,000 lines) with the following format, each line corresponds to a
> > line in the text file. Currently, the whole file was read into a
> > string list using readlines(), then use for loop to search the "= {"
> > and "};" to determine the Object, SubObject,and SubSubObject.
>
> If you open(filename).read() this file into a variable named data, the
> following pyparsing parser will pick out your nested brace
> expressions:
>
> from pyparsing import *
>
> EQ,LBRACE,RBRACE,SEMI = map(Suppress,"={};")
> ident = Word(alphas, alphanums)
> contents = Forward()
> defn = Group(ident + EQ + Group(LBRACE + contents + RBRACE + SEMI))
>
> contents << ZeroOrMore(defn | ~(LBRACE|RBRACE) + Word(printables))
>
> results = defn.parseString(data)
>
> print results
>
> Prints:
>
> [
> ['Object1',
> ['...',
> ['SubObject1',
> ['',
> ['SubSubObject1',
> ['...']
> ]
> ]
> ],
> ['SubObject2',
> ['',
> ['SubSubObject21',
> ['...']
> ]
> ]
> ],
> ['SubObjectN',
> ['',
> ['SubSubObjectN',
> ['...']
> ]
> ]
> ]
> ]
> ]
> ]
>
> -- Paul
Wow, that is great! Thanks
--
http://mail.python.org/mailman/listinfo/python-list
