Shashwat Anand dixit:
> @Alan, @Lie thanks
> The approach which I am taking right now is taking some test-cases, and
> creating rules for them. Later on after expanding the cases there aroused
> some cases which didn't followed earlier pattern so I tweaked some rules so
> as to match all of them.
The other advantage of an error file is that if
your rules become very complex it's a lot faster
to only apply them to the entries that didn't
match the simpler rules.
In other words instead of applying a complex
set of rules to every one of 2 million entries
you might only have to apply them
I am almost doing same thing i.e. to give the values left unparsed a certain
name - 'NIL', and currently I'm redirecting output to a text file. Searching
for 'NIL' tells me where my match failed, although writing it seperately to
a different file dint occurred to me. And yes the job is to reduce as
"Shashwat Anand" wrote
as to match all of them. The task is time-consuming but with every new
test-sets exceptions are becoming less and less. (There are .2 million
such
pages)
One final thing to try is to identify records where you *failed* to find
a match and re write them into an error f
@Alan, @Lie thanks
The approach which I am taking right now is taking some test-cases, and
creating rules for them. Later on after expanding the cases there aroused
some cases which didn't followed earlier pattern so I tweaked some rules so
as to match all of them. The task is time-consuming but wi
On 1/3/2010 4:58 PM, Shashwat Anand wrote:
I need to extract some meaningful data from grabages.
Here are four examples. I need to get date, company name and address
from these.
For date i used regex but I'm unable to find any definite pattern for
address and company name
the format is more or le
"Shashwat Anand" wrote
> here are the examples : http://codepad.org/wF8APZV3
>
>> I need to extract some meaningful data from grabages.
>> How should I parse info if I'm not certain of any definite rules. This is
>> my first time dealing with real-life data.
Unfortunarely to parse it you will