Re: [Tutor] extract meaningful data from garbage

2010-01-03 Thread spir
Shashwat Anand dixit: > @Alan, @Lie thanks > The approach which I am taking right now is taking some test-cases, and > creating rules for them. Later on after expanding the cases there aroused > some cases which didn't followed earlier pattern so I tweaked some rules so > as to match all of them.

Re: [Tutor] extract meaningful data from garbage

2010-01-03 Thread ALAN GAULD
The other advantage of an error file is that if your rules become very complex it's a lot faster to only apply them to the entries that didn't match the simpler rules. In other words instead of applying a complex set of rules to every one of 2 million entries you might only have to apply them

Re: [Tutor] extract meaningful data from garbage

2010-01-03 Thread Shashwat Anand
I am almost doing same thing i.e. to give the values left unparsed a certain name - 'NIL', and currently I'm redirecting output to a text file. Searching for 'NIL' tells me where my match failed, although writing it seperately to a different file dint occurred to me. And yes the job is to reduce as

Re: [Tutor] extract meaningful data from garbage

2010-01-03 Thread Alan Gauld
"Shashwat Anand" wrote as to match all of them. The task is time-consuming but with every new test-sets exceptions are becoming less and less. (There are .2 million such pages) One final thing to try is to identify records where you *failed* to find a match and re write them into an error f

Re: [Tutor] extract meaningful data from garbage

2010-01-03 Thread Shashwat Anand
@Alan, @Lie thanks The approach which I am taking right now is taking some test-cases, and creating rules for them. Later on after expanding the cases there aroused some cases which didn't followed earlier pattern so I tweaked some rules so as to match all of them. The task is time-consuming but wi

Re: [Tutor] extract meaningful data from garbage

2010-01-03 Thread Lie Ryan
On 1/3/2010 4:58 PM, Shashwat Anand wrote: I need to extract some meaningful data from grabages. Here are four examples. I need to get date, company name and address from these. For date i used regex but I'm unable to find any definite pattern for address and company name the format is more or le

Re: [Tutor] extract meaningful data from garbage

2010-01-03 Thread Alan Gauld
"Shashwat Anand" wrote > here are the examples : http://codepad.org/wF8APZV3 > >> I need to extract some meaningful data from grabages. >> How should I parse info if I'm not certain of any definite rules. This is >> my first time dealing with real-life data. Unfortunarely to parse it you will