William Martell <[EMAIL PROTECTED]> wrote:
============================================================
NOTE: This response is long. If you want to ask a question
be responsible and delete anything that is not
relevant to your question. Do NOT top post and repeat
the entire message. Do NOT use the Outlook Excuse(TM).
============================================================
: Hello Mr. Clarkson,
That's my father's name. Please call me Charles. :)
: When I run my code on the attached test data file, I get
: this.
:
: [RESULT]
: "
: price: $825","beds: 3","baths: 1","
: price: $825","phone: 214-821-2794","address: 2731 San ...
: View ","paper: The Dallas Morning News", ...
[snipped some]
: price: $975","beds: 4","baths: 2","
: price: $975","phone: 469-569-0099","location: ...
: Morning News","date: 04/04/2004",
:
: [/RESULT]
:
: My desired result is this, (I want to be able to import
: this data into excel)
:
: [DESIRED]
: price: beds: bath: phone: address: location: paper: date:
: data, data, data, data, data, data, data, data
: [/DESIRED]
There are a lot of ways to approach this. Two come to
mind immediately. We could print each line as you complete
an advertisement or we could collect all the information and
print the lines at once. We'll look at first of these
approaches below.
: The problem is that I am getting duplicate matches with my
: regular expressions
Yes, I noticed that. I suggest a different way to view
your data below.
: and I am not getting the same fields on each record.
We'll handle this with default fields.
: I would like to be able to import this data in excel from
: a csv text file.
For our other readers:
A csv text file is a file where each field is in double
quotes and separated by a comma.
Our program could be broken down this way:
1. extract fields from text file for one ad.
2. print that ad as a csv line.
3. rinse and repeat.
============================================================
NOTE: This response is long. If you want to ask a question
be responsible and delete anything that is not
relevant to your question. Do NOT top post and quote
the entire message.
============================================================
I'm going to approach this from the csv file end first.
Let's pretend that we already have extracted the fields from
an ad.
To make everything easier (on me), I'm going to use a
hash to hold an ad.
my %ad = (
location => 'Mesquite',
price => 725,
bedrooms => 3,
bathrooms => 1,
living_areas => '',
phone => '972-289-2098',
arrangement => '3/2/2',
address => '',
paper => 'The Dallas Morning News',
date => '04/08/2004'
);
Now we will create a subroutine that takes %ad and
converts it to a csv line. Feel free to ask questions
about anything you don't understand. (Be certain to
trim this response to save repeating the whole thing.)
print csv( \%ad ), "\n";
sub csv {
my $ad = shift;
# if you need to change field order
# do it here
my @field_order = qw| price bedrooms bathrooms phone
address arrangement
location paper date |;
# add double quotes (")
foreach my $value ( values %$ad ) {
$value = qq("$value");
}
# use a hash slice to order the fields
return join ',', @$ad{ @field_order };
}
One advantage to csv() is the ease of adding column
names to the csv file:
# column names
print csv( {
date => 'Date',
price => 'Price',
bedrooms => 'Bedrooms',
bathrooms => 'Bathrooms',
living_areas => 'Living_areas',
phone => 'Phone',
address => 'Address',
arrangement => 'Arrangement',
location => 'Location',
paper => 'Paper', } ), "\n";
============================================================
NOTE: This response is long. If you want to ask a question
be responsible and delete anything that is not
relevant to your question. Do NOT top post and repeat
the entire message.
============================================================
Now let's look at how to get the ads into a hash. Here's
two ads for those just joining us:
----------------------------------------------------------------------------
----
Ads basket. Check to add. Uncheck to remove.
Price: $725 BD: 3 BA: 1 LA:
Home 3bd-1ba-1,ch/a new carpet & paint fence stove $725+dep. 972-289-2098
Mesquite - The Dallas Morning News - 04/08/2004
----------------------------------------------------------------------------
----
Ads basket. Check to add. Uncheck to remove.
Price: $950 BD: 3 BA: 1.5 LA:
Home 3bd/1.5ba/2 1620 Summitt $950+dep Sec 8ok 469-525-0204; 214-957-8176
Mesquite - The Dallas Morning News - 04/08/2004
At first glance the 80 '-' dashes look like the start to
a new advertisement. But someone might enter that into the
description field. So let's use the Price line as the first
line of each ad.
For this solution I will place the data below the
__END__ tag in the script. This allows us to retrieve the
data with the special file handle DATA. I won't add the data
and the sub to each rewrite of the script to save space. In
fact, let's just rewrite the loop only.
We start with a simple loop. (It doesn't do anything
yet!):
# column names
print csv( {
date => 'Date',
price => 'Price',
bedrooms => 'Bedrooms',
bathrooms => 'Bathrooms',
living_areas => 'Living_areas',
phone => 'Phone',
address => 'Address',
arrangement => 'Arrangement',
location => 'Location',
paper => 'Paper', } ), "\n";
while ( <DATA> ) {
}
sub csv {
my $ad = shift;
# if you need to change field order
# do it here
my @field_order = qw| price bedrooms bathrooms
phone address location
paper date |;
# add double quotes (")
foreach my $value ( values %$ad ) {
$value = qq("$value");
}
# use a hash slice to order the fields
return join ',', @$ad{ @field_order };
}
__END__
----------------------------------------------------------------------------
----
Ads basket. Check to add. Uncheck to remove.
Price: $725 BD: 3 BA: 1 LA:
Home 3bd-1ba-1,ch/a new carpet & paint fence stove $725+dep. 972-289-2098
Mesquite - The Dallas Morning News - 04/08/2004
----------------------------------------------------------------------------
----
Ads basket. Check to add. Uncheck to remove.
Price: $950 BD: 3 BA: 1.5 LA:
Home 3bd/1.5ba/2 1620 Summitt $950+dep Sec 8ok 469-525-0204; 214-957-8176
Mesquite - The Dallas Morning News - 04/08/2004
============================================================
NOTE: This response is long. If you want to ask a question
be responsible and delete anything that is not
relevant to your question. Do NOT top post and repeat
the entire message.
============================================================
Now let's rewrite the loop to check for a new ad:
while ( <DATA> ) {
print "new ad\n" if /^Price/;
}
I get this (the first line may wrap):
"Price","Bedrooms","Bathrooms","Phone","Address","Location","Paper","Date"
new ad
new ad
That seems to be the correct number of ads. So now we
know when we are starting a new ad. Let's assume that each
in each ad we can't count on any field existing. So it is
possible to get an ad that looks like this:
{
date => '',
price => '',
bedrooms => '',
bathrooms => '',
living_areas => '',
phone => '',
address => '',
arrangement => '',
location => '',
paper => '',
}
Let's assume that we don't want to include such an ad in
our csv file. We'll create a function that tells us if %ad
has any values that are filled in. We call it is_valid_ad()
and it will return true if the ad has at least one field
filled in and false otherwise.
sub is_valid_ad {
my $ad = shift;
foreach $value ( values @$ad ) {
return 1 unless $value eq '';
}
return 0;
}
There are a lot of other ways to write this, but we'll
use this for now. This part of our program might look like
this now:
my %ad;
while ( <DATA> ) {
if ( /^Price/ ) {
# add the last ad if it exists
print csv( \%ad ), "\n" if is_valid_ad( \%ad );
# reset ad
%ad = (
date => '',
price => '',
bedrooms => '',
bathrooms => '',
living_areas => '',
phone => '',
address => '',
arrangement => '',
location => '',
paper => '',
);
}
}
sub is_valid_ad {
my $ad = shift;
foreach my $value ( values %$ad ) {
# return true if one field is valid
return 1 unless $value eq '';
}
# return false if we get here
return;
}
Notice that this neatly bypasses printing an ad on the
first pass.
There are now a whole bunch smarty pants readers with
there hand in the air shouting "Oooh, Oooh, Oooh!". The
problem with our loop is the last ad won't print. We need to
add a final after the while loop:
print csv( \%ad ), "\n" if is_valid_ad( \%ad );
[Please put your hands down!]
In the original data provided, the Home section can
sometimes be multiple lines.
We are going to skip extacting the address because it
would probably a depend ona routine bordering on Artificial
Intelligence to be 99% accurate.
The same goes for 'arrangement' actually. Being a real
estate investor as well as a perl programmer, I can tell you
that each person has a different idea about what 3/2/2/3
means. It also changes by locale.
For this response, let's skip those two fields
completely, it is up to the reader to find a reliable method
of extracting them. We will include those fields in the
script, we just won't fill them in.
Finally, we are assuming that we have permission from
the newspaper to do this or that we are doing it for
personal use only.
Let's look at the regex needed to get the price and
number of rooms first. Remember that any field could be
blank and that each field and name are separated by white
space. I am going to take a longer approach to this regex.
Readers might find it inefficient. Feel free to play with it
and post it to the list. (Remember not to quote this entire
message though.)
Here's a script I wrote to test the "price" regex:
#!/usr/bin/perl
use strict;
use warnings;
#use diagnostics;
use Data::Dumper 'Dumper';
my @test = (
'Price: $725 BD: 3 BA: 1 LA:',
'Price: BD: 3 BA: 1 LA: 2',
'Price: $725 BD: 3 BA: 1 LA: 2',
'Price: $725 BD: 3 BA: LA:',
);
foreach my $sample ( @test ) {
my @fields = split /Price:|BD:|BA:|LA:/, $sample;
print Dumper [EMAIL PROTECTED];
}
__END__
It came up with this output (I edited it to save space):
$VAR1 = [ '', ' $725 ', ' 3 ', ' 1 ' ];
$VAR1 = [ '', ' ', ' 3 ', ' 1 ', ' 2' ];
$VAR1 = [ '', ' $725 ', ' 3 ', ' 1 ', ' 2' ];
$VAR1 = [ '', ' $725 ', ' 3 ' ];
That's close but I found this better:
foreach my $sample ( @test ) {
$sample =~ s/ //g;
my @fields = split /Price:|BD:|BA:|LA:/, $sample;
shift @fields;
push @fields, ( '' ) x ( 4 - @fields );
print Dumper [EMAIL PROTECTED];
}
$VAR1 = [ '$725', '3', '1', '' ];
$VAR1 = [ '', '3', '1', '2' ];
$VAR1 = [ '$725', '3', '1', '2' ];
$VAR1 = [ '$725', '3', '', '' ];
Let's make this a subroutine and test it:
foreach my $sample ( @test ) {
print Dumper [ price_fields( $sample ) ];
}
sub price_fields {
my $line = shift;
# collapse all spaces
$line =~ s/\s//g;
my @fields = split /Price:|BD:|BA:|LA:/, $line;
# delete extra first field
shift @fields;
# pad missing fields
push @fields, ( '' ) x ( 4 - @fields );
return @fields;
}
__END__
I get the same result. Now let's add that to our script:
my %ad;
while ( <DATA> ) {
if ( /^Price/ ) {
# add the last ad if it exists
print csv( \%ad ), "\n" if is_valid_ad( \%ad );
# Set ad defaults
%ad = (
date => '',
price => '',
bedrooms => '',
bathrooms => '',
living_areas => '',
phone => '',
address => '',
arrangement => '',
location => '',
paper => '',
);
# update %ad for this line
@ad{ qw| price bedrooms bathrooms living_areas | }
= price_fields( $_ );
}
}
sub price_fields {
my $line = shift;
# collapse all spaces
$line =~ s/\s//g;
my @fields = split /Price:|BD:|BA:|LA:/, $line;
# delete extra first field
shift @fields;
# pad missing fields
push @fields, ( '' ) x ( 4 - @fields );
return @fields;
}
As I mentioned earlier we will be skipping the 'Home'
line so I can cheat. :) So let's add that:
my %ad;
while ( <DATA> ) {
if ( /^Price/ ) {
# add the last ad if it exists
print csv( \%ad ), "\n" if is_valid_ad( \%ad );
# Set ad defaults
%ad = (
date => '',
price => '',
bedrooms => '',
bathrooms => '',
living_areas => '',
phone => '',
address => '',
arrangement => '',
location => '',
paper => '',
);
# update %ad for this line
@ad{ qw| price bedrooms bathrooms living_areas | }
= price_fields( $_ );
}
if ( /^Home/ ) {
@ad{ qw| address arrangement | } = home_fields( $_ );
}
}
sub home_fields {
# This is a dummy function
return ( '', '' );
}
And now for the location line. I used this regex. The x
option allows us to add white space:
m/
(\S+) # location ($1)
\s+-\s+
([^-]+) # paper with trailing spaces ($2)
\s+
(\d\d\/\d\d\/\d{4}) # date ($3)
/x
So we add the following to our loop:
if (
m/
(\S+) # location ($1)
\s+-\s+
([^-]+) # paper and spaces ($2)
-\s+
(\d\d\/\d\d\/\d{4}) # date ($3)
/x ) {
# use hash slice to set fields
@ad{ qw| location paper date | } = ( $1, $2, $3 );
# trim trailing spaces in paper field
$ad{paper} =~ s/\s+$//;
}
Boy, that looks messy!
But it works! :)
Let's take a final look at the script and let's look at
some ways to improve future script maintenance.
#!/usr/bin/perl
use strict;
use warnings;
#use diagnostics;
use Data::Dumper 'Dumper';
# column names
print csv( {
date => 'Date',
price => 'Price',
bedrooms => 'Bedrooms',
bathrooms => 'Bathrooms',
living_areas => 'Living_areas',
phone => 'Phone',
address => 'Address',
location => 'Location',
arrangement => 'Arrangemant',
paper => 'Paper', } ), "\n";
my %ad;
while ( <DATA> ) {
if ( /^Price/ ) {
# add the last ad if it exists
print csv( \%ad ), "\n" if is_valid_ad( \%ad );
# Set field defaults
%ad = (
date => '',
price => '',
bedrooms => '',
bathrooms => '',
living_areas => '',
phone => '',
address => '',
arrangement => '',
location => '',
paper => '',
);
# update %ad for this line
@ad{ qw| price bedrooms bathrooms living_areas | }
= price_fields( $_ );
}
if ( /^Home/ ) {
@ad{ qw| address arrangement | } = home_fields( $_ );
}
if (
m/
(\S+) # location ($1)
\s+-\s+
([^-]+) # paper with spaces ($2)
-\s+
(\d\d\/\d\d\/\d{4}) # date ($3)
/x
) {
# use hash slice to set fields
@ad{ qw| location paper date | } = ( $1, $2, $3 );
# trim trailing spaces in paper field
$ad{paper} =~ s/\s+$//;
}
}
print csv( \%ad ), "\n" if is_valid_ad( \%ad );
sub home_fields {
# This is a dummy function
return ( '', '' );
}
sub price_fields {
my $line = shift;
# collapse all spaces
$line =~ s/\s//g;
my @fields = split /Price:|BD:|BA:|LA:/, $line;
# delete extra first field
shift @fields;
# pad missing fields
push @fields, ( '' ) x ( 4 - @fields );
return @fields;
}
sub is_valid_ad {
my $ad = shift;
foreach my $value ( values %$ad ) {
# return true if one field is valid
return 1 unless $value eq '';
}
# return false if we get here
return;
}
sub csv {
my $ad = shift;
# if you need to change field order
# do it here
my @field_order = qw| price bedrooms bathrooms phone
address arrangement
location paper date |;
# add double quotes (")
foreach my $value ( values %$ad ) {
$value = qq("$value");
}
# use a hash slice to order the fields
return join ',', @$ad{ @field_order };
}
__END__
----------------------------------------------------------------------------
----
Ads basket. Check to add. Uncheck to remove.
Price: $725 BD: 3 BA: 1 LA:
Home 3bd-1ba-1,ch/a new carpet & paint fence stove $725+dep. 972-289-2098
Mesquite - The Dallas Morning News - 04/08/2004
----------------------------------------------------------------------------
----
Ads basket. Check to add. Uncheck to remove.
Price: $950 BD: 3 BA: 1.5 LA:
Home 3bd/1.5ba/2 1620 Summitt $950+dep Sec 8ok 469-525-0204; 214-957-8176
Mesquite - The Dallas Morning News - 04/08/2004
Here's the output I got:
"Price","Bedrooms","Bathrooms","Phone","Address","Arrangemant","Location","P
aper","Date"
"$725","3","1","","","","Mesquite","The Dallas Morning News","04/08/2004"
"$950","3","1.5","","","","Mesquite","The Dallas Morning News","04/08/2004"
Right off we can see your intrepid instructor forgot the
'phone' field. I'll leave that up to you. The other thing I
notice is that we need to update too many areas to add or
change fields. That could really suck down the line, but
this reply is pushing 750 lines, so I'll give everyone a
break and shut up. :)
HTH,
Charles K. Clarkson
--
Mobile Homes Specialist
254 968-8328
============================================================
NOTE: This response is long. If you want to ask a question
be responsible and delete anything that is not
relevant to your question. Do NOT top post and repeat
the entire message. Do NOT use the Outlook Excuse(TM).
============================================================
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>