RE: Help with Variable Scope, Regular Expressions

Charles K. Clarkson Thu, 08 Apr 2004 20:54:31 -0700

William Martell <[EMAIL PROTECTED]> wrote:

============================================================
NOTE: This response is long. If you want to ask a question
      be responsible and delete anything that is not
      relevant to your question. Do NOT top post and repeat
      the entire message. Do NOT use the Outlook Excuse(TM).
============================================================


: Hello Mr. Clarkson,

    That's my father's name. Please call me Charles. :)


: When I run my code on the attached test data file, I get
: this.
:
: [RESULT]
: "
: price: $825","beds: 3","baths: 1","
: price: $825","phone: 214-821-2794","address: 2731 San ...
: View ","paper: The Dallas Morning News", ...
[snipped some]
: price: $975","beds: 4","baths: 2","
: price: $975","phone: 469-569-0099","location: ...
: Morning News","date: 04/04/2004",
:
: [/RESULT]
:
: My desired result is this,  (I want to be able to import
: this data into excel)
:
: [DESIRED]
: price: beds: bath: phone: address: location: paper: date:
: data,  data, data, data,  data,    data,     data,  data
: [/DESIRED]

    There are a lot of ways to approach this. Two come to
mind immediately. We could print each line as you complete
an advertisement or we could collect all the information and
print the lines at once. We'll look at first of these
approaches below.


: The problem is that I am getting duplicate matches with my
: regular expressions

    Yes, I noticed that. I suggest a different way to view
your data below.


: and I am not getting the same fields on each record.

    We'll handle this with default fields.


: I would like to be able to import this data in excel from
: a csv text file.

    For our other readers:
    A csv text file is a file where each field is in double
quotes and separated by a comma.

    Our program could be broken down this way:

    1. extract fields from text file for one ad.
    2. print that ad as a csv line.
    3. rinse and repeat.

============================================================
NOTE: This response is long. If you want to ask a question
      be responsible and delete anything that is not
      relevant to your question. Do NOT top post and quote
      the entire message.
============================================================

    I'm going to approach this from the csv file end first.
Let's pretend that we already have extracted the fields from
an ad.

    To make everything easier (on me), I'm going to use a
hash to hold an ad.

my %ad = (
    location        => 'Mesquite',
    price           => 725,
    bedrooms        => 3,
    bathrooms       => 1,
    living_areas    => '',
    phone           => '972-289-2098',
    arrangement     => '3/2/2',
    address         => '',
    paper           => 'The Dallas Morning News',
    date            => '04/08/2004'
);

    Now we will create a subroutine that takes %ad and
converts it to a csv line. Feel free to ask questions
about anything you don't understand. (Be certain to
trim this response to save repeating the whole thing.)

print csv( \%ad ), "\n";

sub csv {
    my $ad = shift;

    # if you need to change field order
    # do it here
    my @field_order = qw|   price bedrooms bathrooms phone
                            address arrangement
                            location paper date |;

    # add double quotes (")
    foreach my $value ( values %$ad ) {
        $value = qq("$value");
    }

    # use a hash slice to order the fields
    return join ',', @$ad{ @field_order };
}

    One advantage to csv() is the ease of adding column
names to the csv file:

# column names
print csv( {
    date            => 'Date',
    price           => 'Price',
    bedrooms        => 'Bedrooms',
    bathrooms       => 'Bathrooms',
    living_areas    => 'Living_areas',
    phone           => 'Phone',
    address         => 'Address',
    arrangement     => 'Arrangement',
    location        => 'Location',
    paper           => 'Paper',         } ), "\n";


============================================================
NOTE: This response is long. If you want to ask a question
      be responsible and delete anything that is not
      relevant to your question. Do NOT top post and repeat
      the entire message.
============================================================

    Now let's look at how to get the ads into a hash. Here's
two ads for those just joining us:

----------------------------------------------------------------------------
----
 Ads basket. Check to add. Uncheck to remove.

Price: $725     BD: 3     BA: 1     LA:
Home 3bd-1ba-1,ch/a new carpet & paint fence stove $725+dep. 972-289-2098

Mesquite  -  The Dallas Morning News  -  04/08/2004


----------------------------------------------------------------------------
----
 Ads basket. Check to add. Uncheck to remove.

Price: $950     BD: 3     BA: 1.5     LA:
Home 3bd/1.5ba/2 1620 Summitt $950+dep Sec 8ok 469-525-0204; 214-957-8176

Mesquite  -  The Dallas Morning News  -  04/08/2004

    At first glance the 80 '-' dashes look like the start to
a new advertisement. But someone might enter that into the
description field. So let's use the Price line as the first
line of each ad.

    For this solution I will place the data below the
__END__ tag in the script. This allows us to retrieve the
data with the special file handle DATA. I won't add the data
and the sub to each rewrite of the script to save space. In
fact, let's just rewrite the loop only.

    We start with a simple loop. (It doesn't do anything
yet!):

# column names
print csv( {
    date            => 'Date',
    price           => 'Price',
    bedrooms        => 'Bedrooms',
    bathrooms       => 'Bathrooms',
    living_areas    => 'Living_areas',
    phone           => 'Phone',
    address         => 'Address',
    arrangement     => 'Arrangement',
    location        => 'Location',
    paper           => 'Paper',         } ), "\n";

while ( <DATA> ) {
}

sub csv {
    my $ad = shift;

    # if you need to change field order
    # do it here
    my @field_order = qw|   price bedrooms bathrooms
                            phone address location
                            paper date |;

    # add double quotes (")
    foreach my $value ( values %$ad ) {
        $value = qq("$value");
    }

    # use a hash slice to order the fields
    return join ',', @$ad{ @field_order };
}


__END__
----------------------------------------------------------------------------
----
 Ads basket. Check to add. Uncheck to remove.

Price: $725     BD: 3     BA: 1     LA:
Home 3bd-1ba-1,ch/a new carpet & paint fence stove $725+dep. 972-289-2098

Mesquite  -  The Dallas Morning News  -  04/08/2004



----------------------------------------------------------------------------
----
 Ads basket. Check to add. Uncheck to remove.

Price: $950     BD: 3     BA: 1.5     LA:
Home 3bd/1.5ba/2 1620 Summitt $950+dep Sec 8ok 469-525-0204; 214-957-8176

Mesquite  -  The Dallas Morning News  -  04/08/2004

============================================================
NOTE: This response is long. If you want to ask a question
      be responsible and delete anything that is not
      relevant to your question. Do NOT top post and repeat
      the entire message.
============================================================

    Now let's rewrite the loop to check for a new ad:

while ( <DATA> ) {
    print "new ad\n" if /^Price/;
}

    I get this (the first line may wrap):

"Price","Bedrooms","Bathrooms","Phone","Address","Location","Paper","Date"
new ad
new ad


    That seems to be the correct number of ads. So now we
know when we are starting a new ad. Let's assume that each
in each ad we can't count on any field existing. So it is
possible to get an ad that looks like this:

{
    date            => '',
    price           => '',
    bedrooms        => '',
    bathrooms       => '',
    living_areas    => '',
    phone           => '',
    address         => '',
    arrangement     => '',
    location        => '',
    paper           => '',
}

    Let's assume that we don't want to include such an ad in
our csv file. We'll create a function that tells us if %ad
has any values that are filled in. We call it is_valid_ad()
and it will return true if the ad has at least one field
filled in and false otherwise.

sub is_valid_ad {
    my $ad = shift;
    foreach $value ( values @$ad ) {
        return 1 unless $value eq '';
    }
    return 0;
}

    There are a lot of other ways to write this, but we'll
use this for now. This part of our program might look like
this now:

my %ad;
while ( <DATA> ) {

    if ( /^Price/ ) {

        # add the last ad if it exists
        print csv( \%ad ), "\n" if is_valid_ad( \%ad );

        # reset ad
        %ad = (
            date            => '',
            price           => '',
            bedrooms        => '',
            bathrooms       => '',
            living_areas    => '',
            phone           => '',
            address         => '',
            arrangement     => '',
            location        => '',
            paper           => '',
        );

    }
}

sub is_valid_ad {
    my $ad = shift;
    foreach my $value ( values %$ad ) {
        
        # return true if one field is valid 
        return 1 unless $value eq '';
    }
    
    # return false if we get here
    return;
}

    Notice that this neatly bypasses printing an ad on the
first pass.

    There are now a whole bunch smarty pants readers with
there hand in the air shouting "Oooh, Oooh, Oooh!". The
problem with our loop is the last ad won't print. We need to
add a final after the while loop:

print csv( \%ad ), "\n" if is_valid_ad( \%ad );

    [Please put your hands down!]


    In the original data provided, the Home section can
sometimes be multiple lines.

    We are going to skip extacting the address because it
would probably a depend ona routine bordering on Artificial
Intelligence to be 99% accurate.

    The same goes for 'arrangement' actually. Being a real
estate investor as well as a perl programmer, I can tell you
that each person has a different idea about what 3/2/2/3
means. It also changes by locale.

    For this response, let's skip those two fields
completely, it is up to the reader to find a reliable method
of extracting them. We will include those fields in the
script, we just won't fill them in.

    Finally, we are assuming that we have permission from
the newspaper to do this or that we are doing it for
personal use only.


    Let's look at the regex needed to get the price and
number of rooms first. Remember that any field could be
blank and that each field and name are separated by white
space. I am going to take a longer approach to this regex.
Readers might find it inefficient. Feel free to play with it
and post it to the list. (Remember not to quote this entire
message though.)

    Here's a script I wrote to test the "price" regex:

#!/usr/bin/perl

use strict;
use warnings;
#use diagnostics;

use Data::Dumper 'Dumper';

my @test = (
    'Price: $725     BD: 3     BA: 1     LA:',
    'Price:          BD: 3     BA: 1     LA: 2',
    'Price: $725     BD: 3     BA: 1     LA: 2',
    'Price: $725     BD: 3     BA:       LA:',
);

foreach my $sample ( @test ) {
    my @fields = split /Price:|BD:|BA:|LA:/, $sample;

    print Dumper [EMAIL PROTECTED];
}

__END__

    It came up with this output (I edited it to save space):

$VAR1 = [ '', ' $725     ', ' 3     ', ' 1     '  ];
$VAR1 = [ '', '          ', ' 3     ', ' 1     ',  ' 2' ];
$VAR1 = [ '', ' $725     ', ' 3     ', ' 1     ',  ' 2'  ];
$VAR1 = [ '', ' $725     ', ' 3     ' ];


    That's close but I found this better:

foreach my $sample ( @test ) {
    $sample =~ s/ //g;
    my @fields = split /Price:|BD:|BA:|LA:/, $sample;
    shift @fields;
    push @fields, ( '' ) x ( 4 - @fields );

    print Dumper [EMAIL PROTECTED];
}

$VAR1 = [ '$725', '3', '1', ''  ];
$VAR1 = [ '',     '3', '1', '2' ];
$VAR1 = [ '$725', '3', '1', '2' ];
$VAR1 = [ '$725', '3', '',  ''  ];

    Let's make this a subroutine and test it:

foreach my $sample ( @test ) {
    print Dumper [ price_fields( $sample ) ];
}

sub price_fields {
    my $line = shift;

        # collapse all spaces
    $line =~ s/\s//g;
    my @fields = split /Price:|BD:|BA:|LA:/, $line;
    
    # delete extra first field
    shift @fields;
    
    # pad missing fields
    push @fields, ( '' ) x ( 4 - @fields );

    return @fields;
}

__END__

    I get the same result. Now let's add that to our script:

my %ad;
while ( <DATA> ) {

    if ( /^Price/ ) {

        # add the last ad if it exists
        print csv( \%ad ), "\n" if is_valid_ad( \%ad );

        # Set ad defaults
        %ad = (
            date            => '',
            price           => '',
            bedrooms        => '',
            bathrooms       => '',
            living_areas    => '',
            phone           => '',
            address         => '',
            arrangement     => '',
            location        => '',
            paper           => '',
        );

        # update %ad for this line
        @ad{ qw| price bedrooms bathrooms living_areas | }
                = price_fields( $_ );
    }
}

sub price_fields {
    my $line = shift;

        # collapse all spaces
    $line =~ s/\s//g;
    my @fields = split /Price:|BD:|BA:|LA:/, $line;
    
    # delete extra first field
    shift @fields;
    
    # pad missing fields
    push @fields, ( '' ) x ( 4 - @fields );

    return @fields;
}

    As I mentioned earlier we will be skipping the 'Home'
line so I can cheat. :) So let's add that:


my %ad;
while ( <DATA> ) {

    if ( /^Price/ ) {

        # add the last ad if it exists
        print csv( \%ad ), "\n" if is_valid_ad( \%ad );

        # Set ad defaults
        %ad = (
            date            => '',
            price           => '',
            bedrooms        => '',
            bathrooms       => '',
            living_areas    => '',
            phone           => '',
            address         => '',
            arrangement     => '',
            location        => '',
            paper           => '',
        );

        # update %ad for this line
        @ad{ qw| price bedrooms bathrooms living_areas | }
                = price_fields( $_ );
    }

    if ( /^Home/ ) {
        @ad{ qw| address arrangement | } = home_fields( $_ );
    }
}

sub home_fields {
        # This is a dummy function
    return ( '', '' );
}


    And now for the location line. I used this regex. The x
option allows us to add white space:

m/
    (\S+)               # location ($1)
    \s+-\s+
    ([^-]+)             # paper with trailing spaces ($2)
    \s+
    (\d\d\/\d\d\/\d{4}) # date ($3)
/x

    So we add the following to our loop:

    if (
        m/
            (\S+)               # location ($1)
            \s+-\s+
            ([^-]+)             # paper and spaces ($2)
            -\s+
            (\d\d\/\d\d\/\d{4}) # date ($3)
        /x ) {

                # use hash slice to set fields
        @ad{ qw| location paper date | } = ( $1, $2, $3 );

        # trim trailing spaces in paper field
        $ad{paper} =~ s/\s+$//;
    }

    Boy, that looks messy!
    But it works! :)

    Let's take a final look at the script and let's look at
some ways to improve future script maintenance.

#!/usr/bin/perl

use strict;
use warnings;
#use diagnostics;

use Data::Dumper 'Dumper';

# column names
print csv( {
    date            => 'Date',
    price           => 'Price',
    bedrooms        => 'Bedrooms',
    bathrooms       => 'Bathrooms',
    living_areas    => 'Living_areas',
    phone           => 'Phone',
    address         => 'Address',
    location        => 'Location',
    arrangement     => 'Arrangemant',
    paper           => 'Paper',         } ), "\n";

my %ad;
while ( <DATA> ) {

    if ( /^Price/ ) {

        # add the last ad if it exists
        print csv( \%ad ), "\n" if is_valid_ad( \%ad );

        # Set field defaults
        %ad = (
            date            => '',
            price           => '',
            bedrooms        => '',
            bathrooms       => '',
            living_areas    => '',
            phone           => '',
            address         => '',
            arrangement     => '',
            location        => '',
            paper           => '',
        );

        # update %ad for this line
        @ad{ qw| price bedrooms bathrooms living_areas | }
                = price_fields( $_ );
    }

    if ( /^Home/ ) {
        @ad{ qw| address arrangement | } = home_fields( $_ );
    }

    if (
                m/
                    (\S+)               # location ($1)
                    \s+-\s+
                    ([^-]+)             # paper with spaces ($2)
                    -\s+
                    (\d\d\/\d\d\/\d{4}) # date ($3)
                /x
                ) {

                # use hash slice to set fields
        @ad{ qw| location paper date | } = ( $1, $2, $3 );

        # trim trailing spaces in paper field
        $ad{paper} =~ s/\s+$//;
    }
}

print csv( \%ad ), "\n" if is_valid_ad( \%ad );

sub home_fields {
        # This is a dummy function
    return ( '', '' );
}

sub price_fields {
    my $line = shift;

        # collapse all spaces
    $line =~ s/\s//g;
    my @fields = split /Price:|BD:|BA:|LA:/, $line;
    
    # delete extra first field
    shift @fields;
    
    # pad missing fields
    push @fields, ( '' ) x ( 4 - @fields );

    return @fields;
}

sub is_valid_ad {
    my $ad = shift;
    foreach my $value ( values %$ad ) {
        
        # return true if one field is valid 
        return 1 unless $value eq '';
    }
    
    # return false if we get here
    return;
}

sub csv {
    my $ad = shift;

    # if you need to change field order
    # do it here
    my @field_order = qw|   price bedrooms bathrooms phone
                            address arrangement
                            location paper date |;

    # add double quotes (")
    foreach my $value ( values %$ad ) {
        $value = qq("$value");
    }

    # use a hash slice to order the fields
    return join ',', @$ad{ @field_order };
}

__END__
----------------------------------------------------------------------------
----
 Ads basket. Check to add. Uncheck to remove.

Price: $725     BD: 3     BA: 1     LA:
Home 3bd-1ba-1,ch/a new carpet & paint fence stove $725+dep. 972-289-2098

Mesquite  -  The Dallas Morning News  -  04/08/2004


----------------------------------------------------------------------------
----
 Ads basket. Check to add. Uncheck to remove.

Price: $950     BD: 3     BA: 1.5     LA:
Home 3bd/1.5ba/2 1620 Summitt $950+dep Sec 8ok 469-525-0204; 214-957-8176

Mesquite  -  The Dallas Morning News  -  04/08/2004


    Here's the output I got:

"Price","Bedrooms","Bathrooms","Phone","Address","Arrangemant","Location","P
aper","Date"
"$725","3","1","","","","Mesquite","The Dallas Morning News","04/08/2004"
"$950","3","1.5","","","","Mesquite","The Dallas Morning News","04/08/2004"

    Right off we can see your intrepid instructor forgot the
'phone' field. I'll leave that up to you. The other thing I
notice is that we need to update too many areas to add or
change fields. That could really suck down the line, but
this reply is pushing 750 lines, so I'll give everyone a
break and shut up. :)


HTH,

Charles K. Clarkson
-- 
Mobile Homes Specialist
254 968-8328


============================================================
NOTE: This response is long. If you want to ask a question
      be responsible and delete anything that is not
      relevant to your question. Do NOT top post and repeat
      the entire message. Do NOT use the Outlook Excuse(TM).
============================================================


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

RE: Help with Variable Scope, Regular Expressions

Reply via email to