Bug#608930: Dpkg::Log - log file parsing support for dpkg log files

Raphael Hertzog Wed, 25 Jan 2012 02:33:28 -0800

Hi,

On Fri, 13 Jan 2012, Patrick Schoenfeld wrote:
> we had the topic quiet a while ago and I must confess I haven't made a
> lot of progress on bringing forward the merge of DPKG::Log in the dpkg
> code base.
> 
> However, I feel, I should somehow try to get this forward and so I'm
> sending a patch, which could be reviewed, so actually *some* progress
> is starting to happen.


Thanks for this!

Here's my summary (after the review):

- the patch is much too big for a simple functionality like this one, you
  have to cut some code away, there are useless checks (code will end up
  failing if users submit something that's not expected, you don't have
  to hardcode checks for all possible mistakes that user might make),
  there are too many classes and intermediary objects, etc.
  Do your best to be "concise" and "readable".

- the amount of submitted code is so big that you should split the patch
  in multiple patches: first add the "base" module
  (Dpkg::Log/Dpkg::Log::Entry), then add one type of derived module after
  another.

  Start with a basic design that does only parsing and storage of
  information, come back to me with that and with a few iterations we'll
  get something clean as a base to build upon.

  We'll add the query features later on.

- You should not need so many modules,let me suggest:

  Dpkg::Log (introductory doc + maybe helper functions)
  Dpkg::Log::Base (your actual Dpkg::Log + Dpkg::Log::Analyse)
  Dpkg::Log::Base::Entry (your actual Dpkg::Log::Entry + 
  Dpkg::Log::Status (
  Dpkg::Log::Status::Entry

  i.e. somehow we'll have to find a way to merge your "Analyse*" modules
  into the basic modules.


You'll find belowe many more comments noted while I was doing the review.
There might be more to say but honestly I don't want to be too nitpicky
while the basic design doesn't suit me for now.

> - It includes a dpkg-report script, but its probably not yet useful
>   for anyone, because its missing a template and a manpage.
>   With regard to the template: I'm unsure if dpkg maintainers would be
>   ok to stay with libtemplate-perl, which is used in the script so far.
>   Feedback needed.

What kind of output formatting do you expect people to need? What
advanced feature of libtemplate-perl do you expect people to use?

If there's nothing more than what some plain substitutions can do, then
I don't think that the dependency is warranted.

In any case, it's probably best to keep this for later and deal with the
modules first.

> +  # Return all entries as <ENTRY PACKAGE NAME> objects
> +  @entries = $dpkg_log->entries;
> +  
> +  # Loop over entries
> +  while ($entry = $dpkg_log->next_entry) {
> +      ...
> +  }

I don't think that this integrated iterator brings anything (in particular
since it can't be manually reset apparently).

> +  # Get datetime from logfile or object, depending on weither
> +  # object stores from/to values or not
> +  ($from, $to) = $dpkg_log->get_datetime_info();

It's not clear what those "$from/$to" are used for. Is it to keep only a
subset of the entries parsed?

> +use Carp;

[...]

> +    croak "odd number of arguments" if not $entry_class;
> +    croak "wrong argument type: argument '$entry_class' should be a module"
> +        unless UNIVERSAL::can($entry_class, 'new');

We have Dpkg::ErrorHandling and you should use that and not croak directly.
Maybe you want to extend Dpkg::ErrorHandling so that we have a standard
function that croaks... not sure how to best name it. Maybe "caller_error"
to show that it's to be used when we want to fail because the caller of
the function made a mistake.

> +    my $self = {
> +        entry_class => $entry_class,
> +        entries => [],
> +        invalid_lines => [],
> +        from => undef,
> +        to => undef,
> +        offset => 0,
> +        filename => undef,
> +        parse => 0,
> +        time_zone => 'local',
> +        timestamp_pattern => '%F %T',
> +        from => 0,
> +        to => 0,
> +        %params
> +    };

from and to are listed twice here.

> +=item $dpkg_log->parse()
> +
> +This is a stub method which has to be implemented by an inherting class.
> +
> +=back
> +=cut

I think that you need an empty line between back and cut. Check with
podchecker. There are other similar problems from what I saw.

> +sub parse {
> +    croak "Not yet implemented."
> +}

internerr() instead of croak and the messages should explain what's not
implemented.

> +=head2 Working with the logfile
> +
> +These methods are to retrieve entries from the parsed logfile. All these
> +methods require that the parse method has been run.
> +
> +Be aware that if the object does not store entries these methods will die.
> +This is also true, if a parsed logfile is empty or contains invalid lines 
> only.
> +
> +=over 4
> +
> +=item @entries = $dpkg_log->entries();
> +
> +Returns all entries, stored in the object, as entry objects. The identity of
> +an entry object is defined by the parameter passed to the object constructor.
> +
> +This method accepts a hash with params, which can contain the keys B<from>
> +and B<to>, which, if specified, will be used to filter the entries by date 
> and
> +time.
> +=cut
> +sub entries {
> +    my ($self, %params) = @_;
> +    my $package = $self->{entry_class};
> +
> +    croak "Object does not store entries. Eventually parse function were 
> not" .
> +        " run or log is empty." if (not @{$self->{entries}});

Why is it a problem if the log is empty ? I would not fail and let the
function return something empty...

> +
> +    if (not ($params{from} or $params{to})) {
> +        return map { $package->new($_) } @{$self->{entries}};
> +    } else {
> +        return $self->filter_by_time($package, %params);
> +    }

Invert the clauses to get rid of the not() in the test.

> +}
> +
> +=item $entry = $dpkg_log->next_entry;
> +
> +Returns the next entry, stored in the object, as entry object.
> +The identity of an entry object is defined by the parameter
> +passed to the object constructor.
> +
> +=cut
> +sub next_entry {
> +    my ($self) = @_;
> +    my $package = $self->{entry_class};
> +   
> +    croak "Object does not store entries. Eventually parse function were 
> not" .
> +        " run or log is empty." if (not @{$self->{entries}});
> +
> +    my $offset = $self->{offset}++;
> +    if (not defined(@{$self->{entries}}[$offset])) {
> +        return undef;
> +    } else {
> +        return $package->new(@{$self->{entries}}[$offset]);
> +    }
> +}

As I said, I would get rid of that.

> +
> +=item @entries = $dpkg_log->filter_by_time(from => ts, to => ts)
> +
> +=item @entries = $dpkg_log->filter_by_time(from => ts)
> +
> +=item @entries = $dpkg_log->filter_by_time(to => ts)
> +
> +=item @entries = $dpkg_log->filter_by_time(from => ts, entry_ref => 
> $entry_ref)

Use a single line like the one below and explain %params in the comment:

=item @entries = $dpkg_log->filter_by_time(%params)

Listing all combinations is error-prone (and you're lacking some).

> +This method filters entries by a certain date/time range and returns
> +entry object. The identity of an entry object is defined by the parameter
> +passed to the object constructor.
> +
> +The range can be specified by passing B<from> and B<to> arguments,
> +otherwise it will use whatever is stored in the object.

This fallback to values stored within the object seems counter-intuitive
to me. I don't see why the object should store such values...

> +If for any value no explicit parameter is given the timestamp used for 
> filtering
> +will be measured by the first and last entry in the logfile.

Why do they have to be measured? If I don't specify one of both values, I
expect to have an unbounded interval (on one side)... it will give the
same result but it's clearer IMO.

> +
> +If the given from and to values are not DateTime objects, they will be
> +interpreted as strings and will be passed to DateTime::Format::Strptime.
> +
> +=cut
> +sub filter_by_time {
> +    my ($self, %params) = @_;
> +    my $package = $self->{entry_class};
> +
> +    my $entry_ref;
> +    if (not defined $params{entry_ref}) {
> +        $entry_ref = $self->{entries};
> +    } else {
> +        $entry_ref = $params{entry_ref};
> +    }
> +
> +    croak "Object does not store entries. Eventually parse function were 
> not" .
> +        " run or log is empty." if (not @{$entry_ref});

Same here, filtering an empty set should not fail and return something
empty too.

> +
> +    my $ts_parser = DateTime::Format::Strptime->new(
> +        pattern => $self->{timestamp_pattern},
> +        time_zone => $self->{time_zone}
> +    );
> +
> +    for my $field qw(from to) {
> +        if (not defined $params{$field} and not defined $self->{$field}) {
> +            my $e_index = ($field eq "from") ? 0 : -1;
> +            $params{$field} = 
> $package->new($entry_ref->[$e_index])->timestamp;
> +        } elsif (defined $self->{$field} and not defined $params{$field}) {
> +            $params{$field} = $self->{$field};
> +        } elsif (defined $params{$field} and not reftype($params{$field})) {
> +            $params{$field} = $ts_parser->parse_datetime($params{$field});
> +        }
> +    }
> +    my ($from, $to) = ($params{from}, $params{to});
> +
> +    my @entries = map { $package->new($_) } @{$entry_ref};
> +    return grep { $_->timestamp >= $from and $_->timestamp <= $to } @entries;
> +}

The are ways to simplify this code (and make it more readable) IMO.

> +
> +=item ($from, $to) = $dpkg_log->get_datetime_info()
> +
> +Returns the date/time period of the logfile or the one
> +defined in the object.
> +If object is initialized with from and/or to parameters it should return
> +these parameters, otherwise the timestamp of the first and the last entry
> +are returned.

I don't see (yet) what this function is used for. It's doesn't look like
a useful feature at least.

> --- /dev/null
> +++ b/scripts/Dpkg/Log/Analyse.pm
> @@ -0,0 +1,93 @@
> +package Dpkg::Log::Analyse;

Analyser or Analyzer. Not sure if we have picked en-US or en-GB by default
in dpkg...

> +use Dpkg::Log::Analyse;
> +
> +my $analyser = Dpkg::Log::Analyse->new('filename' => 'dpkg.log');
> +$analyser->analyse;
> +
> +=head1 DESCRIPTION
> +
> +This module is used to analyse a dpkg log.

I don't see what this module is useful for. Is it just to create a
Dkpg::Log and call parse on it?

> +Dpkg::Log::Analyse::Package - Describe a package as analysed from a dpkg.log
> +

What's the purpose of this module?

[...]
> +use overload (
> +    '""' => 'as_string',
> +    'eq' => 'equals',
> +    'cmp' => 'compare',
> +    '<=>' => 'compare'
> +);
[...]

> +        if (defined $params{$v}) {
> +            croak "wrong argument type: argument '$v' should be a 
> Dpkg::Version"
> +                unless reftype($params{$v} eq "Dpkg::Version");
> +        }

Your usage of reftype here (and probably elsewhere in the code) is wrong.

$ perl -MDpkg::Version -M'Scalar::Util qw(reftype blessed)' -e 
'$v=Dpkg::Version->new("1.2"); print reftype($v) . "\n"; print blessed($v)'
HASH
Dpkg::Version

But it might be better to use $obj->isa("Dpkg::Version").

> +    for my $s qw(package status) {
> +        if (defined $params{$s}) {
> +            croak "wrong argument type: argument '$s' must not be a ref"
> +                if reftype($params{$s});
> +        }
> +    }

Are those kind of checks really necessary ?

> +    my ($self, $version) = @_;
> +    if ($version) {

What if we have version 0 ? (use defined() in the check)

> +        $self->_set_version('version', $version);
> +    } else {
> +        $version = $self->{version};
> +    }
> +    return $version;
> +}
> +
> +=item $package->previous_version
> +
> +Return or set the previous version of this package.
> +
> +=cut
> +sub previous_version {
> +    my ($self, $previous_version) = @_;
> +    if ($previous_version) {

Likewise.

> +        $self->_set_version('previous_version', $previous_version);

Those _set_version are really useless. You can do the affectation
immediately 

$self->{previous_version} = Dpkg::Version->new($previous_version);

It will directly with it correctly even if we already have a Dpkg::Version.

> +=item equals($package1, $package2);

Don't document internal methods, instead just document the operator (eq, ==).

> +=item print "equal" if $package1 eq $package2
> +
> +Compares two packages in their string representation.
> +
> +=cut
> +sub equals {
> +    my ($first, $second) = @_;
> +    return ($first->as_string eq $second->as_string);
> +}
> +
> +
> +=item compare($package1, $package2)
> +
> +=item print "greater" if $package1 > $package2
> +
> +Compare two packages. See B<OVERLOADING> for details on how
> +the comparison works.
> +=cut
> +sub compare {
> +    my ($first, $second) = @_;
> +    return -1 if ($first->name ne $second->name);

That's wrong. You want to return "$first->name cmp $second->name" if you
want to be able to order anything based on your comparison operator...

> +    if ((not $first->previous_version) and (not $second->previous_version)) {
> +        return ($first->version <=> $second->version);
> +    } elsif ((not $first->previous_version) or (not 
> $second->previous_version)) {
> +        return -1;
> +    } elsif ($first->previous_version != $second->previous_version) {
> +        return -1;
> +    }
> +    
> +    return (($first->version <=> $second->version));
> +
> +}

And what if the versions are equal ? The same version of the same package
might be listed multiple times in a log file

> +
> +=item $package_str = $package->as_string
> +
> +=item printf("Package name: %s", $package);

I simply used “=item "$package"” to document the conversion to string.
Don't put sample code that is unrelated and might confuse people
(likewise with the “print "equals"” and similar stuff you used before).
Put only the operator and explains what's the return value.

> +Return this package as a string. This will return the package name
> +and the version (if set) in the form package_name/version.
> +If version is not set, it will return the package name only.
> +
> +=cut
> +sub as_string {
> +    my $self = shift;
> +
> +    my $string = $self->{package};
> +    if ($self->version) {

Need a defined() again?

> +        $string = $string . "/" . $self->version;
> +    }
> +    return $string;
> +}
> +
> +sub _set_version {

As I said, this should not be required.

> +++ b/scripts/Dpkg/Log/Analyse/Status.pm
> @@ -0,0 +1,235 @@
> +package Dpkg::Log::Analyse::Status;

Again, I don't see the need for this module. It's doing basic queries
on Dpkg::Log in a way that's not generic enough to be suitable for all use
cases that applications that might have.

You should better have good querying capabilities integrated in your 
Dpkg::Log::Status.pm and let the application decide what it wants to
extract and how.

> +package Dpkg::Log::Entry;
> +
> +use strict;
> +use warnings;
> +
> +use overload ( '""' => 'line' );
> +
> +=head1 METHODS
> +
> +=over 4
> +
> +=item $dpkg_log_entry = PACKAGE->new($line, $lineno, %params)
> +
> +Returns a new PACKAGE object.
> +The arguments B<line> and B<lineno> are mandatory.
> +They store the complete line as stored in the log and the line number.

Documentation should not be a template. Use Dpkg::Log::Entry and not
"PACKAGE". People who are writing derived classes know how it works
and will know how to instantiate their own class.

> +Call the parser.
> +
> +The B<time_zone> parameter is optional and specifies in which time zone
> +the dpkg log timestamps are.  If its omitted it will use the default
> +local time zone.
> +Its possible to specify either a DateTime::TimeZone object or a string.
> +=cut

You should fix your design so that the timestamp is parsed by
Dpkg::Log since all lines of dpkg log files will start with a timestamp.
It's also logical since you have put the timestamp attribute in the
base class Dpkg::Log::Entry.

> +sub parse {
> +    my ($self, %params) = @_;
> +    open(my $log_fh, "<", $self->{filename})
> +        or croak("unable to open logfile for reading: $!");
> + 
> +    my $params = {
> +        'from' => $self->{from},
> +        'to' => $self->{to}, 
> +        'time_zone' => $self->{time_zone},
> +        'timestamp_pattern' => $self->{timestamp_pattern},
> +        %params
> +    };
> +
> +    # Determine system timezone
> +    my $tz;
> +    if (ref($params->{time_zone}) and (ref($params->{time_zone}) eq 
> "DateTime::TimeZone")) {
> +        $tz = $params->{time_zone};
> +    } elsif (ref($params->{time_zone})) {
> +        croak "time_zone argument has to be a string or a DateTime::TimeZone 
> object";
> +    } else {
> +        $tz =  DateTime::TimeZone->new( 'name' => $params->{time_zone} );
> +    }
> +    my $ts_parser = DateTime::Format::Strptime->new( 
> +        pattern => $params->{timestamp_pattern},
> +        time_zone => $params->{time_zone}
> +    );
> +
> +    my $lineno = 0;
> +    while  (my $line = <$log_fh>) {
> +        $lineno++;
> +        chomp $line;
> +        next if $line =~ /^$/;
> +

You could put the content of the loop in a parse_line sub.
And it would also call $self->SUPER::parse_line() to let the
parent class extract the timestamp and fill the object that
you pass it.

(And it would return the remaining of the line to be parsed in case of
success, undef otherwise)

> +        if ($entry[2] eq "update-alternatives:") {
> +            next;

update-alternatives no longer writes to dpkg.log.

> +++ b/scripts/Dpkg/Log/Status/Entry.pm
> @@ -0,0 +1,297 @@
> +=head1 NAME
> +
> +Dpkg::Log::Status::Entry - Describe a log entry in a dpkg.log

All the handling of attributes on *::Entry could be generalized
and stuffed into the base class if it's really only a storage
class. Using dedicated methods like $entry->type() doesn't bring anything
compared to $entry->attribute("type").

I'm skipping dpkg-report.pl and the test-suite for now. There enough work
to do on the basic design of the modules already.

Cheers,
-- 
Raphaël Hertzog ◈ Debian Developer

Pre-order a copy of the Debian Administrator's Handbook and help
liberate it: http://debian-handbook.info/liberation/



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#608930: Dpkg::Log - log file parsing support for dpkg log files

Reply via email to