On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 07:19 PM, [email protected] wrote:
>
> > I have a list of tuples that contains a tool_id, a time, and a message. I
> > want to select from this list all the elements where the message matches
> > some string, and all the other elements where the time is within some diff
> > of any matching message for that tool.
>
> > Here is how I am currently doing this:
>
> No, it's not. This is a fragment of code, without enough clues as to
>
> what else is going. We can guess, but that's likely to make a mess.
Of course it's a fragment - it's part of a large program and I was just showing
the relevant parts.
> First question is whether this code works exactly correctly?
Yes, the code works. I end up with just the rows I want.
> Are you only concerned about speed, not fixing features?
Don't know what you mean by 'fixing features'. The code does what I want, it
just takes too long.
> As far as I can tell, the logic that includes the time comparison is bogus.
Not at all.
> You don't do anything there to worry about the value of tup[2], just whether
> some
> item has a nearby time. Of course, I could misunderstand the spec.
The data comes from a database. tup[2] is a datetime column. tdiff comes from a
datetime.timedelta()
> Are you making a global called 'self' ? That name is by convention only
> used in methods to designate the instance object. What's the attribute
> self?
Yes, self is my instance object. self.message contains the string of interest
that I need to look for.
> Can cdata have duplicates, and are they significant?
No, it will not have duplicates.
> Are you including the time building that as part of your 2 hour measurement?
No, the 2 hours is just the time to run the
cdata[:] = [tup for tup in cdata if determine(tup)]
> Is the list sorted in any way?
Yes, the list is sorted by tool and datetime.
> Chances are your performance bottleneck is the doubly-nested loop. You
> have a list comprehension at top-level code, and inside it calls a
> function that also loops over the 600,000 items. So the inner loop gets
> executed 360 billion times. You can cut this down drastically by some
> judicious sorting, as well as by having a map of lists, where the map is
> keyed by the tool.
Thanks. I will try that.
> > # record time for each message matching the specified message for each tool
>
> > messageTimes = {}
>
> You're building a dictionary; are you actually using the value (1), or
> is only the key relevant?
Only the keys.
> A set is a dict without a value.
Yes, I could use a set, but I don't think that would make it measurably faster.
> But more mportantly, you never look up anything in this dictionary. So why
> isn't it a list? For that matter, why don't you just use the
> messageTimes list?
Yes, it could be a list too.
> > for row in cdata: # tool, time, message
>
> > if self.message in row[2]:
>
> > messageTimes[row[0], row[1]] = 1
>
> >
>
> > # now pull out each message that is within the time diff for each matched
> > message
>
> > # as well as the matched messages themselves
>
> >
>
> > def determine(tup):
>
> > if self.message in tup[2]: return True # matched message
>
> >
>
> > for (tool, date_time) in messageTimes:
>
> > if tool == tup[0]:
>
> > if abs(date_time-tup[1]) <= tdiff:
>
> > return True
>
> >
>
> > return False
>
> >
>
> > cdata[:] = [tup for tup in cdata if determine(tup)]
>
>
>
> As the code exists, there's no need to copy the list. Just do a simple
> bind.
This statement is to remove the items from cdata that I don't want. I don't
know what you mean by bind. I'm not familiar with that python function.
>
>
>
> >
>
> > This code works, but it takes way too long to run - e.g. when cdata has
> > 600,000 elements (which is typical for my app) it takes 2 hours for this to
> > run.
>
> >
>
> > Can anyone give me some suggestions on speeding this up?
--
http://mail.python.org/mailman/listinfo/python-list