hi Richard,
your email was hard to follow, and I don't have real answers for
you but maybe my simpleton's view of the situation might offer
you new avenues of thought to consider.
Richard Lynch wrote:
> It's been 20+ years since I took a stats class...
20 years ago I was mostly riding a push bike ... and I've never
taken a stats class as such (bare this in mind :-)
>
> I didn't enjoy that class, and doubt if I remember 1% of what was
> covered.
you'd be 1 up on me ;-)
>
...
>
> And the sheer number of functions in the stats package is making my
> head spin.
>
...
>
> Some fools have their PC clock set to, like, 1970 or whatever. So
> let's be generous and assume their CMOS battery has died, and they
> haven't had a chance to change it. Fine. Deal with it.
>
> Okay, so *NOW* the algorithm is to do this:
>
> Take the Date: header, or Sent: header if no Date: header -> $whatdate
>
> Parse the Received: headers for the MTA date-stamps -> $fromdates[]
>
> Compare the values in $fromdates array with $whatdate.
>
> If the variance is "too high", then ignore the $whatdate, and take
> the, errr, first?, average?, $fromdates[].
does it matter so long as your consistent in what you pick/use/calculate?
I would tend to go for the oldest date in any given array of processed dates
as this would seem to be the closest to the likely actual send date.
>
> No, wait, maybe I should do a variance within the $fromdates in case
> some stupid MTA server has a bad clock?
I would start by setting out a few acceptable boundaries and 'knowns'
for instance:
1. the first mail was sent no earlier than timestampX
(so any timestamp encountered that is earlier than this is bogus.)
2. a maximum time an email could be expected to hang out at any given MTA whilst
waiting to be moved on.
(could be used to drop an outer timestamps [oldest & newest] from a
given array of
timestamps extracted from mail whose difference is to it's 'neighbour'
is
greater than this agreed maximum period.)
>
> Any advice?
1. don't forget to normalize all found dates in a given mails array of dates
into UTC (if that is even an issue) before doing any actual processing/analysis
of
the collected dates.
2. I would consider the date's found in the Date: and/or Sent: headers with the
same
brush as any dates found in the Recieved headers - your explanation suggest
than no one
header could be construed as being more reliable than another.
3. er there is no 3, unless you consider 'buy a bigger brain' real advice ;-)
>
> Anybody got a good "variance" function to do what I'm trying to do?
>
> Am I on the entirely wrong path here?
dunno - but it's another typical Lynch problem that was just too interesting
for me to let slide :-) please do keep us posted as to your progress!
> Sheesh!
>
> We may just ignore any obviously wrong dates, and process those by
> hand...
indeed anything that is blatantly 'dodgy' with regard to dates is probably
easier
to (and more accurately) processed by hand than it is to create some wizzo
algo. for
it - it's a matter of getting the number of 'dodgy' down to an acceptable level
of course.
>
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php