New proposed fix:
In collab-qa/upload-history/munge_ddc.py , if the Message-Date we were
going to emit is not within the year of the envelope From, plus/minus one
year, we drop Message-Date.
This permits the udd/upload_history_gatherer.py code to use its existing
logic about dropping replacing "Message-Date" with "Date" when
"Message-Date" is visible.
I've tested that on a subset of debian-devel-changes.201307 and confirmed
the patched code has identical output, and that the crazy "The year is
2019" message has N/A as its Message-Date. So I believe the behavior is
correct.
I've also tested the performance of this. I re-ran the patched + unpatched
code until runtimes were close to consistent (in an attempt to remove disk
or other caches as the difference) and then I got these numbers after one
run of each:
Unpatched: 22.994sec
➜ upload-history time python munge_ddc.py
/tmp/debian-devel-changes/debian-devel-changes.201307 >
/tmp/debian-devel-changes.201307.out.without-patch
17.93s user 4.59s system 97% cpu 22.994 total
Patched: 22.983sec
➜ upload-history time python munge_ddc.py
/tmp/debian-devel-changes/debian-devel-changes.201307 >
/tmp/debian-devel-changes.201307.out.with-patch
18.00s user 4.52s system 97% cpu 22.983 total
I'm not realistically arguing that my code runs faster than the original
code, but I believe there is less than 1% runtime difference between the
two, and probably way less than 1%.
Patch attached for munge_ddc.py. Please review and merge, if you like!
-- Asheesh.
Index: munge_ddc.py
===================================================================
--- munge_ddc.py (revision 2767)
+++ munge_ddc.py (working copy)
@@ -105,6 +105,55 @@
return ret
+def get_message_date_if_it_is_sane(msg, rc):
+ # See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=715216 for the reason
+ # this function exists at all.
+ raw_msg_date = msg['Date']
+ message_date = normalize_date(msg['Date'])
+
+ # If there is no usable message date, then we should bail out.
+ # This is preferable to emitting any other value as the message date;
+ # the other date data is also emitted (look for "Message-Date" vs. "Date"
+ # in munge_ddc()), so downstream consumers of this data can decide for
+ # themselves how to proceed.
+ if message_date is None:
+ return raw_msg_date
+
+ # If there is a message date, we want to check how that compares
+ # to the date embedded in the Envelope from, if there is one.
+ #
+ # In the past, there have been situations where the message date
+ # is far ahead of the actual date, probably due to a server error
+ # somewhere.
+ envelope_from = msg.get_unixfrom()
+
+ # If we have no envelope from, then we cannot do the sanity check. So we
+ # let the original data through.
+ if envelope_from is None:
+ return raw_msg_date
+
+ # If the we have an envelope from, try to find a date. If we don't have
+ # one, then we also have to bail out.
+ envelope_from_date = None
+ if envelope_from.count(' ') >= 2:
+ envelope_from_date = normalize_date(
+ envelope_from.split(' ', 2)[2])
+ if envelope_from_date is None:
+ return raw_msg_date
+
+ # Given that we have both, we use the following heuristic: if the
+ # Message-Date is more than one year different from the date in the
+ # envelope From, we assume the Message-Date is corrupt.
+ message_date_year = emailutils.parsedate_tz(message_date)[0]
+ envelope_from_year = emailutils.parsedate_tz(envelope_from_date)[0]
+
+ # If the message date year is within a sane range, we return it.
+ if ((envelope_from_year - 1) < message_date_year <
+ (envelope_from_year + 1)):
+ return raw_msg_date
+ else:
+ return 'N/A'
+
def build_keyring_list(dir):
""" Scan dir and return a list of every file ending with .gpg or .pgp """
keyrings = []
@@ -141,7 +190,7 @@
c['Signed-By'] = 'N/A'
c['Message-Id'] = str(msg['Message-Id']).strip('\n')
- c['Message-Date'] = str(msg['Date']).strip('\n')
+ c['Message-Date'] = get_message_date_if_it_is_sane(msg, rc)
if not set(['Source', 'Architecture', 'Version', 'Date', 'Changes']).issubset(rc.keys()):
sys.stderr.write("Required fields not found, skipping.\n")