Re: [Mailman-Users] [Mailman-cabal] GDPR

2018-05-22 Thread Stephen J. Turnbull
Grant Taylor via Mailman-Users writes:
 > On 05/14/2018 06:33 AM, Andrew Hodgson wrote:

 > > Current advice from the GDPR people is we may have to delete the whole 
 > > thread.
 > 
 > What is their working definition of "thread"?

I would imagine that it is the subthread rooted at the first post
containing complainant's PII -- "Personally Identifying Information".

 > Why can't just the individual's message(s) be delete?  Or better 
 > redacted to not reflect them?

That is going to depend on the presence of PII in the messages.  If
*whole messages* are to be deleted, that would presumably involve
content that somehow identifies the person.  I would expect that we
don't have to delete whole bug reports on this list just because
somebody requests their PII be redacted.

What worries me more is the implications for blockchain, or more
precisely, DAG-based VCSes that use hashes for integrity check like
git: the identity of commits will change if authors and emails are
redacted, including if a commit log refers to PII of a bug reporter as
they often do.  I guess you'd need to maintain an index of pointers
from old commit ids, or at least for branches and tags (we do have the
reflog in git).

And heaven help you if you're a security conscious group like the
Linux kernel and use signed commits.  I guess the person who does the
redaction would sign the new commits, but that's pretty yucky -- that
person could do anything and nobody would know when it happened
because you have to delete the old commits and blobs that get redacted.

 > > Still under discussion, this is also complex because threads and
 > > subjects change, if we delete the whole thread there may be
 > > messages from the same author in other threads that don't have
 > > correct atribution etc.

As I understand the "right to be forgotten", it's *not* a right to
arbitrarily edit content stored by someone else, it's the right to
redact *all* PII in that content.  It's not just messages from a
person, it's headers containing their name and email address,
attribution lines for quoted material, quoted .sigs, etc etc.

 > I see six modes of access to the data:
 > 
 > 1)  List subscribers
 > 2)  List owners / administrators
 > 3)  Host system administrators
 > 4)  Administrators that are in the downstream SMTP / HTTP path and can 
 > track things.
 > 5)  Backups.
 > 6)  Ongoing Discovery.

You're missing

0)  Randos accessing public archives.

For (0), the only logging would be IP addresses in the webserver.

 > I would expect that #1 requires authentication to MM for
 > subscribers to see data, and I expect that this is logged in some
 > (indirect) capacity.

No.  The accessing IPs will be in the webserver logs, but I don't
think there is any logging in either Mailman 2 or Mailman 3 of
authentication data.  All there would be is the implication that
authentication was successful if that data were accessed.  In Mailman
2 there's no PII data whatsoever except for email address and (maybe)
display name in the subscriber data.  I suppose you could put phone #s
and junk like that in the display name, but GDPR is more concerned
with the database fields that might store PII than the actual content.

 > I would expect that #2 would have access to the data as part of their 
 > role of owning / administering a mailing list.

However, in Mailman 2 the various list passwords are shared, and would
not identify individuals in cases with multiple moderators or list
owners.

 > I would also expect that #3 has the capability to access the data.  But 
 > I would also expect that #3 would not access the data in normal day to 
 > day operations.

Indeed.  The problem is identifying them if they do, since they can
just use normal filesystem operations from the shell, which are not
normally logged at all.  In Mailman 3, we can configure databases like
PostgreSQL, which I suppose can log access to the subscriber
databases, and which make it hard (but not impossible) to access data
via ordinary filesystem operations.

However, I think that the issue here is basically moot.  You keep host
access logs to check for suspicious IP addresses (attempting to) log
in, and otherwise (for #2 and #3) you just give the list of all the
people who can access that data in the normal course of their duties.
I don't think the issue with logging is pinning down a particular
access to specific data, but rather determining who *could* access
that data.  The relevant access might have been by a long-since fired
engineer who did a Snowden on your database.  How could you possibly
know?

 > Are you saying that GDPR is going to complicate things related to
 > #3 and make it such that there is more of a union between #2 and
 > #3?  I.e. exclude 3rd party site hosters from being able to be #3?

I don't understand the "exclude third party site hosters".  The
GDPR requirement is not to *limit* access, it's to *log* access.

 > What is their working definition of "marketing"?

I'm pretty sure they're r

Re: [Mailman-Users] Encoding issues when importing archives

2018-05-22 Thread Stephen J. Turnbull
Mark Sapiro writes:

 > > content = content.encode(decoding)
 > > 
 > > UnicodeEncodeError: 'gb2312' codec can't encode character '\ufffd' in 
 > > position 3131: illegal multibyte sequence
 > > 
 > > Apparently the offending attachments are specified as gb2312 (a common
 > > Chinese encoding).
 > > 
 > > Is there something I can do to somehow preprocess the archive mboxes, or
 > > otherwise re-encode the attachments?
 > 
 > Possibly there is, but this is a bug in the hyperkitty_import process.

Technically, it's a bug in common Chinese MUAs.  We can work around it
if we want to, of course, and I think we do.


The backstory is that Chinese (simplified, aka mainland) has three
major encoding standards: GB 2312, GBK, and GB 18030.  GBK is not
really an encoding, it's an encoding schema which says "future Chinese
encodings shall be supersets of GB 2312" but doesn't assign any new
characters, and GB 18030 is not only a superset of GB 2312 that
actually defines the new characters compatibly with GBK, but it is
also a superset of Unicode that folds Unicode into the GBK code space
algorithmically (GB 2312 and Unicode are incompatible in page 0).

Whew!

So, because GB 18030 is backward compatible with GB 2312, a lot of
Chinese MUAs get away with incorrectly labeling the extended character
set "GB 2312", and you get the error above.  The same thing happens
with Shift JIS, by the way.

OTOH, for that exact reason, we can do what Webencodings does, and
promote GB 2312 claims, and *decode* with GB 18030.  I think this is
safe, as there's really no alternative encoding to worry about, and
since this stuff presumably all text/plain or text/html, we should be
OK on security stuff (although I guess in theory it could be source
code or executable scripts that is doing something sneaky).

(On the other hand, I *am* worried about the fact that there is a
REPLACEMENT CHARACTER in the content at this point.  Presumably that's
because we *decoded* the original mail with errors=who-gives-a-fsck,
which is not appropriate here---we can be almost sure that the content
is *not* corrupt, rather it's mislabeled.)

The OP can do a poor man's version, by going through the existing mbox
and case-independently regexp-replacing r"=\?GB2312\?" with
r"=\?GB18030\?", and r'charset=("?)GB2312' with r'charset=\1GB18030'.

I'm still jet-lagged from PyCon, so I'm not going to do more now, and
if you want some Python code to do this, please feel free to ping me
on or off list.

 > It would help if you file an issue at
 >  with enough
 > information for us to reproduce it.

print("""
Subject: nothing to see here: =?GB2312?Q?=FF=FD?=

Oops!
""")

should do the trick. ;-)

I'll be looking for this issue, or you can assign it to me.

Steve

--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] GDPR

2018-05-22 Thread Stephen J. Turnbull
Ángel writes:

 > First of all, and I think it hasn't been mentioned yet is the Right
 > to access, ie. of letting people know which data you have about
 > them.
 > 
 > I would consider that listing all post by email address X would
 > fulfill it, plus a search feature (*) in case they want to search
 > by other terms, like looking for posts with their name in it.

Many posts will include their names in CCs, especially on lists that
munge Reply-To.  Some of these may be hidden (eg, Reply-To is normally
not displayed; I don't know offhand if it's in the mbox files).

However, I think that what that clause means is not "all data items
that mention you," but rather "what personally identifying information
(PII) is stored," ie, name, email, postal address (.sig!), phone
number (.sig!), blog and other website URLs, etc.  The right to be
forgotten would imply at least redacting *all* instances of such PII.

 > (*) It is my understanding that just providing the mbox and
 > expecting them to grep through it just as the sysadmin would have
 > to do would be sufficient (OTOH if you had an advanced system for
 > completely tracking a guy, and provide him just a crude interface
 > that's probably not ok).

If the archives are private, this is seriously problematic if it
provides access to nonsubscribers who "are afraid" they were
mentioned.  Do you really want a stalker trawling through your private
lists just because somebody "might" have called him out by name?

 > Having to find out "anything and everything" where the user was
 > mentioned may imho require what the GDPR calls "a disproportionate
 > effort", and could even result into some liability for not finding some
 > instance.

What "disproportionate" means will have to be decided by courts or
further legislation (I'm not familiar with how this works in the EU).
I suspect that a sed script redacting name, nickname, email addresses,
SNS aliases, phone, postal address, and geographical address (perhaps
even as minimal as city) will be the bare minimum expected for mailing
list archives to the extent that they are covered by GDPR.

 > As such, wrt redacting archives my view is that they should provide
 > all the urls to the content they want removed (which they should
 > have been able to easily found per above).

This could easily be thousands of posts in a long-running mailing
list.  Really, you'd want it done in bulk, using sed on an mbox or SQL
on a database, rather than URL by URL in the HTML.

 > If I detected that there was a follow-up top-posting email containing
 > the original content I would probably also truncate it, but strictly as
 > a courtesy matter and with no guarantees that I would do that.

Consider the example provided later in the thread of a private email
forwarded to the list by a subscriber.  Through no action of their
own, the private mail's author's PII was distributed over dozens (and
in really extreme cases it could be 100s) of posts in a long thread.
Anyway, as pointed out above, I'm pretty sure GDPR envisions *all*
instances of PII being redacted.

 > If they failed to find themselves, why would I need to dig through
 > the archives, not even knowing what I am looking for?

Because if it turns out later that that PII was found in your
archives, you will definitely be considered guilty of negligence or
worse.  You really cannot expect either users who want their PII
redacted or courts to be at all sympathetic to the mailing list
managers on this point.

 > There are too many ways to refer to someone, the email address,
 > different names and abbreviations (and misspellings!), which would
 > not even be unique, plus all kind of references (just suppose that
 > the people to which Julian referred claimed that his email contains
 > PII about them!).

The proverb, "the law is an ass", applies.  But that doesn't mean
people of ill-will can't abuse it, and people in a panic (eg, stalking
victims) may not care about your problems when they are literally at
risk of being murdered if found out.

This applies to several of your other comments implying that you can't
believe that the law means what it says, so I'm eliding them.

 > I would expect reasonable requests not to be a problem, though
 > (eg. just removing an address from a mail signature).

GDPR is not reasonable for mailing list operators who maintain
archives, period.  The problem is not the intent of lawmakers, who
mostly are horrified by the abuses that hackers have made of private
information leaked from various databases, and want to address those
problems as well as stalkers of various types.  The problem is that
people who would use such querying and redaction facilities are
likely to be in an "unreasonable" state of mind, as described above.

Unless we somehow have a blanket exemption, or "click-wrap" "I waive
my GDPR rights with respect to posts to this list" Subscriber
Agreements are deemed valid, I half-expect GDPR will kill volunteer-
maintained mailing lists in Europe, a

Re: [Mailman-Users] [Mailman-cabal] GDPR

2018-05-22 Thread Grant Taylor via Mailman-Users

On 05/22/2018 07:33 PM, Stephen J. Turnbull wrote:
I would imagine that it is the subthread rooted at the first post 
containing complainant's PII -- "Personally Identifying Information".


I feel like that's a self referencing definition.

A "thread" is "a subthread rooted at the first post containing PII".

I agree that's where the focus should start.  But I don't think it 
defines a thread in the way that I'm asking.


What is their working definition of "thread"?

Let's say:

1)  Bla
2)   +--- Re: Bla
3)   +--- Re: Bla
4)   | +--- BlaBlaBla
5)   +--- Re: Bla
6) +--- I hijacked this thread because I need help!!!

Let's say the PII was in message 3 and the person replying to it in 
message 4 removed the PII.  Do messages 3 and 4 need to be removed (or 
otherwise modified)?


Let's say that message 1 had the PII, messages 2, 3, and 5 quoted it, 
but 4 did not and 6 is a hijacker that hit reply on the most convenient 
message (under his cursor) and removed all content.  Do messages 4 and 6 
need to be removed?


What is the "(sub)thread" that needs to be removed?

That is going to depend on the presence of PII in the messages.  If *whole 
messages* are to be deleted, that would presumably involve content that 
somehow identifies the person.  I would expect that we don't have to 
delete whole bug reports on this list just because somebody requests 
their PII be redacted.


I agree that it's possible to remove / redact PII without deleting the 
items containing the PII.


Think about it this way, spooks don't shred the entire sheet of paper, 
instead they take a black marker and redact just the pieces that need to 
be removed.


I'm afraid that the infinite wisdom of politicians will say that the 
entire paper needs to be shredded.


I think it also significantly depends on what needs to be redacted. 
Removing "supercalifragilisticexpialidocious" is a LOT different than 
removing "Grant Taylor" from the Mailman-Users archive. 
"supercalifragilisticexpialidocious" would be like reference to an 
event.  "Grant Taylor" would be any mention of my (or an impostor's) name.


The former is likely MUCH simpler to do than the latter.  The latter 
will also impact MANY more messages.


What worries me more is the implications for blockchain, or more 
precisely, DAG-based VCSes that use hashes for integrity check like git: 
the identity of commits will change if authors and emails are redacted, 
including if a commit log refers to PII of a bug reporter as they often 
do.  I guess you'd need to maintain an index of pointers from old commit 
ids, or at least for branches and tags (we do have the reflog in git).


I don't want to try to work that out.

And heaven help you if you're a security conscious group like the Linux 
kernel and use signed commits.  I guess the person who does the redaction 
would sign the new commits, but that's pretty yucky -- that person could 
do anything and nobody would know when it happened because you have to 
delete the old commits and blobs that get redacted.


Yep.

As I understand the "right to be forgotten", it's *not* a right to 
arbitrarily edit content stored by someone else, it's the right to redact 
*all* PII in that content.


Agreed.

In this case, I don't think that supercalifragilisticexpialidocious 
qualifies under GDPR's right to be forgotten.  }:-)


It's not just messages from a person, it's headers containing their name 
and email address, attribution lines for quoted material, quoted .sigs, 
etc etc.


Agreed.

What about headers containing message ID from an uncommon / single user 
domain like mine?  I'd say that anything that can be used to identify 
less than a group of 1000 people would probably need to be redacted.  (I 
just chose 1000 arbitrarily, but it's a starting point.)



You're missing

0)  Randos accessing public archives.


What other modes have we collectively missed?


For (0), the only logging would be IP addresses in the webserver.


True.

No.  The accessing IPs will be in the webserver logs, but I don't think 
there is any logging in either Mailman 2 or Mailman 3 of authentication 
data.  All there would be is the implication that authentication was 
successful if that data were accessed.


Okay.

I wonder if there's any correlation between the IP that authenticated 
and the IP that accessed data.


In Mailman 2 there's no PII data whatsoever except for email address 
and (maybe) display name in the subscriber data.


I expect that either of those, the email address -or- the display name 
are enough to count as PII.


I believe it's fair to say that people expect gtaylor (at) 
tnetconsulting (dot) net to reference a single person.  I also believe 
it's fair to say that most people expect most email addresses to 
identify be associated with one person.  The only exceptions to the rule 
being things like positional addresses; sales@ or info@ or webmaster@.


I suppose you could put phone #s and junk like that in the display name, 
but GDPR 

Re: [Mailman-Users] GDPR

2018-05-22 Thread Grant Taylor via Mailman-Users

On 05/22/2018 07:46 PM, Stephen J. Turnbull wrote:
Many posts will include their names in CCs, especially on lists that 
munge Reply-To.


Don't forget the munged reply.  }:-)

Some of these may be hidden (eg, Reply-To is normally not displayed; 
I don't know offhand if it's in the mbox files).


Yes, Reply-To: is a standard header and included in mbox files.

However, I think that what that clause means is not "all data items 
that mention you," but rather "what personally identifying information 
(PII) is stored," ie, name, email, postal address (.sig!), phone number 
(.sig!), blog and other website URLs, etc.  The right to be forgotten 
would imply at least redacting *all* instances of such PII.


Agreed.

If the archives are private, this is seriously problematic if it provides 
access to nonsubscribers who "are afraid" they were mentioned.  Do you 
really want a stalker trawling through your private lists just because 
somebody "might" have called him out by name?


Yep.  There are all sorts of implications here.

What "disproportionate" means will have to be decided by courts or 
further legislation (I'm not familiar with how this works in the EU). 
I suspect that a sed script redacting name, nickname, email addresses, 
SNS aliases, phone, postal address, and geographical address (perhaps 
even as minimal as city) will be the bare minimum expected for mailing 
list archives to the extent that they are covered by GDPR.


The technical implications of that in and of itself astound.

What if part of the data is wrapped across lines?  What if it's quoted 
printable encoded with =20 breaking sed scripts trying to deal with line 
breaks?  What if it's base 64 encoded?  What if it's hosted on an 
Exchange server (or something else that uses as massive SIS type DB)?


... trying to think about ways to do this ...
...
... failing ...
...
... giving up

Nope.  I want to NOT go there.

This could easily be thousands of posts in a long-running mailing list. 
Really, you'd want it done in bulk, using sed on an mbox or SQL on a 
database, rather than URL by URL in the HTML.


Wasn't it the owner of Lavabit that gave the master decryption key to 
the courts in tiny font printed on hundreds of pages of paper?  —  He 
complied with the court order, but did not make it easy.


Consider the example provided later in the thread of a private email 
forwarded to the list by a subscriber.  Through no action of their 
own, the private mail's author's PII was distributed over dozens (and 
in really extreme cases it could be 100s) of posts in a long thread.


Or if it's Gmail (or the likes) where the messages being replied to are 
hidden and perpetually added to in each reply.  *HEAVYsigh*


Anyway, as pointed out above, I'm pretty sure GDPR envisions *all* 
instances of PII being redacted.


It's my (mis)understanding that it's the right for $individual to be 
forgotten, which means anything and /everything/ that identifies them. 
Emphasis on "everything".


Because if it turns out later that that PII was found in your archives, 
you will definitely be considered guilty of negligence or worse.  You 
really cannot expect either users who want their PII redacted or courts 
to be at all sympathetic to the mailing list managers on this point.


I mostly agree.

I think there is some small room for good faith effort.  I.e. we found 
and removed 10,000 instances of $plaintiff's PII.  We're sorry for 9 
that we missed.  We've removed them and contracted with 
$external3rdparty to see if we missed anything.


The proverb, "the law is an ass", applies.  But that doesn't mean people 
of ill-will can't abuse it, and people in a panic (eg, stalking victims) 
may not care about your problems when they are literally at risk of 
being murdered if found out.


I would hope there is some small room

GDPR is not reasonable for mailing list operators who maintain archives, 
period.  The problem is not the intent of lawmakers, who mostly are 
horrified by the abuses that hackers have made of private information 
leaked from various databases, and want to address those problems as 
well as stalkers of various types.


I agree that it's black hat hackers that do a lot of the exfiltration. 
But I think it's more the B2B selling of information that causes more 
concern (to me) than what hackers do with it.


I think we've seen enough breaches here in the US (I'm not up on the 
rest of the world) where little if anything makes the news about what is 
done with our the outcome there of the leaked information.


I've heard more about businesses using contact info for marketing.

I follow someone on Twitter who was complaining about Yubico and Linode 
because they used his information from business consumer / contractual 
information for pure marketing purposes.  —  IMHO that's a breach of 
intended use of the information.  —  That being said, it's within the 
CAN-SPAM Act in that there is an established business relationship.


The problem is tha