Hi,

Recent GDPR introduced a new right for people : the right to be forgotten.
This right means that if an organization is asked by a customer to delete
all his data, the organization have to comply most of the time (there are
conditions which can suspend this right but that's besides my point).

Now HDFS being WORM (Write Once Read Multpliple Times), I guess you see
where I'm going. What would be the best way to implement this line deletion
feature (supposing that when a customer asks for a delete of all his data,
the organization would have to delete some lines in some HDFS files).

Right now I'm going for the following :

   - Create a key-value base (user, [files])
   - On file writing, feed this base with the users and file location (by
   appending or updating a key).
   - When the deletion is requested by the user "john", look in that base
   and rewrite all the files of the "john" key (read the file in memmory,
   suppress the lines of "john", rewrite the files)


Would this be the most hadoop way to do that ?
I discarded some cryptoshredding like solution because the HDFS data has to
be readable by some mutliple proprietary softwares and by users at some
point and I'm not sur how to incorporate a decyphering step for all those
uses cases.
Also, I came up with this table solution because a violent grep for some
key on the whole HDFS tree seemed unlikely to scale but maybe I'm mistaken ?

Thanks for your help,
Best regards

Reply via email to