Hi, Recent GDPR introduced a new right for people : the right to be forgotten. This right means that if an organization is asked by a customer to delete all his data, the organization have to comply most of the time (there are conditions which can suspend this right but that's besides my point).
Now HDFS being WORM (Write Once Read Multpliple Times), I guess you see where I'm going. What would be the best way to implement this line deletion feature (supposing that when a customer asks for a delete of all his data, the organization would have to delete some lines in some HDFS files). Right now I'm going for the following : - Create a key-value base (user, [files]) - On file writing, feed this base with the users and file location (by appending or updating a key). - When the deletion is requested by the user "john", look in that base and rewrite all the files of the "john" key (read the file in memmory, suppress the lines of "john", rewrite the files) Would this be the most hadoop way to do that ? I discarded some cryptoshredding like solution because the HDFS data has to be readable by some mutliple proprietary softwares and by users at some point and I'm not sur how to incorporate a decyphering step for all those uses cases. Also, I came up with this table solution because a violent grep for some key on the whole HDFS tree seemed unlikely to scale but maybe I'm mistaken ? Thanks for your help, Best regards
