Wow, Chao, didn't realize you guys are making Hudi into Apache :) HDFS is generally not a good fit for this use case. I've seen people using Kudu for GDPR compliance.
On Mon, Apr 15, 2019 at 11:11 AM Chao Sun <[email protected]> wrote: > Checkout Hudi (https://github.com/apache/incubator-hudi) which adds > upsert functionality on top of columnar data such as Parquet. > > Chao > > On Mon, Apr 15, 2019 at 10:49 AM Vinod Kumar Vavilapalli < > [email protected]> wrote: > >> If one uses HDFS as raw file storage where a single file intermingles >> data from all users, it's not easy to achieve what you are trying to do. >> >> Instead, using systems (e.g. HBase, Hive) that support updates and >> deletes to individual records is the only way to go. >> >> +Vinod >> >> On Apr 15, 2019, at 1:32 AM, Ivan Panico <[email protected]> wrote: >> >> Hi, >> >> Recent GDPR introduced a new right for people : the right to be >> forgotten. This right means that if an organization is asked by a customer >> to delete all his data, the organization have to comply most of the time >> (there are conditions which can suspend this right but that's besides my >> point). >> >> Now HDFS being WORM (Write Once Read Multpliple Times), I guess you see >> where I'm going. What would be the best way to implement this line deletion >> feature (supposing that when a customer asks for a delete of all his data, >> the organization would have to delete some lines in some HDFS files). >> >> Right now I'm going for the following : >> >> - Create a key-value base (user, [files]) >> - On file writing, feed this base with the users and file location >> (by appending or updating a key). >> - When the deletion is requested by the user "john", look in that >> base and rewrite all the files of the "john" key (read the file in >> memmory, >> suppress the lines of "john", rewrite the files) >> >> >> Would this be the most hadoop way to do that ? >> I discarded some cryptoshredding like solution because the HDFS data has >> to be readable by some mutliple proprietary softwares and by users at some >> point and I'm not sur how to incorporate a decyphering step for all those >> uses cases. >> Also, I came up with this table solution because a violent grep for some >> key on the whole HDFS tree seemed unlikely to scale but maybe I'm mistaken ? >> >> Thanks for your help, >> Best regards >> >> >>
