Re: Right to be forgotten and HDFS

Wei-Chiu Chuang Mon, 15 Apr 2019 13:44:21 -0700

Wow, Chao, didn't realize you guys are making Hudi into Apache :)
HDFS is generally not a good fit for this use case. I've seen people using
Kudu for GDPR compliance.


On Mon, Apr 15, 2019 at 11:11 AM Chao Sun <[email protected]> wrote:

> Checkout Hudi (https://github.com/apache/incubator-hudi) which adds
> upsert functionality on top of columnar data such as Parquet.
>
> Chao
>
> On Mon, Apr 15, 2019 at 10:49 AM Vinod Kumar Vavilapalli <
> [email protected]> wrote:
>
>> If one uses HDFS as raw file storage where a single file intermingles
>> data from all users, it's not easy to achieve what you are trying to do.
>>
>> Instead, using systems (e.g. HBase, Hive) that support updates and
>> deletes to individual records is the only way to go.
>>
>> +Vinod
>>
>> On Apr 15, 2019, at 1:32 AM, Ivan Panico <[email protected]> wrote:
>>
>> Hi,
>>
>> Recent GDPR introduced a new right for people : the right to be
>> forgotten. This right means that if an organization is asked by a customer
>> to delete all his data, the organization have to comply most of the time
>> (there are conditions which can suspend this right but that's besides my
>> point).
>>
>> Now HDFS being WORM (Write Once Read Multpliple Times), I guess you see
>> where I'm going. What would be the best way to implement this line deletion
>> feature (supposing that when a customer asks for a delete of all his data,
>> the organization would have to delete some lines in some HDFS files).
>>
>> Right now I'm going for the following :
>>
>>    - Create a key-value base (user, [files])
>>    - On file writing, feed this base with the users and file location
>>    (by appending or updating a key).
>>    - When the deletion is requested by the user "john", look in that
>>    base and rewrite all the files of the "john" key (read the file in 
>> memmory,
>>    suppress the lines of "john", rewrite the files)
>>
>>
>> Would this be the most hadoop way to do that ?
>> I discarded some cryptoshredding like solution because the HDFS data has
>> to be readable by some mutliple proprietary softwares and by users at some
>> point and I'm not sur how to incorporate a decyphering step for all those
>> uses cases.
>> Also, I came up with this table solution because a violent grep for some
>> key on the whole HDFS tree seemed unlikely to scale but maybe I'm mistaken ?
>>
>> Thanks for your help,
>> Best regards
>>
>>
>>

Re: Right to be forgotten and HDFS

Reply via email to