[jira] [Updated] (PIG-5041) RoundRobinPartitioner is not deterministic when order of input records change

Rohini Palaniswamy (JIRA) Mon, 17 Oct 2016 12:13:48 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rohini Palaniswamy updated PIG-5041:
------------------------------------
    Attachment: PIG-5041-branch0.16.patch

> RoundRobinPartitioner is not deterministic when order of input records change
> -----------------------------------------------------------------------------
>
>                 Key: PIG-5041
>                 URL: https://issues.apache.org/jira/browse/PIG-5041
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>            Priority: Critical
>             Fix For: 0.17.0, 0.16.1
>
>         Attachments: PIG-5041-1.patch, PIG-5041-2.patch, 
> PIG-5041-branch0.16.patch
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
> up successfully pulling partitions from first run of the map while other half 
> could pull from the rerun after shuffle fetch failures. If the data is not 
> partitioned by the Partitioner exactly the same way every time then it could 
> lead to incorrect results (loss of records and duplicated records). 
> There is a good probability of order of input records changing
>     - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted 
> but values can be in any order as the shuffle and merge depends on the order 
> in which inputs are fetched. Anything involving FLATTEN can produce different 
> order of output records.
>     - With UnorderedKVInput, the records could be in any order depending on 
> order of shuffle fetch. 
> RoundRobinPartitioner can partition records differently everytime as order of 
> input records change which is very bad. We need to get rid of 
> RoundRobinPartitioner. Since the key is empty whenever we use  
> RoundRobinPartitioner we need to partitioning based on hashcode of values to 
> produce consistent partitioning.
> Partitioning based on hashcode is required for correctness, but disadvantage 
> is that it
>     - adds a lot of performance overhead with hashcode computation
>     - with the random distribution due to hashcode (as opposed to batched 
> round robin) input records sorted on some column could get distributed to 
> different reducers and if union is followed by a store, the output can have 
> bad compression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-5041) RoundRobinPartitioner is not deterministic when order of input records change

Reply via email to