[
https://issues.apache.org/jira/browse/PIG-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated PIG-5041:
------------------------------------
Attachment: PIG-5041-1.patch
> RoundRobinPartitioner is not deterministic when order of input records change
> -----------------------------------------------------------------------------
>
> Key: PIG-5041
> URL: https://issues.apache.org/jira/browse/PIG-5041
> Project: Pig
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Priority: Critical
> Fix For: 0.16.1
>
> Attachments: PIG-5041-1.patch
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end
> up successfully pulling partitions from first run of the map while other half
> could pull from the rerun after shuffle fetch failures. If the data is not
> partitioned by the Partitioner exactly the same way every time then it could
> lead to incorrect results (loss of records and duplicated records).
> There is a good probability of order of input records changing
> - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted
> but values can be in any order as the shuffle and merge depends on the order
> in which inputs are fetched. Anything involving FLATTEN can produce different
> order of output records.
> - With UnorderedKVInput, the records could be in any order depending on
> order of shuffle fetch.
> RoundRobinPartitioner can partition records differently everytime as order of
> input records change which is very bad. We need to get rid of
> RoundRobinPartitioner. Since the key is empty whenever we use
> RoundRobinPartitioner we need to partitioning based on hashcode of values to
> produce consistent partitioning. It adds a lot of performance overhead, but
> required for correctness.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)