[ 
https://issues.apache.org/jira/browse/HBASE-30115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HBASE-30115:
-----------------------------------
    Labels: pull-request-available  (was: )

> Introduce approximate progress estimation for TableRecordReader based on row 
> key position
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-30115
>                 URL: https://issues.apache.org/jira/browse/HBASE-30115
>             Project: HBase
>          Issue Type: Task
>          Components: mapreduce
>            Reporter: JinHyuk Kim
>            Assignee: JinHyuk Kim
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: mapreduce-progress-0.png, mapreduce-progress-after.png
>
>
> h1. Background
> Currently, {{TableRecordReaderImpl.getProgress()}} always returns {*}0{*}, 
> providing no progress feedback to the MapReduce framework. This makes it 
> impossible for users to monitor scan progress during long-running jobs.
> !mapreduce-progress-0.png|width=1095,height=236!
>  
> h1. Suggestion
> This patch estimates progress by converting row keys to numeric values and 
> computing the fraction of the key space covered so far: {{{}(current - start) 
> / (stop - start){}}}.
> Since the {{TableInputFormat}} splitter sets start/stop row keys from region 
> boundaries, they are only empty for the table's very first region (empty 
> start) or last region (empty stop). In those cases, we *probe* the table with 
> a forward or reverse scan (limit 1) to discover the actual boundary row key.
>                                                                               
>                                                                         The 
> implementation is pluggable via {{hbase.mapreduce.rowkey.progress.class}} 
> configuration:
>  * {{ByteBasedRowKeyProgress}} (default) : treats row keys as raw bytes. 
> Works well for most key designs.
>  * {{HexPrefixRowKeyProgress}} : interprets leading bytes as hex characters 
> ([0-9a-f]). Gives accurate linear progress for tables using hex-encoded hash 
> prefixes (e.g. MD5). The raw byte approach is inaccurate for hex keys because 
> there are large byte gaps between '9'→'a' (0x39→0x61) and between "0f"→"10" 
> (0x3066→0x3130) that don't correspond to actual key distance. The prefix 
> length is configurable via 
> {{hbase.mapreduce.rowkey.progress.hex.prefix.length}} (default 4). Bytes 
> beyond the prefix are ignored, so non-hex suffixes do not affect progress.
>  * Users can implement the {{RowKeyProgress}} interface for custom key 
> encoding strategies.
> After this change, you can monitor the progress in this way.
>  
> !mapreduce-progress-after.png|width=1792,height=119!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to