[
https://issues.apache.org/jira/browse/HBASE-29863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Himanshu Gwalani updated HBASE-29863:
-------------------------------------
Description:
*Goal:* Introduce a mechanism to track and expose the specific HFiles involved
in a scan operation.
{*}Use-case{*}: This is essential for validations on client side to ensure
right set of files are scanned (if source of truth is available, for example:
snapshot data manifest during snapshot based scans), debugging performance
related issues and analysis on data access patterns.
*Proposed API* Add {{Set<Path> getScannerInitializedFiles()}} to the
{{KeyValueScanner}} interface.
*Implementation Details*
* *Capturing list of files when scanner is initialized.*
** Leaf Scanners
*** StoreFileScanner: Returns singleton having the path of the associated
{{{}HFile{}}}.
*** SnapshotSegmentScanner / CollectionBackedScanner / SegmentScanner: Returns
empty set.
** Composite Scanners
*** StoreScanner & ReversedStoreScanner: Aggregates files from all active
{{StoreFileScanners}}
*** KeyValueHeap & ReversedKeyValueHeap: Aggregates files from its internal
priority queue of scanners.
** Abstract Scanners
*** NonLazyKeyValueScanner / NonReversedNonLazyKeyValueScanner: Returns empty
set.{*}{*}
* *Exposing via RegionScanner & TableSnapshotRecordReader*
** RegionScanner: Aggregates files from all underlying StoreScanners
** TableSnapshotRecordReader: Proxies the call through ClientSideRegionScanner
to allow MapReduce jobs to access this for snapshot-based scans.
was:Need to do this before removing some deprecated methods in Cell, as we
still need to use these method at server side.
> Add API to KeyValueScanner to retrieve the set of StoreFiles accessed during
> a scan
> -----------------------------------------------------------------------------------
>
> Key: HBASE-29863
> URL: https://issues.apache.org/jira/browse/HBASE-29863
> Project: HBase
> Issue Type: New Feature
> Components: API, regionserver, Scanners
> Reporter: Himanshu Gwalani
> Assignee: Himanshu Gwalani
> Priority: Major
> Fix For: 2.7.0, 3.0.0-beta-2
>
>
> *Goal:* Introduce a mechanism to track and expose the specific HFiles
> involved in a scan operation.
> {*}Use-case{*}: This is essential for validations on client side to ensure
> right set of files are scanned (if source of truth is available, for example:
> snapshot data manifest during snapshot based scans), debugging performance
> related issues and analysis on data access patterns.
> *Proposed API* Add {{Set<Path> getScannerInitializedFiles()}} to the
> {{KeyValueScanner}} interface.
> *Implementation Details*
> * *Capturing list of files when scanner is initialized.*
> ** Leaf Scanners
> *** StoreFileScanner: Returns singleton having the path of the associated
> {{{}HFile{}}}.
> *** SnapshotSegmentScanner / CollectionBackedScanner / SegmentScanner:
> Returns empty set.
> ** Composite Scanners
> *** StoreScanner & ReversedStoreScanner: Aggregates files from all active
> {{StoreFileScanners}}
> *** KeyValueHeap & ReversedKeyValueHeap: Aggregates files from its internal
> priority queue of scanners.
> ** Abstract Scanners
> *** NonLazyKeyValueScanner / NonReversedNonLazyKeyValueScanner: Returns
> empty set.{*}{*}
> * *Exposing via RegionScanner & TableSnapshotRecordReader*
> ** RegionScanner: Aggregates files from all underlying StoreScanners
> ** TableSnapshotRecordReader: Proxies the call through
> ClientSideRegionScanner to allow MapReduce jobs to access this for
> snapshot-based scans.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)