[I] [C++] Improve sequential access use of ChunkResolver [arrow]

via GitHub Mon, 04 Nov 2024 17:28:50 -0800


anjakefala opened a new issue, #44641:
URL: https://github.com/apache/arrow/issues/44641


   ### Describe the enhancement requested
   
   There has been recent work to move the 
[ChunkResolver](https://github.com/apache/arrow/issues/34535) to public API. 
   
   `ChunkResolver` uses `O(log(num_chunks))` binary search to identify chunks, 
which is optimised for random access. For sequential row-by-row access, using 
ChunkResolver would be inefficient.
   
   Sometimes a user needs to be able to do row-major processing of the data. To 
that note, the proposal is to add these [helper 
methods](https://github.com/apache/arrow/issues/34535#issuecomment-1977304538) 
to the `ChunKResolver` API for more efficient sequential access traversal.
   
   These helper methods were written by @felipecrv:
   
   ```
     /// \pre loc.chunk_index >= 0
     /// \pre loc.index_in_chunk is assumed valid if chunk_index is not the 
last one
     inline bool Valid(ChunkLocation loc) const {
       const int64_t last_chunk_index = static_cast<int64_t>(offsets_.size()) - 
1;
       return loc.chunk_index + 1 < last_chunk_index ||
              (loc.chunk_index + 1 == last_chunk_index &&
               loc.index_in_chunk < offsets_[last_chunk_index]);
     }
   
     /// \pre Valid(loc)
     inline ChunkLocation Next(ChunkLocation loc) const {
       const int64_t next_index_in_chunk = loc.index_in_chunk + 1;
       return (next_index_in_chunk < offsets_[loc.chunk_index + 1])
                  ? ChunkLocation{loc.chunk_index, next_index_in_chunk}
                  : ChunkLocation{loc.chunk_index + 1, 0};
     }
   ```
   
   with the resulting loops:
   
   ```
   ChunkResolver resolver(batches);
   for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
     // re-use loc for all the typed columns since they are split on the same 
offsets
   }
   ```
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [C++] Improve sequential access use of ChunkResolver [arrow]

Reply via email to