[PR] API, Core: Optimize CharSequenceMap for file paths [iceberg]

via GitHub Tue, 21 Nov 2023 13:48:05 -0800


aokolnychyi opened a new pull request, #9126:
URL: https://github.com/apache/iceberg/pull/9126


   I am using `CharSequenceMap` in #8755 to build a map of position delete 
indexes for a delete file. While profiling that change, I noticed we spend 
quite a bit of time computing hash codes for file paths compared to what 
`String` would do. This is because our logic in `CharSequenceWrapper` is 
generic while `String` can compute a hash code for latin only chars faster by 
iterating over bytes directly. This PR optimizes the hash code computation for 
file paths by only taking into account file names. This speeds up the 
computation without increasing the chances of collisions.
   
   This PR comes with tests and a benchmark.
   
   ```
   Benchmark                                                Mode  Cnt  Score   
Error  Units
   CharSequenceMapBenchmark.defaultCharSequenceMap            ss   10  2.742 ± 
0.499   s/op
   CharSequenceMapBenchmark.filePathCharSequenceMap           ss   10  1.420 ± 
0.234   s/op
   ```
   
   I am planning to use `CharSequenceMap` in `DeleteFileIndex` so this will be 
a common pattern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] API, Core: Optimize CharSequenceMap for file paths [iceberg]

Reply via email to