neoremind opened a new pull request #91:
URL: https://github.com/apache/lucene/pull/91


   https://issues.apache.org/jira/browse/LUCENE-9932
   
   # Description
   
   In BKD index building, the input bytes must be sorted before calling BKD 
writer related API. The sorting method leverages MSB Radix Sort algorithm, and 
the comparing method takes both the bytes itself and the DocId, but in real 
cases, DocIds are usually monotonically increasing. This could yield one 
possible performance enhancer. I found this enhancement when I dig into one 
performance issue in our system. Then I research on the possible solution.
   
   # Solution
   
   DocId is usually increased by one when building index in a thread-safe way, 
by assuming such condition, the comparing method can eliminate the unnecessary 
comparing input - DocId, only leave the bytes itself to compare. In order to do 
so, MSB radix sorting and its fallback sorting method must be stable, so that 
when elements are the same, the sorting method maintains its original order 
when added, which makes DocId still monotonically increasing. To make MSB Radix 
Sort stable, it needs a trivial update; to make fallback sort table, use merge 
sort instead of quick sort. Meanwhile, there should introduce a switch which is 
able to turn the stable option on or off.
   
   # Tests
   
   Add testcases.
   
   Add performance testcase (maybe removed when merging).
   
   Test environment:
   MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 
MHz DDR3
   
   Java version:
   java version "1.8.0_161"
   Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
   Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
   
   Testcase:
   bytesPerDim = [1, 2, 3, 4, 8, 16, 32]
   dim = 1
   doc num = 2,000,000
   warm up 5 time, run 10 times to calculate average time used.
   
   bytesPerDim\scenario | disable sort doc id (PR branch) | enable sort doc id 
(master branch)
   -- | -- | --
   1 | 30989.594 us | 1151149.9 us
   2 | 313469.47 us | 1115595.1 us
   3 | 844617.8 us | 1465465.1 us
   4 | 1350946.8 us | 1465465.1 us
   8 | 1344814.6 us | 1458115.5 us
   16 | 1344516.6 us | 1459849.6 us
   32 | 1386847.8 us | 1583097.5 us
   
   When there are many duplicate bytes (bytesPerDim = 1 or 2 or 3) which means 
data cardinality is low. By disable sorting DocId, it runs 1.73x to 37x faster.
   When data cardinality is high, the performance does not go backward, also 
runs after.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to