[GitHub] [lucene] neoremind commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building

GitBox Wed, 21 Apr 2021 08:43:40 -0700


neoremind commented on pull request #91:
URL: https://github.com/apache/lucene/pull/91#issuecomment-824164245



   > +1 to always use the stable version of the algorithm. This would only use 
transient memory and in reasonable amounts, so I'm not concerned with the 
memory usage.
   
   Per comment, I make stable reorder in MSB Radix Sort as default. Things 
become a little tricky.
   
   The benchmark of this PR branch goes like below. Starting from bytePerDim=4, 
non-stable MSB Radix Sort runs faster than stable version.
   ```
     -------------------------------------------------
   | bytesPerDim | isDocIdIncremental | avg time(us) |
    -------------------------------------------------
   |      1      |         N          |    981732.7  |
   |      1      |         Y          |     56766.1  |
   |      2      |         N          |    941666.1  |
   |      2      |         Y          |    304395.9  |
   |      3      |         N          |   1306374.9  |
   |      3      |         Y          |    853744.8  |
   |      4      |         N          |   1262810.3  |
   |      4      |         Y          |   1345605.8  |
   |      8      |         N          |   1282193.9  |
   |      8      |         Y          |   1364078.3  |
   |     16      |         N          |   1269967.4  |
   |     16      |         Y          |   1433378.3  |
   |     32      |         N          |   1293686.0  |
   |     32      |         Y          |   1468472.1  |
    -------------------------------------------------
   ```
   
   To make it more clear, I also add the new result into the graph I made above.
   
   
![](https://issues.apache.org/jira/secure/attachment/13024398/refined-code-benchmark2.png)
   
   I try to analysis, according to the flame graph I generated (see above 
picture). Sorting includes the following stages.
   - build histogram
   - fallback intro sort (quick sort for non-stable, in place merge sort for 
stable)
   - reorder
   
   Let's illustrate the total time partitioned by the cost per stage. 
   
   For main branch `MSBRadixSort`, it shows as below.
   ```
   
+--------------------------------------------------------------------------------------------------------+
   |  build histogram |      fallback sort (quick sort)     |             non 
stable reorder                | 
   
+--------------------------------------------------------------------------------------------------------+
   
   ```
   
   For PR branch `StableMSBRadixSort`, it shows as below
   ```
   
+-------------------------------------------------------------------------------------------------+
   |  build histogram |         fallback sort (in place merge sort)        |    
stable reorder       | 
   
+-------------------------------------------------------------------------------------------------+
 
   ```
   
   For PR branch `MSBRadixSort` making stable reorder as default, it shows as 
below. 
   ```
   
+----------------------------------------------------------------------------------+
   |  build histogram |       fallback sort (quick sort)     |    stable 
reorder      | 
   
+----------------------------------------------------------------------------------+
   
   ```
   
   I think that stable reorder contributes most part of the speedup, but in 
place merge sort is slower than 3-pivot quick sort.
   
   After all, regardless of the order of doc IDs, stable reorder is definitely 
a good choice. If data are spread evenly, not many duplicates, 
`StableMSBRadixSort` is a little slower (10%) than `MSBRadixSorter`. If not, 
`StableMSBRadixSort` will out-perform far better than `MSBRadixSorter`. This 
makes me reluctant on the way we should go, @jpountz could you give me some 
advice?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] neoremind commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building

Reply via email to