gf2121 commented on PR #12661:
URL: https://github.com/apache/lucene/pull/12661#issuecomment-1765808689

   Hi @jpountz , Thanks a lot for the suggestion!
   
   > another option could be to encode the number of supplementary bytes using 
unary coding (like UTF8).
   
   This is a great idea that probably makes `readMSBVLong` more faster ! 
   
   FYI, the direction I'm considering is that it's not "decoding the MSB VLong" 
that causes this regression, but "how the MSB VLong changes the FST structure": 
   
   * For LSB VLong output, most/all of the bytes are stored in single arc.
   * For MSB VLong output, bytes are spilitted  into many arcs for prefix 
sharing.
   
   So we will need to more `Outputs#read` and `Outputs#add` on for `MSBVLong` 
to get the whole output. Here is a comparing of call times between LSB VLong 
(before #12631) and MSB VLong (after #12631)
   
   <!--StartFragment--><byte-sheet-html-origin data-id="1697525209097" 
data-version="4" data-is-embed="false" data-grid-line-hidden="false" 
data-importRangeRawData-spreadSource="https://bytedance.feishu.cn/sheets/Yp2Zs5ngphNWEHtZHUic8WlZnUf";
 data-importRangeRawData-range="&#39;Sheet1&#39;!A1:D3">
   
     | LSB VLong | MSB VLong | diff
   -- | -- | -- | --
   Outputs#read times | 116097 | 149803 | 29.03%
   Outputs#add times | 144 | 111568 | 77377.78%
   
   </byte-sheet-html-origin><!--EndFragment-->
   
   Unfortunately, `ByteSequenceOutputs#add` and `ByteSequenceOutputs#read` 
always need to construct new `BytesRef` objects, not efficient enough. This 
patch tried to speed up `ByteSequenceOutputs#add` a bit , getting the tiny 
improvement [mentioned 
above](https://github.com/apache/lucene/pull/12661#issuecomment-1764814636). 
But we are still seeing the regression there because `add` still needed while 
origin patch just ignore the NO_OUTPUT arcs.
   
   So i'm not very sure the optimization of the decoding output can resolve the 
regression as it does not look like the bottleneck to me, but I'd like to give 
a try if you still think it is worth :)
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to