uschindler commented on PR #888:
URL: https://github.com/apache/lucene/pull/888#issuecomment-1792570482

   > @mikemccand: If you want to see the changes I reverted, see the above 
comparison: 
https://github.com/apache/lucene/compare/36de2bb7fa7a0587a102cf5c4d35ac8f94976bbd..c1b626c0636821f4d7c085895359489e7dfa330f
   > 
   > Those changes need to be re-applied to the repo in correct files (not sure 
where this code now lives, looks like BytesRefBlockPool, but no idea, sorry)
   
   I think I know after looking into those changes what the problem was. 
Internally BytesRefHash uses BIG ENDIAN, because some parts in the byte array 
are "UTF-8 like" encoded (if highest bit is set another byte follows). As this 
is stupid to do and requires only a few bytes more storage, I removed that 
encoding to always use shorts instead of "byte or BE short". When the encoding 
no longer matters and must not be "UTF-8 encoding like", it can use native 
order. But for safety you could also use LE encoding to make use of actual CPUs 
(ARM is also LE now).
   
   So we have 2 posisbilities:
   - Change the internal encoding of bytesrefhash and remove the Big Endian 
UTF-8 like encoding (or call it vShort) and switch to Little Endian shorts
   - Use native encoding to also help CPUs like s390 and use native encoding 
(which also works). This PR supports this, but it is questionable for the 
reasons Robert said.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to