uschindler commented on PR #888: URL: https://github.com/apache/lucene/pull/888#issuecomment-1792570482
> @mikemccand: If you want to see the changes I reverted, see the above comparison: https://github.com/apache/lucene/compare/36de2bb7fa7a0587a102cf5c4d35ac8f94976bbd..c1b626c0636821f4d7c085895359489e7dfa330f > > Those changes need to be re-applied to the repo in correct files (not sure where this code now lives, looks like BytesRefBlockPool, but no idea, sorry) I think I know after looking into those changes what the problem was. Internally BytesRefHash uses BIG ENDIAN, because some parts in the byte array are "UTF-8 like" encoded (if highest bit is set another byte follows). As this is stupid to do and requires only a few bytes more storage, I removed that encoding to always use shorts instead of "byte or BE short". When the encoding no longer matters and must not be "UTF-8 encoding like", it can use native order. But for safety you could also use LE encoding to make use of actual CPUs (ARM is also LE now). So we have 2 posisbilities: - Change the internal encoding of bytesrefhash and remove the Big Endian UTF-8 like encoding (or call it vShort) and switch to Little Endian shorts - Use native encoding to also help CPUs like s390 and use native encoding (which also works). This PR supports this, but it is questionable for the reasons Robert said. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org