mikemccand commented on issue #12355: URL: https://github.com/apache/lucene/issues/12355#issuecomment-1808084756
+1 to find a way to reverse the bytes at compilation time. The reversal of bytes during FST compilation is so hard to think about! It happens because the FST is logically append-only, and sort of grows backwards (from the suffixes, inwards onto prefixes), and the newly written nodes always point backwards to the already written (appended to growing `byte[]`, or, soon `DataOutput`). But logically we ought to be able to write all the bytes backwards, then reverse them, but then when resolving absolute or relative node addresses at FST read time, we'd need to re-reverse those addresses. Or, we could try to rewrite the embedded node address references during/after reversal so we don't need to re-reverse on each node read? The pointers will necessarily be different (take different number of `byte[]` after reversal) since small node addresses would become big node addresses and take more bytes to encode absolute. It might even make the FST larger, since the common suffixes today will have smallish/earliesh node addresses. This is similar to what `pack` used to do (actually rewrite addresses), and it was hairy. So maybe for starters we do the simple "reverse `byte[]` after writing them all" and then "re-reverse addresses on decode"? I wonder if Tantivy FST has some sort of post-write-reverse step? Or does it always do cache-unfriendly read backwards during FST traversal? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org