jpountz opened a new pull request, #875: URL: https://github.com/apache/lucene/pull/875
I benchmarked OrdinalMap construction over high-cardinality fields, and lots of time gets spent into `PriorityQueue#downHeap` due to entry comparisons. I added a small hack that speeds up these comparisons a bit by extracting the first 8 bytes of the terms as a comparable unsigned long, and using this long whenever possible for comparisons. On a dataset that consists of 100M documents and 10M unique values that consist of 16-bytes random bytes, OrdinalMap construction went from 9.4s to 6.0s. On the same number of docs/values where values consist of the same 8-bytes prefix and then 8 random bytes to simulate a worst-case scenario for this change, OrdinalMap construction went from 9.6s to 10.1s. So this looks like it can yield a significant speedup in some scenarios, while the slowdown is contained in the worst-case scenario? Unfortunately, this worst-case scenario is not exactly unlikely, e.g. this is what you would get with a dataset of IPv4-mapped IPv6 addresses, where all values share the same 12-bytes prefix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org