rmuir opened a new issue, #12176: URL: https://github.com/apache/lucene/issues/12176
### Description TermInSetQuery currently "ping-pong" intersects a sorted list against the term dictionary. Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which can be built in linear time. Then query could leverage `Terms.intersect` (e.g. TermInSetQuery could be an AutomatonQuery subclass). This should give faster intersection of the terms, which is usually the heavy part of this query. For example BlockTree terms dictionary has a very efficient `Terms.intersect` that makes use of the underlying structure. The annoying part: `DaciukMihovAutomatonBuilder` currently requires unicode strings and makes a UTF-32 automaton, which would then be converted to UTF-8 (binary) automaton via `UTF32ToUTF8`. But I think `TermInSetQuery` may allow arbitrary non-unicode binary strings? In order to support arbitrarily binary terms (and to avoid conversions), the DaciukMihov code would have to modified, to support construction of a binary automaton directly. Probably this is actually simpler? This is just an idea to get more performance, it hasn't been tested. feel free to close the issue if it doesnt work out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org