rmuir opened a new issue, #12176:
URL: https://github.com/apache/lucene/issues/12176

   ### Description
   
   TermInSetQuery currently "ping-pong" intersects a sorted list against the 
term dictionary.
   
   Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which 
can be built in linear time. Then query could leverage `Terms.intersect` (e.g. 
TermInSetQuery could be an AutomatonQuery subclass).
   
   This should give faster intersection of the terms, which is usually the 
heavy part of this query. For example BlockTree terms dictionary has a very 
efficient `Terms.intersect` that makes use of the underlying structure.
   
   The annoying part: `DaciukMihovAutomatonBuilder` currently requires unicode 
strings and makes a UTF-32 automaton, which would then be converted to UTF-8 
(binary) automaton via `UTF32ToUTF8`. But I think `TermInSetQuery` may allow 
arbitrary non-unicode binary strings?
   
   In order to support arbitrarily binary terms (and to avoid conversions), the 
DaciukMihov code would have to modified, to support construction of a binary 
automaton directly. Probably this is actually simpler?
   
   This is just an idea to get more performance, it hasn't been tested. feel 
free to close the issue if it doesnt work out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to