bjacobowitz commented on issue #14427:
URL: https://github.com/apache/lucene/issues/14427#issuecomment-2851859048

   On further reflection / investigation, I think updating the documentation is 
the way to go here.
   
   I tried out permitting the filter fields on both sides of the presearcher 
query (in the terms clause and the filter clause) and that fixes the 
correctness issue, but it comes at a potentially unacceptable performance cost. 
See the attached patch 
[here](https://github.com/user-attachments/files/20041277/0001-Support-filter-fields-appearing-in-Lucene-Monitor-qu.patch)
 (or [this 
commit](https://github.com/bjacobowitz/lucene/commit/adf256bca6f0b3903f8d92ab9d844cbadac90257))
 for the implementation I tried out.
   
   With this (inadvisable) change, if we allow the filter field on both sides, 
we can end up with a presearcher query like this:
   
   ```
   +((field:(test) language:(en)) __anytokenfield:__ANYTOKEN__) 
#(+(language:en))
   ```
   
   This presearcher query would still match a stored query involving 
`field:test` and the filter field `language:en`, but by including the language 
on both sides the presearcher may return _any_ query that includes 
`language:en` for matching. That runs the risk of entirely subverting the 
optimization of only running queries involving `field:test`.
   
   The magnitude of the performance cost here depends on how specific the 
filter field's values are. With a somewhat specific filter field, where 
relatively few queries share the same value, inadvertently running all queries 
with the same value incurs a relatively low cost, but in the case where the 
filter field is something common like `language`, where many queries will share 
the same value for the filter field, the cost of running those extra queries 
would be high, and is probably not acceptable.
   
   There is a larger question around "correctness" when a filter field appears 
in the query itself, because the query could contradict the filter field's 
value from metadata! For example, if the query's filter field in metadata 
indicates "+language:en" and the query itself has a clause "-language:en", what 
is the correct behavior? Hard to say. Allowing that field to float freely in 
the stored query doesn't really make sense and should be avoided, but I 
wouldn't say this is obvious to the user.
   
   I think it would be ideal to block the monitor from storing a query which 
contains a filter field in the query itself, to prevent the user from stumbling 
into errors like I did, but to do that we would potentially need to throw a new 
exception when storing queries and it would run the risk of breaking existing 
users with a new error at runtime, so I don't think it's such a clean option.
   
   In the absence of a more direct solution here, I think we should update the 
documentation around filter fields, so that anyone newly learning about them 
will also learn that they should not appear in the query. I will open a PR to 
do this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to