[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

Uwe Schindler (Jira) Sat, 11 Jun 2022 10:52:04 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553131#comment-17553131
 ]


Uwe Schindler edited comment on LUCENE-10610 at 6/11/22 5:51 PM:
-----------------------------------------------------------------

I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has a RunAutomaton that equals the one of other query (I think 
that's given, only hashCode). We don't need to cache cases where the automaton 
looks different because the regex was different but functionally same.

If we need it for query cache, i think maybe the RunAutomaton should not be 
used at all by the query and only the direct query inputs used for the Query's 
equals/hashcode (like regex string or prefix/wildcard or fuzzy term).


was (Author: thetaphi):
I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has a RunAutomaton that equals the one of other query (I think 
that's given, only hashCode). We don't need to cache cases where the automaton 
looks different because the regex was different but functionally same.

If we need it for query cache, i think maybe the RunAutomaton should not be 
used at all by the query and only the direct query inputs be cached (like regex 
string or prefix/wildcard or fuzzy term).

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-10610
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10610
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Tomoko Uchida
>            Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
>     final int prime = 31;
>     int result = 1;
>     result = prime * result + alphabetSize;
>     result = prime * result + points.length;
>     result = prime * result + size;
>     return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
>     PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
>     PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
>     assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

Reply via email to