[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553063#comment-17553063
 ] 

Uwe Schindler commented on LUCENE-10610:


I think the problem may be query caching. This would mean all automatons cannot 
be cached. Equals is currently implemented for RunAutomaton, it is just that 
hashCode is inconsistent. Which is unfortunate, but not a bug. You just get 
more hash collisions.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553075#comment-17553075
 ] 

Tomoko Uchida commented on LUCENE-10610:


Sorry if I casually dag up an old discussion. I had a chance to play around 
with AutomatonQuery very recently and I'm not knowledgeable about the history 
or background around it.

I don't strongly argue that it's a bug - I marked it as "Bug"  without 
confidence since I had to choose the issue type when opening this. If it's 
expected behavior, I'm fine with it. 
All sub-classes of AutomatonQuery indirectly rely on this 
RunAutomaton#hashCode() though, they override hashCode() to take account of 
extra information (the specific fields in the sub-classes).

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553131#comment-17553131
 ] 

Uwe Schindler commented on LUCENE-10610:


I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has the RunAutomaton. We don't need to cache cases where the 
automaton looks different because the regex was different but functionally same.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553131#comment-17553131
 ] 

Uwe Schindler edited comment on LUCENE-10610 at 6/11/22 5:49 PM:
-

I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has a RunAutomaton that equals the one of other query (I think 
that's given, only hashCode). We don't need to cache cases where the automaton 
looks different because the regex was different but functionally same.

If we need it for query cache, i think maybe the RunAutomaton should not be 
used at all by the query and only the direct query inputs be cached (like regex 
string or prefix/wildcard or fuzzy term).


was (Author: thetaphi):
I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has the RunAutomaton. We don't need to cache cases where the 
automaton looks different because the regex was different but functionally same.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553131#comment-17553131
 ] 

Uwe Schindler edited comment on LUCENE-10610 at 6/11/22 5:51 PM:
-

I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has a RunAutomaton that equals the one of other query (I think 
that's given, only hashCode). We don't need to cache cases where the automaton 
looks different because the regex was different but functionally same.

If we need it for query cache, i think maybe the RunAutomaton should not be 
used at all by the query and only the direct query inputs used for the Query's 
equals/hashcode (like regex string or prefix/wildcard or fuzzy term).


was (Author: thetaphi):
I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has a RunAutomaton that equals the one of other query (I think 
that's given, only hashCode). We don't need to cache cases where the automaton 
looks different because the regex was different but functionally same.

If we need it for query cache, i think maybe the RunAutomaton should not be 
used at all by the query and only the direct query inputs be cached (like regex 
string or prefix/wildcard or fuzzy term).

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553133#comment-17553133
 ] 

Dawid Weiss commented on LUCENE-10610:
--

> I do not think we need to discuss if equals/hashCode ensures that two 
> automatons are semantically equal (describe state machine with same behaviour)

This is, in general, a hard problem.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553136#comment-17553136
 ] 

Tomoko Uchida commented on LUCENE-10610:


bq. I would just make the already existing hashCode bug free. hashCode should 
take the same fields to calculate the hashcode that are also used by equals. 

I think this correctly expresses my first intention when opening this, thanks 
so much...

bq. I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour).

Sure, it's another problem on another level and it must not be the 
responsibility of Object#equals() and #hashCode().

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553142#comment-17553142
 ] 

Tomoko Uchida commented on LUCENE-10610:


I've been thinking about those three options, for what it's worth.

1. Do nothing
* it's not a bug, at least it's harmless
2. Fix RunAutomaton#hashCode()
* we can quickly fix to take account of all elements of the array
3. Fix AutomatonQuery and its sub-classes to not to use RunAutomaton#hashCode()
* this could help to avoid future confusion for other newcomers (someone 
like me) 

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553142#comment-17553142
 ] 

Tomoko Uchida edited comment on LUCENE-10610 at 6/11/22 6:47 PM:
-

I've been thinking about those three options, for what it's worth.

1. Do nothing
 * it's not a bug, at least it's harmless

2. Fix RunAutomaton#hashCode()
 * we can quickly fix to take account of all elements of the array

3. Fix AutomatonQuery and its sub-classes to not to use RunAutomaton#hashCode()
 * this could help to avoid future confusion for other newcomers (someone like 
me)


was (Author: tomoko uchida):
I've been thinking about those three options, for what it's worth.

1. Do nothing
* it's not a bug, at least it's harmless
2. Fix RunAutomaton#hashCode()
* we can quickly fix to take account of all elements of the array
3. Fix AutomatonQuery and its sub-classes to not to use RunAutomaton#hashCode()
* this could help to avoid future confusion for other newcomers (someone 
like me) 

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553142#comment-17553142
 ] 

Tomoko Uchida edited comment on LUCENE-10610 at 6/11/22 7:13 PM:
-

I've been thinking about those three options, for what it's worth.

1. Do nothing
 * it's not a bug, at least it's harmless

2. Fix RunAutomaton#hashCode()
 * we can quickly fix to take account of all elements of the array

3. Rewrite AutomatonQuery and its sub-classes not to use RunAutomaton#hashCode()
 * this could help to avoid future confusion for other newcomers (someone like 
me)


was (Author: tomoko uchida):
I've been thinking about those three options, for what it's worth.

1. Do nothing
 * it's not a bug, at least it's harmless

2. Fix RunAutomaton#hashCode()
 * we can quickly fix to take account of all elements of the array

3. Fix AutomatonQuery and its sub-classes to not to use RunAutomaton#hashCode()
 * this could help to avoid future confusion for other newcomers (someone like 
me)

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553142#comment-17553142
 ] 

Tomoko Uchida edited comment on LUCENE-10610 at 6/11/22 7:54 PM:
-

I've been thinking about those three options, for what it's worth.

1. Do nothing
 * it's not a bug, at least it's harmless

2. Fix RunAutomaton#hashCode()
 * we can quickly fix to take account of all elements of the array

3. Rewrite #hashCode() in AutomatonQuery and its sub-classes not to use 
RunAutomaton#hashCode()
 * this could help to avoid future confusion for other newcomers (someone like 
me)


was (Author: tomoko uchida):
I've been thinking about those three options, for what it's worth.

1. Do nothing
 * it's not a bug, at least it's harmless

2. Fix RunAutomaton#hashCode()
 * we can quickly fix to take account of all elements of the array

3. Rewrite AutomatonQuery and its sub-classes not to use RunAutomaton#hashCode()
 * this could help to avoid future confusion for other newcomers (someone like 
me)

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-11 Thread GitBox


shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1153077923

   > I feel like having 2 `matches` functions in this case would make the API 
unnecessarily complex
   
   As I wrote before "_After we decide whether to stick w/ the long[] or byte[] 
API we'll remove the unneeded variant._". We won't have 2 APIs in the end.
   
   > [...] and then benchmark to see if the byte-based approach is _actually_ 
more optimal
   
   We will need to benchmark this of course, but I thought about this a bit and 
I feel like `long[]` will perform better for these reasons:
   
   1. The `FSM` will store a `long[]` and will iterate it instead of `byte[]`. 
Less array accesses and since it happens **for every** facet set in every 
matching docs, I feel like it's going to be more efficient than re-iterating 
the `byte[]`
   2. Likewise, `MFSC` will convert the BDV values **once** to `long[]` and 
then all `FSMs` will be given that `long[]` to match.
   
   It just feels like more optimal, unless you have a single `FSM`, but I don't 
think that will be a common case. Anyway, let's benchmark it, but with the 
analysis above, I also agree we should actually start with the `long[]` API, 
and replace it with a `byte[]` one only if actually performs better.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-11 Thread GitBox


shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1153078375

   > pushed a change just now that shows a different way we _could_ approach 
this.
   
   If I understand your change correctly, then it creates a new `long[]` in 
each call to `matches()` right? I see two main problems here:
   
   1. It will potentially create millions of these small transient arrays if 
there are many hits and few facet sets per her.
   2. If there's more than one `FSM` then each one will decode the `byte[]` 
itself, right?
   
   Do you see any advantage of this impl over the one I pushed?
   
   BTW, we can remove the `numDims` parameter from my `matches(long[])` 
version, since `numDims == dimValues.length`, so it's redundant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org