[jira] [Created] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Tomoko Uchida (Jira)
Tomoko Uchida created LUCENE-10610:
--

 Summary: RunAutomaton#hashCode() can easily cause hash collision 
for different Automatons
 Key: LUCENE-10610
 URL: https://issues.apache.org/jira/browse/LUCENE-10610
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Tomoko Uchida


Current RunAutomaton#hashCode() is:
{code:java}
  @Override
  public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + alphabetSize;
result = prime * result + points.length;
result = prime * result + size;
return result;
  }
{code}
Since it does not take account of the contents of the {{points}} array, this 
returns the same value for different automatons when their alphabet size and 
state size are the same.

For example, this test code passes.
{code:java}
  public void testHashCode() throws IOException {
PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
assert q1.compiled.runAutomaton.hashCode() == 
q2.compiled.runAutomaton.hashCode();
  }
{code}
I suspect this is a bug?

Note that I think it's not a serious one; all callers of this {{hashCode()}} 
take account of additional information when calculating their own hash value, 
it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #953: LUCENE-10605: fix error in 32bit jvm object alignment gap calculation…

2022-06-10 Thread GitBox


mocobeta commented on PR #953:
URL: https://github.com/apache/lucene/pull/953#issuecomment-1152255099

   @wormday you are right. For your information, external contributors do not 
have to worry about backporting - it's a maintainers' work. Backporting is 
often hard for contributions when it requires additional work for various 
reasons (such as backward compatibility).
   See 
https://github.com/apache/lucene/blob/main/CONTRIBUTING.md#opening-a-pull-request
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552702#comment-17552702
 ] 

Uwe Schindler commented on LUCENE-10610:


The hashCode does not need to be unique. If 2 have same hash code the equals 
relation must be used.

But in general the hashcode should also contain the points, so yet it is a kind 
of bug - but only makes lookups ineffective. The problem is that it is 
expensive to create that from huge FSAs. Normally you should have a separate 
private field in the class marked "privat transient int hashCode" and you 
lazily cache the calculated hashcode there (works if object is unmodifiable). 
The Java String class has this to not recreate the hashCode on every call. The 
method does not need to be synchronized (although can concurrently updated), 
because the hash code is atomic size (32 bits) and it does not matter if 2 
threads recreate the hash code at same time.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552709#comment-17552709
 ] 

Tomoko Uchida commented on LUCENE-10610:


Or, it might be a cleaner approach to have {{hashCode()}} in {{Automaton}} 
class and call it from {{RunAutomaton}}?

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552710#comment-17552710
 ] 

Uwe Schindler commented on LUCENE-10610:


Yes, but Automaton should cache the hashcode.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552713#comment-17552713
 ] 

Tomoko Uchida commented on LUCENE-10610:


bq. The problem is that it is expensive to create that from huge FSAs.

Yes, I thought this could be an intentional code for performance reasons, but I 
was a bit surprised that it causes hash collisions too easily.

bq. Yes, but Automaton should cache the hashcode.

Sure, I agree with that. Maybe I'll go back here later - it might be a 
potential bug but there is no substantial effect on APIs for now I think.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552723#comment-17552723
 ] 

Tomoko Uchida commented on LUCENE-10610:


Hmm Automaton is mutable even after calling {{finishState()}}... it'd be not 
trivial to implement hashCode() for it.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552725#comment-17552725
 ] 

Uwe Schindler commented on LUCENE-10610:


But the RunAutomaton is not modifiable, right. Then maybe cache hashCode only 
there. If its only an array thats easy using Arrays#hashCode() :-)

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552728#comment-17552728
 ] 

Tomoko Uchida commented on LUCENE-10610:


Right, RunAutomaton has no setters - a cached hash value can be there; thanks 
for your suggestion.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552727#comment-17552727
 ] 

Uwe Schindler commented on LUCENE-10610:


I checked the code. If you look at equals you see the automaton is never party 
of the game. it is just referred to in the constructor and only used for some 
calculations in derived objects.

The equals of RunAutomaton only comapres the local arrays. Therefor the 
hashCode must also do this. As creating hashCodes of array is expensive, those 
should be cached.

Actually it is quite easy: either create the integer hashCode in ctor and just 
return it, or do lazy init. The arrays and bitsets in equals don't change 
anymore.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552733#comment-17552733
 ] 

Uwe Schindler commented on LUCENE-10610:


Thanks for finding this. The solution is:
- Make equals and hashCode symmetric in what it includes
- cache the hashCode for performance by either calculating it in constructor or 
do lazy init using a transient field. A SetOnce may also be a good 
candidate for this. No synchronization needed, as different threads may create 
the same cached value in parallel which won't hurt

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552733#comment-17552733
 ] 

Uwe Schindler edited comment on LUCENE-10610 at 6/10/22 12:49 PM:
--

Thanks for finding this. The solution is:
- Make equals and hashCode symmetric in what it includes
- cache the hashCode for performance by either calculating it in constructor or 
do lazy init using a transient field. A Integer object (initially null) may 
also be a good candidate for this. No synchronization needed, as different 
threads may create the same cached value in parallel which won't hurt


was (Author: thetaphi):
Thanks for finding this. The solution is:
- Make equals and hashCode symmetric in what it includes
- cache the hashCode for performance by either calculating it in constructor or 
do lazy init using a transient field. A SetOnce may also be a good 
candidate for this. No synchronization needed, as different threads may create 
the same cached value in parallel which won't hurt

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

2022-06-10 Thread GitBox


gsmiller opened a new pull request, #954:
URL: https://github.com/apache/lucene/pull/954

   This PR is to migrate the facets module to using the newly-added 
`SortedSetDocValues#docValueCount()` for iteration, as described in 
LUCENE-10603. It doesn't attempt to move all `SSDV` iteration, just the facets 
module.
   
   Benchmark results show a potentially small win, and no regressions, so I 
think we should move forward with this.
   
   ```
   TaskQPS baseline  StdDevQPS candidate  
StdDevPct diff p-value
 TermDTSort   99.36 (13.9%)   93.27 
(12.9%)   -6.1% ( -28% -   24%) 0.149
  HighTermDayOfYearSort   99.67 (13.9%)   96.80 
(12.2%)   -2.9% ( -25% -   27%) 0.487
  BrowseDayOfYearSSDVFacets   14.07 (18.4%)   13.82 
(13.8%)   -1.8% ( -28% -   37%) 0.729
   HighTermTitleBDVSort  126.78 (21.9%)  124.79 
(25.4%)   -1.6% ( -40% -   58%) 0.835
 IntNRQ   73.42  (4.7%)   72.57  
(5.1%)   -1.2% ( -10% -9%) 0.454
  OrHighMed   99.39  (3.5%)   98.39  
(3.2%)   -1.0% (  -7% -5%) 0.345
   OrHighNotMed 1295.86  (3.2%) 1285.26  
(5.0%)   -0.8% (  -8% -7%) 0.535
 OrHighHigh   46.15  (3.1%)   45.78  
(3.0%)   -0.8% (  -6% -5%) 0.400
  BrowseMonthSSDVFacets   16.26 (15.1%)   16.13  
(9.7%)   -0.8% ( -22% -   28%) 0.848
  OrHighLow  970.97  (2.8%)  964.35  
(1.9%)   -0.7% (  -5% -4%) 0.375
   OrNotHighMed  945.22  (3.4%)  939.06  
(4.1%)   -0.7% (  -7% -7%) 0.582
MedTerm 2116.21  (5.1%) 2103.04  
(4.5%)   -0.6% (  -9% -9%) 0.684
   PKLookup  169.76  (3.4%)  168.71  
(3.8%)   -0.6% (  -7% -6%) 0.588
AndHighHigh   44.04  (3.0%)   43.78  
(5.5%)   -0.6% (  -8% -8%) 0.677
MedIntervalsOrdered   10.35  (5.8%)   10.31  
(5.2%)   -0.4% ( -10% -   11%) 0.820
  OrNotHighHigh 1077.87  (4.1%) 1074.45  
(5.0%)   -0.3% (  -9% -9%) 0.827
   HighTerm 2923.10  (4.7%) 2914.62  
(4.3%)   -0.3% (  -8% -9%) 0.838
LowTerm 1969.85  (4.9%) 1964.63  
(5.6%)   -0.3% ( -10% -   10%) 0.873
MedSpanNear   59.53  (2.6%)   59.38  
(3.1%)   -0.2% (  -5% -5%) 0.784
   HighIntervalsOrdered   12.23  (8.2%)   12.20  
(7.6%)   -0.2% ( -14% -   16%) 0.920
   HighSpanNear5.30  (2.7%)5.29  
(3.2%)   -0.1% (  -5% -5%) 0.902
   OrNotHighLow 1213.60  (2.8%) 1212.64  
(2.8%)   -0.1% (  -5% -5%) 0.928
LowSloppyPhrase   24.51  (3.3%)   24.49  
(3.3%)   -0.1% (  -6% -6%) 0.953
 OrHighMedDayTaxoFacets   12.99  (4.9%)   12.98  
(6.2%)   -0.1% ( -10% -   11%) 0.974
   MedTermDayTaxoFacets   23.69  (4.8%)   23.68  
(4.1%)   -0.1% (  -8% -9%) 0.971
LowIntervalsOrdered  107.55  (5.3%)  107.51  
(3.8%)   -0.0% (  -8% -9%) 0.980
   OrHighNotLow 1064.18  (4.6%) 1064.82  
(5.6%)0.1% (  -9% -   10%) 0.970
LowSpanNear  190.49  (3.1%)  190.62  
(3.9%)0.1% (  -6% -7%) 0.951
AndHighMedDayTaxoFacets   39.56  (2.1%)   39.60  
(1.6%)0.1% (  -3% -3%) 0.868
  MedPhrase  379.28  (2.1%)  379.69  
(2.5%)0.1% (  -4% -4%) 0.883
 HighPhrase  223.12  (2.5%)  223.61  
(2.8%)0.2% (  -4% -5%) 0.795
  HighTermMonthSort  121.98 (16.5%)  122.28 
(13.8%)0.2% ( -25% -   36%) 0.959
  LowPhrase   66.70  (2.8%)   66.89  
(3.9%)0.3% (  -6% -7%) 0.792
 Fuzzy1   93.42  (1.8%)   93.69  
(1.2%)0.3% (  -2% -3%) 0.556
Respell   53.91  (1.8%)   54.07  
(1.4%)0.3% (  -2% -3%) 0.552
MedSloppyPhrase   16.33  (3.2%)   16.38  
(3.3%)0.3% (  -6% -7%) 0.763
 AndHighMed   90.72  (3.2%)   91.05  
(4.0%)0.4% (  -6% -7%) 0.753
   HighSloppyPhrase   32.08  (4.7%)   32.21  
(4.4%)0.4% (  -8% -9%) 0.774
  OrHighNotHigh  895.72  (4.9%)  

[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552827#comment-17552827
 ] 

Robert Muir commented on LUCENE-10610:
--

what uses this hashcode (anything?).

Let's please not go back to trying to implement hashcode/equals in Automaton. 
There are good reasons why we don't do that.

If possible, let's try to find/prevent anything using 
RunAutomaton.hashCode/equals too. 

This class is already complex and can be resource-intensive, so I'd rather us 
try to avoid adding more of it "just because its java". I am ok with throwing 
exceptions from these methods, too.




> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552844#comment-17552844
 ] 

Tomoko Uchida commented on LUCENE-10610:


bq. Let's please not go back to trying to implement hashcode/equals in 
Automaton. There are good reasons why we don't do that.

Yes, I looked through Automaton class and understand it shouldn't be hash keys.

bq. If possible, let's try to find/prevent anything using 
RunAutomaton.hashCode/equals too.

RunAutomaton.hashCode() is (indirectly) used by AutomatonQuery.hashCode(). I'll 
check if it's possible to implement AutomatonQuery.hashCode() without depending 
CompiledAutomaton.hashCode() - it isn't called from anyware else.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-10 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552851#comment-17552851
 ] 

Greg Miller commented on LUCENE-10603:
--

OK, thanks [~ChrisLu]! +1 to doing this for consistency.

I took a pass at making this change within the faceting module since there are 
a number of places we rely on SSDV ordinal iteration. I figured we could 
probably tackle this change through multiple PRs, so I figured I'd lend a hand 
with faceting: https://github.com/apache/lucene/pull/954

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552856#comment-17552856
 ] 

Robert Muir commented on LUCENE-10610:
--

A simple/fast improvement might be to incorporate {{Arrays.hashCode(points)}} 
into the hashCode()

This is gonna be the smallest of the arrays, its size can be roughly thought 
about as "how many unique letters/ranges are in regex string" and doesn't blow 
up big. It would differ in your example above, because of different letters 
used.

No need for caching or anything IMO, i think it would be fast enough without 
that. 

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552858#comment-17552858
 ] 

Tomoko Uchida commented on LUCENE-10610:


Ah, if my understanding is correct it seems that it's structually difficult to 
make {{AutomatonQuery.hashCode()}} not depend on RunAutomaton.hashCode() - it's 
just a kind of container for Automaton and CompiledAutomaton.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552861#comment-17552861
 ] 

Robert Muir commented on LUCENE-10610:
--

Also i honestly think the current hashcode based on points.length could be 
considered good enough for a hashcode. The example above is two exact same 
automatons (same shape), exact same number of states (same string length), 
exact same number of unique symbols/ranges (points.length). I personally feel a 
collision is OK, from automaton perspective, due to how similar they are. And 
especially since stuff like PrefixQuery doesn't even use this hashcode anyway. 

But again, if you want to make it more refined, substitute 
{{Arrays.hashCode(points)}} for {{points.length}}. It shouldnt be much slower.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #929: LUCENE-10584: Properly support #getSpecificValue for hierarchical dims in SSDV faceting

2022-06-10 Thread GitBox


mdmarshmallow commented on code in PR #929:
URL: https://github.com/apache/lucene/pull/929#discussion_r894746323


##
lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java:
##
@@ -74,7 +74,7 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
 
   @Override
   public Number getSpecificValue(String dim, String... path) throws 
IOException {
-if (path.length != 1) {
+if (stateConfig.getDimConfig(dim).hierarchical == false && path.length != 
1) {
   throw new IllegalArgumentException("path must be length=1");

Review Comment:
   Maybe we should update this error message? Something like `dim + "is not 
configured to be hierarchical, path must be length=1"`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-10 Thread GitBox


mdmarshmallow commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1152603748

   >Is this about providing enough information to optimize with KD/R-Trees? If 
that's the case, I don't think we want a matches(long[]) method right? We just 
need a way for the FSM instances to expose their bounding-boxes. I don't think 
we even need to store the original long[] array to do this.
   
   I just wanted to point out here that the reason I wanted a `long[]` API was 
to support KD/R-Trees, but if we don't need it to add support, then I think 
there is no reason for `long[]`. I feel like having 2 `matches` functions in 
this case would make the API unnecessarily complex. If the user is extending 
the FSM, then I feel like they could figure out the `byte[]` API?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #929: LUCENE-10584: Properly support #getSpecificValue for hierarchical dims in SSDV faceting

2022-06-10 Thread GitBox


gsmiller commented on code in PR #929:
URL: https://github.com/apache/lucene/pull/929#discussion_r894787988


##
lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java:
##
@@ -74,7 +74,7 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
 
   @Override
   public Number getSpecificValue(String dim, String... path) throws 
IOException {
-if (path.length != 1) {
+if (stateConfig.getDimConfig(dim).hierarchical == false && path.length != 
1) {
   throw new IllegalArgumentException("path must be length=1");

Review Comment:
   Good suggestion! Pushed a rev.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552909#comment-17552909
 ] 

Tomoko Uchida commented on LUCENE-10610:


I may completely miss the point so correct me if I'm wrong - but possibly does 
it make sense to make all methods that change the internal state of 
{{Automaton}}, and make it immutable (from the perspective of the outside 
package)? It looks {{Automaton.Builder}} has all the necessary methods to build 
an Automaton instance, so I wonder if it is sufficient to only expose the 
builder class. I feel a bit awkward that the built Automaton instance can be 
still modified after "finishing" by the builder.

If we make it immutable, we could safely set a (pre-computed or on-the-fly) 
hash value to it, and in return for that, we can freely remove {{hashCode()}} 
and {{equals()}} from {{CompiledAutomaton}} and {{RunAutomaton}} classes. 
Rather than having them in the classes to run actual matching operations, I 
guess it could be more natural to have it in {{Automaton}} like {{Query}} class 
- the prototype and most higher interface?

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552909#comment-17552909
 ] 

Tomoko Uchida edited comment on LUCENE-10610 at 6/10/22 6:25 PM:
-

I may completely miss the point so correct me if I'm wrong - but possibly does 
it make sense to make all methods that change the internal state of 
{{Automaton}} private, and make the class virtually immutable (from the 
perspective of the outside package)? It looks {{Automaton.Builder}} has all the 
necessary methods to build an Automaton instance, so I wonder if it is 
sufficient to only expose the builder class. I feel a bit awkward that the 
built Automaton instance can be still modified after "finishing" by the builder.

If we make it immutable, we could safely set a (pre-computed or on-the-fly) 
hash value to it, and in return for that, we can freely remove {{hashCode()}} 
and {{equals()}} from {{CompiledAutomaton}} and {{RunAutomaton}} classes. 
Rather than having them in the classes to run actual matching operations, I 
guess it could be more natural to have it in {{Automaton}} like {{Query}} class 
- the prototype and most higher interface?


was (Author: tomoko uchida):
I may completely miss the point so correct me if I'm wrong - but possibly does 
it make sense to make all methods that change the internal state of 
{{Automaton}}, and make it immutable (from the perspective of the outside 
package)? It looks {{Automaton.Builder}} has all the necessary methods to build 
an Automaton instance, so I wonder if it is sufficient to only expose the 
builder class. I feel a bit awkward that the built Automaton instance can be 
still modified after "finishing" by the builder.

If we make it immutable, we could safely set a (pre-computed or on-the-fly) 
hash value to it, and in return for that, we can freely remove {{hashCode()}} 
and {{equals()}} from {{CompiledAutomaton}} and {{RunAutomaton}} classes. 
Rather than having them in the classes to run actual matching operations, I 
guess it could be more natural to have it in {{Automaton}} like {{Query}} class 
- the prototype and most higher interface?

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-10 Thread GitBox


gsmiller commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1152648521

   > So again, purely from an API perspective, we tell the user "You give us 
long[] at indexing time, we'll give it you back at aggregation time". It's 
simple, readable, intuitive.
   
   Hmm, yeah this is a good point. +1 that it would be more intuitive for users 
to implement a long-based match method if they're writing their own `FSM` 
implementations. And +1 that it also provides a layer of separation between the 
encoding and the logic (so we could change the encoding of the point doc values 
without breaking `FSM`s).
   
   I pushed a change just now that shows a different way we _could_ approach 
this. I'm not sure I think we _should_, but wanted to float it out there to see 
what you all think. I'm also not opposed to moving forward with _only_ a 
long-based API and then benchmark to see if the byte-based approach is 
_actually_ more optimal. Honestly, that's probably the right decision here...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552943#comment-17552943
 ] 

Robert Muir commented on LUCENE-10610:
--

it is much more complicated. I really don't think we should do equals/hashcode 
on automaton, we've been down that road before and it gets bad. The problem has 
to do with things like determinization, minimization, etc.

originally equals was implemented by "sameLanguage" but it requires very 
heavy-duty stuff to compute. Otherwise, if you arent going to define it that 
way, equals is meaningless as there are zillions of possible automata that "do 
the same thing".

It isn't about mutability. I prefer for automaton to simply implement 
Object.equals/hashcode (identity) as it makes just as much sense.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552947#comment-17552947
 ] 

Robert Muir commented on LUCENE-10610:
--

and for the same reason, again, we can do something else for runautomaton too. 
Such as remove hashCode/equals methods entirely and let it be defined by 
identity. it is perfectly reasonable just like it is for automaton.

maybe some tests have to be changed or modified to use 
{{Operations.sameLanguage}} for assertions instead of {{equals()}}, that's fine.



> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wormday commented on pull request #953: LUCENE-10605: fix error in 32bit jvm object alignment gap calculation…

2022-06-10 Thread GitBox


wormday commented on PR #953:
URL: https://github.com/apache/lucene/pull/953#issuecomment-1152830515

   @zhaih @mocobeta
   Thanks for everyone's help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org