[jira] [Created] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
Tomoko Uchida created LUCENE-10610: -- Summary: RunAutomaton#hashCode() can easily cause hash collision for different Automatons Key: LUCENE-10610 URL: https://issues.apache.org/jira/browse/LUCENE-10610 Project: Lucene - Core Issue Type: Bug Reporter: Tomoko Uchida Current RunAutomaton#hashCode() is: {code:java} @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + alphabetSize; result = prime * result + points.length; result = prime * result + size; return result; } {code} Since it does not take account of the contents of the {{points}} array, this returns the same value for different automatons when their alphabet size and state size are the same. For example, this test code passes. {code:java} public void testHashCode() throws IOException { PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); assert q1.compiled.runAutomaton.hashCode() == q2.compiled.runAutomaton.hashCode(); } {code} I suspect this is a bug? Note that I think it's not a serious one; all callers of this {{hashCode()}} take account of additional information when calculating their own hash value, it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #953: LUCENE-10605: fix error in 32bit jvm object alignment gap calculation…
mocobeta commented on PR #953: URL: https://github.com/apache/lucene/pull/953#issuecomment-1152255099 @wormday you are right. For your information, external contributors do not have to worry about backporting - it's a maintainers' work. Backporting is often hard for contributions when it requires additional work for various reasons (such as backward compatibility). See https://github.com/apache/lucene/blob/main/CONTRIBUTING.md#opening-a-pull-request -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552702#comment-17552702 ] Uwe Schindler commented on LUCENE-10610: The hashCode does not need to be unique. If 2 have same hash code the equals relation must be used. But in general the hashcode should also contain the points, so yet it is a kind of bug - but only makes lookups ineffective. The problem is that it is expensive to create that from huge FSAs. Normally you should have a separate private field in the class marked "privat transient int hashCode" and you lazily cache the calculated hashcode there (works if object is unmodifiable). The Java String class has this to not recreate the hashCode on every call. The method does not need to be synchronized (although can concurrently updated), because the hash code is atomic size (32 bits) and it does not matter if 2 threads recreate the hash code at same time. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552709#comment-17552709 ] Tomoko Uchida commented on LUCENE-10610: Or, it might be a cleaner approach to have {{hashCode()}} in {{Automaton}} class and call it from {{RunAutomaton}}? > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552710#comment-17552710 ] Uwe Schindler commented on LUCENE-10610: Yes, but Automaton should cache the hashcode. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552713#comment-17552713 ] Tomoko Uchida commented on LUCENE-10610: bq. The problem is that it is expensive to create that from huge FSAs. Yes, I thought this could be an intentional code for performance reasons, but I was a bit surprised that it causes hash collisions too easily. bq. Yes, but Automaton should cache the hashcode. Sure, I agree with that. Maybe I'll go back here later - it might be a potential bug but there is no substantial effect on APIs for now I think. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552723#comment-17552723 ] Tomoko Uchida commented on LUCENE-10610: Hmm Automaton is mutable even after calling {{finishState()}}... it'd be not trivial to implement hashCode() for it. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552725#comment-17552725 ] Uwe Schindler commented on LUCENE-10610: But the RunAutomaton is not modifiable, right. Then maybe cache hashCode only there. If its only an array thats easy using Arrays#hashCode() :-) > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552728#comment-17552728 ] Tomoko Uchida commented on LUCENE-10610: Right, RunAutomaton has no setters - a cached hash value can be there; thanks for your suggestion. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552727#comment-17552727 ] Uwe Schindler commented on LUCENE-10610: I checked the code. If you look at equals you see the automaton is never party of the game. it is just referred to in the constructor and only used for some calculations in derived objects. The equals of RunAutomaton only comapres the local arrays. Therefor the hashCode must also do this. As creating hashCodes of array is expensive, those should be cached. Actually it is quite easy: either create the integer hashCode in ctor and just return it, or do lazy init. The arrays and bitsets in equals don't change anymore. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552733#comment-17552733 ] Uwe Schindler commented on LUCENE-10610: Thanks for finding this. The solution is: - Make equals and hashCode symmetric in what it includes - cache the hashCode for performance by either calculating it in constructor or do lazy init using a transient field. A SetOnce may also be a good candidate for this. No synchronization needed, as different threads may create the same cached value in parallel which won't hurt > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552733#comment-17552733 ] Uwe Schindler edited comment on LUCENE-10610 at 6/10/22 12:49 PM: -- Thanks for finding this. The solution is: - Make equals and hashCode symmetric in what it includes - cache the hashCode for performance by either calculating it in constructor or do lazy init using a transient field. A Integer object (initially null) may also be a good candidate for this. No synchronization needed, as different threads may create the same cached value in parallel which won't hurt was (Author: thetaphi): Thanks for finding this. The solution is: - Make equals and hashCode symmetric in what it includes - cache the hashCode for performance by either calculating it in constructor or do lazy init using a transient field. A SetOnce may also be a good candidate for this. No synchronization needed, as different threads may create the same cached value in parallel which won't hurt > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
gsmiller opened a new pull request, #954: URL: https://github.com/apache/lucene/pull/954 This PR is to migrate the facets module to using the newly-added `SortedSetDocValues#docValueCount()` for iteration, as described in LUCENE-10603. It doesn't attempt to move all `SSDV` iteration, just the facets module. Benchmark results show a potentially small win, and no regressions, so I think we should move forward with this. ``` TaskQPS baseline StdDevQPS candidate StdDevPct diff p-value TermDTSort 99.36 (13.9%) 93.27 (12.9%) -6.1% ( -28% - 24%) 0.149 HighTermDayOfYearSort 99.67 (13.9%) 96.80 (12.2%) -2.9% ( -25% - 27%) 0.487 BrowseDayOfYearSSDVFacets 14.07 (18.4%) 13.82 (13.8%) -1.8% ( -28% - 37%) 0.729 HighTermTitleBDVSort 126.78 (21.9%) 124.79 (25.4%) -1.6% ( -40% - 58%) 0.835 IntNRQ 73.42 (4.7%) 72.57 (5.1%) -1.2% ( -10% -9%) 0.454 OrHighMed 99.39 (3.5%) 98.39 (3.2%) -1.0% ( -7% -5%) 0.345 OrHighNotMed 1295.86 (3.2%) 1285.26 (5.0%) -0.8% ( -8% -7%) 0.535 OrHighHigh 46.15 (3.1%) 45.78 (3.0%) -0.8% ( -6% -5%) 0.400 BrowseMonthSSDVFacets 16.26 (15.1%) 16.13 (9.7%) -0.8% ( -22% - 28%) 0.848 OrHighLow 970.97 (2.8%) 964.35 (1.9%) -0.7% ( -5% -4%) 0.375 OrNotHighMed 945.22 (3.4%) 939.06 (4.1%) -0.7% ( -7% -7%) 0.582 MedTerm 2116.21 (5.1%) 2103.04 (4.5%) -0.6% ( -9% -9%) 0.684 PKLookup 169.76 (3.4%) 168.71 (3.8%) -0.6% ( -7% -6%) 0.588 AndHighHigh 44.04 (3.0%) 43.78 (5.5%) -0.6% ( -8% -8%) 0.677 MedIntervalsOrdered 10.35 (5.8%) 10.31 (5.2%) -0.4% ( -10% - 11%) 0.820 OrNotHighHigh 1077.87 (4.1%) 1074.45 (5.0%) -0.3% ( -9% -9%) 0.827 HighTerm 2923.10 (4.7%) 2914.62 (4.3%) -0.3% ( -8% -9%) 0.838 LowTerm 1969.85 (4.9%) 1964.63 (5.6%) -0.3% ( -10% - 10%) 0.873 MedSpanNear 59.53 (2.6%) 59.38 (3.1%) -0.2% ( -5% -5%) 0.784 HighIntervalsOrdered 12.23 (8.2%) 12.20 (7.6%) -0.2% ( -14% - 16%) 0.920 HighSpanNear5.30 (2.7%)5.29 (3.2%) -0.1% ( -5% -5%) 0.902 OrNotHighLow 1213.60 (2.8%) 1212.64 (2.8%) -0.1% ( -5% -5%) 0.928 LowSloppyPhrase 24.51 (3.3%) 24.49 (3.3%) -0.1% ( -6% -6%) 0.953 OrHighMedDayTaxoFacets 12.99 (4.9%) 12.98 (6.2%) -0.1% ( -10% - 11%) 0.974 MedTermDayTaxoFacets 23.69 (4.8%) 23.68 (4.1%) -0.1% ( -8% -9%) 0.971 LowIntervalsOrdered 107.55 (5.3%) 107.51 (3.8%) -0.0% ( -8% -9%) 0.980 OrHighNotLow 1064.18 (4.6%) 1064.82 (5.6%)0.1% ( -9% - 10%) 0.970 LowSpanNear 190.49 (3.1%) 190.62 (3.9%)0.1% ( -6% -7%) 0.951 AndHighMedDayTaxoFacets 39.56 (2.1%) 39.60 (1.6%)0.1% ( -3% -3%) 0.868 MedPhrase 379.28 (2.1%) 379.69 (2.5%)0.1% ( -4% -4%) 0.883 HighPhrase 223.12 (2.5%) 223.61 (2.8%)0.2% ( -4% -5%) 0.795 HighTermMonthSort 121.98 (16.5%) 122.28 (13.8%)0.2% ( -25% - 36%) 0.959 LowPhrase 66.70 (2.8%) 66.89 (3.9%)0.3% ( -6% -7%) 0.792 Fuzzy1 93.42 (1.8%) 93.69 (1.2%)0.3% ( -2% -3%) 0.556 Respell 53.91 (1.8%) 54.07 (1.4%)0.3% ( -2% -3%) 0.552 MedSloppyPhrase 16.33 (3.2%) 16.38 (3.3%)0.3% ( -6% -7%) 0.763 AndHighMed 90.72 (3.2%) 91.05 (4.0%)0.4% ( -6% -7%) 0.753 HighSloppyPhrase 32.08 (4.7%) 32.21 (4.4%)0.4% ( -8% -9%) 0.774 OrHighNotHigh 895.72 (4.9%)
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552827#comment-17552827 ] Robert Muir commented on LUCENE-10610: -- what uses this hashcode (anything?). Let's please not go back to trying to implement hashcode/equals in Automaton. There are good reasons why we don't do that. If possible, let's try to find/prevent anything using RunAutomaton.hashCode/equals too. This class is already complex and can be resource-intensive, so I'd rather us try to avoid adding more of it "just because its java". I am ok with throwing exceptions from these methods, too. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552844#comment-17552844 ] Tomoko Uchida commented on LUCENE-10610: bq. Let's please not go back to trying to implement hashcode/equals in Automaton. There are good reasons why we don't do that. Yes, I looked through Automaton class and understand it shouldn't be hash keys. bq. If possible, let's try to find/prevent anything using RunAutomaton.hashCode/equals too. RunAutomaton.hashCode() is (indirectly) used by AutomatonQuery.hashCode(). I'll check if it's possible to implement AutomatonQuery.hashCode() without depending CompiledAutomaton.hashCode() - it isn't called from anyware else. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552851#comment-17552851 ] Greg Miller commented on LUCENE-10603: -- OK, thanks [~ChrisLu]! +1 to doing this for consistency. I took a pass at making this change within the faceting module since there are a number of places we rely on SSDV ordinal iteration. I figured we could probably tackle this change through multiple PRs, so I figured I'd lend a hand with faceting: https://github.com/apache/lucene/pull/954 > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552856#comment-17552856 ] Robert Muir commented on LUCENE-10610: -- A simple/fast improvement might be to incorporate {{Arrays.hashCode(points)}} into the hashCode() This is gonna be the smallest of the arrays, its size can be roughly thought about as "how many unique letters/ranges are in regex string" and doesn't blow up big. It would differ in your example above, because of different letters used. No need for caching or anything IMO, i think it would be fast enough without that. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552858#comment-17552858 ] Tomoko Uchida commented on LUCENE-10610: Ah, if my understanding is correct it seems that it's structually difficult to make {{AutomatonQuery.hashCode()}} not depend on RunAutomaton.hashCode() - it's just a kind of container for Automaton and CompiledAutomaton. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552861#comment-17552861 ] Robert Muir commented on LUCENE-10610: -- Also i honestly think the current hashcode based on points.length could be considered good enough for a hashcode. The example above is two exact same automatons (same shape), exact same number of states (same string length), exact same number of unique symbols/ranges (points.length). I personally feel a collision is OK, from automaton perspective, due to how similar they are. And especially since stuff like PrefixQuery doesn't even use this hashcode anyway. But again, if you want to make it more refined, substitute {{Arrays.hashCode(points)}} for {{points.length}}. It shouldnt be much slower. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #929: LUCENE-10584: Properly support #getSpecificValue for hierarchical dims in SSDV faceting
mdmarshmallow commented on code in PR #929: URL: https://github.com/apache/lucene/pull/929#discussion_r894746323 ## lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java: ## @@ -74,7 +74,7 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I @Override public Number getSpecificValue(String dim, String... path) throws IOException { -if (path.length != 1) { +if (stateConfig.getDimConfig(dim).hierarchical == false && path.length != 1) { throw new IllegalArgumentException("path must be length=1"); Review Comment: Maybe we should update this error message? Something like `dim + "is not configured to be hierarchical, path must be length=1"`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
mdmarshmallow commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1152603748 >Is this about providing enough information to optimize with KD/R-Trees? If that's the case, I don't think we want a matches(long[]) method right? We just need a way for the FSM instances to expose their bounding-boxes. I don't think we even need to store the original long[] array to do this. I just wanted to point out here that the reason I wanted a `long[]` API was to support KD/R-Trees, but if we don't need it to add support, then I think there is no reason for `long[]`. I feel like having 2 `matches` functions in this case would make the API unnecessarily complex. If the user is extending the FSM, then I feel like they could figure out the `byte[]` API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #929: LUCENE-10584: Properly support #getSpecificValue for hierarchical dims in SSDV faceting
gsmiller commented on code in PR #929: URL: https://github.com/apache/lucene/pull/929#discussion_r894787988 ## lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java: ## @@ -74,7 +74,7 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I @Override public Number getSpecificValue(String dim, String... path) throws IOException { -if (path.length != 1) { +if (stateConfig.getDimConfig(dim).hierarchical == false && path.length != 1) { throw new IllegalArgumentException("path must be length=1"); Review Comment: Good suggestion! Pushed a rev. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552909#comment-17552909 ] Tomoko Uchida commented on LUCENE-10610: I may completely miss the point so correct me if I'm wrong - but possibly does it make sense to make all methods that change the internal state of {{Automaton}}, and make it immutable (from the perspective of the outside package)? It looks {{Automaton.Builder}} has all the necessary methods to build an Automaton instance, so I wonder if it is sufficient to only expose the builder class. I feel a bit awkward that the built Automaton instance can be still modified after "finishing" by the builder. If we make it immutable, we could safely set a (pre-computed or on-the-fly) hash value to it, and in return for that, we can freely remove {{hashCode()}} and {{equals()}} from {{CompiledAutomaton}} and {{RunAutomaton}} classes. Rather than having them in the classes to run actual matching operations, I guess it could be more natural to have it in {{Automaton}} like {{Query}} class - the prototype and most higher interface? > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552909#comment-17552909 ] Tomoko Uchida edited comment on LUCENE-10610 at 6/10/22 6:25 PM: - I may completely miss the point so correct me if I'm wrong - but possibly does it make sense to make all methods that change the internal state of {{Automaton}} private, and make the class virtually immutable (from the perspective of the outside package)? It looks {{Automaton.Builder}} has all the necessary methods to build an Automaton instance, so I wonder if it is sufficient to only expose the builder class. I feel a bit awkward that the built Automaton instance can be still modified after "finishing" by the builder. If we make it immutable, we could safely set a (pre-computed or on-the-fly) hash value to it, and in return for that, we can freely remove {{hashCode()}} and {{equals()}} from {{CompiledAutomaton}} and {{RunAutomaton}} classes. Rather than having them in the classes to run actual matching operations, I guess it could be more natural to have it in {{Automaton}} like {{Query}} class - the prototype and most higher interface? was (Author: tomoko uchida): I may completely miss the point so correct me if I'm wrong - but possibly does it make sense to make all methods that change the internal state of {{Automaton}}, and make it immutable (from the perspective of the outside package)? It looks {{Automaton.Builder}} has all the necessary methods to build an Automaton instance, so I wonder if it is sufficient to only expose the builder class. I feel a bit awkward that the built Automaton instance can be still modified after "finishing" by the builder. If we make it immutable, we could safely set a (pre-computed or on-the-fly) hash value to it, and in return for that, we can freely remove {{hashCode()}} and {{equals()}} from {{CompiledAutomaton}} and {{RunAutomaton}} classes. Rather than having them in the classes to run actual matching operations, I guess it could be more natural to have it in {{Automaton}} like {{Query}} class - the prototype and most higher interface? > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1152648521 > So again, purely from an API perspective, we tell the user "You give us long[] at indexing time, we'll give it you back at aggregation time". It's simple, readable, intuitive. Hmm, yeah this is a good point. +1 that it would be more intuitive for users to implement a long-based match method if they're writing their own `FSM` implementations. And +1 that it also provides a layer of separation between the encoding and the logic (so we could change the encoding of the point doc values without breaking `FSM`s). I pushed a change just now that shows a different way we _could_ approach this. I'm not sure I think we _should_, but wanted to float it out there to see what you all think. I'm also not opposed to moving forward with _only_ a long-based API and then benchmark to see if the byte-based approach is _actually_ more optimal. Honestly, that's probably the right decision here... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552943#comment-17552943 ] Robert Muir commented on LUCENE-10610: -- it is much more complicated. I really don't think we should do equals/hashcode on automaton, we've been down that road before and it gets bad. The problem has to do with things like determinization, minimization, etc. originally equals was implemented by "sameLanguage" but it requires very heavy-duty stuff to compute. Otherwise, if you arent going to define it that way, equals is meaningless as there are zillions of possible automata that "do the same thing". It isn't about mutability. I prefer for automaton to simply implement Object.equals/hashcode (identity) as it makes just as much sense. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons
[ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552947#comment-17552947 ] Robert Muir commented on LUCENE-10610: -- and for the same reason, again, we can do something else for runautomaton too. Such as remove hashCode/equals methods entirely and let it be defined by identity. it is perfectly reasonable just like it is for automaton. maybe some tests have to be changed or modified to use {{Operations.sameLanguage}} for assertions instead of {{equals()}}, that's fine. > RunAutomaton#hashCode() can easily cause hash collision for different > Automatons > > > Key: LUCENE-10610 > URL: https://issues.apache.org/jira/browse/LUCENE-10610 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tomoko Uchida >Priority: Minor > > Current RunAutomaton#hashCode() is: > {code:java} > @Override > public int hashCode() { > final int prime = 31; > int result = 1; > result = prime * result + alphabetSize; > result = prime * result + points.length; > result = prime * result + size; > return result; > } > {code} > Since it does not take account of the contents of the {{points}} array, this > returns the same value for different automatons when their alphabet size and > state size are the same. > For example, this test code passes. > {code:java} > public void testHashCode() throws IOException { > PrefixQuery q1 = new PrefixQuery(new Term("field", "aba")); > PrefixQuery q2 = new PrefixQuery(new Term("field", "fee")); > assert q1.compiled.runAutomaton.hashCode() == > q2.compiled.runAutomaton.hashCode(); > } > {code} > I suspect this is a bug? > Note that I think it's not a serious one; all callers of this {{hashCode()}} > take account of additional information when calculating their own hash value, > it seems there is no substantial impact on higher-level APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wormday commented on pull request #953: LUCENE-10605: fix error in 32bit jvm object alignment gap calculation…
wormday commented on PR #953: URL: https://github.com/apache/lucene/pull/953#issuecomment-1152830515 @zhaih @mocobeta Thanks for everyone's help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org