[jira] [Updated] (LUCENE-10233) Store docIds as bitset when doc IDs are strictly sorted and dense
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-10233: -- Summary: Store docIds as bitset when doc IDs are strictly sorted and dense (was: Store docIds as bitset when leafCardinality = 1 to speed up addAll) > Store docIds as bitset when doc IDs are strictly sorted and dense > - > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Fix For: 9.1 > > Attachments: SparseFixedBitSet.png > > Time Spent: 3h 50m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and {{intersect}} will get into {{addAll}} logic. > If we store ids as bitset, and give the IntersectVisitor bulk visiting > ability, we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Optimization will be triggered when the following conditions are met at the > same time: > # doc IDs are sorted strictly > # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding > too much storage) > I mocked a field that has 10,000,000 docs per value and search it with a 1 > term PointInSetQuery, the build scorer time decreased from 151ms to 5ms. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10500) StringValueFacetCounts relies on sequential collection
[ https://issues.apache.org/jira/browse/LUCENE-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10500. --- Fix Version/s: 9.2 Resolution: Fixed > StringValueFacetCounts relies on sequential collection > -- > > Key: LUCENE-10500 > URL: https://issues.apache.org/jira/browse/LUCENE-10500 > Project: Lucene - Core > Issue Type: Bug >Reporter: Luca Cavanna >Priority: Major > Fix For: 9.2 > > Time Spent: 0.5h > Remaining Estimate: 0h > > We recently moved some of the facets tests to use IndexSearcher#search(Query, > CollectorManager) providing a FacetsCollectorManager instead of a > FacetsCollector. Whenever newIndexSearcher(IndexReader) is used in tests, > concurrent search may now be exercised while it was not before. > This caused some build failures on TestStringValueFacetCounts: > {code:java} > java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 > at > __randomizedtesting.SeedInfo.seed([ED8BF8281FCE5C02:9FC7DD27AEAEEA71]:0) > at > org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.util.packed.Packed64.get(Packed64.java:81) > at > org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.OrdinalMap$2.get(OrdinalMap.java:346) > at > org.apache.lucene.facet.StringValueFacetCounts.countOneSegment(StringValueFacetCounts.java:440) > at > org.apache.lucene.facet.StringValueFacetCounts.count(StringValueFacetCounts.java:295) > at > org.apache.lucene.facet.StringValueFacetCounts.(StringValueFacetCounts.java:123) > at > org.apache.lucene.facet.TestStringValueFacetCounts.checkFacetResult(TestStringValueFacetCounts.java:349) > at > org.apache.lucene.facet.TestStringValueFacetCounts.testRandom(TestStringValueFacetCounts.java:325) > {code} > This looks like a real bug, as StringValueFacetCounts#countOneSegment is > called once providing the index of the current loop instead of the ordinal > taken from the matching hits that we are analyzing. That works fine with > single threaded collection as we will go sequentially and the two indices > will always be the same. With multi-threaded search, the order of the > returned matching hits (one per segment) is not deterministic. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10502) Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
Lu Xugang created LUCENE-10502: -- Summary: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc Key: LUCENE-10502 URL: https://issues.apache.org/jira/browse/LUCENE-10502 Project: Lucene - Core Issue Type: Improvement Affects Versions: 9.1 Reporter: Lu Xugang Now -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10502) Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
[ https://issues.apache.org/jira/browse/LUCENE-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10502: --- Description: Since at search phase, vector's all docs of all fields will be fully loaded into memory, could we use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc? (was: Since at search phase, vector's all docs of all fields will be fully loaded into memory, could we ) > Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle > ordToDoc > > > Key: LUCENE-10502 > URL: https://issues.apache.org/jira/browse/LUCENE-10502 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.1 >Reporter: Lu Xugang >Priority: Major > > Since at search phase, vector's all docs of all fields will be fully loaded > into memory, could we use IndexedDISI to store docIds and > DirectMonotonicWriter/Reader to handle ordToDoc? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10502) Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
[ https://issues.apache.org/jira/browse/LUCENE-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10502: --- Description: Since at search phase, vector's all docs of all fields will be fully loaded into memory, could we (was: Now ) > Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle > ordToDoc > > > Key: LUCENE-10502 > URL: https://issues.apache.org/jira/browse/LUCENE-10502 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.1 >Reporter: Lu Xugang >Priority: Major > > Since at search phase, vector's all docs of all fields will be fully loaded > into memory, could we -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10502) Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
[ https://issues.apache.org/jira/browse/LUCENE-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10502: --- Description: Since at search phase, vector's all docs of all fields will be fully loaded into memory, could we use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc mapping? (was: Since at search phase, vector's all docs of all fields will be fully loaded into memory, could we use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc?) > Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle > ordToDoc > > > Key: LUCENE-10502 > URL: https://issues.apache.org/jira/browse/LUCENE-10502 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.1 >Reporter: Lu Xugang >Priority: Major > > Since at search phase, vector's all docs of all fields will be fully loaded > into memory, could we use IndexedDISI to store docIds and > DirectMonotonicWriter/Reader to handle ordToDoc mapping? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang opened a new pull request, #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
LuXugang opened a new pull request, #792: URL: https://github.com/apache/lucene/pull/792 Since at search phase, vector's all docs of all fields will be fully loaded into memory, could we use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc mapping? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #788: LUCENE-10500: StringValueFacetCounts to not rely on sequential collection
javanna commented on PR #788: URL: https://github.com/apache/lucene/pull/788#issuecomment-1089977302 oh well if I had to apologize for every bug I committed... happy to help! Also good to see that using collector managers in tests helped uncover this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #793: LUCENE-10493: add 'backWordPos' array to JapaneseTokenizer.Position
mocobeta opened a new pull request, #793: URL: https://github.com/apache/lucene/pull/793 `JapaneseTokenizer.Position` and `KoreanTokenizer.Position` are almost the same except for `backWordPos` array, which only exists in KoreanTokenizer. To factor out the viterbi algorighm, the two `Position` classes have to be made identical, at least for the moment. I'm sorry that this adds the extra int array to KuromojTokenizer, but I think the integration is worth much that and optimization can come later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10503) Preserve more significant bits of scores in WANDScorer
Adrien Grand created LUCENE-10503: - Summary: Preserve more significant bits of scores in WANDScorer Key: LUCENE-10503 URL: https://issues.apache.org/jira/browse/LUCENE-10503 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand WANDScorer operates on longs to avoid accuracy issues with floating-point numbers. The current process loses more accuracy bits than it could, and making it better could help skip in a few more situations. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #794: LUCENE-10153: Improve accuracy of scaled scores in WANDScorer.
jpountz opened a new pull request, #794: URL: https://github.com/apache/lucene/pull/794 See https://issues.apache.org/jira/browse/LUCENE-10503. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #785: LUCENE-10002: move MemoryIndex to search(Query, CollectorManager)
jpountz merged PR #785: URL: https://github.com/apache/lucene/pull/785 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)
[ https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517958#comment-17517958 ] ASF subversion and git services commented on LUCENE-10002: -- Commit 74e9716aec74e862b3073e01d3ccbccb199b41e0 in lucene's branch refs/heads/main from Luca Cavanna [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=74e9716aec7 ] LUCENE-10002: move MemoryIndex to search(Query, CollectorManager) (#785) > Remove IndexSearcher#search(Query,Collector) in favor of > IndexSearcher#search(Query,CollectorManager) > - > > Key: LUCENE-10002 > URL: https://issues.apache.org/jira/browse/LUCENE-10002 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 16h 50m > Remaining Estimate: 0h > > It's a bit trappy that you can create an IndexSearcher with an executor, but > that it would always search on the caller thread when calling > {{IndexSearcher#search(Query,Collector)}}. > Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to > {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory > methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to > return a {{CollectorManager}} instead of a {{Collector}}? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #787: LUCENE-10002: replace more usages of search(Query, Collector) in tests
jpountz merged PR #787: URL: https://github.com/apache/lucene/pull/787 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)
[ https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517962#comment-17517962 ] ASF subversion and git services commented on LUCENE-10002: -- Commit 1cf1b301af050c9aaedec6bfcbaaebafa6fa3241 in lucene's branch refs/heads/main from Luca Cavanna [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1cf1b301af0 ] LUCENE-10002: replace more usages of search(Query, Collector) in tests (#787) This commit replaces more usages of search(Query, Collector) with calling the corresponding search(Query, CollectorManager) instead. This round focuses on tests that implement custom collector, that need a corresponding collector manager. > Remove IndexSearcher#search(Query,Collector) in favor of > IndexSearcher#search(Query,CollectorManager) > - > > Key: LUCENE-10002 > URL: https://issues.apache.org/jira/browse/LUCENE-10002 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 17h > Remaining Estimate: 0h > > It's a bit trappy that you can create an IndexSearcher with an executor, but > that it would always search on the caller thread when calling > {{IndexSearcher#search(Query,Collector)}}. > Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to > {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory > methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to > return a {{CollectorManager}} instead of a {{Collector}}? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)
[ https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517976#comment-17517976 ] ASF subversion and git services commented on LUCENE-10002: -- Commit 37434ffb1fcaf5e7a9096b13204fd640a9c8113e in lucene's branch refs/heads/branch_9x from Luca Cavanna [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=37434ffb1fc ] LUCENE-10002: move MemoryIndex to search(Query, CollectorManager) (#785) > Remove IndexSearcher#search(Query,Collector) in favor of > IndexSearcher#search(Query,CollectorManager) > - > > Key: LUCENE-10002 > URL: https://issues.apache.org/jira/browse/LUCENE-10002 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 17h > Remaining Estimate: 0h > > It's a bit trappy that you can create an IndexSearcher with an executor, but > that it would always search on the caller thread when calling > {{IndexSearcher#search(Query,Collector)}}. > Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to > {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory > methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to > return a {{CollectorManager}} instead of a {{Collector}}? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)
[ https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517977#comment-17517977 ] ASF subversion and git services commented on LUCENE-10002: -- Commit ccd21fa5d9df7f2a30cd81784f49b3f08116c300 in lucene's branch refs/heads/branch_9x from Luca Cavanna [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ccd21fa5d9d ] LUCENE-10002: replace more usages of search(Query, Collector) in tests (#787) This commit replaces more usages of search(Query, Collector) with calling the corresponding search(Query, CollectorManager) instead. This round focuses on tests that implement custom collector, that need a corresponding collector manager. > Remove IndexSearcher#search(Query,Collector) in favor of > IndexSearcher#search(Query,CollectorManager) > - > > Key: LUCENE-10002 > URL: https://issues.apache.org/jira/browse/LUCENE-10002 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 17h > Remaining Estimate: 0h > > It's a bit trappy that you can create an IndexSearcher with an executor, but > that it would always search on the caller thread when calling > {{IndexSearcher#search(Query,Collector)}}. > Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to > {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory > methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to > return a {{CollectorManager}} instead of a {{Collector}}? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
[ https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517988#comment-17517988 ] Tomoko Uchida commented on LUCENE-10493: I'm starting this with small steps. I'll try to keep the commits self-contained, and also as small as possible for safety. https://github.com/apache/lucene/pull/793 Let me know if there is any feedback, thanks! > Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? > - > > Key: LUCENE-10493 > URL: https://issues.apache.org/jira/browse/LUCENE-10493 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We now have common dictionary interfaces for kuromoji and nori > ([LUCENE-10393]). A natural question would be: is it possible to unify the > Japanese/Korean tokenizers? > The core methods of the two tokenizers are `parse()` and `backtrace()` to > calculate the minimum cost path by Viterbi search. I'd set the goal of this > issue to factoring out them into a separate class (in analysis-common) that > is shared between JapaneseTokenizer and KoreanTokenizer. > The algorithm to solve the minimum cost path itself is of course > language-agnostic, so I think it should be theoretically possible; the most > difficult part here might be the N-best path calculation - which is supported > only by JapaneseTokenizer and not by KoreanTokenizer. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #795: LUCENE-10493: Unify TokenInfoFST in kuromoji and nori
mocobeta opened a new pull request, #795: URL: https://github.com/apache/lucene/pull/795 `org.apache.lucene.analysis.[ja|ko].dict.TokenInfoFST` are exactly the same except for the range of cached FST root arcs; we can safely unify the cache logic and I need this for LUCENE-10493. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
[ https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517988#comment-17517988 ] Tomoko Uchida edited comment on LUCENE-10493 at 4/6/22 11:42 AM: - I'm starting this with small steps. I'll try to keep the commits self-contained, and also as small as possible for safety. https://github.com/apache/lucene/pull/793 https://github.com/apache/lucene/pull/795 Let me know if there is any feedback, thanks! was (Author: tomoko uchida): I'm starting this with small steps. I'll try to keep the commits self-contained, and also as small as possible for safety. https://github.com/apache/lucene/pull/793 Let me know if there is any feedback, thanks! > Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? > - > > Key: LUCENE-10493 > URL: https://issues.apache.org/jira/browse/LUCENE-10493 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > We now have common dictionary interfaces for kuromoji and nori > ([LUCENE-10393]). A natural question would be: is it possible to unify the > Japanese/Korean tokenizers? > The core methods of the two tokenizers are `parse()` and `backtrace()` to > calculate the minimum cost path by Viterbi search. I'd set the goal of this > issue to factoring out them into a separate class (in analysis-common) that > is shared between JapaneseTokenizer and KoreanTokenizer. > The algorithm to solve the minimum cost path itself is of course > language-agnostic, so I think it should be theoretically possible; the most > difficult part here might be the N-best path calculation - which is supported > only by JapaneseTokenizer and not by KoreanTokenizer. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #786: LUCENE-10499: reduce unnecessary copy data overhead when growing array size
wjp719 commented on PR #786: URL: https://github.com/apache/lucene/pull/786#issuecomment-1090235148 @jpountz Hi, can you help to take some time to review this PR, thanks a lot -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
[ https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518271#comment-17518271 ] Tomoko Uchida commented on LUCENE-10493: I'm trying to factor out the core algorithm from Japanese/Korean Tokenizers with the above modifications - it is still a very rough patch but anyhow, seems to work... I'd merge #793 and #795 after waiting for one or two days and then prepare the main PR. The next step can't be small to show the full picture (creating a base `Viterbi` class in analysis-common, moving the common logic to it, and rewriting Japanese/Korean Tokenizers upon it), though, I will try to sort out the interfaces for review. > Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? > - > > Key: LUCENE-10493 > URL: https://issues.apache.org/jira/browse/LUCENE-10493 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > We now have common dictionary interfaces for kuromoji and nori > ([LUCENE-10393]). A natural question would be: is it possible to unify the > Japanese/Korean tokenizers? > The core methods of the two tokenizers are `parse()` and `backtrace()` to > calculate the minimum cost path by Viterbi search. I'd set the goal of this > issue to factoring out them into a separate class (in analysis-common) that > is shared between JapaneseTokenizer and KoreanTokenizer. > The algorithm to solve the minimum cost path itself is of course > language-agnostic, so I think it should be theoretically possible; the most > difficult part here might be the N-best path calculation - which is supported > only by JapaneseTokenizer and not by KoreanTokenizer. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518272#comment-17518272 ] Feng Guo commented on LUCENE-10315: --- Thanks [~ivera], [~jpountz] for all effort and suggestions here! FYI, here is something interesting: I tried to change {code:java} @Benchmark public void readInts24ForUtilVisitor(IntDecodeState state, Blackhole bh) { decode24(state); for (int i = 0; i < state.count; i++) { bh.consume(state.outputInts[i]); } } {code} To {code:java} @Benchmark public void readInts24ForUtilVisitorImproved(IntDecodeState state, Blackhole bh) { decode24(state); int[] ints = state.outputInts; for (int i = 0; i < state.count; i++) { bh.consume(ints[i]); } } {code} And here is the result: {code:java} Benchmark Mode Cnt Score Error Units ReadInts24Benchmark.readInts24ForUtilVisitor thrpt 10 0.776 ± 0.012 ops/us ReadInts24Benchmark.readInts24ForUtilVisitorImproved thrpt 10 0.848 ± 0.012 ops/us ReadInts24Benchmark.readInts24Visitor thrpt 10 0.786 ± 0.006 ops/us $ java -version openjdk version "17.0.2" 2022-01-18 OpenJDK Runtime Environment (build 17.0.2+8-86) OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing) {code} > Speed up BKD leaf block ids codec by a 512 ints ForUtil > --- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Attachments: addall.svg, cpu_profile_baseline.html, > cpu_profile_path.html > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |1|32|1|51.44|148.26|188.22%| > |1|32|2|26.8|101.88|280.15%| > |1|32|4|14.04|53.52|281.20%| > |1|32|8|7.04|28.54|305.40%| > |1|32|16|3.54|14.61|312.71%| > |1|128|1|110.56|350.26|216.81%| > |1|128|8|16.6|89.81|441.02%| > |1|128|16|8.45|48.07|468.88%| > |1|128|32|4.2|25.35|503.57%| > |1|128|64|2.13|13.02|511.27%| > |1|1024|1|536.19|843.88|57.38%| > |1|1024|8|109.71|251.89|129.60%| > |1|1024|32|33.24|104.11|213.21%| > |1|1024|128|8.87|30.47|243.52%| > |1|1024|512|2.24|8.3|270.54%| > |1|8192|1|.33|5000|50.00%| > |1|8192|32|139.47|214.59|53.86%| > |1|8192|128|54.59|109.23|100.09%| > |1
[GitHub] [lucene] zhaih commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide
zhaih commented on code in PR #762: URL: https://github.com/apache/lucene/pull/762#discussion_r844154121 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java: ## @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import java.io.IOException; +import java.nio.file.Path; +import org.apache.lucene.facet.FacetTestCase; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.taxonomy.FacetLabel; +import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.util.IOUtils; + +public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase { + + /** + * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by testing if the + * associated {@link SearcherTaxonomyManager} can successfully refresh and serve queries if the + * underlying taxonomy index is changed to an older checkpoint. Ideally, each checkpoint should be + * self-sufficient and should allow serving search queries when {@link + * SearcherTaxonomyManager#maybeRefresh()} is called. + * + * It does not check whether the private taxoArrays were actually recreated or no. We are + * (correctly) hiding away that complexity away from the user. + */ + public void testAlwaysRefreshDirectoryTaxonomyReader() throws IOException { +final Path taxoPath1 = createTempDir("dir1"); +final Directory dir1 = newFSDirectory(taxoPath1); +final DirectoryTaxonomyWriter tw1 = +new DirectoryTaxonomyWriter(dir1, IndexWriterConfig.OpenMode.CREATE); +tw1.addCategory(new FacetLabel("a")); +tw1.commit(); // commit1 + +final Path taxoPath2 = createTempDir("commit1"); +final Directory commit1 = newFSDirectory(taxoPath2); +// copy all index files from dir1 +for (String file : dir1.listAll()) { + commit1.copyFrom(dir1, file, file, IOContext.READ); +} + +tw1.addCategory(new FacetLabel("b")); +tw1.commit(); // commit2 +tw1.close(); + +final DirectoryReader dr1 = DirectoryReader.open(dir1); +// using a DirectoryTaxonomyReader here will cause the test to fail and throw a AIOOB exception Review Comment: I guess I would write a generic function like ``` void testCase(Function dtrProducer, Class exceptionType) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide
zhaih commented on code in PR #762: URL: https://github.com/apache/lucene/pull/762#discussion_r844154121 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java: ## @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import java.io.IOException; +import java.nio.file.Path; +import org.apache.lucene.facet.FacetTestCase; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.taxonomy.FacetLabel; +import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.util.IOUtils; + +public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase { + + /** + * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by testing if the + * associated {@link SearcherTaxonomyManager} can successfully refresh and serve queries if the + * underlying taxonomy index is changed to an older checkpoint. Ideally, each checkpoint should be + * self-sufficient and should allow serving search queries when {@link + * SearcherTaxonomyManager#maybeRefresh()} is called. + * + * It does not check whether the private taxoArrays were actually recreated or no. We are + * (correctly) hiding away that complexity away from the user. + */ + public void testAlwaysRefreshDirectoryTaxonomyReader() throws IOException { +final Path taxoPath1 = createTempDir("dir1"); +final Directory dir1 = newFSDirectory(taxoPath1); +final DirectoryTaxonomyWriter tw1 = +new DirectoryTaxonomyWriter(dir1, IndexWriterConfig.OpenMode.CREATE); +tw1.addCategory(new FacetLabel("a")); +tw1.commit(); // commit1 + +final Path taxoPath2 = createTempDir("commit1"); +final Directory commit1 = newFSDirectory(taxoPath2); +// copy all index files from dir1 +for (String file : dir1.listAll()) { + commit1.copyFrom(dir1, file, file, IOContext.READ); +} + +tw1.addCategory(new FacetLabel("b")); +tw1.commit(); // commit2 +tw1.close(); + +final DirectoryReader dr1 = DirectoryReader.open(dir1); +// using a DirectoryTaxonomyReader here will cause the test to fail and throw a AIOOB exception Review Comment: I guess I would write a generic function like ``` void testCase(Function dtrProducer, Class exceptionType) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on a diff in pull request #778: LUCENE-10495: Fix bug in TaxonomyFacets
Yuti-G commented on code in PR #778: URL: https://github.com/apache/lucene/pull/778#discussion_r844156099 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java: ## @@ -109,7 +109,7 @@ public boolean childrenLoaded() { * @lucene.experimental */ public boolean siblingsLoaded() { -return children != null; +return siblings != null; Review Comment: Hi @gsmiller, thanks for your feedback! There is only one use case that I could find where `siblingsLoaded()` and `childrenLoaded()` can return different boolean value, and I added a test for it. Please let me know if there is any question. Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on a diff in pull request #778: LUCENE-10495: Fix bug in TaxonomyFacets
Yuti-G commented on code in PR #778: URL: https://github.com/apache/lucene/pull/778#discussion_r844156099 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java: ## @@ -109,7 +109,7 @@ public boolean childrenLoaded() { * @lucene.experimental */ public boolean siblingsLoaded() { -return children != null; +return siblings != null; Review Comment: Hi @gsmiller, thanks for your feedback! There is only one use case that I could find where `siblingsLoaded()` and `childrenLoaded()` can return different boolean values, and I added a test for it. Please let me know if there is any question. Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10495) Fix bug in TaxonomyFacets
[ https://issues.apache.org/jira/browse/LUCENE-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuting Gan updated LUCENE-10495: Description: Found a bug in TaxonomyFacets when trying to use the siblingsLoaded function. siblingsLoaded() should return siblings != null and it returns children != null currently. was: Found a bug in TaxonomyFacets when trying to use the siblingsLoaded function. siblingsLoaded() should return siblings != null; > Fix bug in TaxonomyFacets > - > > Key: LUCENE-10495 > URL: https://issues.apache.org/jira/browse/LUCENE-10495 > Project: Lucene - Core > Issue Type: Bug >Reporter: Yuting Gan >Priority: Minor > Attachments: Screen Shot 2022-03-30 at 8.02.15 PM.png > > Time Spent: 40m > Remaining Estimate: 0h > > Found a bug in TaxonomyFacets when trying to use the siblingsLoaded function. > siblingsLoaded() should return siblings != null and it returns children != > null currently. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated LUCENE-10292: Attachment: LUCENE-10292-1.patch Status: Open (was: Open) {quote}I originally tried to replace the "R/W" locking of SearcherManager with an AtomicReference so we wouldn't need to have any synchronization blocks in {{lookup()}} at all; but I couldn't figure out a "safe" way to do that w/o ref counting the SearcherManager ... {quote} I don't know why it didn't occurto me yesterday, but the obviuos solution to this type of situation is a {{ReadWriteLock}} ... patch updated to use a writeLock() when replacing the {{SearcherManager}} and a {{readLock()}} in {{lookup()}} (and {{getCount()}} > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Attachments: LUCENE-10292-1.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #718: LUCENE-10444: Support alternate aggregation functions in association facets
gsmiller commented on PR #718: URL: https://github.com/apache/lucene/pull/718#issuecomment-1090586015 @mikemccand or @msokolov, did either of you have additional feedback? It didn't really look like it beyond the pre-existing bug (which I've since addressed), but I wanted to check before merging to make sure. Thanks again for the reviews! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10504) KnnGraphTester should use KnnVectorQuery
Michael Sokolov created LUCENE-10504: Summary: KnnGraphTester should use KnnVectorQuery Key: LUCENE-10504 URL: https://issues.apache.org/jira/browse/LUCENE-10504 Project: Lucene - Core Issue Type: Improvement Reporter: Michael Sokolov to get a more realistic picture, and to track developments in the query implementation, the tester should use that rather than implementing its own per-segment search and merging logic. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov opened a new pull request, #796: LUCENE-10504: KnnGraphTester to use KnnVectorQuery
msokolov opened a new pull request, #796: URL: https://github.com/apache/lucene/pull/796 This really has two changes: 1. it switches the vector searches it runs to use the Query impl, as the description says 2. it becomes a bit more clever about managing its cache of "exact" NN that are used for recall comparisons. Previously, if you changed the source data files it would still potentially re-use the cached NN file. Now it stores a hash of the file name and looks at the modification times to see if it should regenerate the NN file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #786: LUCENE-10499: reduce unnecessary copy data overhead when growing array size
msokolov commented on PR #786: URL: https://github.com/apache/lucene/pull/786#issuecomment-1090789978 I don't much like the name either. I wouldn't block, but perhaps `growWithoutCopying`? or `growNoCopy`? The whole idea that we are growing the existing array is deceptive though because really we are just creating a new array (maybe), but I can't improve on the name much either, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani merged pull request #789: Add release wizard step around build failures
jtibshirani merged PR #789: URL: https://github.com/apache/lucene/pull/789 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #796: LUCENE-10504: KnnGraphTester to use KnnVectorQuery
jtibshirani commented on code in PR #796: URL: https://github.com/apache/lucene/pull/796#discussion_r844431197 ## lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java: ## @@ -362,18 +367,19 @@ private void testSearch(Path indexPath, Path queryPath, Path outputPath, int[][] long cpuTimeStartNs; try (Directory dir = FSDirectory.open(indexPath); DirectoryReader reader = DirectoryReader.open(dir)) { +IndexSearcher searcher = new IndexSearcher(reader); numDocs = reader.maxDoc(); for (int i = 0; i < warmCount; i++) { // warm up targets.get(target); - results[i] = doKnnSearch(reader, KNN_FIELD, target, topK, fanout); + doKnnSearch(reader, KNN_FIELD, target, topK, fanout); Review Comment: Would it be fine to use `doKnnVectorQuery` here so we could delete `doKnnSearch`? ## lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java: ## @@ -349,8 +353,9 @@ private void testSearch(Path indexPath, Path queryPath, Path outputPath, int[][] TopDocs[] results = new TopDocs[numIters]; long elapsed, totalCpuTime, totalVisited = 0; try (FileChannel q = FileChannel.open(queryPath)) { + int bufferSize = Math.max(numIters, warmCount) * dim * Float.BYTES; Review Comment: Maybe we could just assert warmCount < numIters, seems unusual to warm up with queries that you don't use in the benchmark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 opened a new pull request, #797: LUCENE-10315: Speed up DocIdsWriter by ForUtil
gf2121 opened a new pull request, #797: URL: https://github.com/apache/lucene/pull/797 https://issues.apache.org/jira/browse/LUCENE-10315 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518473#comment-17518473 ] Feng Guo commented on LUCENE-10315: --- Here is the benchmark result I got on my machine by [https://github.com/iverase/benchmark_forutil]. {code:java} Benchmark Mode Cnt Score Error Units ReadInts24Benchmark.readInts24ForUtil thrpt 25 9.086 ± 0.089 ops/us ReadInts24Benchmark.readInts24ForUtilVisitor thrpt 25 0.764 ± 0.005 ops/us ReadInts24Benchmark.readInts24Legacy thrpt 25 2.877 ± 0.013 ops/us ReadInts24Benchmark.readInts24Visitor thrpt 25 0.778 ± 0.006 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLong1 thrpt 25 3.329 ± 0.023 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLong2 thrpt 25 3.218 ± 0.037 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLong3 thrpt 25 3.755 ± 0.017 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLong4 thrpt 25 3.862 ± 0.025 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor1 thrpt 25 0.710 ± 0.008 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor2 thrpt 25 0.849 ± 0.013 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor3 thrpt 25 0.804 ± 0.006 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor4 thrpt 25 0.768 ± 0.007 ops/us ReadIntsBenchmark.readIntsForUtil thrpt 25 18.957 ± 0.194 ops/us ReadIntsBenchmark.readIntsForUtilVisitor thrpt 25 0.817 ± 0.004 ops/us ReadIntsBenchmark.readIntsLegacy thrpt 25 2.456 ± 0.016 ops/us ReadIntsBenchmark.readIntsLegacyVisitor thrpt 25 0.608 ± 0.007 ops/us {code} In this result, I'm seeing {{readInts24ForUtil}} runs 3 times faster than {{{}readInts24Legacy{}}}. This speed is attractive to me. So i'm trying to find some ways to solve the regression when calling visitor. A way i'm thinking about is to introduce {{visit(int[] docs, int count)}} for {{IntersectVisitor.}} The benefit of this method: 1. This method can help reduce the number of virtual function call. 2. {{BufferAdder}} can directly use {{System#arraycopy}} to append doc ids. 3. {{InverseIntersectVisitor}} can count cost faster. Based on luceneutil, I reproduced the regression successfully on my local machine by nightly benchmark tasks and random seed = 10: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value IntNRQ 27.43 (1.8%) 24.12 (1.1%) -12.1% ( -14% - -9%) 0.000 {code} After the optimization, I can see the speed up with the same seed: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value IntNRQ 27.68 (1.7%) 31.89 (2.0%) 15.2% ( 11% - 19%) 0.000 {code} I post the draft code here: [https://github.com/apache/lucene/pull/797]. This commit [https://github.com/apache/lucene/pull/797/commits/7fb6ac3f5901a29d87e9fa427ba429d1e1749b14] shows what was changed. > Speed up BKD leaf block ids codec by a 512 ints ForUtil > --- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Attachments: addall.svg, cpu_profile_baseline.html, > cpu_profile_path.html > > Time Spent: 6.5h > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be
[GitHub] [lucene] msokolov commented on a diff in pull request #796: LUCENE-10504: KnnGraphTester to use KnnVectorQuery
msokolov commented on code in PR #796: URL: https://github.com/apache/lucene/pull/796#discussion_r89176 ## lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java: ## @@ -362,18 +367,19 @@ private void testSearch(Path indexPath, Path queryPath, Path outputPath, int[][] long cpuTimeStartNs; try (Directory dir = FSDirectory.open(indexPath); DirectoryReader reader = DirectoryReader.open(dir)) { +IndexSearcher searcher = new IndexSearcher(reader); numDocs = reader.maxDoc(); for (int i = 0; i < warmCount; i++) { // warm up targets.get(target); - results[i] = doKnnSearch(reader, KNN_FIELD, target, topK, fanout); + doKnnSearch(reader, KNN_FIELD, target, topK, fanout); Review Comment: Yes, that makes sense, I don't see why not -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #796: LUCENE-10504: KnnGraphTester to use KnnVectorQuery
msokolov commented on code in PR #796: URL: https://github.com/apache/lucene/pull/796#discussion_r844451157 ## lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java: ## @@ -349,8 +353,9 @@ private void testSearch(Path indexPath, Path queryPath, Path outputPath, int[][] TopDocs[] results = new TopDocs[numIters]; long elapsed, totalCpuTime, totalVisited = 0; try (FileChannel q = FileChannel.open(queryPath)) { + int bufferSize = Math.max(numIters, warmCount) * dim * Float.BYTES; Review Comment: Yeah, I think we can simply replace warmCount with numIters -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #718: LUCENE-10444: Support alternate aggregation functions in association facets
gsmiller merged PR #718: URL: https://github.com/apache/lucene/pull/718 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10444) Support alternate aggregation functions in association facets
[ https://issues.apache.org/jira/browse/LUCENE-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518474#comment-17518474 ] ASF subversion and git services commented on LUCENE-10444: -- Commit f870edf2fe26cffcd4bcddc760b8436c13424103 in lucene's branch refs/heads/main from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f870edf2fe2 ] LUCENE-10444: Support alternate aggregation functions in association facets (#718) > Support alternate aggregation functions in association facets > - > > Key: LUCENE-10444 > URL: https://issues.apache.org/jira/browse/LUCENE-10444 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > We currently only support {{sum}} aggregations in the various association > facet implementations. I'd be really interested in extending the association > facet implementations to support other aggregations, starting with {{max}} > and {{min}} (in addition to {{{}sum{}}}). > I've been sketching up a prototype of this and I think I have a reasonable > way to introduce this idea. Will get a PR out for feedback soon. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch deci
mikemccand commented on code in PR #762: URL: https://github.com/apache/lucene/pull/762#discussion_r844479016 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/AlwaysRefreshDirectoryTaxonomyReader.java: ## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import java.io.IOException; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.IOUtils; + +/** + * A modified DirectoryTaxonomyReader that always recreates a new {@link + * AlwaysRefreshDirectoryTaxonomyReader} instance when {@link + * AlwaysRefreshDirectoryTaxonomyReader#doOpenIfChanged()} is called. This enables us to easily go + * forward or backward in time by re-computing the ordinal space during each refresh. Review Comment: Hmm in the previous revision was this a test-only class? I'm nervous about making this available in the `facet` jar -- this class should only be used in exceptional cases, while most (normal) cases should use the normal `DTR` that optimizes the very common "refresh forward" case. Can we maybe make `DTR` smarter to detect when a "roll backwards / revert" situation is happening on refresh? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide
zhaih commented on code in PR #762: URL: https://github.com/apache/lucene/pull/762#discussion_r844486980 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/AlwaysRefreshDirectoryTaxonomyReader.java: ## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import java.io.IOException; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.IOUtils; + +/** + * A modified DirectoryTaxonomyReader that always recreates a new {@link + * AlwaysRefreshDirectoryTaxonomyReader} instance when {@link + * AlwaysRefreshDirectoryTaxonomyReader#doOpenIfChanged()} is called. This enables us to easily go + * forward or backward in time by re-computing the ordinal space during each refresh. Review Comment: Yeah I guess that's possible, right now we have a single `EPOCH` to represent the version of taxonomy index. And `EPOCH` is increased only when the taxonomy index was recreated. I think we might be able to further define `VERSION` and it will be increased at each commit, and reset to 0 when `EPOCH` is bumped. So that then we can inherit an index iff `EPOCH` is the same and `VERSION` is newer? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch d
gautamworah96 commented on code in PR #762: URL: https://github.com/apache/lucene/pull/762#discussion_r844502429 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/AlwaysRefreshDirectoryTaxonomyReader.java: ## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import java.io.IOException; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.IOUtils; + +/** + * A modified DirectoryTaxonomyReader that always recreates a new {@link + * AlwaysRefreshDirectoryTaxonomyReader} instance when {@link + * AlwaysRefreshDirectoryTaxonomyReader#doOpenIfChanged()} is called. This enables us to easily go + * forward or backward in time by re-computing the ordinal space during each refresh. Review Comment: ++ I separated it to a different class because in an earlier [comment](https://github.com/apache/lucene/pull/762#discussion_r840008690) we thought that it could be a good idea to directly expose it to our users. Let's keep it as a test class for now. We can explore the idea of making `DTR smarter to detect when a "roll backwards / revert" situation` in another issue (we will also have to think about how to handle older indexes etc). One thing at a time.. I'll revert it back to a test class -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated LUCENE-10292: Attachment: LUCENE-10292-2.patch Status: Open (was: Open) I refactored the test code so that it could be applied to all other {{Lookup}} impls (that have a {{build()}} method) and found that while none of the other impls had the same problem of {{.lookup()}} failing to return suggestions during a (re)build, a few FST based {{Lookup}}s have {{getCount()}} impls that return results that inconsistent from {{.lookup()}} due to incrementing a {{count}} variable gradually during {{build()}}. This latest patch (in addition to the expanded testing) fixes those {{build()}} methods to update their {{count}} value only after replacing the {{fst}} in use. > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #786: LUCENE-10499: reduce unnecessary copy data overhead when growing array size
wjp719 commented on PR #786: URL: https://github.com/apache/lucene/pull/786#issuecomment-1091011881 @msokolov @jpountz Thank you for your comments, I refactored method name as `ArrayUtil#growNoCopy`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #790: LUCENE-10436: Remove deprecated DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery
zacharymorn commented on PR #790: URL: https://github.com/apache/lucene/pull/790#issuecomment-1091119984 > > For the change entry, I assume this should go into version 10.0.0? > > Yes, we need a CHANGES entry under 10.0.0 and a new entry in `lucene/MIGRATE.txt` that recommends replacing `DocValueFieldExistsQuery` and others with `FieldExistsQuery`. Sounds good. Added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a diff in pull request #790: LUCENE-10436: Remove deprecated DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery
zacharymorn commented on code in PR #790: URL: https://github.com/apache/lucene/pull/790#discussion_r844733404 ## lucene/core/src/java/org/apache/lucene/search/UsageTrackingQueryCachingPolicy.java: ## @@ -58,12 +58,6 @@ private static boolean shouldNeverCache(Query query) { return true; } -if (query instanceof DocValuesFieldExistsQuery) { - // We do not bother caching DocValuesFieldExistsQuery queries since they are already plenty - // fast. - return true; -} Review Comment: Oh sorry I should have added a nocommit for this. Given `FieldExistsQuery` now supports norms and vectors in addition to doc values, would not caching for also norms and vectors here hurt performance, if we were to have similar instance of check for `FieldExistsQuery`? I'm also wondering if there's a luceneutil like benchmark for these as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn merged pull request #791: LUCENE-10436: (Backporting) Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery
zacharymorn merged PR #791: URL: https://github.com/apache/lucene/pull/791 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10436) Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery into a single FieldExistsQuery?
[ https://issues.apache.org/jira/browse/LUCENE-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518601#comment-17518601 ] ASF subversion and git services commented on LUCENE-10436: -- Commit a42326b9ef90a77910a7dcaf46997b53da6266b1 in lucene's branch refs/heads/branch_9x from zacharymorn [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a42326b9ef9 ] LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery (#767) (#791) > Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and > KnnVectorFieldExistsQuery into a single FieldExistsQuery? > -- > > Key: LUCENE-10436 > URL: https://issues.apache.org/jira/browse/LUCENE-10436 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 20m > Remaining Estimate: 0h > > Now that we require consistency across data structures, we could merge > DocValuesFieldExistsQuery, NormsFieldExistsQuery and > KnnVectorFieldExistsQuery together into a FieldExistsQuery that would require > that the field indexes either norms, doc values or vectors? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn opened a new pull request, #798: LUCENE-10436: (Backport) Remove usage of DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery
zacharymorn opened a new pull request, #798: URL: https://github.com/apache/lucene/pull/798 Backporting PR https://github.com/apache/lucene/pull/790 without removal of the deprecated queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #790: LUCENE-10436: Remove deprecated DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery
zacharymorn commented on PR #790: URL: https://github.com/apache/lucene/pull/790#issuecomment-1091130105 > Great. We should backport these changes but the actual removals to 9.x to address deprecation warnings. Thanks for the review! I've created the backporting PR for 9.x here https://github.com/apache/lucene/pull/798 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #790: LUCENE-10436: Remove deprecated DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery
jpountz commented on code in PR #790: URL: https://github.com/apache/lucene/pull/790#discussion_r844759861 ## lucene/core/src/java/org/apache/lucene/search/UsageTrackingQueryCachingPolicy.java: ## @@ -58,12 +58,6 @@ private static boolean shouldNeverCache(Query query) { return true; } -if (query instanceof DocValuesFieldExistsQuery) { - // We do not bother caching DocValuesFieldExistsQuery queries since they are already plenty - // fast. - return true; -} Review Comment: I feel good about not having a benchmark for this. The reasoning is that if the index has a data structure that supports running the query very efficiently, then we should just use it and skip caching. And we have this for doc values, norms and vectors. In contrast, boolean queries for instance need to reconcile multiple queries together, which has overhead. So +1 to exclude FieldExistsQuery from caching entirely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org