[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?
[ https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556683#comment-17556683 ] Adrien Grand commented on LUCENE-10507: --- It looks like this change help find a reproducible test failure: ./gradlew test --tests TestElevationComparator.testSorting -Dtests.seed=3AC6BE539DA8C1F3 -Dtests.locale=sg-CF -Dtests.timezone=America/Indiana/Knox -Dtests.asserts=true -Dtests.file.encoding=UTF-8 I don't understand the reason yet. > Should it be more likely to search concurrently in tests? > - > > Key: LUCENE-10507 > URL: https://issues.apache.org/jira/browse/LUCENE-10507 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > As part of LUCENE-10002 we are migrating test usages of > IndexSearcher#search(Query, Collector) to use the corresponding search method > that takes a CollectorManager in place of a Collector. As part of such > changes, I've been paying attention to whether searchers are created through > LuceneTestCase#newSearcher and migrating to it when possible. > This caused some recent test failures following test changes, which were in > most cases test issues, although they were quite rare due to the fact that we > only rarely exercise the concurrent code-path in tests. > One recent failure uncovered LUCENE-10500, which was an actual bug that > affected concurrent searches only, and was uncovered by a test run that > indexed a considerable amount of docs and was lucky enough to get an executor > set to its index searcher as well as get multiple slices. > LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and > even when useThreads is true, the searcher may not get an executor set. Also, > it can often happen that despite an executor is set, the searcher will hold > only one slice, as not enough documents are indexed. Some nightly tests index > enough documents, and LuceneTestCase also lowers the slice limits but only > 50% of the times and only when wrapWithAssertions is false. Also I wonder if > the lower limits are low enough: > {code:java} > int maxDocPerSlice = 1 + random.nextInt(10); > int maxSegmentsPerSlice = 1 + random.nextInt(20); > {code} > All in all, I wonder if we should make it more likely for real concurrent > searches to happen while testing across multiple slices. It seems like it > could be useful especially as we'd like users to use collector managers > instead of collectors (although that does not necessarily translate to > concurrent search). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?
[ https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556684#comment-17556684 ] Adrien Grand commented on LUCENE-10507: --- Also we wondered if this change could affect the time it takes to run tests, but things look good so far: http://people.apache.org/~mikemccand/lucenebench/antcleantest.html. > Should it be more likely to search concurrently in tests? > - > > Key: LUCENE-10507 > URL: https://issues.apache.org/jira/browse/LUCENE-10507 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > As part of LUCENE-10002 we are migrating test usages of > IndexSearcher#search(Query, Collector) to use the corresponding search method > that takes a CollectorManager in place of a Collector. As part of such > changes, I've been paying attention to whether searchers are created through > LuceneTestCase#newSearcher and migrating to it when possible. > This caused some recent test failures following test changes, which were in > most cases test issues, although they were quite rare due to the fact that we > only rarely exercise the concurrent code-path in tests. > One recent failure uncovered LUCENE-10500, which was an actual bug that > affected concurrent searches only, and was uncovered by a test run that > indexed a considerable amount of docs and was lucky enough to get an executor > set to its index searcher as well as get multiple slices. > LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and > even when useThreads is true, the searcher may not get an executor set. Also, > it can often happen that despite an executor is set, the searcher will hold > only one slice, as not enough documents are indexed. Some nightly tests index > enough documents, and LuceneTestCase also lowers the slice limits but only > 50% of the times and only when wrapWithAssertions is false. Also I wonder if > the lower limits are low enough: > {code:java} > int maxDocPerSlice = 1 + random.nextInt(10); > int maxSegmentsPerSlice = 1 + random.nextInt(20); > {code} > All in all, I wonder if we should make it more likely for real concurrent > searches to happen while testing across multiple slices. It seems like it > could be useful especially as we'd like users to use collector managers > instead of collectors (although that does not necessarily translate to > concurrent search). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?
[ https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556688#comment-17556688 ] ASF subversion and git services commented on LUCENE-10507: -- Commit adcf58fe8751c4af51e6dd841995e61065fa56e6 in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=adcf58fe875 ] LUCENE-10507: Fix test failure. > Should it be more likely to search concurrently in tests? > - > > Key: LUCENE-10507 > URL: https://issues.apache.org/jira/browse/LUCENE-10507 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > As part of LUCENE-10002 we are migrating test usages of > IndexSearcher#search(Query, Collector) to use the corresponding search method > that takes a CollectorManager in place of a Collector. As part of such > changes, I've been paying attention to whether searchers are created through > LuceneTestCase#newSearcher and migrating to it when possible. > This caused some recent test failures following test changes, which were in > most cases test issues, although they were quite rare due to the fact that we > only rarely exercise the concurrent code-path in tests. > One recent failure uncovered LUCENE-10500, which was an actual bug that > affected concurrent searches only, and was uncovered by a test run that > indexed a considerable amount of docs and was lucky enough to get an executor > set to its index searcher as well as get multiple slices. > LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and > even when useThreads is true, the searcher may not get an executor set. Also, > it can often happen that despite an executor is set, the searcher will hold > only one slice, as not enough documents are indexed. Some nightly tests index > enough documents, and LuceneTestCase also lowers the slice limits but only > 50% of the times and only when wrapWithAssertions is false. Also I wonder if > the lower limits are low enough: > {code:java} > int maxDocPerSlice = 1 + random.nextInt(10); > int maxSegmentsPerSlice = 1 + random.nextInt(20); > {code} > All in all, I wonder if we should make it more likely for real concurrent > searches to happen while testing across multiple slices. It seems like it > could be useful especially as we'd like users to use collector managers > instead of collectors (although that does not necessarily translate to > concurrent search). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?
[ https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556689#comment-17556689 ] ASF subversion and git services commented on LUCENE-10507: -- Commit 4fab62b6b8766f42957c9ebb537ac380d5bd7af3 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4fab62b6b87 ] LUCENE-10507: Fix test failure. > Should it be more likely to search concurrently in tests? > - > > Key: LUCENE-10507 > URL: https://issues.apache.org/jira/browse/LUCENE-10507 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > As part of LUCENE-10002 we are migrating test usages of > IndexSearcher#search(Query, Collector) to use the corresponding search method > that takes a CollectorManager in place of a Collector. As part of such > changes, I've been paying attention to whether searchers are created through > LuceneTestCase#newSearcher and migrating to it when possible. > This caused some recent test failures following test changes, which were in > most cases test issues, although they were quite rare due to the fact that we > only rarely exercise the concurrent code-path in tests. > One recent failure uncovered LUCENE-10500, which was an actual bug that > affected concurrent searches only, and was uncovered by a test run that > indexed a considerable amount of docs and was lucky enough to get an executor > set to its index searcher as well as get multiple slices. > LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and > even when useThreads is true, the searcher may not get an executor set. Also, > it can often happen that despite an executor is set, the searcher will hold > only one slice, as not enough documents are indexed. Some nightly tests index > enough documents, and LuceneTestCase also lowers the slice limits but only > 50% of the times and only when wrapWithAssertions is false. Also I wonder if > the lower limits are low enough: > {code:java} > int maxDocPerSlice = 1 + random.nextInt(10); > int maxSegmentsPerSlice = 1 + random.nextInt(20); > {code} > All in all, I wonder if we should make it more likely for real concurrent > searches to happen while testing across multiple slices. It seems like it > could be useful especially as we'd like users to use collector managers > instead of collectors (although that does not necessarily translate to > concurrent search). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?
[ https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556690#comment-17556690 ] Adrien Grand commented on LUCENE-10507: --- OK I found the issue with the test. The comparator was not correctly implemented, {{compareValues}} would sort values in the opposite order as {{compare}}. I pushed a fix. > Should it be more likely to search concurrently in tests? > - > > Key: LUCENE-10507 > URL: https://issues.apache.org/jira/browse/LUCENE-10507 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > As part of LUCENE-10002 we are migrating test usages of > IndexSearcher#search(Query, Collector) to use the corresponding search method > that takes a CollectorManager in place of a Collector. As part of such > changes, I've been paying attention to whether searchers are created through > LuceneTestCase#newSearcher and migrating to it when possible. > This caused some recent test failures following test changes, which were in > most cases test issues, although they were quite rare due to the fact that we > only rarely exercise the concurrent code-path in tests. > One recent failure uncovered LUCENE-10500, which was an actual bug that > affected concurrent searches only, and was uncovered by a test run that > indexed a considerable amount of docs and was lucky enough to get an executor > set to its index searcher as well as get multiple slices. > LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and > even when useThreads is true, the searcher may not get an executor set. Also, > it can often happen that despite an executor is set, the searcher will hold > only one slice, as not enough documents are indexed. Some nightly tests index > enough documents, and LuceneTestCase also lowers the slice limits but only > 50% of the times and only when wrapWithAssertions is false. Also I wonder if > the lower limits are low enough: > {code:java} > int maxDocPerSlice = 1 + random.nextInt(10); > int maxSegmentsPerSlice = 1 + random.nextInt(20); > {code} > All in all, I wonder if we should make it more likely for real concurrent > searches to happen while testing across multiple slices. It seems like it > could be useful especially as we'd like users to use collector managers > instead of collectors (although that does not necessarily translate to > concurrent search). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556475#comment-17556475 ] Tomoko Uchida edited comment on LUCENE-10557 at 6/21/22 7:18 AM: - I browsed through several JSON dumps of Jira issues. These are some observations. - It'd be easy to extract various metadata of issues (reporter id, status, created timestamp, etc.) - It'd be easy to extract all linked issue ids and sub-task ids - It'd be easy to extract all attached file URLs -- Can't estimate how many hours it will take to download all of the files - it'd be easy to extract all comments in an issue -- -Perhaps pagination is needed for issues with many comments- Comments in an issue can be retrieved all at once. - We can apply parser/converter tools to convert the jira markups to markdown -- I think this can be error-prone - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear in the github bot's comments in the Work Log. On GitHub side, there are no difficulties in dealing with the APIs. - It'd be a bit tedious to work with milestones via APIs. They can't be referred to by their text. Id - text mapping is needed - It might need some trials and errors to properly place attached files in their right place As for the cross-link conversion and account mapping script: - To "embed" github issue links / accounts in their right place (maybe next to the Jira issue keys / user names), we need to modify the original text. This can be tricky and the riskiest part to me. Instead of modifying the original text, we could just add some footnotes for the issues/comments - but it could considerably damage the readability. Yes it should be possible with a set of small scripts. Maybe one problem is that it'd be difficult to detect conversion errors/omissions and we can't correct them ourselves if we notice migration errors later (it seems we are not allowed to have the github token of the ASF repository). was (Author: tomoko uchida): I browsed through several JSON dumps of Jira issues. These are some observations. - It'd be easy to extract various metadata of issues (reporter id, status, created timestamp, etc.) - It'd be easy to extract all linked issue ids and sub-task ids - It'd be easy to extract all attached file URLs -- Can't estimate how many hours it will take to download all of the files - it'd be easy to extract all comments in an issue -- Perhaps pagination is needed for issues with many comments - We can apply parser/converter tools to convert the jira markups to markdown -- I think this can be error-prone - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear in the github bot's comments in the Work Log. On GitHub side, there are no difficulties in dealing with the APIs. - It'd be a bit tedious to work with milestones via APIs. They can't be referred to by their text. Id - text mapping is needed - It might need some trials and errors to properly place attached files in their right place As for the cross-link conversion and account mapping script: - To "embed" github issue links / accounts in their right place (maybe next to the Jira issue keys / user names), we need to modify the original text. This can be tricky and the riskiest part to me. Instead of modifying the original text, we could just add some footnotes for the issues/comments - but it could considerably damage the readability. Yes it should be possible with a set of small scripts. Maybe one problem is that it'd be difficult to detect conversion errors/omissions and we can't correct them ourselves if we notice migration errors later (it seems we are not to be allowed to have the github tokens of the ASF repository). > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened
[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)
[ https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556723#comment-17556723 ] Tomoko Uchida commented on LUCENE-10622: Looks like cross-issue links and sub-tasks are fine. There are also issue links to outside projects (e.g. Solr, LEGAL, INFRA, etc.). Do we have to have the fallback links to Jira from GitHub? https://github.com/mocobeta/migration-test-1/issues/24 > Prepare complete migration script to GitHub issue from Jira (best effort) > - > > Key: LUCENE-10622 > URL: https://issues.apache.org/jira/browse/LUCENE-10622 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > If we intend to move the history to GitHub, it should be perfect as far as > possible - significantly degraded copies of history are harmful, rather than > helpful for future contributors, I think. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection
kaivalnp commented on code in PR #951: URL: https://github.com/apache/lucene/pull/951#discussion_r902336738 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -121,35 +140,15 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector filterCollector) - throws IOException { - -if (filterCollector == null) { - Bits acceptDocs = ctx.reader().getLiveDocs(); - return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE); + private TopDocs searchLeaf(LeafReaderContext ctx, Bits acceptDocs, int cost) throws IOException { +TopDocs results = approximateSearch(ctx, acceptDocs, cost); Review Comment: Yes, makes sense! Will add it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection
kaivalnp commented on code in PR #951: URL: https://github.com/apache/lucene/pull/951#discussion_r902366042 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, Query filter) { public Query rewrite(IndexReader reader) throws IOException { TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()]; -BitSetCollector filterCollector = null; +Weight filterWeight = null; if (filter != null) { - filterCollector = new BitSetCollector(reader.leaves().size()); IndexSearcher indexSearcher = new IndexSearcher(reader); BooleanQuery booleanQuery = new BooleanQuery.Builder() .add(filter, BooleanClause.Occur.FILTER) .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER) .build(); - indexSearcher.search(booleanQuery, filterCollector); + Query rewritten = indexSearcher.rewrite(booleanQuery); + filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); } for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterCollector); + Bits acceptDocs; + int cost; + if (filterWeight != null) { +Scorer scorer = filterWeight.scorer(ctx); +if (scorer != null) { + DocIdSetIterator iterator = scorer.iterator(); + if (iterator instanceof BitSetIterator) { +acceptDocs = ((BitSetIterator) iterator).getBitSet(); + } else { +acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc()); + } + cost = (int) iterator.cost(); Review Comment: You're right.. the `scorer` seems to be overestimating quite a lot! I changed it to `cardinality` of the `BitSet`, and it only adds a small latency However, as @jpountz pointed out, it does not include `liveDocs` yet We need some way of incorporating these `liveDocs` into our `BitSet` without iterating one-by-one over matching bits. Any suggestions for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection
jpountz commented on code in PR #951: URL: https://github.com/apache/lucene/pull/951#discussion_r902383800 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, Query filter) { public Query rewrite(IndexReader reader) throws IOException { TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()]; -BitSetCollector filterCollector = null; +Weight filterWeight = null; if (filter != null) { - filterCollector = new BitSetCollector(reader.leaves().size()); IndexSearcher indexSearcher = new IndexSearcher(reader); BooleanQuery booleanQuery = new BooleanQuery.Builder() .add(filter, BooleanClause.Occur.FILTER) .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER) .build(); - indexSearcher.search(booleanQuery, filterCollector); + Query rewritten = indexSearcher.rewrite(booleanQuery); + filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); } for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterCollector); + Bits acceptDocs; + int cost; + if (filterWeight != null) { +Scorer scorer = filterWeight.scorer(ctx); +if (scorer != null) { + DocIdSetIterator iterator = scorer.iterator(); + if (iterator instanceof BitSetIterator) { +acceptDocs = ((BitSetIterator) iterator).getBitSet(); + } else { +acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc()); + } Review Comment: Is it a problem? `exactSearch` doesn't need a `BitSet` but a `DocIdSetIterator`, which should be easy to create by filtering the scorer's iterator to exclude live docs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556848#comment-17556848 ] Robert Muir commented on LUCENE-10577: -- Seems like the codec API needs to be fixed so that ppl can use 8 or 16 bit vectors, etc. I.am -1 against adding any additional similarity functions. The current codec keeps getting more and more bloated instead of scaling out horizontally with more codecs. And more bullshit (eg cosine) keeps getting all piled into this wonder-do-it- all design, perpetuating the argument that it's too difficult to make more codecs, and should be avoided. > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #969: LUCENE-10603: Mark SortedSetDocValues#NO_MORE_ORDS deprecated
gsmiller opened a new pull request, #969: URL: https://github.com/apache/lucene/pull/969 Let's get this marked as deprecated ASAP if we want to actually remove it in a 10.0 release. Unless we remove it, we won't see any performance benefits of LUCENE-10603 since we'll still need to do the internal book-keeping in `Lucene90DocValuesProducer` to surface `NO_MORE_ORDS` as long as it exists as part of the API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556891#comment-17556891 ] Greg Miller commented on LUCENE-10603: -- [~ChrisLu] thanks again for proposing this. I've merged the work in the {{facets}} module to use the new style of iteration, but there's still plenty more locations in our code base that need updating. Let me know if you want any help with this. I'm happy to divide up some of the modules if you'd like (or maybe we can recruit others if interested as well). In the meantime, I propose we get this {{NO_MORE_ORDS}} constant marked as {{deprecated}} so we have a shot of removing it in a 10.0 release. By removing it, as [~jpountz] points out in [#954|https://github.com/apache/lucene/pull/954], we may have a performance benefit since we won't need the book-keeping to keep it updated. I opened another PR for this: [#969|https://github.com/apache/lucene/pull/969]. > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 1h 40m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1161727452 Thanks @shaie! I was away from my computer since Thursday but should have time to catch up on this today, respond to your comments and do another review pass. Agreed that we're close on this. Finish line is in sight! :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556966#comment-17556966 ] Tomoko Uchida commented on LUCENE-10557: I was trying to figure out how to upload attachments (patches, images, etc.) to Github issue with API for hours. {*}There is no way to upload files to GitHub with REST APIs{*}; it is only allowed via the Web Interface. If you want to programmatically port attachment files in Jira to refer to GitHub, you have to [commit the files to the repository|https://docs.github.com/en/rest/repos/contents]. See [https://github.com/isaacs/github/issues/1133] > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Build the convention for issue label/milestone management > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** Give some time to committers to play around with issues/labels/milestones > before the actual migration > ** Make an announcement on the mail lists > ** Show some text messages when opening a new Jira issue (in issue template?) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556966#comment-17556966 ] Tomoko Uchida edited comment on LUCENE-10557 at 6/21/22 3:07 PM: - I was trying to figure out how to upload attachments (patches, images, etc.) to Github issue with API for hours. {*}There is no way to upload files to GitHub with REST APIs{*}; it is only allowed via the Web Interface. If you want to programatically port attachment files in Jira to GitHub, you have to [commit the files to the repository|https://docs.github.com/en/rest/repos/contents]. See [https://github.com/isaacs/github/issues/1133] was (Author: tomoko uchida): I was trying to figure out how to upload attachments (patches, images, etc.) to Github issue with API for hours. {*}There is no way to upload files to GitHub with REST APIs{*}; it is only allowed via the Web Interface. If you want to programmatically port attachment files in Jira to refer to GitHub, you have to [commit the files to the repository|https://docs.github.com/en/rest/repos/contents]. See [https://github.com/isaacs/github/issues/1133] > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Build the convention for issue label/milestone management > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** Give some time to committers to play around with issues/labels/milestones > before the actual migration > ** Make an announcement on the mail lists > ** Show some text messages when opening a new Jira issue (in issue template?) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556475#comment-17556475 ] Tomoko Uchida edited comment on LUCENE-10557 at 6/21/22 3:11 PM: - I browsed through several JSON dumps of Jira issues. These are some observations. - It'd be easy to extract various metadata of issues (reporter id, status, created timestamp, etc.) - It'd be easy to extract all linked issue ids and sub-task ids - It'd be easy to extract all attached file URLs -- Can't estimate how many hours it will take to download all of the files - it'd be easy to extract all comments in an issue -- -Perhaps pagination is needed for issues with many comments- Comments in an issue can be retrieved all at once. - We can apply parser/converter tools to convert the jira markups to markdown -- I think this can be error-prone - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear in the github bot's comments in the Work Log. On GitHub side, there are no difficulties in dealing with the APIs. - It'd be a bit tedious to work with milestones via APIs. They can't be referred to by their text. Id - text mapping is needed - -It might need some trials and errors to properly place attached files in their right place- This is not possible (we can't programatically migrate attachment files to GitHub). As for the cross-link conversion and account mapping script: - To "embed" github issue links / accounts in their right place (maybe next to the Jira issue keys / user names), we need to modify the original text. This can be tricky and the riskiest part to me. Instead of modifying the original text, we could just add some footnotes for the issues/comments - but it could considerably damage the readability. Yes it should be possible with a set of small scripts. Maybe one problem is that it'd be difficult to detect conversion errors/omissions and we can't correct them ourselves if we notice migration errors later (it seems we are not allowed to have the github token of the ASF repository). was (Author: tomoko uchida): I browsed through several JSON dumps of Jira issues. These are some observations. - It'd be easy to extract various metadata of issues (reporter id, status, created timestamp, etc.) - It'd be easy to extract all linked issue ids and sub-task ids - It'd be easy to extract all attached file URLs -- Can't estimate how many hours it will take to download all of the files - it'd be easy to extract all comments in an issue -- -Perhaps pagination is needed for issues with many comments- Comments in an issue can be retrieved all at once. - We can apply parser/converter tools to convert the jira markups to markdown -- I think this can be error-prone - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear in the github bot's comments in the Work Log. On GitHub side, there are no difficulties in dealing with the APIs. - It'd be a bit tedious to work with milestones via APIs. They can't be referred to by their text. Id - text mapping is needed - It might need some trials and errors to properly place attached files in their right place As for the cross-link conversion and account mapping script: - To "embed" github issue links / accounts in their right place (maybe next to the Jira issue keys / user names), we need to modify the original text. This can be tricky and the riskiest part to me. Instead of modifying the original text, we could just add some footnotes for the issues/comments - but it could considerably damage the readability. Yes it should be possible with a set of small scripts. Maybe one problem is that it'd be difficult to detect conversion errors/omissions and we can't correct them ourselves if we notice migration errors later (it seems we are not allowed to have the github token of the ASF repository). > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org
[GitHub] [lucene] jpountz merged pull request #961: Handle more cases in `BooleanWeight#count`.
jpountz merged PR #961: URL: https://github.com/apache/lucene/pull/961 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #969: LUCENE-10603: Mark SortedSetDocValues#NO_MORE_ORDS deprecated
gsmiller merged PR #969: URL: https://github.com/apache/lucene/pull/969 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556988#comment-17556988 ] ASF subversion and git services commented on LUCENE-10603: -- Commit 8f459eb0f9d219af5610642c1027ec704b094dc3 in lucene's branch refs/heads/main from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8f459eb0f9d ] LUCENE-10603: Mark SortedSetDocValues#NO_MORE_ORDS deprecated (#969) > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 1h 40m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #964: LUCENE-10620: Pass the Weight to Collectors.
jpountz commented on PR #964: URL: https://github.com/apache/lucene/pull/964#issuecomment-1161979978 I reverted changes to top-docs collectors. This means this new `Collector#setWeight` API is only useful to `TotalHitCountCollector`. I've been wondering if it was worth adding a new API only for `TotalHitCountCollector` but looking at how facets use this collector, I suspect that many users set up their collectors manually instead of using `IndexSearcher#count` and do not benefit from this optimization, so maybe it's worth the increased API surface. I was chatting about the API with @romseygeek, and we wondered if having the counting logic on `Scorable` would help. I looked into it, and it's not super practical due to the fact that `LeafCollector#setScorer` is not a place where throwing a `CollectorTerminatedException` is supported now (though this could be addressed) and this method can be called multiple times per segment, so we would need to introduce tracking to make sure that we only increment the count on the first time that `setScorer` is called on a segment. For these reasons, I would prefer moving forward with the current API on `Collector`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557000#comment-17557000 ] ASF subversion and git services commented on LUCENE-10603: -- Commit 4de355bd04374bbd6c9ca5fe26b00f4f3dfe74a7 in lucene's branch refs/heads/branch_9x from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4de355bd043 ] LUCENE-10603: Mark SortedSetDocValues#NO_MORE_ORDS deprecated > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 1h 50m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #969: LUCENE-10603: Mark SortedSetDocValues#NO_MORE_ORDS deprecated
gsmiller commented on PR #969: URL: https://github.com/apache/lucene/pull/969#issuecomment-1161982047 Thanks @jpountz ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tang-hi opened a new pull request, #970: LUCENE-10607: Fix potential integer overflow in maxArcs computions
tang-hi opened a new pull request, #970: URL: https://github.com/apache/lucene/pull/970 ### Description (or a Jira issue link if you have one) https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-10607?filter=allopenissues -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10607) NRTSuggesterBuilder扩展input时溢出
[ https://issues.apache.org/jira/browse/LUCENE-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557029#comment-17557029 ] tangdh commented on LUCENE-10607: - Hi, I've raised a Pr to fix the potential integer overflow, [~ChasenY] [~dweiss] https://github.com/apache/lucene/pull/970 > NRTSuggesterBuilder扩展input时溢出 > - > > Key: LUCENE-10607 > URL: https://issues.apache.org/jira/browse/LUCENE-10607 > Project: Lucene - Core > Issue Type: Bug > Components: core/FSTs >Affects Versions: 9.2 >Reporter: chaseny >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > suggest模块在创建索引时,调用NRTSuggestBuilder的finishTerm来写入suggest索引。 > 会调用maxNumArcsForDedupByte函数来扩展analyzed,向后扩展3 5 7 255。 > 当entries长度过长(900)时,调用maxNumArcsForDedupByte扩展时 > > private static int maxNumArcsForDedupByte(int currentNumDedupBytes) { > int maxArcs = 1 + (2 * currentNumDedupBytes); > if (currentNumDedupBytes > 5) > { maxArcs *= currentNumDedupBytes; > //当currentNumDedupBytes大于等于32768时,int相乘会大于int最大值 } > return Math.min(maxArcs, 255); > } > > 另外在扩展时,是否可以选择固定4字节来有序扩展。代替 3 5 7 ... 255的扩展方式 > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1162137822 OK, I think I understand the intention with `FSD` long/int decoding more, but I think it could be a little confusing in the API currently. If I was a user, I'd expect there to be four implementations that correspond with the four types being supported out-of-the-box (int/long/float/double). But this is _really_ about knowing the width of the encoded "sortable longs" in the doc value field. So, with my better understanding, 1) I think the current approach is reasonable, and I can't think of any better suggestion, but 2) maybe we could update the javadocs in `FSD` to make it a little more clear it's about decoding the stored bytes into *comparable longs*? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557061#comment-17557061 ] Tomoko Uchida commented on LUCENE-10557: Maybe GitHub's API call rate limit would be another consideration. {quote}If you're making a large number of {{{}POST{}}}, {{{}PATCH{}}}, {{{}PUT{}}}, or {{DELETE}} requests for a single user or client ID, wait at least one second between each request. {quote} [https://docs.github.com/en/rest/guides/best-practices-for-integrators#dealing-with-secondary-rate-limits|https://docs.github.com/en/rest/overview/resources-in-the-rest-api#secondary-rate-limits] We can't really "bulk" import to GitHub. Every issue and comment has to be posted one by one and between the API calls, at least one-second sleep is required. > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Build the convention for issue label/milestone management > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** Give some time to committers to play around with issues/labels/milestones > before the actual migration > ** Make an announcement on the mail lists > ** Show some text messages when opening a new Jira issue (in issue template?) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557061#comment-17557061 ] Tomoko Uchida edited comment on LUCENE-10557 at 6/21/22 7:00 PM: - Maybe GitHub's API call rate limit would be another consideration. {quote}If you're making a large number of {{{}POST{}}}, {{{}PATCH{}}}, {{{}PUT{}}}, or {{DELETE}} requests for a single user or client ID, wait at least one second between each request. {quote} [https://docs.github.com/en/rest/guides/best-practices-for-integrators#dealing-with-secondary-rate-limits|https://docs.github.com/en/rest/overview/resources-in-the-rest-api#secondary-rate-limits] We can't really "bulk" import to GitHub. Every issue and comment has to be posted one by one and between the API calls, at least one-second sleep is required. I encountered this rate limit many times - actually it seem that the rate limit is strictly monitored. was (Author: tomoko uchida): Maybe GitHub's API call rate limit would be another consideration. {quote}If you're making a large number of {{{}POST{}}}, {{{}PATCH{}}}, {{{}PUT{}}}, or {{DELETE}} requests for a single user or client ID, wait at least one second between each request. {quote} [https://docs.github.com/en/rest/guides/best-practices-for-integrators#dealing-with-secondary-rate-limits|https://docs.github.com/en/rest/overview/resources-in-the-rest-api#secondary-rate-limits] We can't really "bulk" import to GitHub. Every issue and comment has to be posted one by one and between the API calls, at least one-second sleep is required. > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Build the convention for issue label/milestone management > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** Give some time to committers to play around with issues/labels/milestones > before the actual migration > ** Make an announcement on the mail lists > ** Show some text messages when opening a new Jira issue (in issue template?) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To uns
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
mdmarshmallow commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r903051834 ## lucene/facet/src/java/org/apache/lucene/facet/facetset/RangeFacetSetMatcher.java: ## @@ -0,0 +1,166 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.facetset; + +import java.util.Arrays; +import org.apache.lucene.util.NumericUtils; + +/** + * A {@link FacetSetMatcher} which considers a set as a match if all dimensions fall within the + * given corresponding range. + * + * @lucene.experimental + */ +public class RangeFacetSetMatcher extends FacetSetMatcher { + + private final long[] lowerRanges; + private final long[] upperRanges; + + /** + * Constructs an instance to match facet sets with dimensions that fall within the given ranges. + */ + public RangeFacetSetMatcher(String label, DimRange... dimRanges) { +super(label, getDims(dimRanges)); +this.lowerRanges = Arrays.stream(dimRanges).mapToLong(range -> range.min).toArray(); +this.upperRanges = Arrays.stream(dimRanges).mapToLong(range -> range.max).toArray(); + } + + @Override + public boolean matches(long[] dimValues) { +assert dimValues.length == dims +: "Encoded dimensions (dims=" ++ dimValues.length ++ ") is incompatible with range dimensions (dims=" ++ dims ++ ")"; + +for (int i = 0; i < dimValues.length; i++) { + if (dimValues[i] < lowerRanges[i]) { +// Doc's value is too low in this dimension +return false; + } + if (dimValues[i] > upperRanges[i]) { +// Doc's value is too high in this dimension +return false; + } +} +return true; + } + + private static int getDims(DimRange... dimRanges) { +if (dimRanges == null || dimRanges.length == 0) { + throw new IllegalArgumentException("dimRanges cannot be null or empty"); +} +return dimRanges.length; + } + + /** + * Creates a {@link DimRange} for the given min and max long values. This method is also suitable + * for int values. + */ + public static DimRange fromLongs(long min, boolean minInclusive, long max, boolean maxInclusive) { Review Comment: Yeah I think it makes sense in that case to extract DimRange. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
mdmarshmallow commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r903055633 ## lucene/facet/src/java/org/apache/lucene/facet/facetset/FacetSetsField.java: ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.facetset; + +import org.apache.lucene.document.BinaryDocValuesField; +import org.apache.lucene.document.IntPoint; +import org.apache.lucene.util.BytesRef; + +/** + * A {@link BinaryDocValuesField} which encodes a list of {@link FacetSet facet sets}. The encoding + * scheme consists of a packed {@code byte[]} where the first value denotes the number of dimensions + * in all the sets, followed by each set's values. + * + * @lucene.experimental + */ +public class FacetSetsField extends BinaryDocValuesField { + + /** + * Create a new FacetSets field. + * + * @param name field name + * @param facetSets the {@link FacetSet facet sets} to index in that field. All must have the same + * number of dimensions + * @throws IllegalArgumentException if the field name is null or the given facet sets are invalid + */ + public static FacetSetsField create(String name, FacetSet... facetSets) { +if (facetSets == null || facetSets.length == 0) { + throw new IllegalArgumentException("FacetSets cannot be null or empty!"); +} + +return new FacetSetsField(name, toPackedValues(facetSets)); + } + + private FacetSetsField(String name, BytesRef value) { +super(name, value); + } + + private static BytesRef toPackedValues(FacetSet... facetSets) { +int numDims = facetSets[0].dims; +Class expectedClass = facetSets[0].getClass(); +byte[] buf = new byte[Integer.BYTES + facetSets[0].sizePackedBytes() * facetSets.length]; +IntPoint.encodeDimension(numDims, buf, 0); +int offset = Integer.BYTES; +for (FacetSet facetSet : facetSets) { + if (facetSet.dims != numDims) { +throw new IllegalArgumentException( +"All FacetSets must have the same number of dimensions. Expected " ++ numDims ++ " found " ++ facetSet.dims); + } + // It doesn't make sense to index facet sets of different types in the same field + if (facetSet.getClass() != expectedClass) { Review Comment: Took a look at this again and yeah, it doesn't make sense to generify here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r903051691 ## lucene/demo/src/java/org/apache/lucene/demo/facet/CustomFacetSetExample.java: ## @@ -0,0 +1,303 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.demo.facet; + +import java.io.IOException; +import java.time.LocalDate; +import java.time.ZoneOffset; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.analysis.core.WhitespaceAnalyzer; +import org.apache.lucene.document.*; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsCollectorManager; +import org.apache.lucene.facet.facetset.*; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.IndexWriterConfig.OpenMode; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.store.ByteBuffersDirectory; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.NumericUtils; + +/** + * Shows usage of indexing and searching {@link FacetSetsField} with a custom {@link FacetSet} + * implementation. Unlike the out of the box {@link FacetSet} implementations, this example shows + * how to mix and match dimensions of different types, as well as implementing a custom {@link + * FacetSetMatcher}. + */ +public class CustomFacetSetExample { + + private static final long MAY_SECOND_2022 = date("2022-05-02"); + private static final long JUNE_SECOND_2022 = date("2022-06-02"); + private static final long JULY_SECOND_2022 = date("2022-07-02"); + private static final float HUNDRED_TWENTY_DEGREES = fahrenheitToCelsius(120); + private static final float HUNDRED_DEGREES = fahrenheitToCelsius(100); + private static final float EIGHTY_DEGREES = fahrenheitToCelsius(80); + + private final Directory indexDir = new ByteBuffersDirectory(); + + /** Empty constructor */ + public CustomFacetSetExample() {} + + /** Build the example index. */ + private void index() throws IOException { +IndexWriter indexWriter = +new IndexWriter( +indexDir, new IndexWriterConfig(new WhitespaceAnalyzer()).setOpenMode(OpenMode.CREATE)); + +// Every document holds the temperature measures for a City by Date + +Document doc = new Document(); +doc.add(new StringField("city", "city1", Field.Store.YES)); +doc.add( +FacetSetsField.create( +"temperature", +new TemperatureReadingFacetSet(MAY_SECOND_2022, HUNDRED_DEGREES), +new TemperatureReadingFacetSet(JUNE_SECOND_2022, EIGHTY_DEGREES), +new TemperatureReadingFacetSet(JULY_SECOND_2022, HUNDRED_TWENTY_DEGREES))); +indexWriter.addDocument(doc); + +doc = new Document(); +doc.add(new StringField("city", "city2", Field.Store.YES)); +doc.add( +FacetSetsField.create( +"temperature", +new TemperatureReadingFacetSet(MAY_SECOND_2022, EIGHTY_DEGREES), +new TemperatureReadingFacetSet(JUNE_SECOND_2022, HUNDRED_DEGREES), +new TemperatureReadingFacetSet(JULY_SECOND_2022, HUNDRED_TWENTY_DEGREES))); +indexWriter.addDocument(doc); + +indexWriter.close(); + } + + /** Counting documents which exactly match a given {@link FacetSet}. */ + private List exactMatching() throws IOException { +DirectoryReader indexReader = DirectoryReader.open(indexDir); +IndexSearcher searcher = new IndexSearcher(indexReader); + +// MatchAllDocsQuery is for "browsing" (counts facets +// for all non-deleted docs in the index); normally +// you'd use a "normal" query: +FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new FacetsCollectorManager()); + +// Count both "Publish Date" and "Author" dimensions +Facets facets = +new MatchingFacetSetsCounts( +"temperature", +fc, +TemperatureReadingFacetSet::decodeTemperatureReading, +
[GitHub] [lucene] gsmiller commented on a diff in pull request #914: LUCENE-10550: Add getAllChildren functionality to facets
gsmiller commented on code in PR #914: URL: https://github.com/apache/lucene/pull/914#discussion_r903092231 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java: ## @@ -163,6 +164,76 @@ public Number getSpecificValue(String dim, String... path) throws IOException { return getValue(ord); } + @Override + public FacetResult getAllChildren(String dim, String... path) throws IOException { +DimConfig dimConfig = verifyDim(dim); +FacetLabel cp = new FacetLabel(dim, path); +int dimOrd = taxoReader.getOrdinal(cp); +if (dimOrd == -1) { + return null; +} + +int aggregatedValue = 0; +int childCount = 0; + +List ordinals = new ArrayList<>(); +List ordValues = new ArrayList<>(); + +if (sparseValues != null) { + for (IntIntCursor c : sparseValues) { +int value = c.value; +int ord = c.key; +if (parents[ord] == dimOrd && value > 0) { + aggregatedValue = aggregationFunction.aggregate(aggregatedValue, value); + childCount++; + ordinals.add(ord); + ordValues.add(value); +} + } +} else { + int[] children = getChildren(); + int[] siblings = getSiblings(); + int ord = children[dimOrd]; + while (ord != TaxonomyReader.INVALID_ORDINAL) { +int value = values[ord]; +if (value > 0) { + aggregatedValue = aggregationFunction.aggregate(aggregatedValue, value); + childCount++; + ordinals.add(ord); + ordValues.add(value); +} +ord = siblings[ord]; + } +} + +if (aggregatedValue == 0) { + return null; +} + +if (dimConfig.multiValued) { + if (dimConfig.requireDimCount) { +aggregatedValue = getValue(dimOrd); + } else { +// Our sum'd value is not correct, in general: +aggregatedValue = -1; + } +} else { + // Our sum'd dim value is accurate, so we keep it +} + +int[] ordinalArray = new int[ordinals.size()]; +for (int i = 0; i < ordinals.size(); i++) { + ordinalArray[i] = ordinals.get(i); +} Review Comment: Ah, I see. Shoot. It bugs me that we need to copy these ordinals from a list to an array just to do this bulk path lookup, but I see what you're saying. It would be nice if `TaxonomyReader` could directly support `List` in addition to an array, but I don't think this use-case justifies trying to add that right now. Would you mind adding a `TODO` comment here to mention that it would be nice if we didn't need to do this copy just to look up bulk paths? We can leave it at that for now and optimize later if/as necessary. Thanks for pointing this out! ## lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java: ## @@ -72,6 +72,40 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I return createFacetResult(topChildrenForPath, dim, path); } + @Override + public FacetResult getAllChildren(String dim, String... path) throws IOException { +FacetsConfig.DimConfig dimConfig = stateConfig.getDimConfig(dim); + +if (dimConfig.hierarchical) { + int pathOrd = (int) dv.lookupTerm(new BytesRef(FacetsConfig.pathToString(dim, path))); + if (pathOrd < 0) { +// path was never indexed +return null; + } + SortedSetDocValuesReaderState.DimTree dimTree = state.getDimTree(dim); + return getPathResult(dimConfig, dim, path, pathOrd, dimTree.iterator(pathOrd)); +} else { + if (path.length > 0) { +throw new IllegalArgumentException( +"Field is not configured as hierarchical, path should be 0 length"); + } + OrdRange ordRange = state.getOrdRange(dim); + if (ordRange == null) { +// means dimension was never indexed +return null; + } + int dimOrd = ordRange.start; + PrimitiveIterator.OfInt childIt = ordRange.iterator(); + if (dimConfig.multiValued && dimConfig.requireDimCount) { +// If the dim is multi-valued and requires dim counts, we know we've explicitly indexed +// the dimension and we need to skip past it so the iterator is positioned on the first +// child: +childIt.next(); + } + return getPathResult(dimConfig, dim, null, dimOrd, childIt); +} + } Review Comment: Of course! There was a lot of change happening while you were working on this, so I'm sure you were working against an earlier version and just didn't notice some of the change to getTopChildren. Happy to point them out. ## lucene/facet/src/java/org/apache/lucene/facet/Facets.java: ## @@ -29,6 +29,12 @@ public abstract class Facets { /** Default constructor. */ public Facets() {} + /** + * Returns all the children labels with non-zero counts under the specified path in the unsorted + * order. Returns null if the spe
[GitHub] [lucene] Yuti-G commented on pull request #914: LUCENE-10550: Add getAllChildren functionality to facets
Yuti-G commented on PR #914: URL: https://github.com/apache/lucene/pull/914#issuecomment-1162454475 Thank you so much for the last check! I added more javadoc and a new entry to the CHANGES.txt. For back-porting, should I wait until this PR is merged and checkout a new branch against the latest branch_9x to cherrypick the merged commit? Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #914: LUCENE-10550: Add getAllChildren functionality to facets
gsmiller commented on PR #914: URL: https://github.com/apache/lucene/pull/914#issuecomment-1162469235 > For back-porting, should I wait until this PR is merged and checkout a new branch against the latest branch_9x to cherrypick the merged commit? Thanks again! Exactly! Then you can open a PR with that branch against `origin/branch_9x` (github will automatically select `origin/main` as the suggested destination so just change that). And just mention that it's a backport PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10550) Add getAllChildren functionality to facets
[ https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557161#comment-17557161 ] ASF subversion and git services commented on LUCENE-10550: -- Commit bdcb4b37164ba07e87e2e987f7fd4c9c50690601 in lucene's branch refs/heads/main from Yuting Gan [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bdcb4b37164 ] LUCENE-10550: Add getAllChildren functionality to facets (#914) > Add getAllChildren functionality to facets > -- > > Key: LUCENE-10550 > URL: https://issues.apache.org/jira/browse/LUCENE-10550 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > Currently Lucene does not support returning range counts sorted by label > values, but there are use cases demanding this feature. For example, a user > specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts > without changing the range order. Today we can only call getTopChildren to > populate range counts, but it would return ranges sorted by counts (e.g., > [10, 20] 100, [0, 10] 50) instead of range values. > Lucene has a API, getAllChildrenSortByValue, that returns numeric values with > counts sorted by label values, please see > [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. > Therefore, it would be nice that we can also have a similar API to support > range counts. The proposed getAllChildren API is to return value/range counts > sorted by label values instead of counts. > This proposal was inspired from the discussions with [~gsmiller] when I was > working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], > and we believe users would benefit from adding this API to Facets. > Hope I can get some feedback from the community since this proposal would > require changes to the getTopChildren API in RangeFacetCounts. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #914: LUCENE-10550: Add getAllChildren functionality to facets
gsmiller commented on PR #914: URL: https://github.com/apache/lucene/pull/914#issuecomment-1162469491 Merged onto `main`. Thanks again @Yuti-G! Exciting to see this new functionality available :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #914: LUCENE-10550: Add getAllChildren functionality to facets
gsmiller merged PR #914: URL: https://github.com/apache/lucene/pull/914 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G opened a new pull request, #971: LUCENE-10550: Add getAllChildren functionality to facets (#914)
Yuti-G opened a new pull request, #971: URL: https://github.com/apache/lucene/pull/971 Just using to backport. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557164#comment-17557164 ] Yuting Gan commented on LUCENE-10614: - Thank you so much for reviewing and merging the LUCENE-10550 PR! I will start working on this issue and will create a PR to properly return topNChildren in RangeFacetCounts. > Properly support getTopChildren in RangeFacetCounts > --- > > Key: LUCENE-10614 > URL: https://issues.apache.org/jira/browse/LUCENE-10614 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 10.0 (main) >Reporter: Greg Miller >Priority: Minor > > As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing > {{getTopChildren}}. Instead of returning "top" ranges, it returns all > user-provided ranges in the order the user specified them when instantiating. > This is probably more useful functionality, but it would be nice to support > {{getTopChildren}} as well. > LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that > lands, we can replace the current implementation of {{getTopChildren}} with > an actual "top children" implementation and direct users to > {{getAllChildren}} if they want to maintain the current behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #971: LUCENE-10550: Add getAllChildren functionality to facets (#914)
gsmiller merged PR #971: URL: https://github.com/apache/lucene/pull/971 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #971: LUCENE-10550: Add getAllChildren functionality to facets (#914)
gsmiller commented on PR #971: URL: https://github.com/apache/lucene/pull/971#issuecomment-1162519460 Thanks @Yuti-G ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10550) Add getAllChildren functionality to facets
[ https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557174#comment-17557174 ] ASF subversion and git services commented on LUCENE-10550: -- Commit b2c454c8be1549fedd455632a43cea18ff975755 in lucene's branch refs/heads/branch_9x from Yuting Gan [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b2c454c8be1 ] LUCENE-10550: Add getAllChildren functionality to facets > Add getAllChildren functionality to facets > -- > > Key: LUCENE-10550 > URL: https://issues.apache.org/jira/browse/LUCENE-10550 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > Time Spent: 2.5h > Remaining Estimate: 0h > > Currently Lucene does not support returning range counts sorted by label > values, but there are use cases demanding this feature. For example, a user > specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts > without changing the range order. Today we can only call getTopChildren to > populate range counts, but it would return ranges sorted by counts (e.g., > [10, 20] 100, [0, 10] 50) instead of range values. > Lucene has a API, getAllChildrenSortByValue, that returns numeric values with > counts sorted by label values, please see > [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. > Therefore, it would be nice that we can also have a similar API to support > range counts. The proposed getAllChildren API is to return value/range counts > sorted by label values instead of counts. > This proposal was inspired from the discussions with [~gsmiller] when I was > working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], > and we believe users would benefit from adding this API to Facets. > Hope I can get some feedback from the community since this proposal would > require changes to the getTopChildren API in RangeFacetCounts. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #968: [LUCENE-10624] Binary Search for Sparse IndexedDISI advanceWithinBloc…
zacharymorn commented on PR #968: URL: https://github.com/apache/lucene/pull/968#issuecomment-1162539088 Thanks @wuwm for opening this PR! The improvement idea makes sense to me. Quick question though, given the similarities of the binary search implementations in the two methods, is it possible to extract them out into a common method? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10607) NRTSuggesterBuilder扩展input时溢出
[ https://issues.apache.org/jira/browse/LUCENE-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557197#comment-17557197 ] chaseny commented on LUCENE-10607: -- (y) > NRTSuggesterBuilder扩展input时溢出 > - > > Key: LUCENE-10607 > URL: https://issues.apache.org/jira/browse/LUCENE-10607 > Project: Lucene - Core > Issue Type: Bug > Components: core/FSTs >Affects Versions: 9.2 >Reporter: chaseny >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > suggest模块在创建索引时,调用NRTSuggestBuilder的finishTerm来写入suggest索引。 > 会调用maxNumArcsForDedupByte函数来扩展analyzed,向后扩展3 5 7 255。 > 当entries长度过长(900)时,调用maxNumArcsForDedupByte扩展时 > > private static int maxNumArcsForDedupByte(int currentNumDedupBytes) { > int maxArcs = 1 + (2 * currentNumDedupBytes); > if (currentNumDedupBytes > 5) > { maxArcs *= currentNumDedupBytes; > //当currentNumDedupBytes大于等于32768时,int相乘会大于int最大值 } > return Math.min(maxArcs, 255); > } > > 另外在扩展时,是否可以选择固定4字节来有序扩展。代替 3 5 7 ... 255的扩展方式 > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn opened a new pull request, #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction
zacharymorn opened a new pull request, #972: URL: https://github.com/apache/lucene/pull/972 ### Description (or a Jira issue link if you have one) Use Block-Max-Maxscore algorithm for 2 clauses disjunction. Adapted from PR https://github.com/apache/lucene/pull/101 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557223#comment-17557223 ] Lu Xugang commented on LUCENE-10603: Hi, [~gsmiller] when I start to work on the rest of modules, I found a new issue LUCENE-10623 which should be a priority to resolve. {quote}I'm happy to divide up some of the modules {quote} LUCENE-10623 will effect those modules using *SortingSortedDocValues* , if you have free time , your could do the change on the modules that are not affected, and I will later take care of the rest of modules after LUCENE-10623 merged. > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 2h > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction
zacharymorn commented on PR #972: URL: https://github.com/apache/lucene/pull/972#issuecomment-1162613846 Hi @jpountz , I have adapted the original BMM PR https://github.com/apache/lucene/pull/101 to the latest codebase and run further experiments on using it for 2 clauses disjunction. The results look both encouraging and strange :D When I run `python3 src/python/localrun.py -source wikimedium10m` with only `OrHighLow`, `OrHighHigh` and `OrHighMed` tasks from ` tasks/wikimedium.10M.nostopwords.tasks tasks/wikimedium.10M.nostopwords.tasks` (by removing the other tasks), I got pretty impressive speedup on average: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value PKLookup 173.31 (24.6%) 181.79 (26.8%)4.9% ( -37% - 74%) 0.547 OrHighLow 166.70 (62.8%) 385.94 (101.5%) 131.5% ( -20% - 794%) 0.000 OrHighHigh9.27 (48.9%) 23.44 (85.9%) 152.9% ( 12% - 562%) 0.000 OrHighMed 18.45 (61.3%) 55.92 (137.3%) 203.0% ( 2% - 1037%) 0.000 ``` However, when I run all the tasks, `OrHighLow`, `OrHighHigh` and `OrHighMed` have only moderate speedup on average and sometimes even slightly negatively impacted: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value OrHighHigh 35.23 (7.2%) 23.86 (7.0%) -32.3% ( -43% - -19%) 0.000 OrHighLow 898.97 (4.4%) 788.65 (4.2%) -12.3% ( -20% - -3%) 0.000 BrowseDateSSDVFacets2.62 (27.0%)2.43 (18.8%) -7.4% ( -41% - 52%) 0.312 HighSpanNear 21.86 (6.4%) 21.00 (6.1%) -4.0% ( -15% -9%) 0.045 Fuzzy2 94.11 (12.4%) 90.59 (9.8%) -3.7% ( -23% - 21%) 0.290 LowSloppyPhrase 65.63 (8.2%) 63.99 (8.6%) -2.5% ( -17% - 15%) 0.347 HighSloppyPhrase 17.25 (5.3%) 16.84 (5.3%) -2.4% ( -12% -8%) 0.154 TermDTSort 160.18 (8.2%) 156.49 (9.9%) -2.3% ( -18% - 17%) 0.423 HighTermDayOfYearSort 164.86 (6.8%) 161.77 (10.1%) -1.9% ( -17% - 16%) 0.490 OrHighMedDayTaxoFacets 11.05 (7.1%) 10.86 (7.3%) -1.7% ( -15% - 13%) 0.465 AndHighLow 1482.47 (4.0%) 1459.63 (10.6%) -1.5% ( -15% - 13%) 0.544 MedSpanNear 27.77 (7.2%) 27.49 (6.1%) -1.0% ( -13% - 13%) 0.628 HighTermTitleBDVSort 197.53 (7.4%) 195.53 (6.3%) -1.0% ( -13% - 13%) 0.640 AndHighMedDayTaxoFacets 43.61 (8.7%) 43.19 (10.1%) -1.0% ( -18% - 19%) 0.745 HighIntervalsOrdered 17.38 (8.7%) 17.26 (7.5%) -0.7% ( -15% - 16%) 0.782 HighPhrase 454.15 (5.0%) 451.67 (8.7%) -0.5% ( -13% - 13%) 0.807 BrowseRandomLabelSSDVFacets 15.40 (8.1%) 15.32 (7.3%) -0.5% ( -14% - 16%) 0.837 AndHighHighDayTaxoFacets 16.94 (7.0%) 16.87 (6.6%) -0.5% ( -13% - 14%) 0.834 LowSpanNear9.08 (4.8%)9.05 (4.3%) -0.3% ( -9% -9%) 0.838 Wildcard 55.15 (11.3%) 55.01 (12.0%) -0.2% ( -21% - 26%) 0.947 MedPhrase 976.56 (2.8%) 977.29 (3.3%)0.1% ( -5% -6%) 0.939 MedTermDayTaxoFacets 77.21 (8.6%) 77.46 (8.7%)0.3% ( -15% - 19%) 0.908 OrNotHighLow 1187.34 (5.1%) 1191.80 (5.3%)0.4% ( -9% - 11%) 0.819 OrHighNotHigh 1556.42 (4.4%) 1566.26 (4.5%)0.6% ( -7% -9%) 0.654 LowIntervalsOrdered 158.96 (6.4%) 160.03 (8.9%)0.7% ( -13% - 17%) 0.785 OrNotHighHigh 1427.22 (3.8%) 1436.97 (5.0%)0.7% ( -7% -9%) 0.628 Fuzzy1 116.55 (11.4%) 117.41 (9.4%)0.7% ( -18% - 24%) 0.823 LowTerm 3470.46 (5.9%) 3500.25 (5.9%)0.9% ( -10% - 13%) 0.644 HighTermMonthSort 169.22 (10.4%) 170.68 (14.9%)0.9% ( -22% - 29%) 0.832 IntNRQ 115.77 (22.6%) 116.95 (21.3%)1.0% ( -34% - 57
[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
shaie commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r903277452 ## lucene/demo/src/java/org/apache/lucene/demo/facet/CustomFacetSetExample.java: ## @@ -0,0 +1,303 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.demo.facet; + +import java.io.IOException; +import java.time.LocalDate; +import java.time.ZoneOffset; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.analysis.core.WhitespaceAnalyzer; +import org.apache.lucene.document.*; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsCollectorManager; +import org.apache.lucene.facet.facetset.*; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.IndexWriterConfig.OpenMode; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.store.ByteBuffersDirectory; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.NumericUtils; + +/** + * Shows usage of indexing and searching {@link FacetSetsField} with a custom {@link FacetSet} + * implementation. Unlike the out of the box {@link FacetSet} implementations, this example shows + * how to mix and match dimensions of different types, as well as implementing a custom {@link + * FacetSetMatcher}. + */ +public class CustomFacetSetExample { + + private static final long MAY_SECOND_2022 = date("2022-05-02"); + private static final long JUNE_SECOND_2022 = date("2022-06-02"); + private static final long JULY_SECOND_2022 = date("2022-07-02"); + private static final float HUNDRED_TWENTY_DEGREES = fahrenheitToCelsius(120); + private static final float HUNDRED_DEGREES = fahrenheitToCelsius(100); + private static final float EIGHTY_DEGREES = fahrenheitToCelsius(80); + + private final Directory indexDir = new ByteBuffersDirectory(); + + /** Empty constructor */ + public CustomFacetSetExample() {} + + /** Build the example index. */ + private void index() throws IOException { +IndexWriter indexWriter = +new IndexWriter( +indexDir, new IndexWriterConfig(new WhitespaceAnalyzer()).setOpenMode(OpenMode.CREATE)); + +// Every document holds the temperature measures for a City by Date + +Document doc = new Document(); +doc.add(new StringField("city", "city1", Field.Store.YES)); +doc.add( +FacetSetsField.create( +"temperature", +new TemperatureReadingFacetSet(MAY_SECOND_2022, HUNDRED_DEGREES), +new TemperatureReadingFacetSet(JUNE_SECOND_2022, EIGHTY_DEGREES), +new TemperatureReadingFacetSet(JULY_SECOND_2022, HUNDRED_TWENTY_DEGREES))); +indexWriter.addDocument(doc); + +doc = new Document(); +doc.add(new StringField("city", "city2", Field.Store.YES)); +doc.add( +FacetSetsField.create( +"temperature", +new TemperatureReadingFacetSet(MAY_SECOND_2022, EIGHTY_DEGREES), +new TemperatureReadingFacetSet(JUNE_SECOND_2022, HUNDRED_DEGREES), +new TemperatureReadingFacetSet(JULY_SECOND_2022, HUNDRED_TWENTY_DEGREES))); +indexWriter.addDocument(doc); + +indexWriter.close(); + } + + /** Counting documents which exactly match a given {@link FacetSet}. */ + private List exactMatching() throws IOException { +DirectoryReader indexReader = DirectoryReader.open(indexDir); +IndexSearcher searcher = new IndexSearcher(indexReader); + +// MatchAllDocsQuery is for "browsing" (counts facets +// for all non-deleted docs in the index); normally +// you'd use a "normal" query: +FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new FacetsCollectorManager()); + +// Count both "Publish Date" and "Author" dimensions Review Comment: Indeed :), I copied the simple faceting example and didn't cover up my tracks very well :D. -- This is an automated message from the Apache Git Service. To res
[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
shaie commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r903277577 ## lucene/demo/src/java/org/apache/lucene/demo/facet/CustomFacetSetExample.java: ## @@ -0,0 +1,303 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.demo.facet; + +import java.io.IOException; +import java.time.LocalDate; +import java.time.ZoneOffset; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.analysis.core.WhitespaceAnalyzer; +import org.apache.lucene.document.*; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsCollectorManager; +import org.apache.lucene.facet.facetset.*; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.IndexWriterConfig.OpenMode; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.store.ByteBuffersDirectory; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.NumericUtils; + +/** + * Shows usage of indexing and searching {@link FacetSetsField} with a custom {@link FacetSet} + * implementation. Unlike the out of the box {@link FacetSet} implementations, this example shows + * how to mix and match dimensions of different types, as well as implementing a custom {@link + * FacetSetMatcher}. + */ +public class CustomFacetSetExample { + + private static final long MAY_SECOND_2022 = date("2022-05-02"); + private static final long JUNE_SECOND_2022 = date("2022-06-02"); + private static final long JULY_SECOND_2022 = date("2022-07-02"); + private static final float HUNDRED_TWENTY_DEGREES = fahrenheitToCelsius(120); + private static final float HUNDRED_DEGREES = fahrenheitToCelsius(100); + private static final float EIGHTY_DEGREES = fahrenheitToCelsius(80); + + private final Directory indexDir = new ByteBuffersDirectory(); + + /** Empty constructor */ + public CustomFacetSetExample() {} + + /** Build the example index. */ + private void index() throws IOException { +IndexWriter indexWriter = +new IndexWriter( +indexDir, new IndexWriterConfig(new WhitespaceAnalyzer()).setOpenMode(OpenMode.CREATE)); + +// Every document holds the temperature measures for a City by Date + +Document doc = new Document(); +doc.add(new StringField("city", "city1", Field.Store.YES)); +doc.add( +FacetSetsField.create( +"temperature", +new TemperatureReadingFacetSet(MAY_SECOND_2022, HUNDRED_DEGREES), +new TemperatureReadingFacetSet(JUNE_SECOND_2022, EIGHTY_DEGREES), +new TemperatureReadingFacetSet(JULY_SECOND_2022, HUNDRED_TWENTY_DEGREES))); +indexWriter.addDocument(doc); + +doc = new Document(); +doc.add(new StringField("city", "city2", Field.Store.YES)); +doc.add( +FacetSetsField.create( +"temperature", +new TemperatureReadingFacetSet(MAY_SECOND_2022, EIGHTY_DEGREES), +new TemperatureReadingFacetSet(JUNE_SECOND_2022, HUNDRED_DEGREES), +new TemperatureReadingFacetSet(JULY_SECOND_2022, HUNDRED_TWENTY_DEGREES))); +indexWriter.addDocument(doc); + +indexWriter.close(); + } + + /** Counting documents which exactly match a given {@link FacetSet}. */ + private List exactMatching() throws IOException { +DirectoryReader indexReader = DirectoryReader.open(indexDir); +IndexSearcher searcher = new IndexSearcher(indexReader); + +// MatchAllDocsQuery is for "browsing" (counts facets +// for all non-deleted docs in the index); normally +// you'd use a "normal" query: +FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new FacetsCollectorManager()); + +// Count both "Publish Date" and "Author" dimensions +Facets facets = +new MatchingFacetSetsCounts( +"temperature", +fc, +TemperatureReadingFacetSet::decodeTemperatureReading, +
[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
shaie commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r903278541 ## lucene/facet/docs/FacetSets.adoc: ## @@ -0,0 +1,130 @@ += FacetSets Overview +:toc: + +This document describes the `FacetSets` capability, which allows to aggregate on multidimensional values. It starts +with outlining a few example use cases to showcase the motivation for this capability and follows with an API +walk through. + +== Motivation + +[#movie-actors] +=== Movie Actors DB + +Suppose that you want to build a search engine for movie actors which allows you to search for actors by name and see +movie titles they appeared in. You might want to index standard fields such as `actorName`, `genre` and `releaseYear` +which will let you search by the actor's name or see all actors who appeared in movies during 2021. Similarly, you can +index facet fields that will let you aggregate by “Genre” and “Year” so that you can show how many actors appeared in +each year or genre. Few example documents: + +[source] + +{ "name": "Tom Hanks", "genre": ["Comedy", "Drama", …], "year": [1988, 2000,…] } +{ "name": "Harrison Ford", "genre": ["Action", "Adventure", …], "year": [1977, 1981, …] } + + +However, these facet fields do not allow you to show the following aggregation: + +.Number of Actors performing in movies by Genre and Year +[cols="4*"] +|=== +| | 2020 | 2021 | 2022 +| Thriller | 121 | 43 | 97 +| Action| 145 | 52 | 130 +| Adventure | 87 | 21 | 32 +|=== + +The reason is that each “genre” or “releaseYear” facet field is indexed in its own data structure, and therefore if an +actor appeared in a "Thriller" movie in "2020" and "Action" movie in "2021", there's no way for you to tell that they +didn't appear in an "Action" movie in "2020". + +[#automotive-parts] +=== Automotive Parts Store + +Say you're building a search engine for an automotive parts store where customers can search for different car parts. +For simplicity let's assume that each item in the catalog contains a searchable “type” field and “car model” it fits +which consists of two separate fields: “manufacturer” and “year”. This lets you search for parts by their type as well +as filter parts that fit only a certain manufacturer or year. Few example documents: + +[source] + +{ + "type": "Wiper Blades V1", + "models": [ +{ "manufaturer": "Ford", "year": 2010 }, +{ "manufacturer": "Chevy", "year": 2011 } + ] +} +{ + "type": "Wiper Blades V2", + "models": [ +{ "manufaturer": "Ford", "year": 2011 }, +{ "manufacturer": "Chevy", "year": 2010 } + ] +} + + +By breaking up the "models" field into its sub-fields "manufacturer" and "year", you can easily aggregate on parts that +fit a certain manufacturer or year. However, if a user would like to aggregate on parts that can fit either a "Ford +2010" or "Chevy 2011", then aggregating on the sub-fields will lead to a wrong count of 2 (in the above example) instead +of 1. + +[#movie-awards] +=== Movie Awards + +To showcase a 3-D multidimensional aggregation, lets expand the <> example with awards an actor has +received over the years. For this aggregation we will use four dimensions: Award Type ("Oscar", "Grammy", "Emmy"), +Award Category ("Best Actor", "Best Supporting Actress"), Year and Genre. One interesting aggregation is to show how +many "Best Actor" vs "Best Supporting Actor" awards one has received in the "Oscar" or "Emmy" for each year. Another +aggregation is slicing the number of these awards by Genre over all the years. + +Building on these examples, one might be able to come up with an interesting use case for an N-dimensional aggregation +(where `N > 3`). The higher `N` is, the harder it is to aggregate all the dimensions correctly and efficiently without +`FacetSets`. + +== FacetSets API + +The `facetset` package consists of few components which allow you to index and aggregate multidimensional facet sets: + +=== FacetSet + +Holds a set of facet dimension values. Implementations are required to convert the dimensions into comparable long +representation, as well can implement how the values are packed (encoded). The package offers four implementations: +`Int/Float/Long/DoubleFacetSet` for `int`, `float`, `long` and `double` values respectively. You can also look at +`org.apache.lucene.demo.facet.CustomFacetSetExample` in the `lucene/demo` package for a custom implementation of a +`FacetSet`. + +=== FacetSetsField + +A `BinaryDocValues` field which lets you index a list of `FacetSet`. This field can be added to a document only once, so +you will need to construct all the facet sets in advance. + +=== FacetSetMatcher + +Responsible for matching an encoded `FacetSet` against a given criteria. For example, `ExactFacetSetMatcher` only +considers an encoded facet set as a match if all dimension values are equal to a given one. `RangeFacetSetMatcher` +considers an encoded facet set as
[GitHub] [lucene] LuXugang commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues
LuXugang commented on code in PR #967: URL: https://github.com/apache/lucene/pull/967#discussion_r903314253 ## lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java: ## @@ -439,29 +433,42 @@ private void set() { static final class DocOrds { final long[] offsets; final PackedLongValues ords; +final GrowableWriter growableWriter; + +public static final int START_BITS_PER_VALUE = 2; Review Comment: `BitsPerValue` was required for `GrowableWriter`, we could count `maxBitsRequired` while adding values in `SortedSetDocValuesWriter`, but `SortingSortedSetDocValues` was also used in `SortingCodecReader#getDocValuesReader#getSortedSet` which could not supply a `BitsPerValue`, so we have to a default value, could you give some suggestion base on practice for `START_BITS_PER_VALUE ` , @jpountz -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection
kaivalnp commented on code in PR #951: URL: https://github.com/apache/lucene/pull/951#discussion_r903319874 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, Query filter) { public Query rewrite(IndexReader reader) throws IOException { TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()]; -BitSetCollector filterCollector = null; +Weight filterWeight = null; if (filter != null) { - filterCollector = new BitSetCollector(reader.leaves().size()); IndexSearcher indexSearcher = new IndexSearcher(reader); BooleanQuery booleanQuery = new BooleanQuery.Builder() .add(filter, BooleanClause.Occur.FILTER) .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER) .build(); - indexSearcher.search(booleanQuery, filterCollector); + Query rewritten = indexSearcher.rewrite(booleanQuery); + filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); } for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterCollector); + Bits acceptDocs; + int cost; + if (filterWeight != null) { +Scorer scorer = filterWeight.scorer(ctx); +if (scorer != null) { + DocIdSetIterator iterator = scorer.iterator(); + if (iterator instanceof BitSetIterator) { +acceptDocs = ((BitSetIterator) iterator).getBitSet(); + } else { +acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc()); + } Review Comment: We can extend the `BitSetIterator` so that it also incorporates `liveDocs` (return the `nextSetBit` only if it is live, else move to the next bit in loop) But we can't find an accurate estimate of the number of matching + live docs (as it is needed in `visitedLimit` to switch over to `exactSearch`)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org