date:20220621

[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?

2022-06-21 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556683#comment-17556683
 ] 

Adrien Grand commented on LUCENE-10507:
---

It looks like this change help find a reproducible test failure:
./gradlew test --tests TestElevationComparator.testSorting 
-Dtests.seed=3AC6BE539DA8C1F3 -Dtests.locale=sg-CF 
-Dtests.timezone=America/Indiana/Knox -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8

I don't understand the reason yet.

> Should it be more likely to search concurrently in tests?
> -
>
> Key: LUCENE-10507
> URL: https://issues.apache.org/jira/browse/LUCENE-10507
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As part of LUCENE-10002 we are migrating test usages of 
> IndexSearcher#search(Query, Collector) to use the corresponding search method 
> that takes a CollectorManager in place of a Collector. As part of such 
> changes, I've been paying attention to whether searchers are created through 
> LuceneTestCase#newSearcher and migrating to it when possible.
> This caused some recent test failures following test changes, which were in 
> most cases test issues, although they were quite rare due to the fact that we 
> only rarely exercise the concurrent code-path in tests.
> One recent failure uncovered LUCENE-10500, which was an actual bug that 
> affected concurrent searches only, and was uncovered by a test run that 
> indexed a considerable amount of docs and was lucky enough to get an executor 
> set to its index searcher as well as get multiple slices.
> LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and 
> even when useThreads is true, the searcher may not get an executor set. Also, 
> it can often happen that despite an executor is set, the searcher will hold 
> only one slice, as not enough documents are indexed. Some nightly tests index 
> enough documents, and LuceneTestCase also lowers the slice limits but only 
> 50% of the times and only when wrapWithAssertions is false. Also I wonder if 
> the lower limits are low enough:
> {code:java}
> int maxDocPerSlice = 1 + random.nextInt(10);
> int maxSegmentsPerSlice = 1 + random.nextInt(20);
> {code}
> All in all, I wonder if we should make it more likely for real concurrent 
> searches to happen while testing across multiple slices. It seems like it 
> could be useful especially as we'd like users to use collector managers 
> instead of collectors (although that does not necessarily translate to 
> concurrent search).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?

2022-06-21 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556684#comment-17556684
 ] 

Adrien Grand commented on LUCENE-10507:
---

Also we wondered if this change could affect the time it takes to run tests, 
but things look good so far: 
http://people.apache.org/~mikemccand/lucenebench/antcleantest.html.

> Should it be more likely to search concurrently in tests?
> -
>
> Key: LUCENE-10507
> URL: https://issues.apache.org/jira/browse/LUCENE-10507
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As part of LUCENE-10002 we are migrating test usages of 
> IndexSearcher#search(Query, Collector) to use the corresponding search method 
> that takes a CollectorManager in place of a Collector. As part of such 
> changes, I've been paying attention to whether searchers are created through 
> LuceneTestCase#newSearcher and migrating to it when possible.
> This caused some recent test failures following test changes, which were in 
> most cases test issues, although they were quite rare due to the fact that we 
> only rarely exercise the concurrent code-path in tests.
> One recent failure uncovered LUCENE-10500, which was an actual bug that 
> affected concurrent searches only, and was uncovered by a test run that 
> indexed a considerable amount of docs and was lucky enough to get an executor 
> set to its index searcher as well as get multiple slices.
> LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and 
> even when useThreads is true, the searcher may not get an executor set. Also, 
> it can often happen that despite an executor is set, the searcher will hold 
> only one slice, as not enough documents are indexed. Some nightly tests index 
> enough documents, and LuceneTestCase also lowers the slice limits but only 
> 50% of the times and only when wrapWithAssertions is false. Also I wonder if 
> the lower limits are low enough:
> {code:java}
> int maxDocPerSlice = 1 + random.nextInt(10);
> int maxSegmentsPerSlice = 1 + random.nextInt(20);
> {code}
> All in all, I wonder if we should make it more likely for real concurrent 
> searches to happen while testing across multiple slices. It seems like it 
> could be useful especially as we'd like users to use collector managers 
> instead of collectors (although that does not necessarily translate to 
> concurrent search).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?

2022-06-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556688#comment-17556688
 ] 

ASF subversion and git services commented on LUCENE-10507:
--

Commit adcf58fe8751c4af51e6dd841995e61065fa56e6 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=adcf58fe875 ]

LUCENE-10507: Fix test failure.


> Should it be more likely to search concurrently in tests?
> -
>
> Key: LUCENE-10507
> URL: https://issues.apache.org/jira/browse/LUCENE-10507
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As part of LUCENE-10002 we are migrating test usages of 
> IndexSearcher#search(Query, Collector) to use the corresponding search method 
> that takes a CollectorManager in place of a Collector. As part of such 
> changes, I've been paying attention to whether searchers are created through 
> LuceneTestCase#newSearcher and migrating to it when possible.
> This caused some recent test failures following test changes, which were in 
> most cases test issues, although they were quite rare due to the fact that we 
> only rarely exercise the concurrent code-path in tests.
> One recent failure uncovered LUCENE-10500, which was an actual bug that 
> affected concurrent searches only, and was uncovered by a test run that 
> indexed a considerable amount of docs and was lucky enough to get an executor 
> set to its index searcher as well as get multiple slices.
> LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and 
> even when useThreads is true, the searcher may not get an executor set. Also, 
> it can often happen that despite an executor is set, the searcher will hold 
> only one slice, as not enough documents are indexed. Some nightly tests index 
> enough documents, and LuceneTestCase also lowers the slice limits but only 
> 50% of the times and only when wrapWithAssertions is false. Also I wonder if 
> the lower limits are low enough:
> {code:java}
> int maxDocPerSlice = 1 + random.nextInt(10);
> int maxSegmentsPerSlice = 1 + random.nextInt(20);
> {code}
> All in all, I wonder if we should make it more likely for real concurrent 
> searches to happen while testing across multiple slices. It seems like it 
> could be useful especially as we'd like users to use collector managers 
> instead of collectors (although that does not necessarily translate to 
> concurrent search).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?

2022-06-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556689#comment-17556689
 ] 

ASF subversion and git services commented on LUCENE-10507:
--

Commit 4fab62b6b8766f42957c9ebb537ac380d5bd7af3 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4fab62b6b87 ]

LUCENE-10507: Fix test failure.


> Should it be more likely to search concurrently in tests?
> -
>
> Key: LUCENE-10507
> URL: https://issues.apache.org/jira/browse/LUCENE-10507
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As part of LUCENE-10002 we are migrating test usages of 
> IndexSearcher#search(Query, Collector) to use the corresponding search method 
> that takes a CollectorManager in place of a Collector. As part of such 
> changes, I've been paying attention to whether searchers are created through 
> LuceneTestCase#newSearcher and migrating to it when possible.
> This caused some recent test failures following test changes, which were in 
> most cases test issues, although they were quite rare due to the fact that we 
> only rarely exercise the concurrent code-path in tests.
> One recent failure uncovered LUCENE-10500, which was an actual bug that 
> affected concurrent searches only, and was uncovered by a test run that 
> indexed a considerable amount of docs and was lucky enough to get an executor 
> set to its index searcher as well as get multiple slices.
> LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and 
> even when useThreads is true, the searcher may not get an executor set. Also, 
> it can often happen that despite an executor is set, the searcher will hold 
> only one slice, as not enough documents are indexed. Some nightly tests index 
> enough documents, and LuceneTestCase also lowers the slice limits but only 
> 50% of the times and only when wrapWithAssertions is false. Also I wonder if 
> the lower limits are low enough:
> {code:java}
> int maxDocPerSlice = 1 + random.nextInt(10);
> int maxSegmentsPerSlice = 1 + random.nextInt(20);
> {code}
> All in all, I wonder if we should make it more likely for real concurrent 
> searches to happen while testing across multiple slices. It seems like it 
> could be useful especially as we'd like users to use collector managers 
> instead of collectors (although that does not necessarily translate to 
> concurrent search).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?

2022-06-21 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556690#comment-17556690
 ] 

Adrien Grand commented on LUCENE-10507:
---

OK I found the issue with the test. The comparator was not correctly 
implemented, {{compareValues}} would sort values in the opposite order as 
{{compare}}. I pushed a fix.

> Should it be more likely to search concurrently in tests?
> -
>
> Key: LUCENE-10507
> URL: https://issues.apache.org/jira/browse/LUCENE-10507
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As part of LUCENE-10002 we are migrating test usages of 
> IndexSearcher#search(Query, Collector) to use the corresponding search method 
> that takes a CollectorManager in place of a Collector. As part of such 
> changes, I've been paying attention to whether searchers are created through 
> LuceneTestCase#newSearcher and migrating to it when possible.
> This caused some recent test failures following test changes, which were in 
> most cases test issues, although they were quite rare due to the fact that we 
> only rarely exercise the concurrent code-path in tests.
> One recent failure uncovered LUCENE-10500, which was an actual bug that 
> affected concurrent searches only, and was uncovered by a test run that 
> indexed a considerable amount of docs and was lucky enough to get an executor 
> set to its index searcher as well as get multiple slices.
> LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and 
> even when useThreads is true, the searcher may not get an executor set. Also, 
> it can often happen that despite an executor is set, the searcher will hold 
> only one slice, as not enough documents are indexed. Some nightly tests index 
> enough documents, and LuceneTestCase also lowers the slice limits but only 
> 50% of the times and only when wrapWithAssertions is false. Also I wonder if 
> the lower limits are low enough:
> {code:java}
> int maxDocPerSlice = 1 + random.nextInt(10);
> int maxSegmentsPerSlice = 1 + random.nextInt(20);
> {code}
> All in all, I wonder if we should make it more likely for real concurrent 
> searches to happen while testing across multiple slices. It seems like it 
> could be useful especially as we'd like users to use collector managers 
> instead of collectors (although that does not necessarily translate to 
> concurrent search).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-21 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556475#comment-17556475
 ] 

Tomoko Uchida edited comment on LUCENE-10557 at 6/21/22 7:18 AM:
-

I browsed through several JSON dumps of Jira issues. These are some 
observations.
 - It'd be easy to extract various metadata of issues (reporter id, status, 
created timestamp, etc.)
 - It'd be easy to extract all linked issue ids and sub-task ids
 - It'd be easy to extract all attached file URLs
 -- Can't estimate how many hours it will take to download all of the files
 - it'd be easy to extract all comments in an issue
 -- -Perhaps pagination is needed for issues with many comments- Comments in an 
issue can be retrieved all at once.
 - We can apply parser/converter tools to convert the jira markups to markdown
 -- I think this can be error-prone
 - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear 
in the github bot's comments in the Work Log.

On GitHub side, there are no difficulties in dealing with the APIs.
 - It'd be a bit tedious to work with milestones via APIs. They can't be 
referred to by their text. Id - text mapping is needed
 - It might need some trials and errors to properly place attached files in 
their right place

As for the cross-link conversion and account mapping script:
 - To "embed" github issue links / accounts in their right place (maybe next to 
the Jira issue keys / user names), we need to modify the original text. This 
can be tricky and the riskiest part to me. Instead of modifying the original 
text, we could just add some footnotes for the issues/comments - but it could 
considerably damage the readability.

Yes it should be possible with a set of small scripts. Maybe one problem is 
that it'd be difficult to detect conversion errors/omissions and we can't 
correct them ourselves if we notice migration errors later (it seems we are not 
allowed to have the github token of the ASF repository).


was (Author: tomoko uchida):
I browsed through several JSON dumps of Jira issues. These are some 
observations.
 - It'd be easy to extract various metadata of issues (reporter id, status, 
created timestamp, etc.)
 - It'd be easy to extract all linked issue ids and sub-task ids
 - It'd be easy to extract all attached file URLs
 -- Can't estimate how many hours it will take to download all of the files
 - it'd be easy to extract all comments in an issue
 -- Perhaps pagination is needed for issues with many comments
 - We can apply parser/converter tools to convert the jira markups to markdown
 -- I think this can be error-prone
 - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear 
in the github bot's comments in the Work Log.

On GitHub side, there are no difficulties in dealing with the APIs.
 - It'd be a bit tedious to work with milestones via APIs. They can't be 
referred to by their text. Id - text mapping is needed
 - It might need some trials and errors to properly place attached files in 
their right place

As for the cross-link conversion and account mapping script:
 - To "embed" github issue links / accounts in their right place (maybe next to 
the Jira issue keys / user names), we need to modify the original text. This 
can be tricky and the riskiest part to me. Instead of modifying the original 
text, we could just add some footnotes for the issues/comments - but it could 
considerably damage the readability.

Yes it should be possible with a set of small scripts. Maybe one problem is 
that it'd be difficult to detect conversion errors/omissions and we can't 
correct them ourselves if we notice migration errors later (it seems we are not 
to be allowed to have the github tokens of the ASF repository).

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened

[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)

2022-06-21 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556723#comment-17556723
 ] 

Tomoko Uchida commented on LUCENE-10622:


Looks like cross-issue links and sub-tasks are fine. There are also issue links 
to outside projects (e.g. Solr, LEGAL, INFRA, etc.). Do we have to have the 
fallback links to Jira from GitHub?

https://github.com/mocobeta/migration-test-1/issues/24

> Prepare complete migration script to GitHub issue from Jira (best effort)
> -
>
> Key: LUCENE-10622
> URL: https://issues.apache.org/jira/browse/LUCENE-10622
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> If we intend to move the history to GitHub, it should be perfect as far as 
> possible - significantly degraded copies of history are harmful, rather than 
> helpful for future contributors, I think.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kaivalnp commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-21 Thread GitBox



kaivalnp commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r902336738


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -121,35 +140,15 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector 
filterCollector)
-  throws IOException {
-
-if (filterCollector == null) {
-  Bits acceptDocs = ctx.reader().getLiveDocs();
-  return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE);
+  private TopDocs searchLeaf(LeafReaderContext ctx, Bits acceptDocs, int cost) 
throws IOException {
+TopDocs results = approximateSearch(ctx, acceptDocs, cost);

Review Comment:
   Yes, makes sense! Will add it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kaivalnp commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-21 Thread GitBox



kaivalnp commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r902366042


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
   public Query rewrite(IndexReader reader) throws IOException {
 TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
-BitSetCollector filterCollector = null;
+Weight filterWeight = null;
 if (filter != null) {
-  filterCollector = new BitSetCollector(reader.leaves().size());
   IndexSearcher indexSearcher = new IndexSearcher(reader);
   BooleanQuery booleanQuery =
   new BooleanQuery.Builder()
   .add(filter, BooleanClause.Occur.FILTER)
   .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
   .build();
-  indexSearcher.search(booleanQuery, filterCollector);
+  Query rewritten = indexSearcher.rewrite(booleanQuery);
+  filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
 }
 
 for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterCollector);
+  Bits acceptDocs;
+  int cost;
+  if (filterWeight != null) {
+Scorer scorer = filterWeight.scorer(ctx);
+if (scorer != null) {
+  DocIdSetIterator iterator = scorer.iterator();
+  if (iterator instanceof BitSetIterator) {
+acceptDocs = ((BitSetIterator) iterator).getBitSet();
+  } else {
+acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+  }
+  cost = (int) iterator.cost();

Review Comment:
   You're right.. the `scorer` seems to be overestimating quite a lot! I 
changed it to `cardinality` of the `BitSet`, and it only adds a small latency
   
   However, as @jpountz pointed out, it does not include `liveDocs` yet
   We need some way of incorporating these `liveDocs` into our `BitSet` without 
iterating one-by-one over matching bits. Any suggestions for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-21 Thread GitBox



jpountz commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r902383800


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
   public Query rewrite(IndexReader reader) throws IOException {
 TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
-BitSetCollector filterCollector = null;
+Weight filterWeight = null;
 if (filter != null) {
-  filterCollector = new BitSetCollector(reader.leaves().size());
   IndexSearcher indexSearcher = new IndexSearcher(reader);
   BooleanQuery booleanQuery =
   new BooleanQuery.Builder()
   .add(filter, BooleanClause.Occur.FILTER)
   .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
   .build();
-  indexSearcher.search(booleanQuery, filterCollector);
+  Query rewritten = indexSearcher.rewrite(booleanQuery);
+  filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
 }
 
 for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterCollector);
+  Bits acceptDocs;
+  int cost;
+  if (filterWeight != null) {
+Scorer scorer = filterWeight.scorer(ctx);
+if (scorer != null) {
+  DocIdSetIterator iterator = scorer.iterator();
+  if (iterator instanceof BitSetIterator) {
+acceptDocs = ((BitSetIterator) iterator).getBitSet();
+  } else {
+acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+  }

Review Comment:
   Is it a problem? `exactSearch` doesn't need a `BitSet` but a 
`DocIdSetIterator`, which should be easy to create by filtering the scorer's 
iterator to exclude live docs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-06-21 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556848#comment-17556848
 ] 

Robert Muir commented on LUCENE-10577:
--

Seems like the codec API needs to be fixed so that ppl can use 8 or 16 bit 
vectors, etc. I.am -1 against adding any additional similarity functions. 

The current codec keeps getting more and more bloated instead of scaling out 
horizontally with more codecs. And more bullshit (eg cosine) keeps getting all 
piled into this wonder-do-it- all design, perpetuating the argument that it's 
too difficult to make more codecs, and should be avoided. 



> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller opened a new pull request, #969: LUCENE-10603: Mark SortedSetDocValues#NO_MORE_ORDS deprecated

2022-06-21 Thread GitBox



gsmiller opened a new pull request, #969:
URL: https://github.com/apache/lucene/pull/969

   Let's get this marked as deprecated ASAP if we want to actually remove it in 
a 10.0 release. Unless we remove it, we won't see any performance benefits of 
LUCENE-10603 since we'll still need to do the internal book-keeping in 
`Lucene90DocValuesProducer` to surface `NO_MORE_ORDS` as long as it exists as 
part of the API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-21 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556891#comment-17556891
 ] 

Greg Miller commented on LUCENE-10603:
--

[~ChrisLu] thanks again for proposing this. I've merged the work in the 
{{facets}} module to use the new style of iteration, but there's still plenty 
more locations in our code base that need updating. Let me know if you want any 
help with this. I'm happy to divide up some of the modules if you'd like (or 
maybe we can recruit others if interested as well).

In the meantime, I propose we get this {{NO_MORE_ORDS}} constant marked as 
{{deprecated}} so we have a shot of removing it in a 10.0 release. By removing 
it, as [~jpountz] points out in 
[#954|https://github.com/apache/lucene/pull/954], we may have a performance 
benefit since we won't need the book-keeping to keep it updated. I opened 
another PR for this: [#969|https://github.com/apache/lucene/pull/969].

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-21 Thread GitBox



gsmiller commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1161727452

   Thanks @shaie! I was away from my computer since Thursday but should have 
time to catch up on this today, respond to your comments and do another review 
pass. Agreed that we're close on this. Finish line is in sight! :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-21 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556966#comment-17556966
 ] 

Tomoko Uchida commented on LUCENE-10557:


I was trying to figure out how to upload attachments (patches, images, etc.) to 
Github issue with API for hours. {*}There is no way to upload files to GitHub 
with REST APIs{*}; it is only allowed via the Web Interface. If you want to 
programmatically port attachment files in Jira to refer to GitHub, you have to  
[commit the files to the 
repository|https://docs.github.com/en/rest/repos/contents].

See [https://github.com/isaacs/github/issues/1133]

 

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-21 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556966#comment-17556966
 ] 

Tomoko Uchida edited comment on LUCENE-10557 at 6/21/22 3:07 PM:
-

I was trying to figure out how to upload attachments (patches, images, etc.) to 
Github issue with API for hours. {*}There is no way to upload files to GitHub 
with REST APIs{*}; it is only allowed via the Web Interface. If you want to 
programatically port attachment files in Jira to GitHub, you have to  [commit 
the files to the repository|https://docs.github.com/en/rest/repos/contents].

See [https://github.com/isaacs/github/issues/1133]

 


was (Author: tomoko uchida):
I was trying to figure out how to upload attachments (patches, images, etc.) to 
Github issue with API for hours. {*}There is no way to upload files to GitHub 
with REST APIs{*}; it is only allowed via the Web Interface. If you want to 
programmatically port attachment files in Jira to refer to GitHub, you have to  
[commit the files to the 
repository|https://docs.github.com/en/rest/repos/contents].

See [https://github.com/isaacs/github/issues/1133]

 

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-21 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556475#comment-17556475
 ] 

Tomoko Uchida edited comment on LUCENE-10557 at 6/21/22 3:11 PM:
-

I browsed through several JSON dumps of Jira issues. These are some 
observations.
 - It'd be easy to extract various metadata of issues (reporter id, status, 
created timestamp, etc.)
 - It'd be easy to extract all linked issue ids and sub-task ids
 - It'd be easy to extract all attached file URLs
 -- Can't estimate how many hours it will take to download all of the files
 - it'd be easy to extract all comments in an issue
 -- -Perhaps pagination is needed for issues with many comments- Comments in an 
issue can be retrieved all at once.
 - We can apply parser/converter tools to convert the jira markups to markdown
 -- I think this can be error-prone
 - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear 
in the github bot's comments in the Work Log.

On GitHub side, there are no difficulties in dealing with the APIs.
 - It'd be a bit tedious to work with milestones via APIs. They can't be 
referred to by their text. Id - text mapping is needed
 - -It might need some trials and errors to properly place attached files in 
their right place- This is not possible (we can't programatically migrate 
attachment files to GitHub).

As for the cross-link conversion and account mapping script:
 - To "embed" github issue links / accounts in their right place (maybe next to 
the Jira issue keys / user names), we need to modify the original text. This 
can be tricky and the riskiest part to me. Instead of modifying the original 
text, we could just add some footnotes for the issues/comments - but it could 
considerably damage the readability.

Yes it should be possible with a set of small scripts. Maybe one problem is 
that it'd be difficult to detect conversion errors/omissions and we can't 
correct them ourselves if we notice migration errors later (it seems we are not 
allowed to have the github token of the ASF repository).


was (Author: tomoko uchida):
I browsed through several JSON dumps of Jira issues. These are some 
observations.
 - It'd be easy to extract various metadata of issues (reporter id, status, 
created timestamp, etc.)
 - It'd be easy to extract all linked issue ids and sub-task ids
 - It'd be easy to extract all attached file URLs
 -- Can't estimate how many hours it will take to download all of the files
 - it'd be easy to extract all comments in an issue
 -- -Perhaps pagination is needed for issues with many comments- Comments in an 
issue can be retrieved all at once.
 - We can apply parser/converter tools to convert the jira markups to markdown
 -- I think this can be error-prone
 - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear 
in the github bot's comments in the Work Log.

On GitHub side, there are no difficulties in dealing with the APIs.
 - It'd be a bit tedious to work with milestones via APIs. They can't be 
referred to by their text. Id - text mapping is needed
 - It might need some trials and errors to properly place attached files in 
their right place

As for the cross-link conversion and account mapping script:
 - To "embed" github issue links / accounts in their right place (maybe next to 
the Jira issue keys / user names), we need to modify the original text. This 
can be tricky and the riskiest part to me. Instead of modifying the original 
text, we could just add some footnotes for the issues/comments - but it could 
considerably damage the readability.

Yes it should be possible with a set of small scripts. Maybe one problem is 
that it'd be difficult to detect conversion errors/omissions and we can't 
correct them ourselves if we notice migration errors later (it seems we are not 
allowed to have the github token of the ASF repository).

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org

[GitHub] [lucene] jpountz merged pull request #961: Handle more cases in `BooleanWeight#count`.

2022-06-21 Thread GitBox



jpountz merged PR #961:
URL: https://github.com/apache/lucene/pull/961


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #969: LUCENE-10603: Mark SortedSetDocValues#NO_MORE_ORDS deprecated

2022-06-21 Thread GitBox



gsmiller merged PR #969:
URL: https://github.com/apache/lucene/pull/969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556988#comment-17556988
 ] 

ASF subversion and git services commented on LUCENE-10603:
--

Commit 8f459eb0f9d219af5610642c1027ec704b094dc3 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8f459eb0f9d ]

LUCENE-10603: Mark SortedSetDocValues#NO_MORE_ORDS deprecated (#969)



> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #964: LUCENE-10620: Pass the Weight to Collectors.

2022-06-21 Thread GitBox



jpountz commented on PR #964:
URL: https://github.com/apache/lucene/pull/964#issuecomment-1161979978

   I reverted changes to top-docs collectors. This means this new 
`Collector#setWeight` API is only useful to `TotalHitCountCollector`. I've been 
wondering if it was worth adding a new API only for `TotalHitCountCollector` 
but looking at how facets use this collector, I suspect that many users set up 
their collectors manually instead of using `IndexSearcher#count` and do not 
benefit from this optimization, so maybe it's worth the increased API surface.
   
   I was chatting about the API with @romseygeek, and we wondered if having the 
counting logic on `Scorable` would help. I looked into it, and it's not super 
practical due to the fact that `LeafCollector#setScorer` is not a place where 
throwing a `CollectorTerminatedException` is supported now (though this could 
be addressed) and this method can be called multiple times per segment, so we 
would need to introduce tracking to make sure that we only increment the count 
on the first time that `setScorer` is called on a segment. For these reasons, I 
would prefer moving forward with the current API on `Collector`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557000#comment-17557000
 ] 

ASF subversion and git services commented on LUCENE-10603:
--

Commit 4de355bd04374bbd6c9ca5fe26b00f4f3dfe74a7 in lucene's branch 
refs/heads/branch_9x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4de355bd043 ]

LUCENE-10603: Mark SortedSetDocValues#NO_MORE_ORDS deprecated


> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #969: LUCENE-10603: Mark SortedSetDocValues#NO_MORE_ORDS deprecated

2022-06-21 Thread GitBox



gsmiller commented on PR #969:
URL: https://github.com/apache/lucene/pull/969#issuecomment-1161982047

   Thanks @jpountz !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] tang-hi opened a new pull request, #970: LUCENE-10607: Fix potential integer overflow in maxArcs computions

2022-06-21 Thread GitBox



tang-hi opened a new pull request, #970:
URL: https://github.com/apache/lucene/pull/970

   ### Description (or a Jira issue link if you have one)
   
   
   
https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-10607?filter=allopenissues


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10607) NRTSuggesterBuilder扩展input时溢出

2022-06-21 Thread tangdh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557029#comment-17557029
 ] 

tangdh commented on LUCENE-10607:
-

Hi, I've raised a Pr to fix the potential integer overflow, [~ChasenY] 
[~dweiss] 

https://github.com/apache/lucene/pull/970

> NRTSuggesterBuilder扩展input时溢出
> -
>
> Key: LUCENE-10607
> URL: https://issues.apache.org/jira/browse/LUCENE-10607
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/FSTs
>Affects Versions: 9.2
>Reporter: chaseny
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> suggest模块在创建索引时，调用NRTSuggestBuilder的finishTerm来写入suggest索引。
> 会调用maxNumArcsForDedupByte函数来扩展analyzed,向后扩展3 5 7  255。
> 当entries长度过长（900）时，调用maxNumArcsForDedupByte扩展时
>  
> private static int maxNumArcsForDedupByte(int currentNumDedupBytes) {
> int maxArcs = 1 + (2 * currentNumDedupBytes);
> if (currentNumDedupBytes > 5)
> { maxArcs *= currentNumDedupBytes;  
> //当currentNumDedupBytes大于等于32768时，int相乘会大于int最大值 }
> return Math.min(maxArcs, 255);
> }
>  
> 另外在扩展时，是否可以选择固定4字节来有序扩展。代替 3 5 7 ... 255的扩展方式
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-21 Thread GitBox



gsmiller commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1162137822

   OK, I think I understand the intention with `FSD` long/int decoding more, 
but I think it could be a little confusing in the API currently. If I was a 
user, I'd expect there to be four implementations that correspond with the four 
types being supported out-of-the-box (int/long/float/double). But this is 
_really_ about knowing the width of the encoded "sortable longs" in the doc 
value field. So, with my better understanding, 1) I think the current approach 
is reasonable, and I can't think of any better suggestion, but 2) maybe we 
could update the javadocs in `FSD` to make it a little more clear it's about 
decoding the stored bytes into *comparable longs*?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-21 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557061#comment-17557061
 ] 

Tomoko Uchida commented on LUCENE-10557:


Maybe GitHub's API call rate limit would be another consideration.
{quote}If you're making a large number of {{{}POST{}}}, {{{}PATCH{}}}, 
{{{}PUT{}}}, or {{DELETE}} requests for a single user or client ID, wait at 
least one second between each request.
{quote}
[https://docs.github.com/en/rest/guides/best-practices-for-integrators#dealing-with-secondary-rate-limits|https://docs.github.com/en/rest/overview/resources-in-the-rest-api#secondary-rate-limits]

We can't really "bulk" import to GitHub. Every issue and comment has to be 
posted one by one and between the API calls, at least one-second sleep is 
required.

 

 

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-21 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557061#comment-17557061
 ] 

Tomoko Uchida edited comment on LUCENE-10557 at 6/21/22 7:00 PM:
-

Maybe GitHub's API call rate limit would be another consideration.
{quote}If you're making a large number of {{{}POST{}}}, {{{}PATCH{}}}, 
{{{}PUT{}}}, or {{DELETE}} requests for a single user or client ID, wait at 
least one second between each request.
{quote}
[https://docs.github.com/en/rest/guides/best-practices-for-integrators#dealing-with-secondary-rate-limits|https://docs.github.com/en/rest/overview/resources-in-the-rest-api#secondary-rate-limits]

We can't really "bulk" import to GitHub. Every issue and comment has to be 
posted one by one and between the API calls, at least one-second sleep is 
required.

I encountered this rate limit many times - actually it seem that the rate limit 
is strictly monitored.

 


was (Author: tomoko uchida):
Maybe GitHub's API call rate limit would be another consideration.
{quote}If you're making a large number of {{{}POST{}}}, {{{}PATCH{}}}, 
{{{}PUT{}}}, or {{DELETE}} requests for a single user or client ID, wait at 
least one second between each request.
{quote}
[https://docs.github.com/en/rest/guides/best-practices-for-integrators#dealing-with-secondary-rate-limits|https://docs.github.com/en/rest/overview/resources-in-the-rest-api#secondary-rate-limits]

We can't really "bulk" import to GitHub. Every issue and comment has to be 
posted one by one and between the API calls, at least one-second sleep is 
required.

 

 

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To uns

[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-21 Thread GitBox



mdmarshmallow commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r903051834


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/RangeFacetSetMatcher.java:
##
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+import java.util.Arrays;
+import org.apache.lucene.util.NumericUtils;
+
+/**
+ * A {@link FacetSetMatcher} which considers a set as a match if all 
dimensions fall within the
+ * given corresponding range.
+ *
+ * @lucene.experimental
+ */
+public class RangeFacetSetMatcher extends FacetSetMatcher {
+
+  private final long[] lowerRanges;
+  private final long[] upperRanges;
+
+  /**
+   * Constructs an instance to match facet sets with dimensions that fall 
within the given ranges.
+   */
+  public RangeFacetSetMatcher(String label, DimRange... dimRanges) {
+super(label, getDims(dimRanges));
+this.lowerRanges = Arrays.stream(dimRanges).mapToLong(range -> 
range.min).toArray();
+this.upperRanges = Arrays.stream(dimRanges).mapToLong(range -> 
range.max).toArray();
+  }
+
+  @Override
+  public boolean matches(long[] dimValues) {
+assert dimValues.length == dims
+: "Encoded dimensions (dims="
++ dimValues.length
++ ") is incompatible with range dimensions (dims="
++ dims
++ ")";
+
+for (int i = 0; i < dimValues.length; i++) {
+  if (dimValues[i] < lowerRanges[i]) {
+// Doc's value is too low in this dimension
+return false;
+  }
+  if (dimValues[i] > upperRanges[i]) {
+// Doc's value is too high in this dimension
+return false;
+  }
+}
+return true;
+  }
+
+  private static int getDims(DimRange... dimRanges) {
+if (dimRanges == null || dimRanges.length == 0) {
+  throw new IllegalArgumentException("dimRanges cannot be null or empty");
+}
+return dimRanges.length;
+  }
+
+  /**
+   * Creates a {@link DimRange} for the given min and max long values. This 
method is also suitable
+   * for int values.
+   */
+  public static DimRange fromLongs(long min, boolean minInclusive, long max, 
boolean maxInclusive) {

Review Comment:
   Yeah I think it makes sense in that case to extract DimRange.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-21 Thread GitBox



mdmarshmallow commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r903055633


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/FacetSetsField.java:
##
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+import org.apache.lucene.document.BinaryDocValuesField;
+import org.apache.lucene.document.IntPoint;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * A {@link BinaryDocValuesField} which encodes a list of {@link FacetSet 
facet sets}. The encoding
+ * scheme consists of a packed {@code byte[]} where the first value denotes 
the number of dimensions
+ * in all the sets, followed by each set's values.
+ *
+ * @lucene.experimental
+ */
+public class FacetSetsField extends BinaryDocValuesField {
+
+  /**
+   * Create a new FacetSets field.
+   *
+   * @param name field name
+   * @param facetSets the {@link FacetSet facet sets} to index in that field. 
All must have the same
+   * number of dimensions
+   * @throws IllegalArgumentException if the field name is null or the given 
facet sets are invalid
+   */
+  public static FacetSetsField create(String name, FacetSet... facetSets) {
+if (facetSets == null || facetSets.length == 0) {
+  throw new IllegalArgumentException("FacetSets cannot be null or empty!");
+}
+
+return new FacetSetsField(name, toPackedValues(facetSets));
+  }
+
+  private FacetSetsField(String name, BytesRef value) {
+super(name, value);
+  }
+
+  private static BytesRef toPackedValues(FacetSet... facetSets) {
+int numDims = facetSets[0].dims;
+Class expectedClass = facetSets[0].getClass();
+byte[] buf = new byte[Integer.BYTES + facetSets[0].sizePackedBytes() * 
facetSets.length];
+IntPoint.encodeDimension(numDims, buf, 0);
+int offset = Integer.BYTES;
+for (FacetSet facetSet : facetSets) {
+  if (facetSet.dims != numDims) {
+throw new IllegalArgumentException(
+"All FacetSets must have the same number of dimensions. Expected "
++ numDims
++ " found "
++ facetSet.dims);
+  }
+  // It doesn't make sense to index facet sets of different types in the 
same field
+  if (facetSet.getClass() != expectedClass) {

Review Comment:
   Took a look at this again and yeah, it doesn't make sense to generify here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-21 Thread GitBox



gsmiller commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r903051691


##
lucene/demo/src/java/org/apache/lucene/demo/facet/CustomFacetSetExample.java:
##
@@ -0,0 +1,303 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.demo.facet;
+
+import java.io.IOException;
+import java.time.LocalDate;
+import java.time.ZoneOffset;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
+import org.apache.lucene.document.*;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsCollectorManager;
+import org.apache.lucene.facet.facetset.*;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.IndexWriterConfig.OpenMode;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.store.ByteBuffersDirectory;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.NumericUtils;
+
+/**
+ * Shows usage of indexing and searching {@link FacetSetsField} with a custom 
{@link FacetSet}
+ * implementation. Unlike the out of the box {@link FacetSet} implementations, 
this example shows
+ * how to mix and match dimensions of different types, as well as implementing 
a custom {@link
+ * FacetSetMatcher}.
+ */
+public class CustomFacetSetExample {
+
+  private static final long MAY_SECOND_2022 = date("2022-05-02");
+  private static final long JUNE_SECOND_2022 = date("2022-06-02");
+  private static final long JULY_SECOND_2022 = date("2022-07-02");
+  private static final float HUNDRED_TWENTY_DEGREES = fahrenheitToCelsius(120);
+  private static final float HUNDRED_DEGREES = fahrenheitToCelsius(100);
+  private static final float EIGHTY_DEGREES = fahrenheitToCelsius(80);
+
+  private final Directory indexDir = new ByteBuffersDirectory();
+
+  /** Empty constructor */
+  public CustomFacetSetExample() {}
+
+  /** Build the example index. */
+  private void index() throws IOException {
+IndexWriter indexWriter =
+new IndexWriter(
+indexDir, new IndexWriterConfig(new 
WhitespaceAnalyzer()).setOpenMode(OpenMode.CREATE));
+
+// Every document holds the temperature measures for a City by Date
+
+Document doc = new Document();
+doc.add(new StringField("city", "city1", Field.Store.YES));
+doc.add(
+FacetSetsField.create(
+"temperature",
+new TemperatureReadingFacetSet(MAY_SECOND_2022, HUNDRED_DEGREES),
+new TemperatureReadingFacetSet(JUNE_SECOND_2022, EIGHTY_DEGREES),
+new TemperatureReadingFacetSet(JULY_SECOND_2022, 
HUNDRED_TWENTY_DEGREES)));
+indexWriter.addDocument(doc);
+
+doc = new Document();
+doc.add(new StringField("city", "city2", Field.Store.YES));
+doc.add(
+FacetSetsField.create(
+"temperature",
+new TemperatureReadingFacetSet(MAY_SECOND_2022, EIGHTY_DEGREES),
+new TemperatureReadingFacetSet(JUNE_SECOND_2022, HUNDRED_DEGREES),
+new TemperatureReadingFacetSet(JULY_SECOND_2022, 
HUNDRED_TWENTY_DEGREES)));
+indexWriter.addDocument(doc);
+
+indexWriter.close();
+  }
+
+  /** Counting documents which exactly match a given {@link FacetSet}. */
+  private List exactMatching() throws IOException {
+DirectoryReader indexReader = DirectoryReader.open(indexDir);
+IndexSearcher searcher = new IndexSearcher(indexReader);
+
+// MatchAllDocsQuery is for "browsing" (counts facets
+// for all non-deleted docs in the index); normally
+// you'd use a "normal" query:
+FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new 
FacetsCollectorManager());
+
+// Count both "Publish Date" and "Author" dimensions
+Facets facets =
+new MatchingFacetSetsCounts(
+"temperature",
+fc,
+TemperatureReadingFacetSet::decodeTemperatureReading,
+

[GitHub] [lucene] gsmiller commented on a diff in pull request #914: LUCENE-10550: Add getAllChildren functionality to facets

2022-06-21 Thread GitBox



gsmiller commented on code in PR #914:
URL: https://github.com/apache/lucene/pull/914#discussion_r903092231


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java:
##
@@ -163,6 +164,76 @@ public Number getSpecificValue(String dim, String... path) 
throws IOException {
 return getValue(ord);
   }
 
+  @Override
+  public FacetResult getAllChildren(String dim, String... path) throws 
IOException {
+DimConfig dimConfig = verifyDim(dim);
+FacetLabel cp = new FacetLabel(dim, path);
+int dimOrd = taxoReader.getOrdinal(cp);
+if (dimOrd == -1) {
+  return null;
+}
+
+int aggregatedValue = 0;
+int childCount = 0;
+
+List ordinals = new ArrayList<>();
+List ordValues = new ArrayList<>();
+
+if (sparseValues != null) {
+  for (IntIntCursor c : sparseValues) {
+int value = c.value;
+int ord = c.key;
+if (parents[ord] == dimOrd && value > 0) {
+  aggregatedValue = aggregationFunction.aggregate(aggregatedValue, 
value);
+  childCount++;
+  ordinals.add(ord);
+  ordValues.add(value);
+}
+  }
+} else {
+  int[] children = getChildren();
+  int[] siblings = getSiblings();
+  int ord = children[dimOrd];
+  while (ord != TaxonomyReader.INVALID_ORDINAL) {
+int value = values[ord];
+if (value > 0) {
+  aggregatedValue = aggregationFunction.aggregate(aggregatedValue, 
value);
+  childCount++;
+  ordinals.add(ord);
+  ordValues.add(value);
+}
+ord = siblings[ord];
+  }
+}
+
+if (aggregatedValue == 0) {
+  return null;
+}
+
+if (dimConfig.multiValued) {
+  if (dimConfig.requireDimCount) {
+aggregatedValue = getValue(dimOrd);
+  } else {
+// Our sum'd value is not correct, in general:
+aggregatedValue = -1;
+  }
+} else {
+  // Our sum'd dim value is accurate, so we keep it
+}
+
+int[] ordinalArray = new int[ordinals.size()];
+for (int i = 0; i < ordinals.size(); i++) {
+  ordinalArray[i] = ordinals.get(i);
+}

Review Comment:
   Ah, I see. Shoot. It bugs me that we need to copy these ordinals from a list 
to an array just to do this bulk path lookup, but I see what you're saying. It 
would be nice if `TaxonomyReader` could directly support `List` in addition to 
an array, but I don't think this use-case justifies trying to add that right 
now. Would you mind adding a `TODO` comment here to mention that it would be 
nice if we didn't need to do this copy just to look up bulk paths? We can leave 
it at that for now and optimize later if/as necessary. Thanks for pointing this 
out!



##
lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java:
##
@@ -72,6 +72,40 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
 return createFacetResult(topChildrenForPath, dim, path);
   }
 
+  @Override
+  public FacetResult getAllChildren(String dim, String... path) throws 
IOException {
+FacetsConfig.DimConfig dimConfig = stateConfig.getDimConfig(dim);
+
+if (dimConfig.hierarchical) {
+  int pathOrd = (int) dv.lookupTerm(new 
BytesRef(FacetsConfig.pathToString(dim, path)));
+  if (pathOrd < 0) {
+// path was never indexed
+return null;
+  }
+  SortedSetDocValuesReaderState.DimTree dimTree = state.getDimTree(dim);
+  return getPathResult(dimConfig, dim, path, pathOrd, 
dimTree.iterator(pathOrd));
+} else {
+  if (path.length > 0) {
+throw new IllegalArgumentException(
+"Field is not configured as hierarchical, path should be 0 
length");
+  }
+  OrdRange ordRange = state.getOrdRange(dim);
+  if (ordRange == null) {
+// means dimension was never indexed
+return null;
+  }
+  int dimOrd = ordRange.start;
+  PrimitiveIterator.OfInt childIt = ordRange.iterator();
+  if (dimConfig.multiValued && dimConfig.requireDimCount) {
+// If the dim is multi-valued and requires dim counts, we know we've 
explicitly indexed
+// the dimension and we need to skip past it so the iterator is 
positioned on the first
+// child:
+childIt.next();
+  }
+  return getPathResult(dimConfig, dim, null, dimOrd, childIt);
+}
+  }

Review Comment:
   Of course! There was a lot of change happening while you were working on 
this, so I'm sure you were working against an earlier version and just didn't 
notice some of the change to getTopChildren. Happy to point them out.



##
lucene/facet/src/java/org/apache/lucene/facet/Facets.java:
##
@@ -29,6 +29,12 @@ public abstract class Facets {
   /** Default constructor. */
   public Facets() {}
 
+  /**
+   * Returns all the children labels with non-zero counts under the specified 
path in the unsorted
+   * order. Returns null if the spe

[GitHub] [lucene] Yuti-G commented on pull request #914: LUCENE-10550: Add getAllChildren functionality to facets

2022-06-21 Thread GitBox



Yuti-G commented on PR #914:
URL: https://github.com/apache/lucene/pull/914#issuecomment-1162454475

   Thank you so much for the last check! I added more javadoc and a new entry 
to the CHANGES.txt. For back-porting, should I wait until this PR is merged and 
checkout a new branch against the latest branch_9x to cherrypick the merged 
commit? Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #914: LUCENE-10550: Add getAllChildren functionality to facets

2022-06-21 Thread GitBox



gsmiller commented on PR #914:
URL: https://github.com/apache/lucene/pull/914#issuecomment-1162469235

   > For back-porting, should I wait until this PR is merged and checkout a new 
branch against the latest branch_9x to cherrypick the merged commit? Thanks 
again!
   
   Exactly! Then you can open a PR with that branch against `origin/branch_9x` 
(github will automatically select `origin/main` as the suggested destination so 
just change that). And just mention that it's a backport PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10550) Add getAllChildren functionality to facets

2022-06-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557161#comment-17557161
 ] 

ASF subversion and git services commented on LUCENE-10550:
--

Commit bdcb4b37164ba07e87e2e987f7fd4c9c50690601 in lucene's branch 
refs/heads/main from Yuting Gan
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bdcb4b37164 ]

LUCENE-10550: Add getAllChildren functionality to facets (#914)



> Add getAllChildren functionality to facets
> --
>
> Key: LUCENE-10550
> URL: https://issues.apache.org/jira/browse/LUCENE-10550
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Currently Lucene does not support returning range counts sorted by label 
> values, but there are use cases demanding this feature. For example, a user 
> specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts 
> without changing the range order. Today we can only call getTopChildren to 
> populate range counts, but it would return ranges sorted by counts (e.g., 
> [10, 20] 100, [0, 10] 50) instead of range values. 
> Lucene has a API, getAllChildrenSortByValue, that returns numeric values with 
> counts sorted by label values, please see 
> [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. 
> Therefore, it would be nice that we can also have a similar API to support 
> range counts. The proposed getAllChildren API is to return value/range counts 
> sorted by label values instead of counts. 
> This proposal was inspired from the discussions with [~gsmiller] when I was 
> working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], 
> and we believe users would benefit from adding this API to Facets. 
> Hope I can get some feedback from the community since this proposal would 
> require changes to the getTopChildren API in RangeFacetCounts. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #914: LUCENE-10550: Add getAllChildren functionality to facets

2022-06-21 Thread GitBox



gsmiller commented on PR #914:
URL: https://github.com/apache/lucene/pull/914#issuecomment-1162469491

   Merged onto `main`. Thanks again @Yuti-G! Exciting to see this new 
functionality available :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #914: LUCENE-10550: Add getAllChildren functionality to facets

2022-06-21 Thread GitBox



gsmiller merged PR #914:
URL: https://github.com/apache/lucene/pull/914


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Yuti-G opened a new pull request, #971: LUCENE-10550: Add getAllChildren functionality to facets (#914)

2022-06-21 Thread GitBox



Yuti-G opened a new pull request, #971:
URL: https://github.com/apache/lucene/pull/971

   Just using to backport.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts

2022-06-21 Thread Yuting Gan (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557164#comment-17557164
 ] 

Yuting Gan commented on LUCENE-10614:
-

Thank you so much for reviewing and merging the LUCENE-10550 PR! I will start 
working on this issue and will create a PR to properly return topNChildren in 
RangeFacetCounts.

> Properly support getTopChildren in RangeFacetCounts
> ---
>
> Key: LUCENE-10614
> URL: https://issues.apache.org/jira/browse/LUCENE-10614
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 10.0 (main)
>Reporter: Greg Miller
>Priority: Minor
>
> As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing 
> {{getTopChildren}}. Instead of returning "top" ranges, it returns all 
> user-provided ranges in the order the user specified them when instantiating. 
> This is probably more useful functionality, but it would be nice to support 
> {{getTopChildren}} as well.
> LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that 
> lands, we can replace the current implementation of {{getTopChildren}} with 
> an actual "top children" implementation and direct users to 
> {{getAllChildren}} if they want to maintain the current behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #971: LUCENE-10550: Add getAllChildren functionality to facets (#914)

2022-06-21 Thread GitBox



gsmiller merged PR #971:
URL: https://github.com/apache/lucene/pull/971


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #971: LUCENE-10550: Add getAllChildren functionality to facets (#914)

2022-06-21 Thread GitBox



gsmiller commented on PR #971:
URL: https://github.com/apache/lucene/pull/971#issuecomment-1162519460

   Thanks @Yuti-G !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10550) Add getAllChildren functionality to facets

2022-06-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557174#comment-17557174
 ] 

ASF subversion and git services commented on LUCENE-10550:
--

Commit b2c454c8be1549fedd455632a43cea18ff975755 in lucene's branch 
refs/heads/branch_9x from Yuting Gan
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b2c454c8be1 ]

LUCENE-10550: Add getAllChildren functionality to facets



> Add getAllChildren functionality to facets
> --
>
> Key: LUCENE-10550
> URL: https://issues.apache.org/jira/browse/LUCENE-10550
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently Lucene does not support returning range counts sorted by label 
> values, but there are use cases demanding this feature. For example, a user 
> specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts 
> without changing the range order. Today we can only call getTopChildren to 
> populate range counts, but it would return ranges sorted by counts (e.g., 
> [10, 20] 100, [0, 10] 50) instead of range values. 
> Lucene has a API, getAllChildrenSortByValue, that returns numeric values with 
> counts sorted by label values, please see 
> [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. 
> Therefore, it would be nice that we can also have a similar API to support 
> range counts. The proposed getAllChildren API is to return value/range counts 
> sorted by label values instead of counts. 
> This proposal was inspired from the discussions with [~gsmiller] when I was 
> working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], 
> and we believe users would benefit from adding this API to Facets. 
> Hope I can get some feedback from the community since this proposal would 
> require changes to the getTopChildren API in RangeFacetCounts. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on pull request #968: [LUCENE-10624] Binary Search for Sparse IndexedDISI advanceWithinBloc…

2022-06-21 Thread GitBox



zacharymorn commented on PR #968:
URL: https://github.com/apache/lucene/pull/968#issuecomment-1162539088

   Thanks @wuwm for opening this PR! The improvement idea makes sense to me. 
Quick question though, given the similarities of the binary search 
implementations in the two methods, is it possible to extract them out into a 
common method? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10607) NRTSuggesterBuilder扩展input时溢出

2022-06-21 Thread chaseny (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557197#comment-17557197
 ] 

chaseny commented on LUCENE-10607:
--

(y)

> NRTSuggesterBuilder扩展input时溢出
> -
>
> Key: LUCENE-10607
> URL: https://issues.apache.org/jira/browse/LUCENE-10607
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/FSTs
>Affects Versions: 9.2
>Reporter: chaseny
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> suggest模块在创建索引时，调用NRTSuggestBuilder的finishTerm来写入suggest索引。
> 会调用maxNumArcsForDedupByte函数来扩展analyzed,向后扩展3 5 7  255。
> 当entries长度过长（900）时，调用maxNumArcsForDedupByte扩展时
>  
> private static int maxNumArcsForDedupByte(int currentNumDedupBytes) {
> int maxArcs = 1 + (2 * currentNumDedupBytes);
> if (currentNumDedupBytes > 5)
> { maxArcs *= currentNumDedupBytes;  
> //当currentNumDedupBytes大于等于32768时，int相乘会大于int最大值 }
> return Math.min(maxArcs, 255);
> }
>  
> 另外在扩展时，是否可以选择固定4字节来有序扩展。代替 3 5 7 ... 255的扩展方式
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn opened a new pull request, #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-21 Thread GitBox



zacharymorn opened a new pull request, #972:
URL: https://github.com/apache/lucene/pull/972

   ### Description (or a Jira issue link if you have one)
   Use Block-Max-Maxscore algorithm for 2 clauses disjunction. Adapted from PR 
https://github.com/apache/lucene/pull/101
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-21 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557223#comment-17557223
 ] 

Lu Xugang commented on LUCENE-10603:


Hi, [~gsmiller]  when I start to work on the rest of modules,  I found a new 
issue LUCENE-10623 which should be a priority to resolve.

 
{quote}I'm happy to divide up some of the modules
{quote}
 

LUCENE-10623 will effect those modules using *SortingSortedDocValues* , if you 
have free time , your could do the change on the modules that are not affected, 
and I will later take care of the rest of modules after LUCENE-10623 merged.

 

 

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-21 Thread GitBox



zacharymorn commented on PR #972:
URL: https://github.com/apache/lucene/pull/972#issuecomment-1162613846

   Hi @jpountz , I have adapted the original BMM PR 
https://github.com/apache/lucene/pull/101 to the latest codebase and run 
further experiments on using it for 2 clauses disjunction. The results look 
both encouraging and strange :D 
   
   When I run `python3 src/python/localrun.py -source wikimedium10m` with only 
`OrHighLow`, `OrHighHigh` and `OrHighMed` tasks from ` 
tasks/wikimedium.10M.nostopwords.tasks tasks/wikimedium.10M.nostopwords.tasks` 
(by removing the other tasks), I got pretty impressive speedup on average:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
   PKLookup  173.31 (24.6%)  181.79 
(26.8%)4.9% ( -37% -   74%) 0.547
  OrHighLow  166.70 (62.8%)  385.94
(101.5%)  131.5% ( -20% -  794%) 0.000
 OrHighHigh9.27 (48.9%)   23.44 
(85.9%)  152.9% (  12% -  562%) 0.000
  OrHighMed   18.45 (61.3%)   55.92
(137.3%)  203.0% (   2% - 1037%) 0.000
   ```
   
   However, when I run all the tasks, `OrHighLow`, `OrHighHigh` and `OrHighMed` 
have only moderate speedup on average and sometimes even slightly negatively 
impacted:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 OrHighHigh   35.23  (7.2%)   23.86  
(7.0%)  -32.3% ( -43% -  -19%) 0.000
  OrHighLow  898.97  (4.4%)  788.65  
(4.2%)  -12.3% ( -20% -   -3%) 0.000
   BrowseDateSSDVFacets2.62 (27.0%)2.43 
(18.8%)   -7.4% ( -41% -   52%) 0.312
   HighSpanNear   21.86  (6.4%)   21.00  
(6.1%)   -4.0% ( -15% -9%) 0.045
 Fuzzy2   94.11 (12.4%)   90.59  
(9.8%)   -3.7% ( -23% -   21%) 0.290
LowSloppyPhrase   65.63  (8.2%)   63.99  
(8.6%)   -2.5% ( -17% -   15%) 0.347
   HighSloppyPhrase   17.25  (5.3%)   16.84  
(5.3%)   -2.4% ( -12% -8%) 0.154
 TermDTSort  160.18  (8.2%)  156.49  
(9.9%)   -2.3% ( -18% -   17%) 0.423
  HighTermDayOfYearSort  164.86  (6.8%)  161.77 
(10.1%)   -1.9% ( -17% -   16%) 0.490
 OrHighMedDayTaxoFacets   11.05  (7.1%)   10.86  
(7.3%)   -1.7% ( -15% -   13%) 0.465
 AndHighLow 1482.47  (4.0%) 1459.63 
(10.6%)   -1.5% ( -15% -   13%) 0.544
MedSpanNear   27.77  (7.2%)   27.49  
(6.1%)   -1.0% ( -13% -   13%) 0.628
   HighTermTitleBDVSort  197.53  (7.4%)  195.53  
(6.3%)   -1.0% ( -13% -   13%) 0.640
AndHighMedDayTaxoFacets   43.61  (8.7%)   43.19 
(10.1%)   -1.0% ( -18% -   19%) 0.745
   HighIntervalsOrdered   17.38  (8.7%)   17.26  
(7.5%)   -0.7% ( -15% -   16%) 0.782
 HighPhrase  454.15  (5.0%)  451.67  
(8.7%)   -0.5% ( -13% -   13%) 0.807
BrowseRandomLabelSSDVFacets   15.40  (8.1%)   15.32  
(7.3%)   -0.5% ( -14% -   16%) 0.837
   AndHighHighDayTaxoFacets   16.94  (7.0%)   16.87  
(6.6%)   -0.5% ( -13% -   14%) 0.834
LowSpanNear9.08  (4.8%)9.05  
(4.3%)   -0.3% (  -9% -9%) 0.838
   Wildcard   55.15 (11.3%)   55.01 
(12.0%)   -0.2% ( -21% -   26%) 0.947
  MedPhrase  976.56  (2.8%)  977.29  
(3.3%)0.1% (  -5% -6%) 0.939
   MedTermDayTaxoFacets   77.21  (8.6%)   77.46  
(8.7%)0.3% ( -15% -   19%) 0.908
   OrNotHighLow 1187.34  (5.1%) 1191.80  
(5.3%)0.4% (  -9% -   11%) 0.819
  OrHighNotHigh 1556.42  (4.4%) 1566.26  
(4.5%)0.6% (  -7% -9%) 0.654
LowIntervalsOrdered  158.96  (6.4%)  160.03  
(8.9%)0.7% ( -13% -   17%) 0.785
  OrNotHighHigh 1427.22  (3.8%) 1436.97  
(5.0%)0.7% (  -7% -9%) 0.628
 Fuzzy1  116.55 (11.4%)  117.41  
(9.4%)0.7% ( -18% -   24%) 0.823
LowTerm 3470.46  (5.9%) 3500.25  
(5.9%)0.9% ( -10% -   13%) 0.644
  HighTermMonthSort  169.22 (10.4%)  170.68 
(14.9%)0.9% ( -22% -   29%) 0.832
 IntNRQ  115.77 (22.6%)  116.95 
(21.3%)1.0% ( -34% -   57

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-21 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r903277452


##
lucene/demo/src/java/org/apache/lucene/demo/facet/CustomFacetSetExample.java:
##
@@ -0,0 +1,303 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.demo.facet;
+
+import java.io.IOException;
+import java.time.LocalDate;
+import java.time.ZoneOffset;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
+import org.apache.lucene.document.*;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsCollectorManager;
+import org.apache.lucene.facet.facetset.*;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.IndexWriterConfig.OpenMode;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.store.ByteBuffersDirectory;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.NumericUtils;
+
+/**
+ * Shows usage of indexing and searching {@link FacetSetsField} with a custom 
{@link FacetSet}
+ * implementation. Unlike the out of the box {@link FacetSet} implementations, 
this example shows
+ * how to mix and match dimensions of different types, as well as implementing 
a custom {@link
+ * FacetSetMatcher}.
+ */
+public class CustomFacetSetExample {
+
+  private static final long MAY_SECOND_2022 = date("2022-05-02");
+  private static final long JUNE_SECOND_2022 = date("2022-06-02");
+  private static final long JULY_SECOND_2022 = date("2022-07-02");
+  private static final float HUNDRED_TWENTY_DEGREES = fahrenheitToCelsius(120);
+  private static final float HUNDRED_DEGREES = fahrenheitToCelsius(100);
+  private static final float EIGHTY_DEGREES = fahrenheitToCelsius(80);
+
+  private final Directory indexDir = new ByteBuffersDirectory();
+
+  /** Empty constructor */
+  public CustomFacetSetExample() {}
+
+  /** Build the example index. */
+  private void index() throws IOException {
+IndexWriter indexWriter =
+new IndexWriter(
+indexDir, new IndexWriterConfig(new 
WhitespaceAnalyzer()).setOpenMode(OpenMode.CREATE));
+
+// Every document holds the temperature measures for a City by Date
+
+Document doc = new Document();
+doc.add(new StringField("city", "city1", Field.Store.YES));
+doc.add(
+FacetSetsField.create(
+"temperature",
+new TemperatureReadingFacetSet(MAY_SECOND_2022, HUNDRED_DEGREES),
+new TemperatureReadingFacetSet(JUNE_SECOND_2022, EIGHTY_DEGREES),
+new TemperatureReadingFacetSet(JULY_SECOND_2022, 
HUNDRED_TWENTY_DEGREES)));
+indexWriter.addDocument(doc);
+
+doc = new Document();
+doc.add(new StringField("city", "city2", Field.Store.YES));
+doc.add(
+FacetSetsField.create(
+"temperature",
+new TemperatureReadingFacetSet(MAY_SECOND_2022, EIGHTY_DEGREES),
+new TemperatureReadingFacetSet(JUNE_SECOND_2022, HUNDRED_DEGREES),
+new TemperatureReadingFacetSet(JULY_SECOND_2022, 
HUNDRED_TWENTY_DEGREES)));
+indexWriter.addDocument(doc);
+
+indexWriter.close();
+  }
+
+  /** Counting documents which exactly match a given {@link FacetSet}. */
+  private List exactMatching() throws IOException {
+DirectoryReader indexReader = DirectoryReader.open(indexDir);
+IndexSearcher searcher = new IndexSearcher(indexReader);
+
+// MatchAllDocsQuery is for "browsing" (counts facets
+// for all non-deleted docs in the index); normally
+// you'd use a "normal" query:
+FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new 
FacetsCollectorManager());
+
+// Count both "Publish Date" and "Author" dimensions

Review Comment:
   Indeed :), I copied the simple faceting example and didn't cover up my 
tracks very well :D.



-- 
This is an automated message from the Apache Git Service.
To res

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-21 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r903277577


##
lucene/demo/src/java/org/apache/lucene/demo/facet/CustomFacetSetExample.java:
##
@@ -0,0 +1,303 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.demo.facet;
+
+import java.io.IOException;
+import java.time.LocalDate;
+import java.time.ZoneOffset;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
+import org.apache.lucene.document.*;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsCollectorManager;
+import org.apache.lucene.facet.facetset.*;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.IndexWriterConfig.OpenMode;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.store.ByteBuffersDirectory;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.NumericUtils;
+
+/**
+ * Shows usage of indexing and searching {@link FacetSetsField} with a custom 
{@link FacetSet}
+ * implementation. Unlike the out of the box {@link FacetSet} implementations, 
this example shows
+ * how to mix and match dimensions of different types, as well as implementing 
a custom {@link
+ * FacetSetMatcher}.
+ */
+public class CustomFacetSetExample {
+
+  private static final long MAY_SECOND_2022 = date("2022-05-02");
+  private static final long JUNE_SECOND_2022 = date("2022-06-02");
+  private static final long JULY_SECOND_2022 = date("2022-07-02");
+  private static final float HUNDRED_TWENTY_DEGREES = fahrenheitToCelsius(120);
+  private static final float HUNDRED_DEGREES = fahrenheitToCelsius(100);
+  private static final float EIGHTY_DEGREES = fahrenheitToCelsius(80);
+
+  private final Directory indexDir = new ByteBuffersDirectory();
+
+  /** Empty constructor */
+  public CustomFacetSetExample() {}
+
+  /** Build the example index. */
+  private void index() throws IOException {
+IndexWriter indexWriter =
+new IndexWriter(
+indexDir, new IndexWriterConfig(new 
WhitespaceAnalyzer()).setOpenMode(OpenMode.CREATE));
+
+// Every document holds the temperature measures for a City by Date
+
+Document doc = new Document();
+doc.add(new StringField("city", "city1", Field.Store.YES));
+doc.add(
+FacetSetsField.create(
+"temperature",
+new TemperatureReadingFacetSet(MAY_SECOND_2022, HUNDRED_DEGREES),
+new TemperatureReadingFacetSet(JUNE_SECOND_2022, EIGHTY_DEGREES),
+new TemperatureReadingFacetSet(JULY_SECOND_2022, 
HUNDRED_TWENTY_DEGREES)));
+indexWriter.addDocument(doc);
+
+doc = new Document();
+doc.add(new StringField("city", "city2", Field.Store.YES));
+doc.add(
+FacetSetsField.create(
+"temperature",
+new TemperatureReadingFacetSet(MAY_SECOND_2022, EIGHTY_DEGREES),
+new TemperatureReadingFacetSet(JUNE_SECOND_2022, HUNDRED_DEGREES),
+new TemperatureReadingFacetSet(JULY_SECOND_2022, 
HUNDRED_TWENTY_DEGREES)));
+indexWriter.addDocument(doc);
+
+indexWriter.close();
+  }
+
+  /** Counting documents which exactly match a given {@link FacetSet}. */
+  private List exactMatching() throws IOException {
+DirectoryReader indexReader = DirectoryReader.open(indexDir);
+IndexSearcher searcher = new IndexSearcher(indexReader);
+
+// MatchAllDocsQuery is for "browsing" (counts facets
+// for all non-deleted docs in the index); normally
+// you'd use a "normal" query:
+FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new 
FacetsCollectorManager());
+
+// Count both "Publish Date" and "Author" dimensions
+Facets facets =
+new MatchingFacetSetsCounts(
+"temperature",
+fc,
+TemperatureReadingFacetSet::decodeTemperatureReading,
+

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-21 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r903278541


##
lucene/facet/docs/FacetSets.adoc:
##
@@ -0,0 +1,130 @@
+= FacetSets Overview
+:toc:
+
+This document describes the `FacetSets` capability, which allows to aggregate 
on multidimensional values. It starts
+with outlining a few example use cases to showcase the motivation for this 
capability and follows with an API
+walk through.
+
+== Motivation
+
+[#movie-actors]
+=== Movie Actors DB
+
+Suppose that you want to build a search engine for movie actors which allows 
you to search for actors by name and see
+movie titles they appeared in. You might want to index standard fields such as 
`actorName`, `genre` and `releaseYear`
+which will let you search by the actor's name or see all actors who appeared 
in movies during 2021. Similarly, you can
+index facet fields that will let you aggregate by “Genre” and “Year” so that 
you can show how many actors appeared in
+each year or genre. Few example documents:
+
+[source]
+
+{ "name": "Tom Hanks", "genre": ["Comedy", "Drama", …], "year": [1988, 2000,…] 
}
+{ "name": "Harrison Ford", "genre": ["Action", "Adventure", …], "year": [1977, 
1981, …] }
+
+
+However, these facet fields do not allow you to show the following aggregation:
+
+.Number of Actors performing in movies by Genre and Year
+[cols="4*"]
+|===
+|   | 2020 | 2021 | 2022
+| Thriller  | 121  | 43   | 97
+| Action| 145  | 52   | 130
+| Adventure | 87   | 21   | 32
+|===
+
+The reason is that each “genre” or “releaseYear” facet field is indexed in its 
own data structure, and therefore if an
+actor appeared in a "Thriller" movie in "2020" and "Action" movie in "2021", 
there's no way for you to tell that they
+didn't appear in an "Action" movie in "2020".
+
+[#automotive-parts]
+=== Automotive Parts Store
+
+Say you're building a search engine for an automotive parts store where 
customers can search for different car parts.
+For simplicity let's assume that each item in the catalog contains a 
searchable “type” field and “car model” it fits
+which consists of two separate fields: “manufacturer” and “year”. This lets 
you search for parts by their type as well
+as filter parts that fit only a certain manufacturer or year. Few example 
documents:
+
+[source]
+
+{
+  "type": "Wiper Blades V1",
+  "models": [
+{ "manufaturer": "Ford", "year": 2010 },
+{ "manufacturer": "Chevy", "year": 2011 }
+  ]
+}
+{
+  "type": "Wiper Blades V2",
+  "models": [
+{ "manufaturer": "Ford", "year": 2011 },
+{ "manufacturer": "Chevy", "year": 2010 }
+  ]
+}
+
+
+By breaking up the "models" field into its sub-fields "manufacturer" and 
"year", you can easily aggregate on parts that
+fit a certain manufacturer or year. However, if a user would like to aggregate 
on parts that can fit either a "Ford
+2010" or "Chevy 2011", then aggregating on the sub-fields will lead to a wrong 
count of 2 (in the above example) instead
+of 1.
+
+[#movie-awards]
+=== Movie Awards
+
+To showcase a 3-D multidimensional aggregation, lets expand the 
<> example with awards an actor has
+received over the years. For this aggregation we will use four dimensions: 
Award Type ("Oscar", "Grammy", "Emmy"),
+Award Category ("Best Actor", "Best Supporting Actress"), Year and Genre. One 
interesting aggregation is to show how
+many "Best Actor" vs "Best Supporting Actor" awards one has received in the 
"Oscar" or "Emmy" for each year. Another
+aggregation is slicing the number of these awards by Genre over all the years.
+
+Building on these examples, one might be able to come up with an interesting 
use case for an N-dimensional aggregation
+(where `N > 3`). The higher `N` is, the harder it is to aggregate all the 
dimensions correctly and efficiently without
+`FacetSets`.
+
+== FacetSets API
+
+The `facetset` package consists of few components which allow you to index and 
aggregate multidimensional facet sets:
+
+=== FacetSet
+
+Holds a set of facet dimension values. Implementations are required to convert 
the dimensions into comparable long
+representation, as well can implement how the values are packed (encoded). The 
package offers four implementations:
+`Int/Float/Long/DoubleFacetSet` for `int`, `float`, `long` and `double` values 
respectively. You can also look at
+`org.apache.lucene.demo.facet.CustomFacetSetExample` in the `lucene/demo` 
package for a custom implementation of a
+`FacetSet`.
+
+=== FacetSetsField
+
+A `BinaryDocValues` field which lets you index a list of `FacetSet`. This 
field can be added to a document only once, so
+you will need to construct all the facet sets in advance.
+
+=== FacetSetMatcher
+
+Responsible for matching an encoded `FacetSet` against a given criteria. For 
example, `ExactFacetSetMatcher` only
+considers an encoded facet set as a match if all dimension values are equal to 
a given one. `RangeFacetSetMatcher`
+considers an encoded facet set as

[GitHub] [lucene] LuXugang commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-21 Thread GitBox



LuXugang commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r903314253


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -439,29 +433,42 @@ private void set() {
   static final class DocOrds {
 final long[] offsets;
 final PackedLongValues ords;
+final GrowableWriter growableWriter;
+
+public static final int START_BITS_PER_VALUE = 2;
 

Review Comment:
   `BitsPerValue` was required for `GrowableWriter`, we could count 
`maxBitsRequired` while adding values in `SortedSetDocValuesWriter`, but 
`SortingSortedSetDocValues` was also used in 
`SortingCodecReader#getDocValuesReader#getSortedSet`  which could not supply a 
`BitsPerValue`, so we have to a default value, could you give some suggestion 
base on practice for `START_BITS_PER_VALUE `  , @jpountz 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kaivalnp commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-21 Thread GitBox



kaivalnp commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r903319874


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
   public Query rewrite(IndexReader reader) throws IOException {
 TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
-BitSetCollector filterCollector = null;
+Weight filterWeight = null;
 if (filter != null) {
-  filterCollector = new BitSetCollector(reader.leaves().size());
   IndexSearcher indexSearcher = new IndexSearcher(reader);
   BooleanQuery booleanQuery =
   new BooleanQuery.Builder()
   .add(filter, BooleanClause.Occur.FILTER)
   .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
   .build();
-  indexSearcher.search(booleanQuery, filterCollector);
+  Query rewritten = indexSearcher.rewrite(booleanQuery);
+  filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
 }
 
 for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterCollector);
+  Bits acceptDocs;
+  int cost;
+  if (filterWeight != null) {
+Scorer scorer = filterWeight.scorer(ctx);
+if (scorer != null) {
+  DocIdSetIterator iterator = scorer.iterator();
+  if (iterator instanceof BitSetIterator) {
+acceptDocs = ((BitSetIterator) iterator).getBitSet();
+  } else {
+acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+  }

Review Comment:
   We can extend the `BitSetIterator` so that it also incorporates `liveDocs` 
(return the `nextSetBit` only if it is live, else move to the next bit in loop)
   But we can't find an accurate estimate of the number of matching + live docs 
(as it is needed in `visitedLimit` to switch over to `exactSearch`)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

52 matches

Mail list logo