date:20220614

[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

2022-06-14 Thread GitBox



jpountz commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154822015

   This is exactly the testing that I had in mind, thanks for running these 
benchmarks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #950: LUCENE-10608: Implement Weight#count on pure conjunctions.

2022-06-14 Thread GitBox



jpountz merged PR #950:
URL: https://github.com/apache/lucene/pull/950


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10608) Implement Weight#count for pure conjunctions

2022-06-14 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553949#comment-17553949
 ] 

ASF subversion and git services commented on LUCENE-10608:
--

Commit 83461601adb08ff410c32a870cb0381b6b0857f2 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=83461601adb ]

LUCENE-10608: Implement Weight#count on pure conjunctions. (#950)



> Implement Weight#count for pure conjunctions
> 
>
> Key: LUCENE-10608
> URL: https://issues.apache.org/jira/browse/LUCENE-10608
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> It's common for Elasticsearch to ingest time-based data where newer segments 
> contain recent data and older segments contain older data. On such indices, 
> it's common for range queries on the time field to match either all of or 
> none of the documents in the segment.
> We could implement Weight#count on pure conjunctions to take advantage of 
> this by either returning 0 if any of the clauses has a match count of 0, or 
> the count of the only clause that doesn't have a match count that is equal to 
> maxDoc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira?

2022-06-14 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553953#comment-17553953
 ] 

Tomoko Uchida commented on LUCENE-10557:


Vote thread:

[https://lists.apache.org/thread/124nfzjmz2vqtw7kl6xohd2jct57m6tr]

Vote count:

[https://docs.google.com/spreadsheets/d/1MnRO-Kfbglj00liFDqaboyAvseI19_jWj5QwxuLiXWE/edit?usp=sharing]

 

 

> Migrate to GitHub issue from Jira?
> --
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * Get a consensus about the migration among committers
>  * Enable Github issue on the lucene's repository (currently, it is disabled 
> on it)
>  * Build the convention or rules for issue label/milestone management
>  * Choose issues that should be moved to GitHub (I think too old or obsolete 
> issues can remain Jira.)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10608) Implement Weight#count for pure conjunctions

2022-06-14 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553970#comment-17553970
 ] 

ASF subversion and git services commented on LUCENE-10608:
--

Commit 4da1a16835d36b322bbd359e5ddc21f71c4fe3aa in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4da1a16835d ]

LUCENE-10608: Implement Weight#count on pure conjunctions. (#950)



> Implement Weight#count for pure conjunctions
> 
>
> Key: LUCENE-10608
> URL: https://issues.apache.org/jira/browse/LUCENE-10608
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> It's common for Elasticsearch to ingest time-based data where newer segments 
> contain recent data and older segments contain older data. On such indices, 
> it's common for range queries on the time field to match either all of or 
> none of the documents in the segment.
> We could implement Weight#count on pure conjunctions to take advantage of 
> this by either returning 0 if any of the clauses has a match count of 0, or 
> the count of the only clause that doesn't have a match count that is equal to 
> maxDoc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kaivalnp opened a new pull request, #958: LUCENE-10611: Fix Heap Error in HnswGraphSearcher

2022-06-14 Thread GitBox



kaivalnp opened a new pull request, #958:
URL: https://github.com/apache/lucene/pull/958

   ### Description
   
   Link to [Jira](https://issues.apache.org/jira/browse/LUCENE-10611)
   
   The HNSW graph search does not consider that visitedLimit may be reached in 
the upper levels of graph search itself
   
   This occurs when the pre-filter is too restrictive (and its count sets the 
visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
from an empty 
heap](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90)
 and throws an error
   
   ### Solution
   
   We can check if results are incomplete after searching in upper levels, and 
break out accordingly. This way it won't throw heap errors, and gracefully 
switch to exactSearch instead


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters

2022-06-14 Thread Kaival Parikh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553973#comment-17553973
 ] 

Kaival Parikh commented on LUCENE-10611:


Thanks, I have added the fix! As for the test, I feel that 
*testRandomWithFilter* is ideal for this (as it checks switching over to 
{*}exactSearch{*}, and we should extend it for higher levels as well)

If we increase the *numDocs* reasonably high (~2000), we start getting heap 
errors (as the *visitedLimit* is exhausted in upper levels). With the fix, we 
can check if it still switches to *exactSearch*

Here is the [PR|https://github.com/apache/lucene/pull/958]

> KnnVectorQuery throwing Heap Error for Restrictive Filters
> --
>
> Key: LUCENE-10611
> URL: https://issues.apache.org/jira/browse/LUCENE-10611
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kaival Parikh
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The HNSW graph search does not consider that visitedLimit may be reached in 
> the upper levels of graph search itself
> This occurs when the pre-filter is too restrictive (and its count sets the 
> visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
> from an empty 
> heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
>  and throws an error
>  
> To reproduce this error, we can +increase the numDocs 
> [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
>  to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
> faster)
>  
> Stacktrace:
> {code:java}
> The heap is empty
> java.lang.IllegalStateException: The heap is empty
> at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
> at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
> at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
> at 
> org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
> at 
> org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
> at 
> org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
> at 
> org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
> at 
> org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #950: LUCENE-10608: Implement Weight#count on pure conjunctions.

2022-06-14 Thread GitBox



jpountz commented on PR #950:
URL: https://github.com/apache/lucene/pull/950#issuecomment-1154864922

   Thanks @zhaih !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10608) Implement Weight#count for pure conjunctions

2022-06-14 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10608.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Implement Weight#count for pure conjunctions
> 
>
> Key: LUCENE-10608
> URL: https://issues.apache.org/jira/browse/LUCENE-10608
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> It's common for Elasticsearch to ingest time-based data where newer segments 
> contain recent data and older segments contain older data. On such indices, 
> it's common for range queries on the time field to match either all of or 
> none of the documents in the segment.
> We could implement Weight#count on pure conjunctions to take advantage of 
> this by either returning 0 if any of the clauses has a match count of 0, or 
> the count of the only clause that doesn't have a match count that is equal to 
> maxDoc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-06-14 Thread GitBox



jpountz commented on code in PR #907:
URL: https://github.com/apache/lucene/pull/907#discussion_r896529395


##
lucene/codecs/src/java/org/apache/lucene/codecs/bloom/BloomFilteringPostingsFormat.java:
##
@@ -200,8 +200,8 @@ public Terms terms(String field) throws IOException {
 return delegateFieldsProducer.terms(field);
   } else {
 Terms result = delegateFieldsProducer.terms(field);
-if (result == null) {
-  return null;
+if (result == null || result == Terms.EMPTY) {

Review Comment:
   This case is a bit special indeed, but I think we should fix it too to make 
sure that it only returns a `null` `Terms` instance if the field doesn't exist 
(fieldInfo == null) or if the field doesn't index terms (indexOptions == NONE).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-06-14 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553987#comment-17553987
 ] 

Lu Xugang commented on LUCENE-10600:


Hi [~jpountz] ,should we also make SortedSetDocValues#nextOrd() returns int 
because ssdv's values were represented by termID(int type) in 
SortedDocValuesWriter.

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-06-14 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553993#comment-17553993
 ] 

Lu Xugang commented on LUCENE-10600:


{quote}Is it that the unique count of ordinals will always fit inside an int
{quote}
Hi [~vigyas], yes, during Indexing phase, ssdv's values were represented as 
termID and collect non-duplicate termIDs , Detailed implementation could see 
SortedSetDocValuesWriter#finishCurrentDoc
{quote}I guess it stores integer ordinals compressed as PackedLongValues? 
Should this also be changed to an int
{quote}
We could first make long to int and then think about what you mentiond. Have 
you ever start to this work , if not, I would fix it in the next few days.

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-06-14 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554004#comment-17554004
 ] 

Adrien Grand commented on LUCENE-10600:
---

bq. should we also make SortedSetDocValues#nextOrd() returns int

No, SORTED_SET doc values could have more than Integer.MAX_VALUE unique values 
overall. SortedSetDocValuesWriter does indeed use ints to represent term IDs, 
but this class is only used for flushes and flushes have a hard bound of ~2GB 
per thread so you can't have more than Integer.MAX_VALUE unique terms in a 
flush. However, the unique count of terms can grow through merges beyond 
Integer.MAX_VALUE.

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

2022-06-14 Thread Adrien Grand (Jira)

Adrien Grand created LUCENE-10616:
-

 Summary: Moving to dictionaries has made stored fields slower at 
skipping
 Key: LUCENE-10616
 URL: https://issues.apache.org/jira/browse/LUCENE-10616
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand


[~ywelsch] has been digging into a regression of stored fields retrieval that 
is caused by LUCENE-9486.

Say your documents have two stored fields, one that is 100B and is stored 
first, and the other one that is 100kB, and you are only interested in the 
first one. While the idea behind blocks of stored fields is to store multiple 
documents in the same block to leverage redundancy across documents, sometimes 
documents are larger than the block size. As soon as documents are larger than 
2x the block size, our stored fields format splits such large documents into 
multiple blocks, so that you wouldn't need to decompress everything only to 
retrieve a couple small fields.

Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving the 
first field value would only need to decompress 16kB of data. With the move to 
preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have blocks of 
80kB, so stored fields would now need to decompress 80kB of data, 5x more than 
before.

With dictionaries, our blocks are now split into 10 sub blocks. We happen to 
eagerly decompress all sub blocks that intersect with the stored document, 
which is why we would decompress 80kB of data, but this is an implementation 
detail. It should be possible to decompress these sub blocks lazily so that we 
would only decompress those that intersect with one of the field values that 
the user is interested in retrieving?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10615) Add license information for SmartChineseAnalyzer to NOTICE.txt

2022-06-14 Thread Robert Muir (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-10615.
--
Resolution: Invalid

Please don't use jira for questions like this. We won't be adding unnecessary 
stuff to NOTICE.txt. Look at the source code files if you want to see the 
license.

> Add license information for SmartChineseAnalyzer to NOTICE.txt
> --
>
> Key: LUCENE-10615
> URL: https://issues.apache.org/jira/browse/LUCENE-10615
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jan Dornseifer
>Priority: Trivial
>
> The Lucene NOTICE file contains the statement
> The SmartChineseAnalyzer source code (smartcn) was
> provided by Xiaoping Gao and copyright 2009 by 
> [www.imdict.net.|http://www.imdict.net./]
> without providing license information. Can this information be supplemented 
> or is it even outdated?
> We are using Apache Lucene v8.4.1. We are currently subject to a license 
> audit of our software, where also 3rd party FOSS components are checked for 
> usage. Among other things, this part came to our attention. I would be very 
> grateful for information.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10615) Add license information for SmartChineseAnalyzer to NOTICE.txt

2022-06-14 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554039#comment-17554039
 ] 

Dawid Weiss commented on LUCENE-10615:
--

I think the reference you're looking for is here:
https://github.com/apache/lucene/blob/main/lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.java#L44-L45

although these web sites and their associated resources vanish over time.

> Add license information for SmartChineseAnalyzer to NOTICE.txt
> --
>
> Key: LUCENE-10615
> URL: https://issues.apache.org/jira/browse/LUCENE-10615
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jan Dornseifer
>Priority: Trivial
>
> The Lucene NOTICE file contains the statement
> The SmartChineseAnalyzer source code (smartcn) was
> provided by Xiaoping Gao and copyright 2009 by 
> [www.imdict.net.|http://www.imdict.net./]
> without providing license information. Can this information be supplemented 
> or is it even outdated?
> We are using Apache Lucene v8.4.1. We are currently subject to a license 
> audit of our software, where also 3rd party FOSS components are checked for 
> usage. Among other things, this part came to our attention. I would be very 
> grateful for information.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10615) Add license information for SmartChineseAnalyzer to NOTICE.txt

2022-06-14 Thread Jan Dornseifer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554067#comment-17554067
 ] 

Jan Dornseifer commented on LUCENE-10615:
-

[~rcmuir] thanks for taking the time. I don't want the Lucene team to add 
unnecessary stuff to NOTICE.txt but in my opinion this is sensitive 
information. My intent was more of a question, as I cannot verify this 
information (dead link) and do not know under what terms this contribution was 
made. Because we, as the users, have to comply.

[~dweiss] thanks. Yes, I checked the source code before creating the issue and 
found this information, too. This is where my thoughts came from, that the 
other information from the NOTICE file may be outdated. However, this part only 
refers to the dictionary data. I see, the source code files of the package 
itself contain the Apache-2.0 header, so my thoughts now are the following:

The source code of the org/apache/lucene/analysis/cn/smart package is licensed 
under Apache-2.0 and
(either the code was contributed by Xiaoping Gao and copyright 
[www.imdict.net|http://www.imdict.net/]
or this information from the NOTICE.txt is outdated and the copyright of the 
code is Apache Software Foundation).
Additionally, dictionary data is copyright ICTCLAS and also licensed under 
Apache-2.0.

Can either one or the other be confirmed?

Licensing issues are always annoying and I am not a lawyer, but we as users 
depend on this information being complete to stay out of trouble. Hence my 
question.

> Add license information for SmartChineseAnalyzer to NOTICE.txt
> --
>
> Key: LUCENE-10615
> URL: https://issues.apache.org/jira/browse/LUCENE-10615
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jan Dornseifer
>Priority: Trivial
>
> The Lucene NOTICE file contains the statement
> The SmartChineseAnalyzer source code (smartcn) was
> provided by Xiaoping Gao and copyright 2009 by 
> [www.imdict.net.|http://www.imdict.net./]
> without providing license information. Can this information be supplemented 
> or is it even outdated?
> We are using Apache Lucene v8.4.1. We are currently subject to a license 
> audit of our software, where also 3rd party FOSS components are checked for 
> usage. Among other things, this part came to our attention. I would be very 
> grateful for information.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10615) Add license information for SmartChineseAnalyzer to NOTICE.txt

2022-06-14 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554070#comment-17554070
 ] 

Robert Muir commented on LUCENE-10615:
--

> Please don't use jira for questions like this.

> Add license information for SmartChineseAnalyzer to NOTICE.txt
> --
>
> Key: LUCENE-10615
> URL: https://issues.apache.org/jira/browse/LUCENE-10615
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jan Dornseifer
>Priority: Trivial
>
> The Lucene NOTICE file contains the statement
> The SmartChineseAnalyzer source code (smartcn) was
> provided by Xiaoping Gao and copyright 2009 by 
> [www.imdict.net.|http://www.imdict.net./]
> without providing license information. Can this information be supplemented 
> or is it even outdated?
> We are using Apache Lucene v8.4.1. We are currently subject to a license 
> audit of our software, where also 3rd party FOSS components are checked for 
> usage. Among other things, this part came to our attention. I would be very 
> grateful for information.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10615) Add license information for SmartChineseAnalyzer to NOTICE.txt

2022-06-14 Thread Jan Dornseifer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554072#comment-17554072
 ] 

Jan Dornseifer commented on LUCENE-10615:
-

Where should such questions be asked instead?

> Add license information for SmartChineseAnalyzer to NOTICE.txt
> --
>
> Key: LUCENE-10615
> URL: https://issues.apache.org/jira/browse/LUCENE-10615
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jan Dornseifer
>Priority: Trivial
>
> The Lucene NOTICE file contains the statement
> The SmartChineseAnalyzer source code (smartcn) was
> provided by Xiaoping Gao and copyright 2009 by 
> [www.imdict.net.|http://www.imdict.net./]
> without providing license information. Can this information be supplemented 
> or is it even outdated?
> We are using Apache Lucene v8.4.1. We are currently subject to a license 
> audit of our software, where also 3rd party FOSS components are checked for 
> usage. Among other things, this part came to our attention. I would be very 
> grateful for information.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10615) Add license information for SmartChineseAnalyzer to NOTICE.txt

2022-06-14 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554077#comment-17554077
 ] 

Tomoko Uchida commented on LUCENE-10615:


Hi, looks like the webpage (dead link) was moved to 
[http://ictclas.nlpir.org/index_e.html].

Also, you can find the original license file (ALv2) in this repository 
[https://github.com/NLPIR-team/nlpir-analysis-cn-ictclas] (this can be reached 
from the above website).

I don't think we can help any further - the licensing of language models or 
dictionaries is sometimes very complicated and difficult to decouple with the 
source code... if you need more help please contact the site owner.

> Add license information for SmartChineseAnalyzer to NOTICE.txt
> --
>
> Key: LUCENE-10615
> URL: https://issues.apache.org/jira/browse/LUCENE-10615
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jan Dornseifer
>Priority: Trivial
>
> The Lucene NOTICE file contains the statement
> The SmartChineseAnalyzer source code (smartcn) was
> provided by Xiaoping Gao and copyright 2009 by 
> [www.imdict.net.|http://www.imdict.net./]
> without providing license information. Can this information be supplemented 
> or is it even outdated?
> We are using Apache Lucene v8.4.1. We are currently subject to a license 
> audit of our software, where also 3rd party FOSS components are checked for 
> usage. Among other things, this part came to our attention. I would be very 
> grateful for information.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec

2022-06-14 Thread Elia Porciani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553593#comment-17553593
 ] 

Elia Porciani edited comment on LUCENE-10612 at 6/14/22 12:32 PM:
--

However, I understand the concern about backward compatibility. I don't think 
at this time is harmful to have custom HNSW parameters but things might be 
different in future releases.

Even if we decide not to move forward, I have created this PR for making the 
proposal clearer: 
[https://github.com/apache/lucene/pull/955|https://github.com/apache/lucene/pull/955]


was (Author: JIRAUSER280197):
However, I understand the concern about backward compatibility. I don't think 
at this time is harmful to have custom HNSW parameters but things might be 
different in future releases.

Even if we decide not to move forward, I have created this PR for making the 
proposal clearer: 
[https://github.com/apache/lucene/pull/955|https://github.com/apache/lucene/pull/955.]

> Add parameters for HNSW codec in Lucene93Codec
> --
>
> Key: LUCENE-10612
> URL: https://issues.apache.org/jira/browse/LUCENE-10612
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Reporter: Elia Porciani
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, it is possible to specify only the compression mode for stored 
> fields in the LuceneXXCodec constructors.
> With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, 
> LuceneXXCodec should provide an easy way to specify custom parameters for 
> HNSW graph layout:
> * maxConn
> * beamWidth



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10615) Add license information for SmartChineseAnalyzer to NOTICE.txt

2022-06-14 Thread Jan Dornseifer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554092#comment-17554092
 ] 

Jan Dornseifer commented on LUCENE-10615:
-

[~tomoko] thanks for providing this information. I will update the license 
information in our distribution of Apache Lucene.

> Add license information for SmartChineseAnalyzer to NOTICE.txt
> --
>
> Key: LUCENE-10615
> URL: https://issues.apache.org/jira/browse/LUCENE-10615
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jan Dornseifer
>Priority: Trivial
>
> The Lucene NOTICE file contains the statement
> The SmartChineseAnalyzer source code (smartcn) was
> provided by Xiaoping Gao and copyright 2009 by 
> [www.imdict.net.|http://www.imdict.net./]
> without providing license information. Can this information be supplemented 
> or is it even outdated?
> We are using Apache Lucene v8.4.1. We are currently subject to a license 
> audit of our software, where also 3rd party FOSS components are checked for 
> usage. Among other things, this part came to our attention. I would be very 
> grateful for information.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #927: LUCENE-10151: Adding Timeout Support to IndexSearcher

2022-06-14 Thread GitBox



msokolov commented on code in PR #927:
URL: https://github.com/apache/lucene/pull/927#discussion_r896898656


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -766,18 +778,29 @@ protected void search(List leaves, 
Weight weight, Collector c
   }
   BulkScorer scorer = weight.bulkScorer(ctx);
   if (scorer != null) {
-try {
-  scorer.score(leafCollector, ctx.reader().getLiveDocs());
-} catch (
-@SuppressWarnings("unused")
-CollectionTerminatedException e) {
-  // collection was terminated prematurely
-  // continue with the following leaf
+if (isTimeoutEnabled) {
+  TimeLimitingBulkScorer timeLimitingBulkScorer =
+  new TimeLimitingBulkScorer(scorer, queryTimeout);
+  try {
+timeLimitingBulkScorer.score(leafCollector, 
ctx.reader().getLiveDocs());
+  } catch (
+  @SuppressWarnings("unused")
+  TimeLimitingBulkScorer.TimeExceededException e) {
+partialResult = true;

Review Comment:
   perhaps we could return the time anyway?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8193) Deprecate LowercaseTokenizer

2022-06-14 Thread Alan Woodward (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Woodward resolved LUCENE-8193.
---
Resolution: Duplicate

It is indeed a duplicate, thanks [~asalamon74] 

> Deprecate LowercaseTokenizer
> 
>
> Key: LUCENE-8193
> URL: https://issues.apache.org/jira/browse/LUCENE-8193
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tim Allison
>Priority: Minor
>
> On LUCENE-8186, discussion favored deprecating and eventually removing 
> LowercaseTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani merged pull request #956: Make sure KnnVectorQuery applies search boost

2022-06-14 Thread GitBox



jtibshirani merged PR #956:
URL: https://github.com/apache/lucene/pull/956


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec

2022-06-14 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554232#comment-17554232
 ] 

Michael Sokolov commented on LUCENE-10612:
--

> Actually, the change I'm proposing is to make it possible to specify the 
> parameters for HNSM without the need to know which HNWS codec is used 
> underlying.

 

I think the idea is that we might choose in the future to use a different 
nearest-neighbor algorithm that would not support the same configuration 
parameters as HNSW. The public-facing API is deliberately not specific to HNSW.

> Add parameters for HNSW codec in Lucene93Codec
> --
>
> Key: LUCENE-10612
> URL: https://issues.apache.org/jira/browse/LUCENE-10612
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Reporter: Elia Porciani
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, it is possible to specify only the compression mode for stored 
> fields in the LuceneXXCodec constructors.
> With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, 
> LuceneXXCodec should provide an easy way to specify custom parameters for 
> HNSW graph layout:
> * maxConn
> * beamWidth



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-14 Thread GitBox



gsmiller commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r897216959


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/ExactFacetSetMatcher.java:
##
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+/**
+ * A {@link FacetSetMatcher} which considers a set as a match only if all 
dimension values are equal
+ * to the given one.
+ *
+ * @lucene.experimental
+ */
+public class ExactFacetSetMatcher extends FacetSetMatcher {

Review Comment:
   Should we include `Long` as part of the naming scheme for this (and 
`RangeFacetSetMatcher`) to note that it expects long points? I imagine we may 
want to create a "double" version of this in the future as well. Since we have 
different point types (`LongPoint`, `DoublePoint`, `IntPoint`, `FloatPoint`), 
we might need corresponding versions of these matchers for all of them right?



##
lucene/facet/src/java/org/apache/lucene/facet/facetset/RangeFacetSetMatcher.java:
##
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+import java.util.Arrays;
+
+/**
+ * A {@link FacetSetMatcher} which considers a set as a match if all 
dimensions fall within the
+ * given corresponding range.
+ *
+ * @lucene.experimental
+ */
+public class RangeFacetSetMatcher extends FacetSetMatcher {
+
+  private final long[] lowerRanges;
+  private final long[] upperRanges;
+
+  /**
+   * Constructs and instance to match facet sets with dimensions that fall 
within the given ranges.

Review Comment:
   typo: "an instance" not "and instance"



##
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.search.ConjunctionUtils;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * Returns the counts for each given {@link FacetSet}
+ *
+ * @lucene.experimental
+ */
+public class MatchingFacetSetsCounts extends Facets {
+
+  private final FacetSetMatcher[] facetSetMatchers;
+  private fi

[GitHub] [lucene] javanna opened a new pull request, #959: LUCENE-10507: Make it more likely to perform concurrent search in tests

2022-06-14 Thread GitBox



javanna opened a new pull request, #959:
URL: https://github.com/apache/lucene/pull/959

   I took a stab at this, these are the changes that I made:
   
   1) Replace default useThreads value: rarely() -> randomBoolean()
   2) apply lower slices thresholds more frequently: randomBoolean() -> 
frequently
   3) lower maxDocsPerSlice and maxSegmentsPerSlice threshold when applied
   4) apply lower maxSegments and maxSegmentsPerSlice also when 
wrapWithAssertions is true
   
   Please let me know what you think. Would it be better to rather make one 
change at a time, or make less aggressive changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on pull request #738: LUCENE-10448: Avoid instant rate write bursts by writing bytes buffer in chunks

2022-06-14 Thread GitBox



vigyasharma commented on PR #738:
URL: https://github.com/apache/lucene/pull/738#issuecomment-1155751419

   Based on @jpountz's response in 
[Lucene-10448](https://issues.apache.org/jira/browse/LUCENE-10448), looks like 
it is unusual for lucene to write byte[] arrays that are longer than the 
`chunk` size in this PR. 
   
   Since chunking with pauses in between would add memory pressure by delaying 
gc on these arrays, and it is probably not an expected normal scenario, I am 
planning to close this PR. Will wait for a couple days in case there are follow 
up comments on this.
   
   On a related note, since `writeBytes()` is the only API that doesn't pause, 
it may be useful to add this note in comments or docstring somewhere, perhaps 
with a reference to the jira comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-14 Thread GitBox



shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1155985365

   > but the one bigger question I'd like to discuss is how we envision handing 
different point types?
   
   I think there are two sides of supporting additional numeric types: indexing 
and aggregation. IMO it's still fine if `FSM` handles a `long[]`: indexing 
`doubles` will be done as `toSortableLong` and reading `int` and `float` into 
`long` is doable. Therefore on the aggregation side I feel like it's fine to 
keep the `long[]` matching API.
   
   For indexing we just need to convert the values to `byte[]`. We can do that 
by making `FacetSet` abstract with a `toBytes()` method and the current impl 
will be changed to `LongFacetSet`. To complement that on the aggregation side 
we will need to pass a _reader_ which can convert the `BytesRef` to a `long[]`. 
I'm thinking that the `Int/Float/Long/DoubleFacetSet` impls will do that.
   
   As for "mix-and-match" I think this provides a solution too, since the user 
will be able to implement their own `FacetSet` and convert their, as example, 
`int, long, long, float` facet set to `byte[]` and decode that back. I'll give 
it a try to see how it works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-14 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r897533011


##
lucene/facet/docs/FacetSets.adoc:
##
@@ -0,0 +1,90 @@
+= FacetSets Overview
+:toc:
+
+This document describes the `FacetSets` capability, which allows to aggregate 
on multi dimensional values. It starts
+with outlining a few example use cases to showcase the motivation for this 
capability and follows with an API
+walk through.
+
+== Motivation
+
+[#movie-actors]
+=== Movie Actors DB
+
+Suppose that you want to build a search engine for movie actors which allows 
you to search for actors by name and see
+movie titles they appeared in. You might want to index standard fields such as 
`actorName`, `genre` and `releaseYear`
+which will let you search by the actor's name or see all actors who appeared 
in movies during 2021. Similarly, you can
+index facet fields that will let you aggregate by “Genre” and “Year” so that 
you can show how many actors appeared in
+each year or genre. Few example documents:
+
+[source]
+
+{ "name": "Tom Hanks", "genre": ["Comedy", "Drama", …], "year": [1988, 2000,…] 
}
+{ "name": "Harrison Ford", "genre": ["Action", "Adventure", …], "year": [1977, 
1981, …] }
+
+
+However, these facet fields do not allow you to show the following aggregation:
+
+.Number of Actors performing in movies by Genre and Year
+[cols="4*"]
+|===
+| | 2020 | 2021 | 2022
+| Thriller | 121 | 43 | 97
+| Action | 145 | 52 | 130
+| Adventure | 87 | 21 | 32
+|===
+
+The reason is that each “genre” or “releaseYear” facet field is indexed in its 
own data structure, and therefore if an
+actor appeared in a "Thriller" movie in "2020" and "Action" movie in "2021", 
there's no way for you to tell that they
+didn't appear in an "Action" movie in "2020".
+
+[#automotive-parts]
+=== Automotive Parts Store
+
+Say you're building a search engine for an automotive parts store where 
customers can search for different car parts.
+For simplicity let's assume that each item in the catalog contains a 
searchable “type” field and “car model” it fits
+which consists of two separate fields: “manufacturer” and “year”. This lets 
you search for parts by their type as well
+as filter parts that fit only a certain manufacturer or year. Few example 
documents:
+
+[source]
+
+{
+  "type": "Wiper Blades V1",
+  "models": [
+{ "manufaturer": "Ford", "year": 2010 },
+{ "manufacturer": "Chevy", "year": 2011 }
+  ]
+}
+{
+  "type": "Wiper Blades V2",
+  "models": [
+{ "manufaturer": "Ford", "year": 2011 },
+{ "manufacturer": "Chevy", "year": 2010 }
+  ]
+}
+
+
+By breaking up the "models" field into its sub-fields "manufacturer" and 
"year", you can easily aggregate on parts that
+fit a certain manufacturer or year. However, if a user would like to aggregate 
on parts that can fit either a "Ford
+2010" or "Chevy 2011", then aggregating on the sub-fields will lead to a wrong 
count of 2 (in the above example) instead
+of 1.
+
+[#movie-awards]
+=== Movie Awards
+
+To showcase a 3-D multi-dimensional aggregation, lets expand the 
<> example with awards an actor has
+received over the years. For this aggregation we will use four dimensions: 
Award Type ("Oscar", "Grammy", "Emmy"),
+Award Category ("Best Actor", "Best Supporting Actress"), Year and Genre. One 
interesting aggregation is to show how
+many "Best Actor" vs "Best Supporting Actor" awards one has received in the 
"Oscar" or "Emmy" for each year. Another
+aggregation is slicing the number of these awards by Genre over all the years.
+
+Building on these examples, one might be able to come up with an interesting 
use case for an N-dimensional aggregation
+(where `N > 3`). The higher `N` is, the harder it is to aggregate all the 
dimensions correctly and efficiently without
+`FacetSets`.
+
+== FacetSets API
+
+TBD
+
+== FacetSets Under the Hood
+
+TBD

Review Comment:
   I intended to do that, just wanted us to finalize the API first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-14 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r897535013


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.search.ConjunctionUtils;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * Returns the counts for each given {@link FacetSet}
+ *
+ * @lucene.experimental
+ */
+public class MatchingFacetSetsCounts extends Facets {
+
+  private final FacetSetMatcher[] facetSetMatchers;
+  private final int[] counts;
+  private final String field;
+  private final int totCount;
+
+  /**
+   * Constructs a new instance of matching facet set counts which calculates 
the countBytes for each
+   * given facet set matcher.
+   */
+  public MatchingFacetSetsCounts(
+  String field, FacetsCollector hits, FacetSetMatcher... facetSetMatchers) 
throws IOException {
+if (facetSetMatchers == null || facetSetMatchers.length == 0) {
+  throw new IllegalArgumentException("facetSetMatchers cannot be null or 
empty");
+}
+if (areFacetSetMatcherDimensionsInconsistent(facetSetMatchers)) {
+  throw new IllegalArgumentException("All facet set matchers must be the 
same dimensionality");
+}
+this.field = field;
+this.facetSetMatchers = facetSetMatchers;
+this.counts = new int[facetSetMatchers.length];
+this.totCount = count(field, hits.getMatchingDocs());
+  }
+
+  /** Counts from the provided field. */
+  private int count(String field, List 
matchingDocs)

Review Comment:
   I see your point. I did that mainly to keep fields `final` to denote that 
are not changing after initialization. I realize there's a "side effect" of 
populating the counts array in the method which sucks (cause we can't return 
two values from a method). Is it better though over having all fields `final`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-14 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r897535157


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.search.ConjunctionUtils;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * Returns the counts for each given {@link FacetSet}
+ *
+ * @lucene.experimental
+ */
+public class MatchingFacetSetsCounts extends Facets {
+
+  private final FacetSetMatcher[] facetSetMatchers;
+  private final int[] counts;
+  private final String field;
+  private final int totCount;
+
+  /**
+   * Constructs a new instance of matching facet set counts which calculates 
the countBytes for each

Review Comment:
   This is an IDEA refactor side-effect, obviously an error :).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] LuXugang merged pull request #957: LUCENE-10598: (backport) SortedSetDocValues#docValueCount() should be always greater than zero

2022-06-14 Thread GitBox



LuXugang merged PR #957:
URL: https://github.com/apache/lucene/pull/957


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10598) SortedSetDocValues#docValueCount() should be always greater than zero

2022-06-14 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554367#comment-17554367
 ] 

ASF subversion and git services commented on LUCENE-10598:
--

Commit 90b5d5383f1ced8d567dc02462ac7632a5e5949d in lucene's branch 
refs/heads/branch_9x from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=90b5d5383f1 ]

LUCENE-10598: (backport) SortedSetDocValues#docValueCount() should be always 
greater than zero (#957)

* LUCENE-10598: SortedSetDocValues#docValueCount() should be always greater 
than zero (#934)

* LUCENE-10598: Use count to record docValueCount similar to 
SortedNumericDocValues did (#942)

* Fix docValueCount() on Lucene70  sorted set doc values.

> SortedSetDocValues#docValueCount() should be always greater than zero
> -
>
> Key: LUCENE-10598
> URL: https://issues.apache.org/jira/browse/LUCENE-10598
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Lu Xugang
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This test runs failed.
> {code:java}
>   public void testDocValueCount() throws IOException {
>   try (Directory d = newDirectory()) {
> try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
>   for (int j = 0; j < 1; j++) {
> Document doc = new Document();
> doc.add(new SortedSetDocValuesField("field", new BytesRef("a")));
> doc.add(new SortedSetDocValuesField("field", new BytesRef("a")));
> doc.add(new SortedSetDocValuesField("field", new BytesRef("b")));
> w.addDocument(doc);
>   }
> }
> try (IndexReader reader = DirectoryReader.open(d)) {
>   assertEquals(1, reader.leaves().size());
>   for (LeafReaderContext leaf : reader.leaves()) {
> SortedSetDocValues docValues= 
> leaf.reader().getSortedSetDocValues("field") ;
> for (int doc1 = docValues.nextDoc(); doc1 != 
> DocIdSetIterator.NO_MORE_DOCS; doc1 = docValues.nextDoc()) {
>   assert docValues.docValueCount() > 0;
> }
>   }
> }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-14 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r897535559


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/RangeFacetSetMatcher.java:
##
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+import java.util.Arrays;
+
+/**
+ * A {@link FacetSetMatcher} which considers a set as a match if all 
dimensions fall within the
+ * given corresponding range.
+ *
+ * @lucene.experimental
+ */
+public class RangeFacetSetMatcher extends FacetSetMatcher {
+
+  private final long[] lowerRanges;
+  private final long[] upperRanges;
+
+  /**
+   * Constructs and instance to match facet sets with dimensions that fall 
within the given ranges.

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-06-14 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554374#comment-17554374
 ] 

Lu Xugang commented on LUCENE-10600:


{quote}but this class is only used for flushes and flushes have a hard bound of 
~2GB per thread so you can't have more than Integer.MAX_VALUE unique terms in a 
flush. However, the unique count of terms can grow through merges beyond 
Integer.MAX_VALUE{quote}

Thanks for the explanation!

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-14 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r897538192


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/FacetSetsField.java:
##
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+import java.util.Arrays;
+import org.apache.lucene.document.BinaryDocValuesField;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * A {@link BinaryDocValuesField} which encodes a list of {@link FacetSet 
facet sets}. The encoding
+ * scheme consists of a packed {@code long[]} where the first value denotes 
the number of dimensions
+ * in all the sets, followed by each set's values.
+ *
+ * @lucene.experimental
+ */
+public class FacetSetsField extends BinaryDocValuesField {
+
+  /**
+   * Create a new FacetSets field.
+   *
+   * @param name field name
+   * @param facetSets the {@link FacetSet} to index in that field. All must 
have the same number of
+   * dimensions
+   * @throws IllegalArgumentException if the field name is null or the given 
facet sets are invalid
+   */
+  public static FacetSetsField create(String name, FacetSet... facetSets) {
+validateFacetSets(facetSets);
+
+return new FacetSetsField(name, toPackedLongs(facetSets));
+  }
+
+  private FacetSetsField(String name, BytesRef value) {
+super(name, value);
+  }
+
+  private static void validateFacetSets(FacetSet... facetSets) {

Review Comment:
   Good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-14 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r897537151


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/FacetSetsField.java:
##
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+import java.util.Arrays;
+import org.apache.lucene.document.BinaryDocValuesField;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * A {@link BinaryDocValuesField} which encodes a list of {@link FacetSet 
facet sets}. The encoding
+ * scheme consists of a packed {@code long[]} where the first value denotes 
the number of dimensions
+ * in all the sets, followed by each set's values.
+ *
+ * @lucene.experimental
+ */
+public class FacetSetsField extends BinaryDocValuesField {
+
+  /**
+   * Create a new FacetSets field.
+   *
+   * @param name field name
+   * @param facetSets the {@link FacetSet} to index in that field. All must 
have the same number of
+   * dimensions
+   * @throws IllegalArgumentException if the field name is null or the given 
facet sets are invalid
+   */
+  public static FacetSetsField create(String name, FacetSet... facetSets) {
+validateFacetSets(facetSets);
+
+return new FacetSetsField(name, toPackedLongs(facetSets));
+  }
+
+  private FacetSetsField(String name, BytesRef value) {
+super(name, value);
+  }
+
+  private static void validateFacetSets(FacetSet... facetSets) {
+if (facetSets == null || facetSets.length == 0) {
+  throw new IllegalArgumentException("FacetSets cannot be null or empty!");
+}
+
+int dims = facetSets[0].values.length;
+if (!Arrays.stream(facetSets).allMatch(facetSet -> facetSet.values.length 
== dims)) {

Review Comment:
   Wasn't aware of this preference in the code base, will change to `noneMatch`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-14 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r897537151


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/FacetSetsField.java:
##
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+import java.util.Arrays;
+import org.apache.lucene.document.BinaryDocValuesField;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * A {@link BinaryDocValuesField} which encodes a list of {@link FacetSet 
facet sets}. The encoding
+ * scheme consists of a packed {@code long[]} where the first value denotes 
the number of dimensions
+ * in all the sets, followed by each set's values.
+ *
+ * @lucene.experimental
+ */
+public class FacetSetsField extends BinaryDocValuesField {
+
+  /**
+   * Create a new FacetSets field.
+   *
+   * @param name field name
+   * @param facetSets the {@link FacetSet} to index in that field. All must 
have the same number of
+   * dimensions
+   * @throws IllegalArgumentException if the field name is null or the given 
facet sets are invalid
+   */
+  public static FacetSetsField create(String name, FacetSet... facetSets) {
+validateFacetSets(facetSets);
+
+return new FacetSetsField(name, toPackedLongs(facetSets));
+  }
+
+  private FacetSetsField(String name, BytesRef value) {
+super(name, value);
+  }
+
+  private static void validateFacetSets(FacetSet... facetSets) {
+if (facetSets == null || facetSets.length == 0) {
+  throw new IllegalArgumentException("FacetSets cannot be null or empty!");
+}
+
+int dims = facetSets[0].values.length;
+if (!Arrays.stream(facetSets).allMatch(facetSet -> facetSet.values.length 
== dims)) {

Review Comment:
   Wasn't aware of this preference in the code base, will change to `anyMatch`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-14 Thread GitBox



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r897535976


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/ExactFacetSetMatcher.java:
##
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.facetset;
+
+/**
+ * A {@link FacetSetMatcher} which considers a set as a match only if all 
dimension values are equal
+ * to the given one.
+ *
+ * @lucene.experimental
+ */
+public class ExactFacetSetMatcher extends FacetSetMatcher {
+
+  private final long[] values;
+
+  /** Constructs an instance to match the given facet set. */
+  public ExactFacetSetMatcher(String label, FacetSet facetSet) {
+super(label, facetSet.values.length);
+this.values = facetSet.values;
+  }
+
+  @Override
+  public boolean matches(long[] dimValues) {
+assert dimValues.length == dims
+: "Encoded dimensions (dims="
++ dimValues.length
++ ") is incompatible with FacetSet dimensions (dims="
++ dims
++ ")";
+
+for (int i = 0; i < dimValues.length; i++) {
+  if (dimValues[i] != values[i]) {
+// Field's dimension value is not equal to given dimension, the entire 
set is rejected
+return false;
+  }
+}
+return true;

Review Comment:
   I thought we want to avoid calling other methods from such hot code, but 
yeah, `Arrays.equals` may even be more optimal. 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gautamworah96 commented on a diff in pull request #922: Index only the docs for FacetField posting list

2022-06-14 Thread GitBox



gautamworah96 commented on code in PR #922:
URL: https://github.com/apache/lucene/pull/922#discussion_r897570270


##
lucene/facet/src/java/org/apache/lucene/facet/FacetField.java:
##
@@ -30,14 +30,12 @@
  */
 public class FacetField extends Field {
 
-  /** Field type used for storing facet values: docs, freqs, and positions. */
+  /**
+   * Field type used for storing facet values. Actual field type used for 
indexing is determined in
+   * {@link FacetsConfig#build(TaxonomyWriter, Document)}
+   */
   public static final FieldType TYPE = new FieldType();
 
-  static {

Review Comment:
   Yeah, tbh, I was debating whether this change was even needed or no. Lets 
keep it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gautamworah96 commented on pull request #922: Index only the docs for FacetField posting list

2022-06-14 Thread GitBox



gautamworah96 commented on PR #922:
URL: https://github.com/apache/lucene/pull/922#issuecomment-1156032542

   > I think we might be doing the right thing already? If you look at 
StringField, we are setting: setIndexOptions(IndexOptions.DOCS)
   
   Yes, that is indeed the case.
   
   Thanks for taking a look at this change @gsmiller. I had committed my 
changes so as to not lose them with time. I'll explicitly request a review 
through the UI the next time I commit my partial work :)   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10617) Investigate recent Jenkins build failures in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler

2022-06-14 Thread Gautam Worah (Jira)

Gautam Worah created LUCENE-10617:
-

 Summary: Investigate recent Jenkins build failures in 
TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler
 Key: LUCENE-10617
 URL: https://issues.apache.org/jira/browse/LUCENE-10617
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Gautam Worah


Sample failures: [https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/692/, 
https://jenkins.thetaphi.de/job/Lucene-main-MacOSX/8177/|https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/692/]

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

44 matches

Mail list logo