[jira] [Created] (LUCENE-10680) UnifiedHighlighter's term extraction not working for some query rewrites

2022-08-17 Thread Yannick Welsch (Jira)
Yannick Welsch created LUCENE-10680:
---

 Summary: UnifiedHighlighter's term extraction not working for some 
query rewrites
 Key: LUCENE-10680
 URL: https://issues.apache.org/jira/browse/LUCENE-10680
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Reporter: Yannick Welsch


UnifiedHighlighter rewrites the query against an empty index when extracting 
the terms from the query (see 
[https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149).|https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149)]

The rewrite step can unfortunately drop the terms that are to be extracted.

Take for example the boolean query "+field:value 
-ConstantScore(FieldExistsQuery [field=other_field])" when highlighting on 
"field".

The `FieldExistsQuery` rewrites on an empty index to a `MatchAllDocsQuery`, and 
as a `MUST_NOT` clause rewrites the overall boolean query to a 
`MatchNoDocsQuery`, dropping the `MUST` clause in the process, which means that 
the `field:value` term is not being extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10680) UnifiedHighlighter's term extraction not working for some query rewrites

2022-08-17 Thread Alan Woodward (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580735#comment-17580735
 ] 

Alan Woodward commented on LUCENE-10680:


I think the `rewrite` call here is actually unnecessary, and indeed has been 
since we switched to using QueryVisitors. Removing it doesn't cause any tests 
to fail either

> UnifiedHighlighter's term extraction not working for some query rewrites
> 
>
> Key: LUCENE-10680
> URL: https://issues.apache.org/jira/browse/LUCENE-10680
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Reporter: Yannick Welsch
>Priority: Minor
>
> UnifiedHighlighter rewrites the query against an empty index when extracting 
> the terms from the query (see 
> [https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149).|https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149)]
> The rewrite step can unfortunately drop the terms that are to be extracted.
> Take for example the boolean query "+field:value 
> -ConstantScore(FieldExistsQuery [field=other_field])" when highlighting on 
> "field".
> The `FieldExistsQuery` rewrites on an empty index to a `MatchAllDocsQuery`, 
> and as a `MUST_NOT` clause rewrites the overall boolean query to a 
> `MatchNoDocsQuery`, dropping the `MUST` clause in the process, which means 
> that the `field:value` term is not being extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10680) UnifiedHighlighter's term extraction not working for some query rewrites

2022-08-17 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580874#comment-17580874
 ] 

Julie Tibshirani commented on LUCENE-10680:
---

Thanks for debugging this [~ywelsch]. It seems like the same problem as 
https://issues.apache.org/jira/browse/LUCENE-10454. Maybe we could close this 
in favor of that one to keep discussion in one place.

> UnifiedHighlighter's term extraction not working for some query rewrites
> 
>
> Key: LUCENE-10680
> URL: https://issues.apache.org/jira/browse/LUCENE-10680
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Reporter: Yannick Welsch
>Priority: Minor
>
> UnifiedHighlighter rewrites the query against an empty index when extracting 
> the terms from the query (see 
> [https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149).|https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149)]
> The rewrite step can unfortunately drop the terms that are to be extracted.
> Take for example the boolean query "+field:value 
> -ConstantScore(FieldExistsQuery [field=other_field])" when highlighting on 
> "field".
> The `FieldExistsQuery` rewrites on an empty index to a `MatchAllDocsQuery`, 
> and as a `MUST_NOT` clause rewrites the overall boolean query to a 
> `MatchNoDocsQuery`, dropping the `MUST` clause in the process, which means 
> that the `field:value` term is not being extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file

2022-08-17 Thread Jira
Luís Filipe Nassif created LUCENE-10681:
---

 Summary: ArrayIndexOutOfBoundsException while indexing large 
binary file
 Key: LUCENE-10681
 URL: https://issues.apache.org/jira/browse/LUCENE-10681
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 9.2
 Environment: Linux Ubuntu (will check the user version), java x64 
version 11.0.16.1
Reporter: Luís Filipe Nassif


Hello,

I looked for a similar issue, but didn't find one, so I'm creating this, sorry 
if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and 
an user reported error below while indexing a huge binary file in a 
parent-children schema where strings extracted from the huge binary file (using 
strings command) are indexed as thousands of ~10MB children docs of the parent 
metadata document:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds 
for length 71428
    at 
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at iped.engine.task.index.IndexTask.process(IndexTask.java:148) 
~[iped-engine-4.0.2.jar:?]
    at 
iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) 
~[iped-engine-4.0.2.jar:?]

 

This seems an integer overflow to me, not sure... It didn't use to happen with 
previous lucene-5.5.5 and indexing files like this is pretty common to us, 
although with lucene-5.5.5 we used to break that huge file manually before 
indexing using IndexWriter.addDocument(Document) method several times for each 
10MB chunck, now we are using the IndexWriter.addDocuments(Iterable) method 
with lucene-9.2.0... Any thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

2022-08-17 Thread GitBox


gsmiller commented on code in PR #1013:
URL: https://github.com/apache/lucene/pull/1013#discussion_r948197598


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java:
##
@@ -455,9 +500,9 @@ public void testEmptyRangesMultiValued() throws Exception {
 
 Facets facets = new LongRangeFacetCounts("field", fc);
 
-FacetResult result = facets.getAllChildren("field");
-assertEquals("dim=field path=[] value=0 childCount=0\n", 
result.toString());
-result = facets.getTopChildren(1, "field");
+assertFacetResult(
+facets.getAllChildren("field"), "field", new String[0], 0, 0, new 
LabelAndValue[] {});

Review Comment:
   minor: `new LabelAndValue[0]`?



##
lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java:
##
@@ -424,9 +469,9 @@ public void testEmptyRangesSingleValued() throws Exception {
 
 Facets facets = new LongRangeFacetCounts("field", fc);
 
-FacetResult result = facets.getAllChildren("field");
-assertEquals("dim=field path=[] value=0 childCount=0\n", 
result.toString());
-result = facets.getTopChildren(1, "field");
+assertFacetResult(
+facets.getAllChildren("field"), "field", new String[0], 0, 0, new 
LabelAndValue[] {});

Review Comment:
   minor: for consistency here, I'd suggest `new LabelAndValue[0]`?



##
lucene/CHANGES.txt:
##
@@ -52,6 +52,8 @@ Improvements
 
 * LUCENE-10614: Properly support getTopChildren in RangeFacetCounts. (Yuting 
Gan)
 
+* LUCENE-10644: Facets#getAllChildren testing should ignore child order. 
(Yuting Gan)

Review Comment:
   We don't need to wait for 10.0 to release this do we? Should we try to 
release this with 9.4?



##
lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java:
##
@@ -100,12 +100,21 @@ public void testBasicLong() throws Exception {
 new LongRange("90 or above", 90L, true, 100L, false),
 new LongRange("over 1000", 1000L, false, Long.MAX_VALUE, true));
 
-FacetResult result = facets.getAllChildren("field");
-assertEquals(
-"dim=field path=[] value=22 childCount=5\n  less than 10 (10)\n  less 
than or equal to 10 (11)\n  over 90 (9)\n  90 or above (10)\n  over 1000 (1)\n",
-result.toString());
+assertFacetResult(

Review Comment:
   So our implementation (and javadoc) for `RangeFacetCounts#getAllChildren` 
specifies that we _do_ actually guarantee child ordering. We should probably 
make sure our tests actually _do_ check ordering for range faceting right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file

2022-08-17 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif updated LUCENE-10681:

Description: 
Hello,

I looked for a similar issue, but didn't find one, so I'm creating this, sorry 
if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and 
an user reported error below while indexing a huge binary file in a 
parent-children schema where strings extracted from the huge binary file (using 
strings command) are indexed as thousands of ~10MB children text docs of the 
parent metadata document:

 

{{Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of 
bounds for length 71428}}
{{    at 
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at iped.engine.task.index.IndexTask.process(IndexTask.java:148) 
~[iped-engine-4.0.2.jar:?]}}
{{    at 
iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) 
~[iped-engine-4.0.2.jar:?]}}

 

This seems an integer overflow to me, not sure... It didn't use to happen with 
previous lucene-5.5.5 and indexing files like this is pretty common to us, 
although with lucene-5.5.5 we used to break that huge file manually before 
indexing and to index using IndexWriter.addDocument(Document) method several 
times for each 10MB chunk, now we are using the 
IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?

  was:
Hello,

I looked for a similar issue, but didn't find one, so I'm creating this, sorry 
if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and 
an user reported error below while indexing a huge binary file in a 
parent-children schema where strings extracted from the huge binary file (using 
strings command) are indexed as thousands of ~10MB children docs of the parent 
metadata document:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds 
for length 71428
    at 
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apa

[jira] [Updated] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file

2022-08-17 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif updated LUCENE-10681:

Description: 
Hello,

I looked for a similar issue, but didn't find one, so I'm creating this, sorry 
if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and 
an user reported error below while indexing a huge binary file in a 
parent-children schema where strings extracted from the huge binary file (using 
strings command) are indexed as thousands of ~10MB children text docs of the 
parent metadata document:

 
{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds 
for length 71428
    at 
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at iped.engine.task.index.IndexTask.process(IndexTask.java:148) 
~[iped-engine-4.0.2.jar:?]
    at 
iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) 
~[iped-engine-4.0.2.jar:?]{noformat}
 

This seems an integer overflow to me, not sure... It didn't use to happen with 
previous lucene-5.5.5 and indexing files like this is pretty common to us, 
although with lucene-5.5.5 we used to break that huge file manually before 
indexing and to index using IndexWriter.addDocument(Document) method several 
times for each 10MB chunk, now we are using the 
IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?

  was:
Hello,

I looked for a similar issue, but didn't find one, so I'm creating this, sorry 
if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and 
an user reported error below while indexing a huge binary file in a 
parent-children schema where strings extracted from the huge binary file (using 
strings command) are indexed as thousands of ~10MB children text docs of the 
parent metadata document:

 

{{Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of 
bounds for length 71428}}
{{    at 
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]}}
{{    at 
org.apache.lucene.index.FreqPro

[GitHub] [lucene] gsmiller commented on a diff in pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in TermInSetQuery

2022-08-17 Thread GitBox


gsmiller commented on code in PR #1058:
URL: https://github.com/apache/lucene/pull/1058#discussion_r948265844


##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -345,15 +345,62 @@ public BulkScorer bulkScorer(LeafReaderContext context) 
throws IOException {
   }
 
   @Override
-  public Scorer scorer(LeafReaderContext context) throws IOException {
-final WeightOrDocIdSet weightOrBitSet = rewrite(context);
-if (weightOrBitSet == null) {
-  return null;
-} else if (weightOrBitSet.weight != null) {
-  return weightOrBitSet.weight.scorer(context);
-} else {
-  return scorer(weightOrBitSet.set);
+  public ScorerSupplier scorerSupplier(LeafReaderContext context) throws 
IOException {
+// Cost estimation reasoning is:
+//  1. Assume every query term matches at least one document 
(queryTermsCount).
+//  2. Determine the total number of docs beyond the first one for 
each term.
+// That count provides a ceiling on the number of extra docs that 
could match beyond
+// that first one. (We omit the first since it's already been 
counted in #1).
+// This approach still provides correct worst-case cost in general, 
but provides tighter
+// estimates for primary-key-like fields. See: LUCENE-10207
+
+// TODO: This cost estimation may grossly overestimate since we have 
no index statistics
+// for the specific query terms. While it's nice to avoid the cost of 
intersecting the
+// query terms with the index, it could be beneficial to do that work 
and get better
+// cost estimates.
+final long cost;
+final long queryTermsCount = termData.size();
+Terms indexTerms = context.reader().terms(field);
+long potentialExtraCost = indexTerms.getSumDocFreq();
+final long indexedTermCount = indexTerms.size();
+if (indexedTermCount != -1) {
+  potentialExtraCost -= indexedTermCount;
 }
+cost = queryTermsCount + potentialExtraCost;
+
+final Weight weight = this;
+return new ScorerSupplier() {
+  @Override
+  public Scorer get(long leadCost) throws IOException {
+WeightOrDocIdSet weightOrDocIdSet = rewrite(context);
+if (weightOrDocIdSet == null) {
+  return null;
+}
+
+final Scorer scorer;
+if (weightOrDocIdSet.weight != null) {
+  scorer = weightOrDocIdSet.weight.scorer(context);
+} else {
+  scorer = scorer(weightOrDocIdSet.set);
+}
+
+return Objects.requireNonNullElseGet(
+scorer,
+() ->
+new ConstantScoreScorer(weight, score(), scoreMode, 
DocIdSetIterator.empty()));
+  }
+
+  @Override
+  public long cost() {
+return cost;

Review Comment:
   @msokolov when you have a chance, I'm curious what you think about this ^^. 
Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10318) Reuse HNSW graphs when merging segments?

2022-08-17 Thread Jack Mazanec (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580935#comment-17580935
 ] 

Jack Mazanec commented on LUCENE-10318:
---

Hi [~julietibs]  I was thinking about something similar and would be interested 
in working on this. I can run some experiments to see if this would improve 
performance, if you haven’t already started to do so.

Additionally, I am wondering if it would make sense to extend this to support 
graphs that contain deleted nodes. I can think of an approach, but it is a 
little messy. It would follow the same idea for merging — add vectors from 
smaller graph into larger graph. However, before adding vectors from smaller 
graph, all of the deleted nodes would need to be removed from the larger graph.

In order to remove a node from the graph, I think we would need to remove it 
from list of neighbor arrays for each level it is in. In addition to this, 
because removal would break the ordinals, we would have to update all of the 
ordinals in the graph, which for OnHeapHNSW graph would mean updating all nodes 
by levels and also potentially each neighbor in each NeighborArray in the 
graph. 

Because removing a node could cause a number of nodes in the graph to lose a 
neighbor, we would need to repair the graph. To do this, I think we could 
create a _repair_list_ that tracks the nodes that lost a connection due to the 
deleted node{_}.{_} To fill the list, we would need to iterate over all of the 
nodes in the graph and then check if any of their _m_ connections are to the 
deleted node (I think this could be done when the ordinals are being updated). 
If so, remove the connection and then add the node to the {_}repair_list{_}.

Once the _repair_list_ is complete, for each node in the list, search the graph 
to get new neighbors to fill up the node’s connections to the desired amount. 
At this point, I would expect the time it takes to finish merging to be equal 
to the time it takes to insert the number of live vectors in the smaller graph 
plus the size of the repair list into the large graph.

All that being said, I am not sure if removing deleted nodes in the graph would 
be faster than just building the graph from scratch. From the logic above, we 
would need to at least iterate over each connection in the graph and 
potentially perform several list deletions. My guess is that when the repair 
list is small, I think it would be faster, but when it is large, probably not. 
I am going to try to start playing around with this idea, but please let me 
know what you think!

> Reuse HNSW graphs when merging segments?
> 
>
> Key: LUCENE-10318
> URL: https://issues.apache.org/jira/browse/LUCENE-10318
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire 
> graph from scratch. In general, building these graphs is very expensive, and 
> it'd be nice to optimize it in any way we can. I was wondering if during 
> merge, we could choose the largest segment with no deletes, and load its HNSW 
> graph into heap. Then we'd add vectors from the other segments to this graph, 
> through the normal build process. This could cut down on the number of 
> operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would 
> help. I'd guess that whether it helps would also depend on details of the 
> MergePolicy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

2022-08-17 Thread GitBox


gsmiller commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r948336077


##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -363,6 +370,29 @@ public boolean isCacheable(LeafReaderContext ctx) {
 // sets.
 return ramBytesUsed() <= 
RamUsageEstimator.QUERY_DEFAULT_RAM_BYTES_USED;
   }
+
+  static final class MatchAllDocIdSet extends DocIdSet {
+private final int size;

Review Comment:
   Thanks for the suggestion @LuXugang. Yeah, I think exposing an `ALL` 
`DocIdSet` for general use is reasonable. I'll update the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file

2022-08-17 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif updated LUCENE-10681:

Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1  (was: Linux 
Ubuntu (will check the user version), java x64 version 11.0.16.1)

> ArrayIndexOutOfBoundsException while indexing large binary file
> ---
>
> Key: LUCENE-10681
> URL: https://issues.apache.org/jira/browse/LUCENE-10681
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.2
> Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1
>Reporter: Luís Filipe Nassif
>Priority: Minor
>
> Hello,
> I looked for a similar issue, but didn't find one, so I'm creating this, 
> sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 
> recently and an user reported error below while indexing a huge binary file 
> in a parent-children schema where strings extracted from the huge binary file 
> (using strings command) are indexed as thousands of ~10MB children text docs 
> of the parent metadata document:
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of 
> bounds for length 71428
>     at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at iped.engine.task.index.IndexTask.process(IndexTask.java:148) 
> ~[iped-engine-4.0.2.jar:?]
>     at 
> iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) 
> ~[iped-engine-4.0.2.jar:?]{noformat}
>  
> This seems an integer overflow to me, not sure... It didn't use to happen 
> with previous lucene-5.5.5 and indexing files like this is pretty common to 
> us, although with lucene-5.5.5 we used to break that huge file manually 
> before indexing and to index using IndexWriter.addDocument(Document) method 
> several times for each 10MB chunk, now we are using the 
> IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.

[jira] [Updated] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file

2022-08-17 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif updated LUCENE-10681:

Priority: Major  (was: Minor)

> ArrayIndexOutOfBoundsException while indexing large binary file
> ---
>
> Key: LUCENE-10681
> URL: https://issues.apache.org/jira/browse/LUCENE-10681
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.2
> Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1
>Reporter: Luís Filipe Nassif
>Priority: Major
>
> Hello,
> I looked for a similar issue, but didn't find one, so I'm creating this, 
> sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 
> recently and an user reported error below while indexing a huge binary file 
> in a parent-children schema where strings extracted from the huge binary file 
> (using strings command) are indexed as thousands of ~10MB children text docs 
> of the parent metadata document:
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of 
> bounds for length 71428
>     at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at iped.engine.task.index.IndexTask.process(IndexTask.java:148) 
> ~[iped-engine-4.0.2.jar:?]
>     at 
> iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) 
> ~[iped-engine-4.0.2.jar:?]{noformat}
>  
> This seems an integer overflow to me, not sure... It didn't use to happen 
> with previous lucene-5.5.5 and indexing files like this is pretty common to 
> us, although with lucene-5.5.5 we used to break that huge file manually 
> before indexing and to index using IndexWriter.addDocument(Document) method 
> several times for each 10MB chunk, now we are using the 
> IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file

2022-08-17 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580973#comment-17580973
 ] 

Luís Filipe Nassif commented on LUCENE-10681:
-

Just changed the priority to the default (major), I changed it accidentally, 
but not sure if it is ok.

> ArrayIndexOutOfBoundsException while indexing large binary file
> ---
>
> Key: LUCENE-10681
> URL: https://issues.apache.org/jira/browse/LUCENE-10681
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.2
> Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1
>Reporter: Luís Filipe Nassif
>Priority: Major
>
> Hello,
> I looked for a similar issue, but didn't find one, so I'm creating this, 
> sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 
> recently and an user reported error below while indexing a huge binary file 
> in a parent-children schema where strings extracted from the huge binary file 
> (using strings command) are indexed as thousands of ~10MB children text docs 
> of the parent metadata document:
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of 
> bounds for length 71428
>     at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at iped.engine.task.index.IndexTask.process(IndexTask.java:148) 
> ~[iped-engine-4.0.2.jar:?]
>     at 
> iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) 
> ~[iped-engine-4.0.2.jar:?]{noformat}
>  
> This seems an integer overflow to me, not sure... It didn't use to happen 
> with previous lucene-5.5.5 and indexing files like this is pretty common to 
> us, although with lucene-5.5.5 we used to break that huge file manually 
> before indexing and to index using IndexWriter.addDocument(Document) method 
> several times for each 10MB chunk, now we are using the 
> IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr

[jira] [Commented] (LUCENE-10318) Reuse HNSW graphs when merging segments?

2022-08-17 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580979#comment-17580979
 ] 

Mayya Sharipova commented on LUCENE-10318:
--

Thanks for looking into this, Jack. 

We have not done any development on this, but some thoughts from us (may be 
Julie can add more):
 * Looks like the way MergePolicy works, it chooses segments of approximately 
same size. So during merge, we may not have one single big segment, whose graph 
we can reuse.  So I would imagine for many uses case it may not worth reusing 
graphs (especially if segments are relative small) - extra complexity would not 
justify a very small speedups. 
 * I agree with your thoughts on deletions that it may also not worth reusing 
graphs is some heavy deletions are present.

So may be, a good start  could be have a very lean prototype with a lot of  
performance benchmarks. 

> Reuse HNSW graphs when merging segments?
> 
>
> Key: LUCENE-10318
> URL: https://issues.apache.org/jira/browse/LUCENE-10318
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire 
> graph from scratch. In general, building these graphs is very expensive, and 
> it'd be nice to optimize it in any way we can. I was wondering if during 
> merge, we could choose the largest segment with no deletes, and load its HNSW 
> graph into heap. Then we'd add vectors from the other segments to this graph, 
> through the normal build process. This could cut down on the number of 
> operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would 
> help. I'd guess that whether it helps would also depend on details of the 
> MergePolicy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10318) Reuse HNSW graphs when merging segments?

2022-08-17 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580979#comment-17580979
 ] 

Mayya Sharipova edited comment on LUCENE-10318 at 8/17/22 8:01 PM:
---

Thanks for looking into this, Jack. 

We have not done any development on this, but some thoughts from us:
 * Looks like the way MergePolicy works, it chooses segments of approximately 
same size. So during merge, we may not have one single big segment, whose graph 
we can reuse.  So I would imagine for many uses case it may not worth reusing 
graphs (especially if segments are relative small) - extra complexity would not 
justify a very small speedups. 
 * I agree with your thoughts on deletions that it may also not worth reusing 
graphs if some heavy deletions are present.

So may be, a good start  could be have a very lean prototype with a lot of  
performance benchmarks. 


was (Author: mayyas):
Thanks for looking into this, Jack. 

We have not done any development on this, but some thoughts from us (may be 
Julie can add more):
 * Looks like the way MergePolicy works, it chooses segments of approximately 
same size. So during merge, we may not have one single big segment, whose graph 
we can reuse.  So I would imagine for many uses case it may not worth reusing 
graphs (especially if segments are relative small) - extra complexity would not 
justify a very small speedups. 
 * I agree with your thoughts on deletions that it may also not worth reusing 
graphs is some heavy deletions are present.

So may be, a good start  could be have a very lean prototype with a lot of  
performance benchmarks. 

> Reuse HNSW graphs when merging segments?
> 
>
> Key: LUCENE-10318
> URL: https://issues.apache.org/jira/browse/LUCENE-10318
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire 
> graph from scratch. In general, building these graphs is very expensive, and 
> it'd be nice to optimize it in any way we can. I was wondering if during 
> merge, we could choose the largest segment with no deletes, and load its HNSW 
> graph into heap. Then we'd add vectors from the other segments to this graph, 
> through the normal build process. This could cut down on the number of 
> operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would 
> help. I'd guess that whether it helps would also depend on details of the 
> MergePolicy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10318) Reuse HNSW graphs when merging segments?

2022-08-17 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580979#comment-17580979
 ] 

Mayya Sharipova edited comment on LUCENE-10318 at 8/17/22 8:02 PM:
---

Thanks for looking into this, Jack. 

We have not done any development on this, but some thoughts from us:
 * Looks like the way MergePolicy works, it chooses segments of approximately 
same size. So during merge, we may not have one single big segment, whose graph 
we can reuse.  So I would imagine for many uses case it may not worth reusing 
graphs (especially if segments are relative small) - extra complexity would not 
justify a very small speedups. 
 * I agree with your thoughts on deletions that it may also not worth reusing 
graphs if some heavy deletions are present.

So may be, a good start  could be to have a very lean prototype with a lot of  
performance benchmarks. 


was (Author: mayyas):
Thanks for looking into this, Jack. 

We have not done any development on this, but some thoughts from us:
 * Looks like the way MergePolicy works, it chooses segments of approximately 
same size. So during merge, we may not have one single big segment, whose graph 
we can reuse.  So I would imagine for many uses case it may not worth reusing 
graphs (especially if segments are relative small) - extra complexity would not 
justify a very small speedups. 
 * I agree with your thoughts on deletions that it may also not worth reusing 
graphs if some heavy deletions are present.

So may be, a good start  could be have a very lean prototype with a lot of  
performance benchmarks. 

> Reuse HNSW graphs when merging segments?
> 
>
> Key: LUCENE-10318
> URL: https://issues.apache.org/jira/browse/LUCENE-10318
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire 
> graph from scratch. In general, building these graphs is very expensive, and 
> it'd be nice to optimize it in any way we can. I was wondering if during 
> merge, we could choose the largest segment with no deletes, and load its HNSW 
> graph into heap. Then we'd add vectors from the other segments to this graph, 
> through the normal build process. This could cut down on the number of 
> operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would 
> help. I'd guess that whether it helps would also depend on details of the 
> MergePolicy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10318) Reuse HNSW graphs when merging segments?

2022-08-17 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581016#comment-17581016
 ] 

Julie Tibshirani commented on LUCENE-10318:
---

[~jmazanec15] it's great you're interested in looking into this! I don't have 
any prototype or experiments, you're welcome to pick it up.

Removing nodes and repairing the graph could be a nice direction. But for now 
we can keep things simple and assume there's a segment without deletes. If 
that's looking good and shows a nice improvement in index/ merge benchmarks, 
then we can handle deletes in a follow-up.

> Reuse HNSW graphs when merging segments?
> 
>
> Key: LUCENE-10318
> URL: https://issues.apache.org/jira/browse/LUCENE-10318
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire 
> graph from scratch. In general, building these graphs is very expensive, and 
> it'd be nice to optimize it in any way we can. I was wondering if during 
> merge, we could choose the largest segment with no deletes, and load its HNSW 
> graph into heap. Then we'd add vectors from the other segments to this graph, 
> through the normal build process. This could cut down on the number of 
> operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would 
> help. I'd guess that whether it helps would also depend on details of the 
> MergePolicy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10318) Reuse HNSW graphs when merging segments?

2022-08-17 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581016#comment-17581016
 ] 

Julie Tibshirani edited comment on LUCENE-10318 at 8/17/22 8:51 PM:


[~jmazanec15] it's great you're interested in looking into this! I don't have 
any prototype or experiments, you're welcome to pick it up.

Removing nodes and repairing the graph could be a nice direction. But for now 
we can keep things simple and assume there's a segment without deletes. If 
that's looking good and shows a nice improvement in index/ merge benchmarks, 
then we can handle deletes in a follow-up.

Edit: Oops, I didn't refresh the page so I missed Mayya's comment. It looks 
like we're in agreement!


was (Author: julietibs):
[~jmazanec15] it's great you're interested in looking into this! I don't have 
any prototype or experiments, you're welcome to pick it up.

Removing nodes and repairing the graph could be a nice direction. But for now 
we can keep things simple and assume there's a segment without deletes. If 
that's looking good and shows a nice improvement in index/ merge benchmarks, 
then we can handle deletes in a follow-up.

> Reuse HNSW graphs when merging segments?
> 
>
> Key: LUCENE-10318
> URL: https://issues.apache.org/jira/browse/LUCENE-10318
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire 
> graph from scratch. In general, building these graphs is very expensive, and 
> it'd be nice to optimize it in any way we can. I was wondering if during 
> merge, we could choose the largest segment with no deletes, and load its HNSW 
> graph into heap. Then we'd add vectors from the other segments to this graph, 
> through the normal build process. This could cut down on the number of 
> operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would 
> help. I'd guess that whether it helps would also depend on details of the 
> MergePolicy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

2022-08-17 Thread GitBox


Yuti-G commented on code in PR #1013:
URL: https://github.com/apache/lucene/pull/1013#discussion_r948407058


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java:
##
@@ -100,12 +100,21 @@ public void testBasicLong() throws Exception {
 new LongRange("90 or above", 90L, true, 100L, false),
 new LongRange("over 1000", 1000L, false, Long.MAX_VALUE, true));
 
-FacetResult result = facets.getAllChildren("field");
-assertEquals(
-"dim=field path=[] value=22 childCount=5\n  less than 10 (10)\n  less 
than or equal to 10 (11)\n  over 90 (9)\n  90 or above (10)\n  over 1000 (1)\n",
-result.toString());
+assertFacetResult(

Review Comment:
   Sorry, I am confused. Our javadoc for Facets#getAllChildren explicitly calls 
out that callers should make _**NO**_ assumptions about child ordering. Isn't 
the purpose of addressing the previous tests by ignoring child order like you 
described in the LUCENE-10644 Jira issue? 
   
   I know it's been a while, but please refer to our PR comments and confirm 
whether we misunderstand something here. Thanks! 
   
   > Thanks @Yuti-G! This approach looks good to me. Is your plan to iterate on 
this PR to stop enforcing the ordering checks in all the tests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new pull request, #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer

2022-08-17 Thread GitBox


jtibshirani opened a new pull request, #1071:
URL: https://github.com/apache/lucene/pull/1071

   This change folds the `RandomAccessVectorValuesProducer` interface into
   `RandomAccessVectorValues`. This reduces the number of interfaces and 
clarifies
   the cloning/ copying behavior.
   
   This is a small simplification related to LUCENE-9583, but does not address 
the
   main issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer

2022-08-17 Thread GitBox


jtibshirani commented on code in PR #1071:
URL: https://github.com/apache/lucene/pull/1071#discussion_r948528112


##
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##
@@ -783,66 +742,6 @@ private static void usage() {
 System.exit(1);
   }
 
-  class BinaryFileVectors implements RandomAccessVectorValuesProducer, 
Closeable {

Review Comment:
   I wasn't sure this functionality was worth preserving. Let me know though 
and I can restore and refactor it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10318) Reuse HNSW graphs when merging segments?

2022-08-17 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10318:
--
Labels: vector-based-search  (was: )

> Reuse HNSW graphs when merging segments?
> 
>
> Key: LUCENE-10318
> URL: https://issues.apache.org/jira/browse/LUCENE-10318
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire 
> graph from scratch. In general, building these graphs is very expensive, and 
> it'd be nice to optimize it in any way we can. I was wondering if during 
> merge, we could choose the largest segment with no deletes, and load its HNSW 
> graph into heap. Then we'd add vectors from the other segments to this graph, 
> through the normal build process. This could cut down on the number of 
> operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would 
> help. I'd guess that whether it helps would also depend on details of the 
> MergePolicy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

2022-08-17 Thread GitBox


gsmiller commented on code in PR #1013:
URL: https://github.com/apache/lucene/pull/1013#discussion_r948541210


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java:
##
@@ -100,12 +100,21 @@ public void testBasicLong() throws Exception {
 new LongRange("90 or above", 90L, true, 100L, false),
 new LongRange("over 1000", 1000L, false, Long.MAX_VALUE, true));
 
-FacetResult result = facets.getAllChildren("field");
-assertEquals(
-"dim=field path=[] value=22 childCount=5\n  less than 10 (10)\n  less 
than or equal to 10 (11)\n  over 90 (9)\n  90 or above (10)\n  over 1000 (1)\n",
-result.toString());
+assertFacetResult(

Review Comment:
   @Yuti-G I'm referring to the javadoc on `RangeFAcetCounts#getAllChildren`, 
which notes an exception to this rule in range counting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

2022-08-17 Thread GitBox


Yuti-G commented on code in PR #1013:
URL: https://github.com/apache/lucene/pull/1013#discussion_r948557786


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java:
##
@@ -100,12 +100,21 @@ public void testBasicLong() throws Exception {
 new LongRange("90 or above", 90L, true, 100L, false),
 new LongRange("over 1000", 1000L, false, Long.MAX_VALUE, true));
 
-FacetResult result = facets.getAllChildren("field");
-assertEquals(
-"dim=field path=[] value=22 childCount=5\n  less than 10 (10)\n  less 
than or equal to 10 (11)\n  over 90 (9)\n  90 or above (10)\n  over 1000 (1)\n",
-result.toString());
+assertFacetResult(

Review Comment:
   Thanks for catching this! Sorry for overlooking `range` in the comment. I 
reverted the changes in TestRangeFacetCounts. Please let me know if there is 
any question. Thank you so much for your time!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-17 Thread GitBox


jtibshirani commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r948548244


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -133,22 +130,21 @@ private TopDocs searchLeaf(LeafReaderContext ctx, Weight 
filterWeight) throws IO
   return NO_RESULTS;
 }
 
-BitSet bitSet = createBitSet(scorer.iterator(), liveDocs, maxDoc);
-BitSetIterator filterIterator = new BitSetIterator(bitSet, 
bitSet.cardinality());
+BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc);
 
-if (filterIterator.cost() <= k) {
+if (acceptDocs.cardinality() <= k) {

Review Comment:
   Whenever possible, we should avoiding calling `cardinality` multiple times 
since it can run in linear time. I thought the original logic was clearer (but 
I'm biased since I wrote it 😊 )



##
lucene/core/src/java/org/apache/lucene/index/VectorEncoding.java:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+/** The numeric datatype of the vector values. */
+public enum VectorEncoding {
+
+  /**
+   * Encodes vector using 8 bits of precision per sample. Use only with 
DOT_PRODUCT similarity.

Review Comment:
   Is it still true that it should only be used with DOT_PRODUCT similarity?



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java:
##
@@ -76,6 +78,15 @@ public static KnnVectorsFormat forName(String name) {
   /** Returns a {@link KnnVectorsReader} to read the vectors from the index. */
   public abstract KnnVectorsReader fieldsReader(SegmentReadState state) throws 
IOException;
 
+  /**
+   * Returns the current KnnVectorsFormat version number. Indexes written 
using the format will be
+   * "stamped" with this version.
+   */
+  public int currentVersion() {

Review Comment:
   It seems confusing to have a new concept of "version" separate from the 
codec version. It's only used in `BaseKnnVectorsFormatTestCase` -- could we 
instead make the `randomVectorEncoding` overridable? It would default to all 
encodings but older codecs could override it and just return float32?



##
lucene/core/src/java/org/apache/lucene/document/KnnVectorField.java:
##
@@ -117,6 +160,21 @@ public KnnVectorField(String name, float[] vector, 
FieldType fieldType) {
 fieldsData = vector;
   }
 
+  /**
+   * Creates a numeric vector field. Fields are single-valued: each document 
has either one value or
+   * no value. Vectors of a single field share the same dimension and 
similarity function.
+   *
+   * @param name field name
+   * @param vector value
+   * @param fieldType field type
+   * @throws IllegalArgumentException if any parameter is null, or the vector 
is empty or has
+   * dimension > 1024.
+   */
+  public KnnVectorField(String name, BytesRef vector, FieldType fieldType) {

Review Comment:
   I think this method is only meant to be used with `VectorEncoding.BYTE`? 
Then it'd be good to validate this on the `FieldType`. The same thought applies 
to the float-oriented constructor.



##
lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java:
##
@@ -249,6 +261,29 @@ private void writeSortingField(FieldWriter fieldData, int 
maxDoc, Sorter.DocMap
 mockGraph);
   }
 
+  private long writeSortedFloat32Vectors(FieldWriter fieldData, int[] 
ordMap)
+  throws IOException {
+long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES);
+final ByteBuffer buffer =
+ByteBuffer.allocate(fieldData.dim * 
Float.BYTES).order(ByteOrder.LITTLE_ENDIAN);
+final BytesRef binaryValue = new BytesRef(buffer.array());
+for (int ordinal : ordMap) {
+  float[] vector = (float[]) fieldData.vectors.get(ordinal);
+  buffer.asFloatBuffer().put(vector);
+  vectorData.writeBytes(binaryValue.bytes, binaryValue.offset, 
binaryValue.length);
+}
+return vectorDataOffset;
+  }
+
+  private long writeSortedByteVectors(FieldWriter fieldData, int[] ordMap) 
throws IOException {
+long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES);
+for (int ordinal : ordMap) {
+  by