[jira] [Created] (LUCENE-10680) UnifiedHighlighter's term extraction not working for some query rewrites
Yannick Welsch created LUCENE-10680: --- Summary: UnifiedHighlighter's term extraction not working for some query rewrites Key: LUCENE-10680 URL: https://issues.apache.org/jira/browse/LUCENE-10680 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Reporter: Yannick Welsch UnifiedHighlighter rewrites the query against an empty index when extracting the terms from the query (see [https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149).|https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149)] The rewrite step can unfortunately drop the terms that are to be extracted. Take for example the boolean query "+field:value -ConstantScore(FieldExistsQuery [field=other_field])" when highlighting on "field". The `FieldExistsQuery` rewrites on an empty index to a `MatchAllDocsQuery`, and as a `MUST_NOT` clause rewrites the overall boolean query to a `MatchNoDocsQuery`, dropping the `MUST` clause in the process, which means that the `field:value` term is not being extracted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10680) UnifiedHighlighter's term extraction not working for some query rewrites
[ https://issues.apache.org/jira/browse/LUCENE-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580735#comment-17580735 ] Alan Woodward commented on LUCENE-10680: I think the `rewrite` call here is actually unnecessary, and indeed has been since we switched to using QueryVisitors. Removing it doesn't cause any tests to fail either > UnifiedHighlighter's term extraction not working for some query rewrites > > > Key: LUCENE-10680 > URL: https://issues.apache.org/jira/browse/LUCENE-10680 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter >Reporter: Yannick Welsch >Priority: Minor > > UnifiedHighlighter rewrites the query against an empty index when extracting > the terms from the query (see > [https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149).|https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149)] > The rewrite step can unfortunately drop the terms that are to be extracted. > Take for example the boolean query "+field:value > -ConstantScore(FieldExistsQuery [field=other_field])" when highlighting on > "field". > The `FieldExistsQuery` rewrites on an empty index to a `MatchAllDocsQuery`, > and as a `MUST_NOT` clause rewrites the overall boolean query to a > `MatchNoDocsQuery`, dropping the `MUST` clause in the process, which means > that the `field:value` term is not being extracted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10680) UnifiedHighlighter's term extraction not working for some query rewrites
[ https://issues.apache.org/jira/browse/LUCENE-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580874#comment-17580874 ] Julie Tibshirani commented on LUCENE-10680: --- Thanks for debugging this [~ywelsch]. It seems like the same problem as https://issues.apache.org/jira/browse/LUCENE-10454. Maybe we could close this in favor of that one to keep discussion in one place. > UnifiedHighlighter's term extraction not working for some query rewrites > > > Key: LUCENE-10680 > URL: https://issues.apache.org/jira/browse/LUCENE-10680 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter >Reporter: Yannick Welsch >Priority: Minor > > UnifiedHighlighter rewrites the query against an empty index when extracting > the terms from the query (see > [https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149).|https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149)] > The rewrite step can unfortunately drop the terms that are to be extracted. > Take for example the boolean query "+field:value > -ConstantScore(FieldExistsQuery [field=other_field])" when highlighting on > "field". > The `FieldExistsQuery` rewrites on an empty index to a `MatchAllDocsQuery`, > and as a `MUST_NOT` clause rewrites the overall boolean query to a > `MatchNoDocsQuery`, dropping the `MUST` clause in the process, which means > that the `field:value` term is not being extracted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file
Luís Filipe Nassif created LUCENE-10681: --- Summary: ArrayIndexOutOfBoundsException while indexing large binary file Key: LUCENE-10681 URL: https://issues.apache.org/jira/browse/LUCENE-10681 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 9.2 Environment: Linux Ubuntu (will check the user version), java x64 version 11.0.16.1 Reporter: Luís Filipe Nassif Hello, I looked for a similar issue, but didn't find one, so I'm creating this, sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and an user reported error below while indexing a huge binary file in a parent-children schema where strings extracted from the huge binary file (using strings command) are indexed as thousands of ~10MB children docs of the parent metadata document: Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 71428 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at iped.engine.task.index.IndexTask.process(IndexTask.java:148) ~[iped-engine-4.0.2.jar:?] at iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) ~[iped-engine-4.0.2.jar:?] This seems an integer overflow to me, not sure... It didn't use to happen with previous lucene-5.5.5 and indexing files like this is pretty common to us, although with lucene-5.5.5 we used to break that huge file manually before indexing using IndexWriter.addDocument(Document) method several times for each 10MB chunck, now we are using the IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order
gsmiller commented on code in PR #1013: URL: https://github.com/apache/lucene/pull/1013#discussion_r948197598 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java: ## @@ -455,9 +500,9 @@ public void testEmptyRangesMultiValued() throws Exception { Facets facets = new LongRangeFacetCounts("field", fc); -FacetResult result = facets.getAllChildren("field"); -assertEquals("dim=field path=[] value=0 childCount=0\n", result.toString()); -result = facets.getTopChildren(1, "field"); +assertFacetResult( +facets.getAllChildren("field"), "field", new String[0], 0, 0, new LabelAndValue[] {}); Review Comment: minor: `new LabelAndValue[0]`? ## lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java: ## @@ -424,9 +469,9 @@ public void testEmptyRangesSingleValued() throws Exception { Facets facets = new LongRangeFacetCounts("field", fc); -FacetResult result = facets.getAllChildren("field"); -assertEquals("dim=field path=[] value=0 childCount=0\n", result.toString()); -result = facets.getTopChildren(1, "field"); +assertFacetResult( +facets.getAllChildren("field"), "field", new String[0], 0, 0, new LabelAndValue[] {}); Review Comment: minor: for consistency here, I'd suggest `new LabelAndValue[0]`? ## lucene/CHANGES.txt: ## @@ -52,6 +52,8 @@ Improvements * LUCENE-10614: Properly support getTopChildren in RangeFacetCounts. (Yuting Gan) +* LUCENE-10644: Facets#getAllChildren testing should ignore child order. (Yuting Gan) Review Comment: We don't need to wait for 10.0 to release this do we? Should we try to release this with 9.4? ## lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java: ## @@ -100,12 +100,21 @@ public void testBasicLong() throws Exception { new LongRange("90 or above", 90L, true, 100L, false), new LongRange("over 1000", 1000L, false, Long.MAX_VALUE, true)); -FacetResult result = facets.getAllChildren("field"); -assertEquals( -"dim=field path=[] value=22 childCount=5\n less than 10 (10)\n less than or equal to 10 (11)\n over 90 (9)\n 90 or above (10)\n over 1000 (1)\n", -result.toString()); +assertFacetResult( Review Comment: So our implementation (and javadoc) for `RangeFacetCounts#getAllChildren` specifies that we _do_ actually guarantee child ordering. We should probably make sure our tests actually _do_ check ordering for range faceting right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file
[ https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif updated LUCENE-10681: Description: Hello, I looked for a similar issue, but didn't find one, so I'm creating this, sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and an user reported error below while indexing a huge binary file in a parent-children schema where strings extracted from the huge binary file (using strings command) are indexed as thousands of ~10MB children text docs of the parent metadata document: {{Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 71428}} {{ at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at iped.engine.task.index.IndexTask.process(IndexTask.java:148) ~[iped-engine-4.0.2.jar:?]}} {{ at iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) ~[iped-engine-4.0.2.jar:?]}} This seems an integer overflow to me, not sure... It didn't use to happen with previous lucene-5.5.5 and indexing files like this is pretty common to us, although with lucene-5.5.5 we used to break that huge file manually before indexing and to index using IndexWriter.addDocument(Document) method several times for each 10MB chunk, now we are using the IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts? was: Hello, I looked for a similar issue, but didn't find one, so I'm creating this, sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and an user reported error below while indexing a huge binary file in a parent-children schema where strings extracted from the huge binary file (using strings command) are indexed as thousands of ~10MB children docs of the parent metadata document: Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 71428 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apa
[jira] [Updated] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file
[ https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif updated LUCENE-10681: Description: Hello, I looked for a similar issue, but didn't find one, so I'm creating this, sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and an user reported error below while indexing a huge binary file in a parent-children schema where strings extracted from the huge binary file (using strings command) are indexed as thousands of ~10MB children text docs of the parent metadata document: {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 71428 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13] at iped.engine.task.index.IndexTask.process(IndexTask.java:148) ~[iped-engine-4.0.2.jar:?] at iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) ~[iped-engine-4.0.2.jar:?]{noformat} This seems an integer overflow to me, not sure... It didn't use to happen with previous lucene-5.5.5 and indexing files like this is pretty common to us, although with lucene-5.5.5 we used to break that huge file manually before indexing and to index using IndexWriter.addDocument(Document) method several times for each 10MB chunk, now we are using the IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts? was: Hello, I looked for a similar issue, but didn't find one, so I'm creating this, sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and an user reported error below while indexing a huge binary file in a parent-children schema where strings extracted from the huge binary file (using strings command) are indexed as thousands of ~10MB children text docs of the parent metadata document: {{Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 71428}} {{ at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]}} {{ at org.apache.lucene.index.FreqPro
[GitHub] [lucene] gsmiller commented on a diff in pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in TermInSetQuery
gsmiller commented on code in PR #1058: URL: https://github.com/apache/lucene/pull/1058#discussion_r948265844 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -345,15 +345,62 @@ public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { } @Override - public Scorer scorer(LeafReaderContext context) throws IOException { -final WeightOrDocIdSet weightOrBitSet = rewrite(context); -if (weightOrBitSet == null) { - return null; -} else if (weightOrBitSet.weight != null) { - return weightOrBitSet.weight.scorer(context); -} else { - return scorer(weightOrBitSet.set); + public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException { +// Cost estimation reasoning is: +// 1. Assume every query term matches at least one document (queryTermsCount). +// 2. Determine the total number of docs beyond the first one for each term. +// That count provides a ceiling on the number of extra docs that could match beyond +// that first one. (We omit the first since it's already been counted in #1). +// This approach still provides correct worst-case cost in general, but provides tighter +// estimates for primary-key-like fields. See: LUCENE-10207 + +// TODO: This cost estimation may grossly overestimate since we have no index statistics +// for the specific query terms. While it's nice to avoid the cost of intersecting the +// query terms with the index, it could be beneficial to do that work and get better +// cost estimates. +final long cost; +final long queryTermsCount = termData.size(); +Terms indexTerms = context.reader().terms(field); +long potentialExtraCost = indexTerms.getSumDocFreq(); +final long indexedTermCount = indexTerms.size(); +if (indexedTermCount != -1) { + potentialExtraCost -= indexedTermCount; } +cost = queryTermsCount + potentialExtraCost; + +final Weight weight = this; +return new ScorerSupplier() { + @Override + public Scorer get(long leadCost) throws IOException { +WeightOrDocIdSet weightOrDocIdSet = rewrite(context); +if (weightOrDocIdSet == null) { + return null; +} + +final Scorer scorer; +if (weightOrDocIdSet.weight != null) { + scorer = weightOrDocIdSet.weight.scorer(context); +} else { + scorer = scorer(weightOrDocIdSet.set); +} + +return Objects.requireNonNullElseGet( +scorer, +() -> +new ConstantScoreScorer(weight, score(), scoreMode, DocIdSetIterator.empty())); + } + + @Override + public long cost() { +return cost; Review Comment: @msokolov when you have a chance, I'm curious what you think about this ^^. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10318) Reuse HNSW graphs when merging segments?
[ https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580935#comment-17580935 ] Jack Mazanec commented on LUCENE-10318: --- Hi [~julietibs] I was thinking about something similar and would be interested in working on this. I can run some experiments to see if this would improve performance, if you haven’t already started to do so. Additionally, I am wondering if it would make sense to extend this to support graphs that contain deleted nodes. I can think of an approach, but it is a little messy. It would follow the same idea for merging — add vectors from smaller graph into larger graph. However, before adding vectors from smaller graph, all of the deleted nodes would need to be removed from the larger graph. In order to remove a node from the graph, I think we would need to remove it from list of neighbor arrays for each level it is in. In addition to this, because removal would break the ordinals, we would have to update all of the ordinals in the graph, which for OnHeapHNSW graph would mean updating all nodes by levels and also potentially each neighbor in each NeighborArray in the graph. Because removing a node could cause a number of nodes in the graph to lose a neighbor, we would need to repair the graph. To do this, I think we could create a _repair_list_ that tracks the nodes that lost a connection due to the deleted node{_}.{_} To fill the list, we would need to iterate over all of the nodes in the graph and then check if any of their _m_ connections are to the deleted node (I think this could be done when the ordinals are being updated). If so, remove the connection and then add the node to the {_}repair_list{_}. Once the _repair_list_ is complete, for each node in the list, search the graph to get new neighbors to fill up the node’s connections to the desired amount. At this point, I would expect the time it takes to finish merging to be equal to the time it takes to insert the number of live vectors in the smaller graph plus the size of the repair list into the large graph. All that being said, I am not sure if removing deleted nodes in the graph would be faster than just building the graph from scratch. From the logic above, we would need to at least iterate over each connection in the graph and potentially perform several list deletions. My guess is that when the repair list is small, I think it would be faster, but when it is large, probably not. I am going to try to start playing around with this idea, but please let me know what you think! > Reuse HNSW graphs when merging segments? > > > Key: LUCENE-10318 > URL: https://issues.apache.org/jira/browse/LUCENE-10318 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Major > > Currently when merging segments, the HNSW vectors format rebuilds the entire > graph from scratch. In general, building these graphs is very expensive, and > it'd be nice to optimize it in any way we can. I was wondering if during > merge, we could choose the largest segment with no deletes, and load its HNSW > graph into heap. Then we'd add vectors from the other segments to this graph, > through the normal build process. This could cut down on the number of > operations we need to perform when building the graph. > This is just an early idea, I haven't run experiments to see if it would > help. I'd guess that whether it helps would also depend on details of the > MergePolicy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment
gsmiller commented on code in PR #1062: URL: https://github.com/apache/lucene/pull/1062#discussion_r948336077 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -363,6 +370,29 @@ public boolean isCacheable(LeafReaderContext ctx) { // sets. return ramBytesUsed() <= RamUsageEstimator.QUERY_DEFAULT_RAM_BYTES_USED; } + + static final class MatchAllDocIdSet extends DocIdSet { +private final int size; Review Comment: Thanks for the suggestion @LuXugang. Yeah, I think exposing an `ALL` `DocIdSet` for general use is reasonable. I'll update the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file
[ https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif updated LUCENE-10681: Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1 (was: Linux Ubuntu (will check the user version), java x64 version 11.0.16.1) > ArrayIndexOutOfBoundsException while indexing large binary file > --- > > Key: LUCENE-10681 > URL: https://issues.apache.org/jira/browse/LUCENE-10681 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.2 > Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1 >Reporter: Luís Filipe Nassif >Priority: Minor > > Hello, > I looked for a similar issue, but didn't find one, so I'm creating this, > sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 > recently and an user reported error below while indexing a huge binary file > in a parent-children schema where strings extracted from the huge binary file > (using strings command) are indexed as thousands of ~10MB children text docs > of the parent metadata document: > > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of > bounds for length 71428 > at > org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at iped.engine.task.index.IndexTask.process(IndexTask.java:148) > ~[iped-engine-4.0.2.jar:?] > at > iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) > ~[iped-engine-4.0.2.jar:?]{noformat} > > This seems an integer overflow to me, not sure... It didn't use to happen > with previous lucene-5.5.5 and indexing files like this is pretty common to > us, although with lucene-5.5.5 we used to break that huge file manually > before indexing and to index using IndexWriter.addDocument(Document) method > several times for each 10MB chunk, now we are using the > IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.
[jira] [Updated] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file
[ https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif updated LUCENE-10681: Priority: Major (was: Minor) > ArrayIndexOutOfBoundsException while indexing large binary file > --- > > Key: LUCENE-10681 > URL: https://issues.apache.org/jira/browse/LUCENE-10681 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.2 > Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1 >Reporter: Luís Filipe Nassif >Priority: Major > > Hello, > I looked for a similar issue, but didn't find one, so I'm creating this, > sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 > recently and an user reported error below while indexing a huge binary file > in a parent-children schema where strings extracted from the huge binary file > (using strings command) are indexed as thousands of ~10MB children text docs > of the parent metadata document: > > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of > bounds for length 71428 > at > org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at iped.engine.task.index.IndexTask.process(IndexTask.java:148) > ~[iped-engine-4.0.2.jar:?] > at > iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) > ~[iped-engine-4.0.2.jar:?]{noformat} > > This seems an integer overflow to me, not sure... It didn't use to happen > with previous lucene-5.5.5 and indexing files like this is pretty common to > us, although with lucene-5.5.5 we used to break that huge file manually > before indexing and to index using IndexWriter.addDocument(Document) method > several times for each 10MB chunk, now we are using the > IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file
[ https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580973#comment-17580973 ] Luís Filipe Nassif commented on LUCENE-10681: - Just changed the priority to the default (major), I changed it accidentally, but not sure if it is ok. > ArrayIndexOutOfBoundsException while indexing large binary file > --- > > Key: LUCENE-10681 > URL: https://issues.apache.org/jira/browse/LUCENE-10681 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.2 > Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1 >Reporter: Luís Filipe Nassif >Priority: Major > > Hello, > I looked for a similar issue, but didn't find one, so I'm creating this, > sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 > recently and an user reported error below while indexing a huge binary file > in a parent-children schema where strings extracted from the huge binary file > (using strings command) are indexed as thousands of ~10MB children text docs > of the parent metadata document: > > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of > bounds for length 71428 > at > org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at iped.engine.task.index.IndexTask.process(IndexTask.java:148) > ~[iped-engine-4.0.2.jar:?] > at > iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) > ~[iped-engine-4.0.2.jar:?]{noformat} > > This seems an integer overflow to me, not sure... It didn't use to happen > with previous lucene-5.5.5 and indexing files like this is pretty common to > us, although with lucene-5.5.5 we used to break that huge file manually > before indexing and to index using IndexWriter.addDocument(Document) method > several times for each 10MB chunk, now we are using the > IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr
[jira] [Commented] (LUCENE-10318) Reuse HNSW graphs when merging segments?
[ https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580979#comment-17580979 ] Mayya Sharipova commented on LUCENE-10318: -- Thanks for looking into this, Jack. We have not done any development on this, but some thoughts from us (may be Julie can add more): * Looks like the way MergePolicy works, it chooses segments of approximately same size. So during merge, we may not have one single big segment, whose graph we can reuse. So I would imagine for many uses case it may not worth reusing graphs (especially if segments are relative small) - extra complexity would not justify a very small speedups. * I agree with your thoughts on deletions that it may also not worth reusing graphs is some heavy deletions are present. So may be, a good start could be have a very lean prototype with a lot of performance benchmarks. > Reuse HNSW graphs when merging segments? > > > Key: LUCENE-10318 > URL: https://issues.apache.org/jira/browse/LUCENE-10318 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Major > > Currently when merging segments, the HNSW vectors format rebuilds the entire > graph from scratch. In general, building these graphs is very expensive, and > it'd be nice to optimize it in any way we can. I was wondering if during > merge, we could choose the largest segment with no deletes, and load its HNSW > graph into heap. Then we'd add vectors from the other segments to this graph, > through the normal build process. This could cut down on the number of > operations we need to perform when building the graph. > This is just an early idea, I haven't run experiments to see if it would > help. I'd guess that whether it helps would also depend on details of the > MergePolicy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10318) Reuse HNSW graphs when merging segments?
[ https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580979#comment-17580979 ] Mayya Sharipova edited comment on LUCENE-10318 at 8/17/22 8:01 PM: --- Thanks for looking into this, Jack. We have not done any development on this, but some thoughts from us: * Looks like the way MergePolicy works, it chooses segments of approximately same size. So during merge, we may not have one single big segment, whose graph we can reuse. So I would imagine for many uses case it may not worth reusing graphs (especially if segments are relative small) - extra complexity would not justify a very small speedups. * I agree with your thoughts on deletions that it may also not worth reusing graphs if some heavy deletions are present. So may be, a good start could be have a very lean prototype with a lot of performance benchmarks. was (Author: mayyas): Thanks for looking into this, Jack. We have not done any development on this, but some thoughts from us (may be Julie can add more): * Looks like the way MergePolicy works, it chooses segments of approximately same size. So during merge, we may not have one single big segment, whose graph we can reuse. So I would imagine for many uses case it may not worth reusing graphs (especially if segments are relative small) - extra complexity would not justify a very small speedups. * I agree with your thoughts on deletions that it may also not worth reusing graphs is some heavy deletions are present. So may be, a good start could be have a very lean prototype with a lot of performance benchmarks. > Reuse HNSW graphs when merging segments? > > > Key: LUCENE-10318 > URL: https://issues.apache.org/jira/browse/LUCENE-10318 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Major > > Currently when merging segments, the HNSW vectors format rebuilds the entire > graph from scratch. In general, building these graphs is very expensive, and > it'd be nice to optimize it in any way we can. I was wondering if during > merge, we could choose the largest segment with no deletes, and load its HNSW > graph into heap. Then we'd add vectors from the other segments to this graph, > through the normal build process. This could cut down on the number of > operations we need to perform when building the graph. > This is just an early idea, I haven't run experiments to see if it would > help. I'd guess that whether it helps would also depend on details of the > MergePolicy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10318) Reuse HNSW graphs when merging segments?
[ https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580979#comment-17580979 ] Mayya Sharipova edited comment on LUCENE-10318 at 8/17/22 8:02 PM: --- Thanks for looking into this, Jack. We have not done any development on this, but some thoughts from us: * Looks like the way MergePolicy works, it chooses segments of approximately same size. So during merge, we may not have one single big segment, whose graph we can reuse. So I would imagine for many uses case it may not worth reusing graphs (especially if segments are relative small) - extra complexity would not justify a very small speedups. * I agree with your thoughts on deletions that it may also not worth reusing graphs if some heavy deletions are present. So may be, a good start could be to have a very lean prototype with a lot of performance benchmarks. was (Author: mayyas): Thanks for looking into this, Jack. We have not done any development on this, but some thoughts from us: * Looks like the way MergePolicy works, it chooses segments of approximately same size. So during merge, we may not have one single big segment, whose graph we can reuse. So I would imagine for many uses case it may not worth reusing graphs (especially if segments are relative small) - extra complexity would not justify a very small speedups. * I agree with your thoughts on deletions that it may also not worth reusing graphs if some heavy deletions are present. So may be, a good start could be have a very lean prototype with a lot of performance benchmarks. > Reuse HNSW graphs when merging segments? > > > Key: LUCENE-10318 > URL: https://issues.apache.org/jira/browse/LUCENE-10318 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Major > > Currently when merging segments, the HNSW vectors format rebuilds the entire > graph from scratch. In general, building these graphs is very expensive, and > it'd be nice to optimize it in any way we can. I was wondering if during > merge, we could choose the largest segment with no deletes, and load its HNSW > graph into heap. Then we'd add vectors from the other segments to this graph, > through the normal build process. This could cut down on the number of > operations we need to perform when building the graph. > This is just an early idea, I haven't run experiments to see if it would > help. I'd guess that whether it helps would also depend on details of the > MergePolicy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10318) Reuse HNSW graphs when merging segments?
[ https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581016#comment-17581016 ] Julie Tibshirani commented on LUCENE-10318: --- [~jmazanec15] it's great you're interested in looking into this! I don't have any prototype or experiments, you're welcome to pick it up. Removing nodes and repairing the graph could be a nice direction. But for now we can keep things simple and assume there's a segment without deletes. If that's looking good and shows a nice improvement in index/ merge benchmarks, then we can handle deletes in a follow-up. > Reuse HNSW graphs when merging segments? > > > Key: LUCENE-10318 > URL: https://issues.apache.org/jira/browse/LUCENE-10318 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Major > > Currently when merging segments, the HNSW vectors format rebuilds the entire > graph from scratch. In general, building these graphs is very expensive, and > it'd be nice to optimize it in any way we can. I was wondering if during > merge, we could choose the largest segment with no deletes, and load its HNSW > graph into heap. Then we'd add vectors from the other segments to this graph, > through the normal build process. This could cut down on the number of > operations we need to perform when building the graph. > This is just an early idea, I haven't run experiments to see if it would > help. I'd guess that whether it helps would also depend on details of the > MergePolicy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10318) Reuse HNSW graphs when merging segments?
[ https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581016#comment-17581016 ] Julie Tibshirani edited comment on LUCENE-10318 at 8/17/22 8:51 PM: [~jmazanec15] it's great you're interested in looking into this! I don't have any prototype or experiments, you're welcome to pick it up. Removing nodes and repairing the graph could be a nice direction. But for now we can keep things simple and assume there's a segment without deletes. If that's looking good and shows a nice improvement in index/ merge benchmarks, then we can handle deletes in a follow-up. Edit: Oops, I didn't refresh the page so I missed Mayya's comment. It looks like we're in agreement! was (Author: julietibs): [~jmazanec15] it's great you're interested in looking into this! I don't have any prototype or experiments, you're welcome to pick it up. Removing nodes and repairing the graph could be a nice direction. But for now we can keep things simple and assume there's a segment without deletes. If that's looking good and shows a nice improvement in index/ merge benchmarks, then we can handle deletes in a follow-up. > Reuse HNSW graphs when merging segments? > > > Key: LUCENE-10318 > URL: https://issues.apache.org/jira/browse/LUCENE-10318 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Major > > Currently when merging segments, the HNSW vectors format rebuilds the entire > graph from scratch. In general, building these graphs is very expensive, and > it'd be nice to optimize it in any way we can. I was wondering if during > merge, we could choose the largest segment with no deletes, and load its HNSW > graph into heap. Then we'd add vectors from the other segments to this graph, > through the normal build process. This could cut down on the number of > operations we need to perform when building the graph. > This is just an early idea, I haven't run experiments to see if it would > help. I'd guess that whether it helps would also depend on details of the > MergePolicy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order
Yuti-G commented on code in PR #1013: URL: https://github.com/apache/lucene/pull/1013#discussion_r948407058 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java: ## @@ -100,12 +100,21 @@ public void testBasicLong() throws Exception { new LongRange("90 or above", 90L, true, 100L, false), new LongRange("over 1000", 1000L, false, Long.MAX_VALUE, true)); -FacetResult result = facets.getAllChildren("field"); -assertEquals( -"dim=field path=[] value=22 childCount=5\n less than 10 (10)\n less than or equal to 10 (11)\n over 90 (9)\n 90 or above (10)\n over 1000 (1)\n", -result.toString()); +assertFacetResult( Review Comment: Sorry, I am confused. Our javadoc for Facets#getAllChildren explicitly calls out that callers should make _**NO**_ assumptions about child ordering. Isn't the purpose of addressing the previous tests by ignoring child order like you described in the LUCENE-10644 Jira issue? I know it's been a while, but please refer to our PR comments and confirm whether we misunderstand something here. Thanks! > Thanks @Yuti-G! This approach looks good to me. Is your plan to iterate on this PR to stop enforcing the ordering checks in all the tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani opened a new pull request, #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer
jtibshirani opened a new pull request, #1071: URL: https://github.com/apache/lucene/pull/1071 This change folds the `RandomAccessVectorValuesProducer` interface into `RandomAccessVectorValues`. This reduces the number of interfaces and clarifies the cloning/ copying behavior. This is a small simplification related to LUCENE-9583, but does not address the main issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer
jtibshirani commented on code in PR #1071: URL: https://github.com/apache/lucene/pull/1071#discussion_r948528112 ## lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java: ## @@ -783,66 +742,6 @@ private static void usage() { System.exit(1); } - class BinaryFileVectors implements RandomAccessVectorValuesProducer, Closeable { Review Comment: I wasn't sure this functionality was worth preserving. Let me know though and I can restore and refactor it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10318) Reuse HNSW graphs when merging segments?
[ https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-10318: -- Labels: vector-based-search (was: ) > Reuse HNSW graphs when merging segments? > > > Key: LUCENE-10318 > URL: https://issues.apache.org/jira/browse/LUCENE-10318 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Major > Labels: vector-based-search > > Currently when merging segments, the HNSW vectors format rebuilds the entire > graph from scratch. In general, building these graphs is very expensive, and > it'd be nice to optimize it in any way we can. I was wondering if during > merge, we could choose the largest segment with no deletes, and load its HNSW > graph into heap. Then we'd add vectors from the other segments to this graph, > through the normal build process. This could cut down on the number of > operations we need to perform when building the graph. > This is just an early idea, I haven't run experiments to see if it would > help. I'd guess that whether it helps would also depend on details of the > MergePolicy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order
gsmiller commented on code in PR #1013: URL: https://github.com/apache/lucene/pull/1013#discussion_r948541210 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java: ## @@ -100,12 +100,21 @@ public void testBasicLong() throws Exception { new LongRange("90 or above", 90L, true, 100L, false), new LongRange("over 1000", 1000L, false, Long.MAX_VALUE, true)); -FacetResult result = facets.getAllChildren("field"); -assertEquals( -"dim=field path=[] value=22 childCount=5\n less than 10 (10)\n less than or equal to 10 (11)\n over 90 (9)\n 90 or above (10)\n over 1000 (1)\n", -result.toString()); +assertFacetResult( Review Comment: @Yuti-G I'm referring to the javadoc on `RangeFAcetCounts#getAllChildren`, which notes an exception to this rule in range counting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order
Yuti-G commented on code in PR #1013: URL: https://github.com/apache/lucene/pull/1013#discussion_r948557786 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeFacetCounts.java: ## @@ -100,12 +100,21 @@ public void testBasicLong() throws Exception { new LongRange("90 or above", 90L, true, 100L, false), new LongRange("over 1000", 1000L, false, Long.MAX_VALUE, true)); -FacetResult result = facets.getAllChildren("field"); -assertEquals( -"dim=field path=[] value=22 childCount=5\n less than 10 (10)\n less than or equal to 10 (11)\n over 90 (9)\n 90 or above (10)\n over 1000 (1)\n", -result.toString()); +assertFacetResult( Review Comment: Thanks for catching this! Sorry for overlooking `range` in the comment. I reverted the changes in TestRangeFacetCounts. Please let me know if there is any question. Thank you so much for your time! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
jtibshirani commented on code in PR #1054: URL: https://github.com/apache/lucene/pull/1054#discussion_r948548244 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -133,22 +130,21 @@ private TopDocs searchLeaf(LeafReaderContext ctx, Weight filterWeight) throws IO return NO_RESULTS; } -BitSet bitSet = createBitSet(scorer.iterator(), liveDocs, maxDoc); -BitSetIterator filterIterator = new BitSetIterator(bitSet, bitSet.cardinality()); +BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc); -if (filterIterator.cost() <= k) { +if (acceptDocs.cardinality() <= k) { Review Comment: Whenever possible, we should avoiding calling `cardinality` multiple times since it can run in linear time. I thought the original logic was clearer (but I'm biased since I wrote it 😊 ) ## lucene/core/src/java/org/apache/lucene/index/VectorEncoding.java: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +/** The numeric datatype of the vector values. */ +public enum VectorEncoding { + + /** + * Encodes vector using 8 bits of precision per sample. Use only with DOT_PRODUCT similarity. Review Comment: Is it still true that it should only be used with DOT_PRODUCT similarity? ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java: ## @@ -76,6 +78,15 @@ public static KnnVectorsFormat forName(String name) { /** Returns a {@link KnnVectorsReader} to read the vectors from the index. */ public abstract KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException; + /** + * Returns the current KnnVectorsFormat version number. Indexes written using the format will be + * "stamped" with this version. + */ + public int currentVersion() { Review Comment: It seems confusing to have a new concept of "version" separate from the codec version. It's only used in `BaseKnnVectorsFormatTestCase` -- could we instead make the `randomVectorEncoding` overridable? It would default to all encodings but older codecs could override it and just return float32? ## lucene/core/src/java/org/apache/lucene/document/KnnVectorField.java: ## @@ -117,6 +160,21 @@ public KnnVectorField(String name, float[] vector, FieldType fieldType) { fieldsData = vector; } + /** + * Creates a numeric vector field. Fields are single-valued: each document has either one value or + * no value. Vectors of a single field share the same dimension and similarity function. + * + * @param name field name + * @param vector value + * @param fieldType field type + * @throws IllegalArgumentException if any parameter is null, or the vector is empty or has + * dimension > 1024. + */ + public KnnVectorField(String name, BytesRef vector, FieldType fieldType) { Review Comment: I think this method is only meant to be used with `VectorEncoding.BYTE`? Then it'd be good to validate this on the `FieldType`. The same thought applies to the float-oriented constructor. ## lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java: ## @@ -249,6 +261,29 @@ private void writeSortingField(FieldWriter fieldData, int maxDoc, Sorter.DocMap mockGraph); } + private long writeSortedFloat32Vectors(FieldWriter fieldData, int[] ordMap) + throws IOException { +long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES); +final ByteBuffer buffer = +ByteBuffer.allocate(fieldData.dim * Float.BYTES).order(ByteOrder.LITTLE_ENDIAN); +final BytesRef binaryValue = new BytesRef(buffer.array()); +for (int ordinal : ordMap) { + float[] vector = (float[]) fieldData.vectors.get(ordinal); + buffer.asFloatBuffer().put(vector); + vectorData.writeBytes(binaryValue.bytes, binaryValue.offset, binaryValue.length); +} +return vectorDataOffset; + } + + private long writeSortedByteVectors(FieldWriter fieldData, int[] ordMap) throws IOException { +long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES); +for (int ordinal : ordMap) { + by