date:20220608

[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-08 Thread GitBox



shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1149554008

   > `facetset` for the exact implementation and `hyperrectangle` for the range 
implementation
   
   If I understand correctly, @gsmiller's proposal was more about efficiency, 
right? So if you have a large number of points that you'd like to index, an 
R-Tree might be a better data structure for these range matches, than the 
current linear scan we do in `RangeFSM`. But it doesn't mean we shouldn't have 
`RangeFSM` for a small dimensionality / small number of FacetSet per doc cases 
right? They might even perform better?
   
   IMO, and to echo Greg's "progress not perfection", I feel it's better if we 
introduce the `HyperRectangle` impls (1) when we have a good use case to 
demonstrate the need for it (it just helps to reason about it) and (2) when 
we're actually ready to implement it "more efficiently". That is, in this PR I 
feel like we can do w/ `RangeFSM` and later (if and when we'll have a better 
alternative for large dims) document and reference the alternative?
   
   > For the second question, I think we should keep this as a `long[]` based 
API as we know we want to make the KD tree and R tree optimizations in the 
future
   
   If we think that KD-Trees can be more efficient for `FacetSetMatcher` impls 
and we're sure that we'll get there _soon_, then yeah, we can move to `long[]` 
based API. Another alternative I thought of is keep the `byte[]` API, but 
introduce a helper method on `FSM` which converts the `byte[]` to `long[]` for 
whoever needs it.
   
   To re-iterate what I previously wrote about API back-compat, I feel that 
putting `@lucene.experimental` tag on the classes is enough to have us not 
worry about changing it. Especially if the API hasn't been released yet, and 
definitely if we intend to follow with a KD-Tree impl very soon. A year from 
now (just an example) it's a different story, but for now I don't think we need 
to finalize the API for good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-08 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551429#comment-17551429
 ] 

Adrien Grand commented on LUCENE-10396:
---

Another potential use-case we are interested in for Elasticsearch would be to 
have the ability to visit a single document per field value for a field that is 
the primary index sort. This has a few applications, one of them is to compute 
the number of unique values of this primary sort field for documents that match 
a query. The collector could implement {{LeafCollector#competitiveIterator}} by 
using the sparse index to skip all documents that have the same value as the 
last collected hit.

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10604) Improve ability to test and debug triangulation algorithm in Tessellator

2022-06-08 Thread Craig Taverner (Jira)

Craig Taverner created LUCENE-10604:
---

 Summary: Improve ability to test and debug triangulation algorithm 
in Tessellator
 Key: LUCENE-10604
 URL: https://issues.apache.org/jira/browse/LUCENE-10604
 Project: Lucene - Core
  Issue Type: Test
  Components: core/other
Reporter: Craig Taverner


While working on https://issues.apache.org/jira/browse/LUCENE-10563, it was 
found to be extremely difficult to debug the inner workings of the 
triangulation algorithm in the Tessellator class. To make things easier, we 
made use of a monitor interface that tests could use to instrument the 
algorithm and receive detailed information on the state of the algorithms 
progress. This issue requests that this interface and a test implementation of 
it be included in Lucene to facilitate future bug-fixing of this complex 
algorithm.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10604) Improve ability to test and debug triangulation algorithm in Tessellator

2022-06-08 Thread Ignacio Vera (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551470#comment-17551470
 ] 

Ignacio Vera commented on LUCENE-10604:
---

+1

This algorithm is very difficult to debug for big, complex polygons. I have 
seen the monitor interface in action and it was extremely useful to understand 
what the algorithm was doing.

> Improve ability to test and debug triangulation algorithm in Tessellator
> 
>
> Key: LUCENE-10604
> URL: https://issues.apache.org/jira/browse/LUCENE-10604
> Project: Lucene - Core
>  Issue Type: Test
>  Components: core/other
>Reporter: Craig Taverner
>Priority: Minor
>  Labels: geo, pull-request-available
>
> While working on https://issues.apache.org/jira/browse/LUCENE-10563, it was 
> found to be extremely difficult to debug the inner workings of the 
> triangulation algorithm in the Tessellator class. To make things easier, we 
> made use of a monitor interface that tests could use to instrument the 
> algorithm and receive detailed information on the state of the algorithms 
> progress. This issue requests that this interface and a test implementation 
> of it be included in Lucene to facilitate future bug-fixing of this complex 
> algorithm.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase merged pull request #933: LUCENE-10563: Unable to Tessellate polygon

2022-06-08 Thread GitBox



iverase merged PR #933:
URL: https://github.com/apache/lucene/pull/933


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10563) Unable to Tessellate polygon

2022-06-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551492#comment-17551492
 ] 

ASF subversion and git services commented on LUCENE-10563:
--

Commit 1dceff12c8bb764255927be55e46f4d435f47aed in lucene's branch 
refs/heads/main from Craig Taverner
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1dceff12c8b ]

LUCENE-10563: Fix failure to tessellate complex polygon (#933)



> Unable to Tessellate polygon
> 
>
> Key: LUCENE-10563
> URL: https://issues.apache.org/jira/browse/LUCENE-10563
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.1
>Reporter: Yixun Xu
>Priority: Major
> Attachments: polygon-1.json, polygon-2.json, polygon-3.json
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Following up to LUCENE-10470, I found some more polygons that cause 
> {{Tessellator.tessellate}} to throw "Unable to Tessellate shape", which are 
> not covered by the fix to LUCENE-10470. I attached the geojson of 3 failing 
> shapes that I got, and this is the 
> [branch|https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1#diff-5e8e8052af8b8618e7e4325b7d69def4d562a356acbfea3e983198327c7c8d18R17-R19]
>  I am testing on that demonstrates the tessellation failures. 
>  
> [^polygon-1.json]
> [^polygon-2.json]
> [^polygon-3.json]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10563) Unable to Tessellate polygon

2022-06-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551493#comment-17551493
 ] 

ASF subversion and git services commented on LUCENE-10563:
--

Commit b7231bb54884f9ce0232430c4a60cdb5753c6b82 in lucene's branch 
refs/heads/branch_9x from Craig Taverner
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b7231bb5488 ]

LUCENE-10563: Fix failure to tessellate complex polygon (#933)

# Conflicts:
#   lucene/CHANGES.txt


> Unable to Tessellate polygon
> 
>
> Key: LUCENE-10563
> URL: https://issues.apache.org/jira/browse/LUCENE-10563
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.1
>Reporter: Yixun Xu
>Priority: Major
> Attachments: polygon-1.json, polygon-2.json, polygon-3.json
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Following up to LUCENE-10470, I found some more polygons that cause 
> {{Tessellator.tessellate}} to throw "Unable to Tessellate shape", which are 
> not covered by the fix to LUCENE-10470. I attached the geojson of 3 failing 
> shapes that I got, and this is the 
> [branch|https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1#diff-5e8e8052af8b8618e7e4325b7d69def4d562a356acbfea3e983198327c7c8d18R17-R19]
>  I am testing on that demonstrates the tessellation failures. 
>  
> [^polygon-1.json]
> [^polygon-2.json]
> [^polygon-3.json]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10604) Improve ability to test and debug triangulation algorithm in Tessellator

2022-06-08 Thread Craig Taverner (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551498#comment-17551498
 ] 

Craig Taverner commented on LUCENE-10604:
-

I have created a PR for this at https://github.com/apache/lucene/pull/946

> Improve ability to test and debug triangulation algorithm in Tessellator
> 
>
> Key: LUCENE-10604
> URL: https://issues.apache.org/jira/browse/LUCENE-10604
> Project: Lucene - Core
>  Issue Type: Test
>  Components: core/other
>Reporter: Craig Taverner
>Priority: Minor
>  Labels: geo, pull-request-available
>
> While working on https://issues.apache.org/jira/browse/LUCENE-10563, it was 
> found to be extremely difficult to debug the inner workings of the 
> triangulation algorithm in the Tessellator class. To make things easier, we 
> made use of a monitor interface that tests could use to instrument the 
> algorithm and receive detailed information on the state of the algorithms 
> progress. This issue requests that this interface and a test implementation 
> of it be included in Lucene to facilitate future bug-fixing of this complex 
> algorithm.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10563) Unable to Tessellate polygon

2022-06-08 Thread Ignacio Vera (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-10563.
---
Fix Version/s: 9.3
 Assignee: Ignacio Vera
   Resolution: Fixed

Note that polygon 1 was a real bug but the other 2 polygons are illegal and the 
error that is thrown is legit.

> Unable to Tessellate polygon
> 
>
> Key: LUCENE-10563
> URL: https://issues.apache.org/jira/browse/LUCENE-10563
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.1
>Reporter: Yixun Xu
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.3
>
> Attachments: polygon-1.json, polygon-2.json, polygon-3.json
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Following up to LUCENE-10470, I found some more polygons that cause 
> {{Tessellator.tessellate}} to throw "Unable to Tessellate shape", which are 
> not covered by the fix to LUCENE-10470. I attached the geojson of 3 failing 
> shapes that I got, and this is the 
> [branch|https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1#diff-5e8e8052af8b8618e7e4325b7d69def4d562a356acbfea3e983198327c7c8d18R17-R19]
>  I am testing on that demonstrates the tessellation failures. 
>  
> [^polygon-1.json]
> [^polygon-2.json]
> [^polygon-3.json]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #935: LUCENE-10599: Improve LogMergePolicy's handling of maxMergeSize.

2022-06-08 Thread GitBox



jpountz commented on code in PR #935:
URL: https://github.com/apache/lucene/pull/935#discussion_r892145811


##
lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java:
##
@@ -568,23 +568,41 @@ public MergeSpecification findMerges(
   // Finally, record all merges that are viable at this level:
   int end = start + mergeFactor;
   while (end <= 1 + upto) {
-boolean anyTooLarge = false;
 boolean anyMerging = false;
+long mergeSize = 0;
+long mergeDocs = 0;
 for (int i = start; i < end; i++) {
   final SegmentInfoAndLevel segLevel = levels.get(i);
   final SegmentCommitInfo info = segLevel.info;
-  anyTooLarge |=
-  (size(info, mergeContext) >= maxMergeSize
-  || sizeDocs(info, mergeContext) >= maxMergeDocs);
   if (mergingSegments.contains(info)) {
 anyMerging = true;
 break;
   }
+  long segmentSize = size(info, mergeContext);
+  long segmentDocs = sizeDocs(info, mergeContext);
+  if (mergeSize + segmentSize > maxMergeSize || mergeDocs + 
segmentDocs > maxMergeDocs) {
+// This merge is full, stop adding more segments to it
+if (i == start) {
+  // This segment alone is too large, return a singleton merge
+  if (verbose(mergeContext)) {
+message(
+"" + i + " is larger than the max merge size/docs; 
ignoring", mergeContext);
+  }
+  end = i + 1;
+} else {
+  // Previous segments are under the max merge size, return them
+  end = i;
+}
+break;
+  }
+  mergeSize += segmentSize;
+  mergeDocs += segmentDocs;
 }
 
-if (anyMerging) {
-  // skip
-} else if (!anyTooLarge) {
+if (anyMerging || end - start <= 1) {
+  // skip: there is an ongoing merge at the current level or the 
computed merge has a single
+  // segment and this merge policy doesn't do singleton merges
+} else {

Review Comment:
   Correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #935: LUCENE-10599: Improve LogMergePolicy's handling of maxMergeSize.

2022-06-08 Thread GitBox



jpountz commented on code in PR #935:
URL: https://github.com/apache/lucene/pull/935#discussion_r892164198


##
lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java:
##
@@ -568,23 +568,41 @@ public MergeSpecification findMerges(
   // Finally, record all merges that are viable at this level:

Review Comment:
   > how could it guarantee that the level decided in range of [start,end) 
won't contain segments that have lower level than levelBottom?
   
   LogMergePolicy doesn't try to provide this guarantee. It's actually 
important it doesn't try to provide this guarantee, otherwise it could end up 
with lots of unmerged segments. For instance imagine that you have 9 segments 
(S1..S9) on level 10 then one segment (S10) on level 9, one more segment on 
level 10 (S11) and then potentially other segments.
   If LogMergePolicy refused to merge segments that are on a lower level then 
it could never merge together segments S1..S10. This is because segment S10 can 
only be merged with segments that are on a higher level because both the 
previous and the next segment are on a higher level, and LogMergePolicy only 
merges adjacent segments.
   
   This is a downside of LogMergePolicy compared to TieredMergePolicy: because 
it only performs merges of adjacent segments, it sometimes has to return merges 
where not all segments are on the same level.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

2022-06-08 Thread GitBox



kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r892189591


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
   return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
 }
 
+public void setBitSet(BitSet bitSet, int cost) {
+  bitSets[ord] = bitSet;

Review Comment:
   Thank you @jtibshirani @msokolov for your input!
   
   I agree that we shouldn't modify core classes for tests, and we can create a 
separate issue for possible collection optimizations
   
   However I feel that this test is important as well, for deciding which query 
would be suitable for pre-filtering (by comparing with post-filtering + over 
selection time) and we should address this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-06-08 Thread GitBox



mikemccand commented on PR #633:
URL: https://github.com/apache/lucene/pull/633#issuecomment-1149911664

   My beasting has not uncovered any more failures!  I think this is ready!  
I'll try to merge soon, just to `main` ... let's let it bake there for a while 
before backporting to 9.x?  Thanks @vigyasharma.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



iverase commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892375896


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -817,8 +841,12 @@ private static final boolean splitEarcut(
 sortByMortonWithReset(searchNode);
 sortByMortonWithReset(splitNode);
   }
-  earcutLinkedList(polygon, searchNode, tessellation, State.INIT, 
mortonOptimized);
-  earcutLinkedList(polygon, splitNode, tessellation, State.INIT, 
mortonOptimized);
+  notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, 
splitNode);

Review Comment:
   Maybe we should just pass the depth here so we are not allocating those 
strings every time?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



iverase commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892379156


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -131,16 +136,24 @@ public static List tessellate(final Polygon 
polygon, boolean checkSelf
 }
 // Calculate the tessellation using the doubly LinkedList.
 List result =
-earcutLinkedList(polygon, outerNode, new ArrayList<>(), State.INIT, 
mortonOptimized);
+earcutLinkedList(
+polygon, outerNode, new ArrayList<>(), State.INIT, 
mortonOptimized, monitor, 0);
 if (result.size() == 0) {
+  notifyMonitor("FAILED", monitor, null, result);

Review Comment:
   Shall we extract static strings instead for FAILED, COMPLETED, ...?



##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -817,8 +841,12 @@ private static final boolean splitEarcut(
 sortByMortonWithReset(searchNode);
 sortByMortonWithReset(splitNode);
   }
-  earcutLinkedList(polygon, searchNode, tessellation, State.INIT, 
mortonOptimized);
-  earcutLinkedList(polygon, splitNode, tessellation, State.INIT, 
mortonOptimized);
+  notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, 
splitNode);
+  earcutLinkedList(
+  polygon, searchNode, tessellation, State.INIT, mortonOptimized, 
monitor, depth);
+  earcutLinkedList(
+  polygon, splitNode, tessellation, State.INIT, mortonOptimized, 
monitor, depth);
+  notifyMonitorSplitEnd("SPLIT[" + depth + "]", monitor);

Review Comment:
   same here



##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -1444,6 +1472,60 @@ public static final boolean pointInPolygon(
 return false;
   }
 
+  /**
+   * Implementation of this interface will receive calls with internal data at 
each step of the
+   * triangulation algoirithm. This is of use for debugging complex cases, as 
well as gaining

Review Comment:
   s/algoirithm/algorithm



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation

2022-06-08 Thread sun wuqiang (Jira)

sun wuqiang created LUCENE-10605:


 Summary: fix error in 32bit jvm object alignment gap calculation
 Key: LUCENE-10605
 URL: https://issues.apache.org/jira/browse/LUCENE-10605
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Affects Versions: 9.2
 Environment: jdk 7 32-bit

jdk 8 32-bit
Reporter: sun wuqiang
 Attachments: image-2022-06-08-20-50-27-712.png, 
image-2022-06-08-21-24-57-674.png

ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
This method is used to calculate the optimal length of an array during 
expansion.
 
According to current logic,in order to avoid space waste caused by *object 
alignment gap.* In *32-bit* JVM,the array length will select the numbers(the 
+current optional+ columns) in the table below. But the results weren't perfect.
See the table below.
!image-2022-06-08-20-50-27-712.png!
 
I used *jol-core* to calculate object alignment gap
{code:java}

org.openjdk.jol
jol-core
0.16
compile
 {code}
 
Execute the following code:
{code:java}
System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); {code}
 
!image-2022-06-08-21-24-57-674.png!
 
To further verify that the tool's results are correct, I wrote the following 
code to infer how much space the array of different lengths actually occupies 
based on when the OOM occursThe conclusion is consistent with jol-core.
{code:java}
// -Xms16m -Xmx16m
// Used to infer the memory space occupied
// by the length of various arrays
public static void main(String[] args) {
byte[][] arr = new byte[1024 * 1024][];
for (int i = 0; i < arr.length; i++) {
if (i % 100 == 0) {
System.out.println(i);
}
// According to OOM occurrence time
// in 32-bit JVM,
// Arrays range in length from 5 to 12,
// occupying the same amount of memory
arr[i]=new byte[5];
}
} {code}
*new byte[5]* and *new byte[12]* use the same amount of memory

 

In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return 
value of this method. But I don't know whether it is necessary to do this 
function. If necessary, I will modify it together. Thank you very much!

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



craigtaverner commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892406529


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -817,8 +841,12 @@ private static final boolean splitEarcut(
 sortByMortonWithReset(searchNode);
 sortByMortonWithReset(splitNode);
   }
-  earcutLinkedList(polygon, searchNode, tessellation, State.INIT, 
mortonOptimized);
-  earcutLinkedList(polygon, splitNode, tessellation, State.INIT, 
mortonOptimized);
+  notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, 
splitNode);

Review Comment:
   Since the split function is called very infrequently (a couple of times for 
an entire tessellation session), I presume this object allocation is negligible 
(note how a few lines later we have `return true` so even though this is a 
while loop, it is called only once).
   
   The option of passing in the depth only and creating the string deeper in 
the stack will result in exactly the same number of object constructions. 
However, I can see the advantage of test code not creating the strings at all. 
Since this is called very little, it is only a tiny saving, but a saving 
nevertheless. I kept it as a string here mainly for symmetry with the other 
functions (notifyMonitor).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] wormday opened a new pull request, #949: LUCENE-10605: fix error in 32bit jvm object alignment gap calculation

2022-06-08 Thread GitBox



wormday opened a new pull request, #949:
URL: https://github.com/apache/lucene/pull/949

   ### Description
   https://issues.apache.org/jira/browse/LUCENE-10605
   ArrayUtil.oversize()
   The generated array length cannot eliminate the Object alignment gap, 
wasting some space


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



craigtaverner commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892428974


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -131,16 +136,24 @@ public static List tessellate(final Polygon 
polygon, boolean checkSelf
 }
 // Calculate the tessellation using the doubly LinkedList.
 List result =
-earcutLinkedList(polygon, outerNode, new ArrayList<>(), State.INIT, 
mortonOptimized);
+earcutLinkedList(
+polygon, outerNode, new ArrayList<>(), State.INIT, 
mortonOptimized, monitor, 0);
 if (result.size() == 0) {
+  notifyMonitor("FAILED", monitor, null, result);

Review Comment:
   Good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



craigtaverner commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892429207


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -1444,6 +1472,60 @@ public static final boolean pointInPolygon(
 return false;
   }
 
+  /**
+   * Implementation of this interface will receive calls with internal data at 
each step of the
+   * triangulation algoirithm. This is of use for debugging complex cases, as 
well as gaining

Review Comment:
   Definitely!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



craigtaverner commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892430679


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -817,8 +841,12 @@ private static final boolean splitEarcut(
 sortByMortonWithReset(searchNode);
 sortByMortonWithReset(splitNode);
   }
-  earcutLinkedList(polygon, searchNode, tessellation, State.INIT, 
mortonOptimized);
-  earcutLinkedList(polygon, splitNode, tessellation, State.INIT, 
mortonOptimized);
+  notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, 
splitNode);

Review Comment:
   I realize that even though we create these strings infrequently, we should 
not create them at all if `monitor==null`, so I moved the creation one level 
down in the stack to get that behaviour. Hopefully this is sufficient?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



craigtaverner commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892431465


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -817,8 +841,12 @@ private static final boolean splitEarcut(
 sortByMortonWithReset(searchNode);
 sortByMortonWithReset(splitNode);
   }
-  earcutLinkedList(polygon, searchNode, tessellation, State.INIT, 
mortonOptimized);
-  earcutLinkedList(polygon, splitNode, tessellation, State.INIT, 
mortonOptimized);
+  notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, 
splitNode);
+  earcutLinkedList(
+  polygon, searchNode, tessellation, State.INIT, mortonOptimized, 
monitor, depth);
+  earcutLinkedList(
+  polygon, splitNode, tessellation, State.INIT, mortonOptimized, 
monitor, depth);
+  notifyMonitorSplitEnd("SPLIT[" + depth + "]", monitor);

Review Comment:
   Moved creation down one level to get it after the `if(monitor==null)` check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



iverase commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892480730


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -1444,6 +1471,71 @@ public static final boolean pointInPolygon(
 return false;
   }
 
+  /**
+   * Implementation of this interface will receive calls with internal data at 
each step of the
+   * triangulation algorithm. This is of use for debugging complex cases, as 
well as gaining insight
+   * into the way the algorithm works. Data provided includes a status string 
containing the current
+   * mode, list of points representing the current linked-list of internal 
nodes used for
+   * triangulation, and a list of triangles so far created by the algorithm.
+   */
+  public interface Monitor {
+String FAILED = "FAILED";
+String COMPLETED = "COMPLETED";

Review Comment:
   static?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



iverase commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892481023


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -1444,6 +1471,71 @@ public static final boolean pointInPolygon(
 return false;
   }
 
+  /**
+   * Implementation of this interface will receive calls with internal data at 
each step of the
+   * triangulation algorithm. This is of use for debugging complex cases, as 
well as gaining insight
+   * into the way the algorithm works. Data provided includes a status string 
containing the current
+   * mode, list of points representing the current linked-list of internal 
nodes used for
+   * triangulation, and a list of triangles so far created by the algorithm.
+   */
+  public interface Monitor {
+String FAILED = "FAILED";
+String COMPLETED = "COMPLETED";

Review Comment:
   and final



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



iverase commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892482033


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -817,8 +841,12 @@ private static final boolean splitEarcut(
 sortByMortonWithReset(searchNode);
 sortByMortonWithReset(splitNode);
   }
-  earcutLinkedList(polygon, searchNode, tessellation, State.INIT, 
mortonOptimized);
-  earcutLinkedList(polygon, splitNode, tessellation, State.INIT, 
mortonOptimized);
+  notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, 
splitNode);

Review Comment:
   yes, that is what I was thinking, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



iverase commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892487112


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -1444,6 +1471,71 @@ public static final boolean pointInPolygon(
 return false;
   }
 
+  /**
+   * Implementation of this interface will receive calls with internal data at 
each step of the
+   * triangulation algorithm. This is of use for debugging complex cases, as 
well as gaining insight
+   * into the way the algorithm works. Data provided includes a status string 
containing the current
+   * mode, list of points representing the current linked-list of internal 
nodes used for
+   * triangulation, and a list of triangles so far created by the algorithm.
+   */
+  public interface Monitor {
+String FAILED = "FAILED";
+String COMPLETED = "COMPLETED";

Review Comment:
   I see, it is an interface so they are static and final by default, sorry :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-06-08 Thread GitBox



msokolov commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1150043099

   If folks have time to review, that would be great. The main thing to focus 
on I think is 
   
   1. how the HnswGraphSearcher/Writer implementation is forked into `float[]` 
and `BytesRef` using generics
   2. adding a VERSION to the index format so we can handle back-compat in a 
reasonable way
   3. the fact that some of the details of VectorSimilarityFunction are now 
directly handled in util.hnsw -- I didn't see a better way, but it blurs the 
abstraction boundary


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

2022-06-08 Thread GitBox



jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r892605221


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
   return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
 }
 
+public void setBitSet(BitSet bitSet, int cost) {
+  bitSets[ord] = bitSet;

Review Comment:
   Sounds good, it definitely makes sense to do both changes. @kaivalnp would 
you like to open a new JIRA issue focused on using cached/ precomputed filters 
directly? I'm also happy to file the issue and tag you for credit + awareness.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



craigtaverner commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892608782


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -1444,6 +1471,71 @@ public static final boolean pointInPolygon(
 return false;
   }
 
+  /**
+   * Implementation of this interface will receive calls with internal data at 
each step of the
+   * triangulation algorithm. This is of use for debugging complex cases, as 
well as gaining insight
+   * into the way the algorithm works. Data provided includes a status string 
containing the current
+   * mode, list of points representing the current linked-list of internal 
nodes used for
+   * triangulation, and a list of triangles so far created by the algorithm.
+   */
+  public interface Monitor {
+String FAILED = "FAILED";
+String COMPLETED = "COMPLETED";

Review Comment:
   It is in an interface, so `public static final` is implied.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



craigtaverner commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892626127


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -817,8 +841,12 @@ private static final boolean splitEarcut(
 sortByMortonWithReset(searchNode);
 sortByMortonWithReset(splitNode);
   }
-  earcutLinkedList(polygon, searchNode, tessellation, State.INIT, 
mortonOptimized);
-  earcutLinkedList(polygon, splitNode, tessellation, State.INIT, 
mortonOptimized);
+  notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, 
splitNode);

Review Comment:
   Great. I made a similar change to the main call too, so all string creation 
(other than failed) is done after the if statement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm

2022-06-08 Thread GitBox



craigtaverner commented on code in PR #946:
URL: https://github.com/apache/lucene/pull/946#discussion_r892627220


##
lucene/core/src/java/org/apache/lucene/geo/Tessellator.java:
##
@@ -1444,6 +1471,71 @@ public static final boolean pointInPolygon(
 return false;
   }
 
+  /**
+   * Implementation of this interface will receive calls with internal data at 
each step of the
+   * triangulation algorithm. This is of use for debugging complex cases, as 
well as gaining insight
+   * into the way the algorithm works. Data provided includes a status string 
containing the current
+   * mode, list of points representing the current linked-list of internal 
nodes used for
+   * triangulation, and a list of triangles so far created by the algorithm.
+   */
+  public interface Monitor {
+String FAILED = "FAILED";
+String COMPLETED = "COMPLETED";

Review Comment:
   Woops, I see you already realized that. I responded once I saw the first 
message, but my UI had not updated to show the others.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing

2022-06-08 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551712#comment-17551712
 ] 

Adrien Grand commented on LUCENE-10602:
---

I work with Chris and I suggested him to open this issue, I'll try to provide a 
bit more context.

It's not uncommon for us to have nodes that handle several TBs of data. With 
documents that need ~250 bytes each, which is also typical, this gives ~4.4B 
documents per TB of data. Caching a query for 4.4B documents requires ~520MB of 
memory assuming one bit per document. So if we want to be able to cache, say 4 
queries across 1TB of data, then we need ~2GB of heap for the query cache.

We could give less memory to the cache, but this would increase the risk that 
every new entry in the cache evicts a hot entry for the cache. This is 
potentially an issue for the query cache since computing cache entries has 
overhead: it requires evaluating all documents that match the query, while the 
query that is being cached might be used in a conjunction that only requires 
evaluating a subset of the matching docs.

But we're also seeing the opposite case when the cache is oversized for the 
amount of data that a node handles. And because the cache only evicts when it's 
full or when segments get closed, the cache will often grow until it's 
completely full, even though most cache entries never get used.

We don't know at node startup time how much data this node is going to handle 
eventually, which makes it impossible to size the query cache correctly. So if 
Lucene's query cache could evict entries from the cache when they appear to be 
very little used, this would help not spend large amounts of heap on useless 
cache entries.

> Dynamic Index Cache Sizing
> --
>
> Key: LUCENE-10602
> URL: https://issues.apache.org/jira/browse/LUCENE-10602
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Chris Earle
>Priority: Major
>
> Working with Lucene's filter cache, it has become apparent that it can be an 
> enormous drain on the heap and therefore the JVM. After extensive usage of an 
> index, it is not uncommon to tune performance by shrinking or altogether 
> removing the filter cache.
> Lucene tracks hit/miss stats of the filter cache, but it does nothing with 
> the data other than inform an interested user about the effectiveness of 
> their index's caching.
> It would be interesting if Lucene would be able to tune the index filter 
> cache heuristically based on actual usage (age, frequency, and value).
> This could ultimately be used to give GBs of heap back to an individual 
> Lucene instance instead of burning it on cache storage that's not effectively 
> used (or useful).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing

2022-06-08 Thread Uwe Schindler (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551717#comment-17551717
 ] 

Uwe Schindler commented on LUCENE-10602:


If all this is needed, why not implement your own QueryCache instance for ES? 
Then you have more control over when and how something is cached or evicted. 
There's no need to use the IndexSearcher#DEFAULT_QUERY_CACHE, you can set you 
own (globally as default per node or per searcher).

Maybe it would be good to get more context, but actualy the cache context is 
more up to the application. IMHO, it should be enough to implement a 
sophisticated 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/QueryCache.java

The LRUQueryCache in Lucene is just an example.

> Dynamic Index Cache Sizing
> --
>
> Key: LUCENE-10602
> URL: https://issues.apache.org/jira/browse/LUCENE-10602
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Chris Earle
>Priority: Major
>
> Working with Lucene's filter cache, it has become apparent that it can be an 
> enormous drain on the heap and therefore the JVM. After extensive usage of an 
> index, it is not uncommon to tune performance by shrinking or altogether 
> removing the filter cache.
> Lucene tracks hit/miss stats of the filter cache, but it does nothing with 
> the data other than inform an interested user about the effectiveness of 
> their index's caching.
> It would be interesting if Lucene would be able to tune the index filter 
> cache heuristically based on actual usage (age, frequency, and value).
> This could ultimately be used to give GBs of heap back to an individual 
> Lucene instance instead of burning it on cache storage that's not effectively 
> used (or useful).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing

2022-06-08 Thread Uwe Schindler (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551719#comment-17551719
 ] 

Uwe Schindler commented on LUCENE-10602:


Elasticsearch could for example subclass LRUQueryCache and disallow caching of 
results for some (huge) shards or make sure no smaller entries are expunged.

> Dynamic Index Cache Sizing
> --
>
> Key: LUCENE-10602
> URL: https://issues.apache.org/jira/browse/LUCENE-10602
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Chris Earle
>Priority: Major
>
> Working with Lucene's filter cache, it has become apparent that it can be an 
> enormous drain on the heap and therefore the JVM. After extensive usage of an 
> index, it is not uncommon to tune performance by shrinking or altogether 
> removing the filter cache.
> Lucene tracks hit/miss stats of the filter cache, but it does nothing with 
> the data other than inform an interested user about the effectiveness of 
> their index's caching.
> It would be interesting if Lucene would be able to tune the index filter 
> cache heuristically based on actual usage (age, frequency, and value).
> This could ultimately be used to give GBs of heap back to an individual 
> Lucene instance instead of burning it on cache storage that's not effectively 
> used (or useful).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing

2022-06-08 Thread Uwe Schindler (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551720#comment-17551720
 ] 

Uwe Schindler commented on LUCENE-10602:


Theoretically you could also serialize huge Bitsets to disk (temp directory) 
and with MMapDircetory (and or Java 19 MemorySegments -> next week on BBUZZ) 
make a off-heap cache for large shards.

I think there is a lot of flexibility in that small interface.

> Dynamic Index Cache Sizing
> --
>
> Key: LUCENE-10602
> URL: https://issues.apache.org/jira/browse/LUCENE-10602
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Chris Earle
>Priority: Major
>
> Working with Lucene's filter cache, it has become apparent that it can be an 
> enormous drain on the heap and therefore the JVM. After extensive usage of an 
> index, it is not uncommon to tune performance by shrinking or altogether 
> removing the filter cache.
> Lucene tracks hit/miss stats of the filter cache, but it does nothing with 
> the data other than inform an interested user about the effectiveness of 
> their index's caching.
> It would be interesting if Lucene would be able to tune the index filter 
> cache heuristically based on actual usage (age, frequency, and value).
> This could ultimately be used to give GBs of heap back to an individual 
> Lucene instance instead of burning it on cache storage that's not effectively 
> used (or useful).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zhaih commented on a diff in pull request #935: LUCENE-10599: Improve LogMergePolicy's handling of maxMergeSize.

2022-06-08 Thread GitBox



zhaih commented on code in PR #935:
URL: https://github.com/apache/lucene/pull/935#discussion_r892682597


##
lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java:
##
@@ -568,23 +568,41 @@ public MergeSpecification findMerges(
   // Finally, record all merges that are viable at this level:

Review Comment:
   Got it, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4574) FunctionQuery ValueSource value computed twice per document

2022-06-08 Thread Kevin Risden (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-4574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551732#comment-17551732
 ] 

Kevin Risden commented on LUCENE-4574:
--

[~dsmiley] / [~jpountz] - is it possible this was addressed by LUCENE-6263 (and 
somewhat related LUCENE-8405)? I don't see this behavior currently. Even 
requesting score, the score is now cached so don't compute the function twice.

> FunctionQuery ValueSource value computed twice per document
> ---
>
> Key: LUCENE-4574
> URL: https://issues.apache.org/jira/browse/LUCENE-4574
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 4.0, 4.1
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-4574.patch, LUCENE-4574.patch, LUCENE-4574.patch, 
> LUCENE-4574.patch, Test_for_LUCENE-4574.patch
>
>
> I was working on a custom ValueSource and did some basic profiling and 
> debugging to see if it was being used optimally.  To my surprise, the value 
> was being fetched twice per document in a row.  This computation isn't 
> exactly cheap to calculate so this is a big problem.  I was able to 
> work-around this problem trivially on my end by caching the last value with 
> corresponding docid in my FunctionValues implementation.
> Here is an excerpt of the code path to the first execution:
> {noformat}
> at 
> org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48)
> at 
> org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153)
> at 
> org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:291)
> at org.apache.lucene.search.Scorer.score(Scorer.java:62)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
> {noformat}
> And here is the 2nd call:
> {noformat}
> at 
> org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48)
> at 
> org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153)
> at 
> org.apache.lucene.search.ScoreCachingWrappingScorer.score(ScoreCachingWrappingScorer.java:56)
> at 
> org.apache.lucene.search.FieldComparator$RelevanceComparator.copy(FieldComparator.java:951)
> at 
> org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:312)
> at org.apache.lucene.search.Scorer.score(Scorer.java:62)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
> {noformat}
> The 2nd call appears to use some score caching mechanism, which is all well 
> and good, but that same mechanism wasn't used in the first call so there's no 
> cached value to retrieve.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] pminkov commented on pull request #940: Use similarity.tf() in MoreLikeThis

2022-06-08 Thread GitBox



pminkov commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1150241668

   I'm uploading the selected words file:
   
[mlt-selected-words.txt](https://github.com/apache/lucene/files/8863975/mlt-selected-words.txt)
   
   And the input file:
   [plots.txt](https://github.com/apache/lucene/files/8863978/plots.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #948: Fix typo in PostingsReaderBase docstring

2022-06-08 Thread GitBox



msokolov commented on PR #948:
URL: https://github.com/apache/lucene/pull/948#issuecomment-1150259511

   It looks to me as if the writer may intended `ImpactsEnum` there? Maybe 
check git blame to see when that edit was made...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4574) FunctionQuery ValueSource value computed twice per document

2022-06-08 Thread Kevin Risden (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-4574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551745#comment-17551745
 ] 

Kevin Risden commented on LUCENE-4574:
--

Ugh actually just after I sent that I found a reproducing case in Solr - I 
don't know if this is Solr specific. Still digging. bf=functionquery and 
sort=SOME_FIELD and fl=score it looks like. TopFieldCollector.populateScores is 
the secondary score calculation after MaxScore is already trying to calculate 
the scores.

> FunctionQuery ValueSource value computed twice per document
> ---
>
> Key: LUCENE-4574
> URL: https://issues.apache.org/jira/browse/LUCENE-4574
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 4.0, 4.1
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-4574.patch, LUCENE-4574.patch, LUCENE-4574.patch, 
> LUCENE-4574.patch, Test_for_LUCENE-4574.patch
>
>
> I was working on a custom ValueSource and did some basic profiling and 
> debugging to see if it was being used optimally.  To my surprise, the value 
> was being fetched twice per document in a row.  This computation isn't 
> exactly cheap to calculate so this is a big problem.  I was able to 
> work-around this problem trivially on my end by caching the last value with 
> corresponding docid in my FunctionValues implementation.
> Here is an excerpt of the code path to the first execution:
> {noformat}
> at 
> org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48)
> at 
> org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153)
> at 
> org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:291)
> at org.apache.lucene.search.Scorer.score(Scorer.java:62)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
> {noformat}
> And here is the 2nd call:
> {noformat}
> at 
> org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48)
> at 
> org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153)
> at 
> org.apache.lucene.search.ScoreCachingWrappingScorer.score(ScoreCachingWrappingScorer.java:56)
> at 
> org.apache.lucene.search.FieldComparator$RelevanceComparator.copy(FieldComparator.java:951)
> at 
> org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:312)
> at org.apache.lucene.search.Scorer.score(Scorer.java:62)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
> {noformat}
> The 2nd call appears to use some score caching mechanism, which is all well 
> and good, but that same mechanism wasn't used in the first call so there's no 
> cached value to retrieve.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #945: Try to fix the gradle compilation in idea

2022-06-08 Thread GitBox



msokolov commented on PR #945:
URL: https://github.com/apache/lucene/pull/945#issuecomment-1150273382

   Thanks, @dweiss -- still seems to work for me after your latest fixes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4574) FunctionQuery ValueSource value computed twice per document

2022-06-08 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-4574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551750#comment-17551750
 ] 

David Smiley commented on LUCENE-4574:
--

Maybe, but more recently, see LUCENE-10252 which Solr does not yet have as it 
was done in 9.1 (Solr is on Lucene 9.0 still).

> FunctionQuery ValueSource value computed twice per document
> ---
>
> Key: LUCENE-4574
> URL: https://issues.apache.org/jira/browse/LUCENE-4574
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 4.0, 4.1
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-4574.patch, LUCENE-4574.patch, LUCENE-4574.patch, 
> LUCENE-4574.patch, Test_for_LUCENE-4574.patch
>
>
> I was working on a custom ValueSource and did some basic profiling and 
> debugging to see if it was being used optimally.  To my surprise, the value 
> was being fetched twice per document in a row.  This computation isn't 
> exactly cheap to calculate so this is a big problem.  I was able to 
> work-around this problem trivially on my end by caching the last value with 
> corresponding docid in my FunctionValues implementation.
> Here is an excerpt of the code path to the first execution:
> {noformat}
> at 
> org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48)
> at 
> org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153)
> at 
> org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:291)
> at org.apache.lucene.search.Scorer.score(Scorer.java:62)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
> {noformat}
> And here is the 2nd call:
> {noformat}
> at 
> org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48)
> at 
> org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153)
> at 
> org.apache.lucene.search.ScoreCachingWrappingScorer.score(ScoreCachingWrappingScorer.java:56)
> at 
> org.apache.lucene.search.FieldComparator$RelevanceComparator.copy(FieldComparator.java:951)
> at 
> org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:312)
> at org.apache.lucene.search.Scorer.score(Scorer.java:62)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
> {noformat}
> The 2nd call appears to use some score caching mechanism, which is all well 
> and good, but that same mechanism wasn't used in the first call so there's no 
> cached value to retrieve.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on pull request #948: Fix typo in PostingsReaderBase docstring

2022-06-08 Thread GitBox



vigyasharma commented on PR #948:
URL: https://github.com/apache/lucene/pull/948#issuecomment-1150395272

   > It looks to me as if the writer may intended `ImpactsEnum` there? Maybe 
check git blame to see when that edit was made...
   
   Good idea. I checked git history and found [this 
commit](https://github.com/apache/lucene/commit/c13216934c58870b487a9bd04b2fd4ea24431000#diff-dde065fe5990e588a2a00cb720b3df719655db2cb84d0cb2168519445fedf993),
 where 
   ```java
   /**
*  of this class to manage creation of {@link DocsEnum} and
*  {@link DocsAndPositionsEnum} instances. 
* /
   
   was replaced by 
   
   /**
*  of this class to manage creation of {@link 
org.apache.lucene.index.PostingsEnum} and
*  {@link org.apache.lucene.index.PostingsEnum} instances. 
* /
   ```
   
   I guess earlier we had separate `DocsEnum` and `DocsAndPositionsEnum`, but 
now both can be handled within the `PostingsEnum`? 
   
   The class, however, does provide for creating `ImpactsEnum` (added later 
[here](https://github.com/apache/lucene/commit/af680af77f3f80c779e038a0ad8a136c9dcb9f5d)),
 and we can mention that in the doc string. I'll update the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on pull request #948: Fix typo in PostingsReaderBase docstring

2022-06-08 Thread GitBox



vigyasharma commented on PR #948:
URL: https://github.com/apache/lucene/pull/948#issuecomment-1150432604

   The "force push" was a rebase on main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-08 Thread GitBox



gsmiller commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1150439971

   Let me try to summarize my understanding of the "future optimization" debate 
as it pertains to this proposed API/implementation and see if we're on the same 
page or if I'm overlooking something.
   
   The current proposal encapsulates/delegates matching logic behind 
`FacetSetMatcher`, which is responsible for determining whether-or-not that 
`FSM` instance matches a provided point. There's `RangeFSM` which knows how to 
match based on whether-or-not the point is contained in the n-dim range, and 
there's `ExactFSM` which is just doing exact point equivalence. The "protocol" 
for doing facet aggregation/counting is implemented inside 
`MatchingFacetSetsCounts`, which delegates to the `FSM#matches` method. The 
inefficiency is that—because `MatchingFacetSetsCounts` can make no assumptions 
about the `FSM` instances provided to it, it must iterate _all_ provided `FSM` 
instances for every indexed point for every provided hit.
   
   For the single point case (`ExactFSM`), this is crazy right? Even if there 
are a small number of provided `ExactFSM` instances we're matching against, 
doing a linear scan of all of them for every point is pretty dang inefficient. 
Especially so for a case where there are many provided hits with many indexed 
points for each. I think the same logic generally holds true for the range case 
as well, but maybe that's more debatable.
   
   But, the problem is, `MatchingFacetSetsCounts` doesn't know anything about 
these `FSM` instances it gets and can't do anything other than just ask them 
all, "hey, do you match this point?" And so the debate seems to be how to setup 
the API to allow for future optimizations, or if we should even worry about it 
at all.
   
   I personally think we should design with this optimization in mind, but I 
think we're close and I don't actually think the current proposal needs to 
really change to allow for future optimizations.
   
   This is where I get a little fuzzy on the conversation that's taken place as 
I haven't totally caught up on the various proposals, and conversations taking 
place. But, if we kept the implementation as it currently exists, in the 
future, if we want to put these optimizations in place, could we not just add a 
method to `FSM` that exposes the min/max values for each dimension? Then, 
`MatchingFacetSetsCounts` could inspect the underlying "hyperrectangles" 
represented by each `FSM` by looking at these min/max values before it counts 
and decide how it wants to proceed. The `matches` method is potentially still 
useful/needed depending on how flexible we want this thing to be; if a user 
creates an `FSM` implementation that's more complicated than just a 
"hyperrectangle" (e.g., some complicated geometry), the contract for providing 
min/max could be that the provided min/max is a bounding box, but `matches` 
still needs to be called to actually confirm the match.
   
   I point this out not to propose implementing it now, but to say that I think 
we have options to extend what is currently proposed here if/when we want to 
optimize. Does this make sense or am I talking about a completely different 
problem or missing key parts of the conversation that's happened? Apologies if 
I have.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mdmarshmallow commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-08 Thread GitBox



mdmarshmallow commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1150471678

   So based on everyone's comments:
   
   1. It seems like we should ditch the `hyperrectangle` implementation and 
that `facetset` does everything we need for right now
   2. When we decide to optimize this (right after this PR is merged ideally), 
we would let `MatchingFacetSetsCount` be able to take a look at the `FSM`'s 
passed to it and then determine if it should put the FSM into an R tree, KD 
tree, or just linearly scan based on the `min` and `max` of each`FSM`. I think 
this makes sense, but we also shouldn't discuss it too much here as I think 
this is for another PR. I think the point is we can optimize the `facetsets` 
package in it's current state. With that being said, I do plan on writing the 
KD and R tree optimizations as soon as this is merged so I am still for this 
remaining a `long[]` API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation

2022-06-08 Thread sun wuqiang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sun wuqiang updated LUCENE-10605:
-
Issue Type: Bug  (was: Improvement)

> fix error in 32bit jvm object alignment gap calculation
> ---
>
> Key: LUCENE-10605
> URL: https://issues.apache.org/jira/browse/LUCENE-10605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 9.2
> Environment: jdk 7 32-bit
> jdk 8 32-bit
>Reporter: sun wuqiang
>Priority: Trivial
> Attachments: image-2022-06-08-20-50-27-712.png, 
> image-2022-06-08-21-24-57-674.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
> This method is used to calculate the optimal length of an array during 
> expansion.
>  
> According to current logic,in order to avoid space waste caused by *object 
> alignment gap.* In *32-bit* JVM,the array length will select the numbers(the 
> +current optional+ columns) in the table below. But the results weren't 
> perfect.
> See the table below.
> !image-2022-06-08-20-50-27-712.png!
>  
> I used *jol-core* to calculate object alignment gap
> {code:java}
> 
> org.openjdk.jol
> jol-core
> 0.16
> compile
>  {code}
>  
> Execute the following code:
> {code:java}
> System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); 
> {code}
>  
> !image-2022-06-08-21-24-57-674.png!
>  
> To further verify that the tool's results are correct, I wrote the following 
> code to infer how much space the array of different lengths actually occupies 
> based on when the OOM occursThe conclusion is consistent with jol-core.
> {code:java}
> // -Xms16m -Xmx16m
> // Used to infer the memory space occupied
> // by the length of various arrays
> public static void main(String[] args) {
> byte[][] arr = new byte[1024 * 1024][];
> for (int i = 0; i < arr.length; i++) {
> if (i % 100 == 0) {
> System.out.println(i);
> }
> // According to OOM occurrence time
> // in 32-bit JVM,
> // Arrays range in length from 5 to 12,
> // occupying the same amount of memory
> arr[i]=new byte[5];
> }
> } {code}
> *new byte[5]* and *new byte[12]* use the same amount of memory
> 
>  
> In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return 
> value of this method. But I don't know whether it is necessary to do this 
> function. If necessary, I will modify it together. Thank you very much!
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation

2022-06-08 Thread sun wuqiang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sun wuqiang updated LUCENE-10605:
-
Attachment: image-2022-06-09-08-25-55-289.png

> fix error in 32bit jvm object alignment gap calculation
> ---
>
> Key: LUCENE-10605
> URL: https://issues.apache.org/jira/browse/LUCENE-10605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 9.2
> Environment: jdk 7 32-bit
> jdk 8 32-bit
>Reporter: sun wuqiang
>Priority: Trivial
> Attachments: image-2022-06-08-20-50-27-712.png, 
> image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
> This method is used to calculate the optimal length of an array during 
> expansion.
>  
> According to current logic,in order to avoid space waste caused by *object 
> alignment gap.* In *32-bit* JVM,the array length will select the numbers(the 
> +current optional+ columns) in the table below. But the results weren't 
> perfect.
> See the table below.
> !image-2022-06-08-20-50-27-712.png!
>  
> I used *jol-core* to calculate object alignment gap
> {code:java}
> 
> org.openjdk.jol
> jol-core
> 0.16
> compile
>  {code}
>  
> Execute the following code:
> {code:java}
> System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); 
> {code}
>  
> !image-2022-06-08-21-24-57-674.png!
>  
> To further verify that the tool's results are correct, I wrote the following 
> code to infer how much space the array of different lengths actually occupies 
> based on when the OOM occursThe conclusion is consistent with jol-core.
> {code:java}
> // -Xms16m -Xmx16m
> // Used to infer the memory space occupied
> // by the length of various arrays
> public static void main(String[] args) {
> byte[][] arr = new byte[1024 * 1024][];
> for (int i = 0; i < arr.length; i++) {
> if (i % 100 == 0) {
> System.out.println(i);
> }
> // According to OOM occurrence time
> // in 32-bit JVM,
> // Arrays range in length from 5 to 12,
> // occupying the same amount of memory
> arr[i]=new byte[5];
> }
> } {code}
> *new byte[5]* and *new byte[12]* use the same amount of memory
> 
>  
> In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return 
> value of this method. But I don't know whether it is necessary to do this 
> function. If necessary, I will modify it together. Thank you very much!
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation

2022-06-08 Thread sun wuqiang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sun wuqiang updated LUCENE-10605:
-
Attachment: image-2022-06-09-08-26-36-528.png

> fix error in 32bit jvm object alignment gap calculation
> ---
>
> Key: LUCENE-10605
> URL: https://issues.apache.org/jira/browse/LUCENE-10605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 9.2
> Environment: jdk 7 32-bit
> jdk 8 32-bit
>Reporter: sun wuqiang
>Priority: Trivial
> Attachments: image-2022-06-08-20-50-27-712.png, 
> image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png, 
> image-2022-06-09-08-26-36-528.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
> This method is used to calculate the optimal length of an array during 
> expansion.
>  
> According to current logic,in order to avoid space waste caused by *object 
> alignment gap.* In *32-bit* JVM,the array length will select the numbers(the 
> +current optional+ columns) in the table below. But the results weren't 
> perfect.
> See the table below.
> !image-2022-06-08-20-50-27-712.png!
>  
> I used *jol-core* to calculate object alignment gap
> {code:java}
> 
> org.openjdk.jol
> jol-core
> 0.16
> compile
>  {code}
>  
> Execute the following code:
> {code:java}
> System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); 
> {code}
>  
> !image-2022-06-08-21-24-57-674.png!
>  
> To further verify that the tool's results are correct, I wrote the following 
> code to infer how much space the array of different lengths actually occupies 
> based on when the OOM occursThe conclusion is consistent with jol-core.
> {code:java}
> // -Xms16m -Xmx16m
> // Used to infer the memory space occupied
> // by the length of various arrays
> public static void main(String[] args) {
> byte[][] arr = new byte[1024 * 1024][];
> for (int i = 0; i < arr.length; i++) {
> if (i % 100 == 0) {
> System.out.println(i);
> }
> // According to OOM occurrence time
> // in 32-bit JVM,
> // Arrays range in length from 5 to 12,
> // occupying the same amount of memory
> arr[i]=new byte[5];
> }
> } {code}
> *new byte[5]* and *new byte[12]* use the same amount of memory
> 
>  
> In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return 
> value of this method. But I don't know whether it is necessary to do this 
> function. If necessary, I will modify it together. Thank you very much!
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation

2022-06-08 Thread sun wuqiang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sun wuqiang updated LUCENE-10605:
-
Description: 
ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
This method is used to calculate the optimal length of an array during 
expansion.
 
According to current logic,in order to avoid space waste caused by *object 
alignment gap.* In *32-bit* JVM,the array length will select the numbers(the 
+current optional+ columns) in the table below. But the results weren't perfect.

For example, if I want to expand byte[2], I call the method oversize(2,1) to 
get the size of the next array, which returns 8.
But Byte [8] is not the best result.
Since byte[8] and byte[12] use the same memory space (both are 24 bytes due to 
alignment gap),
So it's best to return 12 here.
See the table below.

!image-2022-06-09-08-26-36-528.png!

 
I used *jol-core* to calculate object alignment gap
{code:java}

org.openjdk.jol
jol-core
0.16
compile
 {code}
 
Execute the following code:
{code:java}
System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); {code}
 
!image-2022-06-08-21-24-57-674.png!
 
To further verify that the tool's results are correct, I wrote the following 
code to infer how much space the array of different lengths actually occupies 
based on when the OOM occursThe conclusion is consistent with jol-core.
{code:java}
// -Xms16m -Xmx16m
// Used to infer the memory space occupied
// by the length of various arrays
public static void main(String[] args) {
byte[][] arr = new byte[1024 * 1024][];
for (int i = 0; i < arr.length; i++) {
if (i % 100 == 0) {
System.out.println(i);
}
// According to OOM occurrence time
// in 32-bit JVM,
// Arrays range in length from 5 to 12,
// occupying the same amount of memory
arr[i]=new byte[5];
}
} {code}
*new byte[5]* and *new byte[12]* use the same amount of memory

 

In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return 
value of this method. But I don't know whether it is necessary to do this 
function. If necessary, I will modify it together. Thank you very much!

 

  was:
ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
This method is used to calculate the optimal length of an array during 
expansion.
 
According to current logic,in order to avoid space waste caused by *object 
alignment gap.* In *32-bit* JVM,the array length will select the numbers(the 
+current optional+ columns) in the table below. But the results weren't perfect.
See the table below.
!image-2022-06-08-20-50-27-712.png!
 
I used *jol-core* to calculate object alignment gap
{code:java}

org.openjdk.jol
jol-core
0.16
compile
 {code}
 
Execute the following code:
{code:java}
System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); {code}
 
!image-2022-06-08-21-24-57-674.png!
 
To further verify that the tool's results are correct, I wrote the following 
code to infer how much space the array of different lengths actually occupies 
based on when the OOM occursThe conclusion is consistent with jol-core.
{code:java}
// -Xms16m -Xmx16m
// Used to infer the memory space occupied
// by the length of various arrays
public static void main(String[] args) {
byte[][] arr = new byte[1024 * 1024][];
for (int i = 0; i < arr.length; i++) {
if (i % 100 == 0) {
System.out.println(i);
}
// According to OOM occurrence time
// in 32-bit JVM,
// Arrays range in length from 5 to 12,
// occupying the same amount of memory
arr[i]=new byte[5];
}
} {code}
*new byte[5]* and *new byte[12]* use the same amount of memory

 

In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return 
value of this method. But I don't know whether it is necessary to do this 
function. If necessary, I will modify it together. Thank you very much!

 


> fix error in 32bit jvm object alignment gap calculation
> ---
>
> Key: LUCENE-10605
> URL: https://issues.apache.org/jira/browse/LUCENE-10605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 9.2
> Environment: jdk 7 32-bit
> jdk 8 32-bit
>Reporter: sun wuqiang
>Priority: Trivial
> Attachments: image-2022-06-08-20-50-27-712.png, 
> image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png, 
> image-2022-06-09-08-26-36-528.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
> This method is used to calculate the optimal length of an array during 
> expansion.
>  
> According to current logic,in order to avoid space waste caused by

[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation

2022-06-08 Thread sun wuqiang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sun wuqiang updated LUCENE-10605:
-
Description: 
ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
This method is used to calculate the optimal length of an array during 
expansion.
 
According to current logic,in order to avoid space waste caused by *object 
alignment gap.* In *32-bit* JVM,the array length will select the numbers(the 
+current optional+ columns) in the table below. But the results weren't perfect.

For example, if I want to expand byte[2], I will call the method oversize(2,1) 
to get the size of the next array, which returns 8.
But byte [8] is not the best result.
Since byte[8] and byte[12] use the same memory space (both are 24 bytes due to 
alignment gap),
So it's best to return 12 here.
See the table below.

!image-2022-06-09-08-26-36-528.png!

 
I used *jol-core* to calculate object alignment gap
{code:java}

org.openjdk.jol
jol-core
0.16
compile
 {code}
 
Execute the following code:
{code:java}
System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); {code}
 
!image-2022-06-08-21-24-57-674.png!
 
To further verify that the tool's results are correct, I wrote the following 
code to infer how much space the array of different lengths actually occupies 
based on when the OOM occursThe conclusion is consistent with jol-core.
{code:java}
// -Xms16m -Xmx16m
// Used to infer the memory space occupied
// by the length of various arrays
public static void main(String[] args) {
byte[][] arr = new byte[1024 * 1024][];
for (int i = 0; i < arr.length; i++) {
if (i % 100 == 0) {
System.out.println(i);
}
// According to OOM occurrence time
// in 32-bit JVM,
// Arrays range in length from 5 to 12,
// occupying the same amount of memory
arr[i]=new byte[5];
}
} {code}
*new byte[5]* and *new byte[12]* use the same amount of memory

 

In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return 
value of this method. But I don't know whether it is necessary to do this 
function. If necessary, I will modify it together. Thank you very much!

 

  was:
ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
This method is used to calculate the optimal length of an array during 
expansion.
 
According to current logic,in order to avoid space waste caused by *object 
alignment gap.* In *32-bit* JVM,the array length will select the numbers(the 
+current optional+ columns) in the table below. But the results weren't perfect.

For example, if I want to expand byte[2], I call the method oversize(2,1) to 
get the size of the next array, which returns 8.
But Byte [8] is not the best result.
Since byte[8] and byte[12] use the same memory space (both are 24 bytes due to 
alignment gap),
So it's best to return 12 here.
See the table below.

!image-2022-06-09-08-26-36-528.png!

 
I used *jol-core* to calculate object alignment gap
{code:java}

org.openjdk.jol
jol-core
0.16
compile
 {code}
 
Execute the following code:
{code:java}
System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); {code}
 
!image-2022-06-08-21-24-57-674.png!
 
To further verify that the tool's results are correct, I wrote the following 
code to infer how much space the array of different lengths actually occupies 
based on when the OOM occursThe conclusion is consistent with jol-core.
{code:java}
// -Xms16m -Xmx16m
// Used to infer the memory space occupied
// by the length of various arrays
public static void main(String[] args) {
byte[][] arr = new byte[1024 * 1024][];
for (int i = 0; i < arr.length; i++) {
if (i % 100 == 0) {
System.out.println(i);
}
// According to OOM occurrence time
// in 32-bit JVM,
// Arrays range in length from 5 to 12,
// occupying the same amount of memory
arr[i]=new byte[5];
}
} {code}
*new byte[5]* and *new byte[12]* use the same amount of memory

 

In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return 
value of this method. But I don't know whether it is necessary to do this 
function. If necessary, I will modify it together. Thank you very much!

 


> fix error in 32bit jvm object alignment gap calculation
> ---
>
> Key: LUCENE-10605
> URL: https://issues.apache.org/jira/browse/LUCENE-10605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 9.2
> Environment: jdk 7 32-bit
> jdk 8 32-bit
>Reporter: sun wuqiang
>Priority: Trivial
> Attachments: image-2022-06-08-20-50-27-712.png, 
> image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png, 
> image-2022-06-09-

[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation

2022-06-08 Thread sun wuqiang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sun wuqiang updated LUCENE-10605:
-
Affects Version/s: 8.11.1
   (was: 9.2)

> fix error in 32bit jvm object alignment gap calculation
> ---
>
> Key: LUCENE-10605
> URL: https://issues.apache.org/jira/browse/LUCENE-10605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.11.1
> Environment: jdk 7 32-bit
> jdk 8 32-bit
>Reporter: sun wuqiang
>Priority: Trivial
> Attachments: image-2022-06-08-20-50-27-712.png, 
> image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png, 
> image-2022-06-09-08-26-36-528.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
> This method is used to calculate the optimal length of an array during 
> expansion.
>  
> According to current logic,in order to avoid space waste caused by *object 
> alignment gap.* In *32-bit* JVM,the array length will select the numbers(the 
> +current optional+ columns) in the table below. But the results weren't 
> perfect.
> For example, if I want to expand byte[2], I will call the method 
> oversize(2,1) to get the size of the next array, which returns 8.
> But byte [8] is not the best result.
> Since byte[8] and byte[12] use the same memory space (both are 24 bytes due 
> to alignment gap),
> So it's best to return 12 here.
> See the table below.
> !image-2022-06-09-08-26-36-528.png!
>  
> I used *jol-core* to calculate object alignment gap
> {code:java}
> 
> org.openjdk.jol
> jol-core
> 0.16
> compile
>  {code}
>  
> Execute the following code:
> {code:java}
> System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); 
> {code}
>  
> !image-2022-06-08-21-24-57-674.png!
>  
> To further verify that the tool's results are correct, I wrote the following 
> code to infer how much space the array of different lengths actually occupies 
> based on when the OOM occursThe conclusion is consistent with jol-core.
> {code:java}
> // -Xms16m -Xmx16m
> // Used to infer the memory space occupied
> // by the length of various arrays
> public static void main(String[] args) {
> byte[][] arr = new byte[1024 * 1024][];
> for (int i = 0; i < arr.length; i++) {
> if (i % 100 == 0) {
> System.out.println(i);
> }
> // According to OOM occurrence time
> // in 32-bit JVM,
> // Arrays range in length from 5 to 12,
> // occupying the same amount of memory
> arr[i]=new byte[5];
> }
> } {code}
> *new byte[5]* and *new byte[12]* use the same amount of memory
> 
>  
> In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return 
> value of this method. But I don't know whether it is necessary to do this 
> function. If necessary, I will modify it together. Thank you very much!
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

2022-06-08 Thread Kaival Parikh (Jira)

Kaival Parikh created LUCENE-10606:
--

 Summary: Optimize hit collection of prefilter in KnnVectorQuery 
for BitSet backed queries
 Key: LUCENE-10606
 URL: https://issues.apache.org/jira/browse/LUCENE-10606
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Kaival Parikh


While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
prefilter testing support, we saw that hit collection took a long time for 
BitSetIterator backed scorers (due to iteration over the entire underlying 
BitSet, and copying it into an internal one) (Link to 
[numbers|https://github.com/apache/lucene/pull/932#discussion_r96850], 
second table)

These BitSetIterators can be frequent (as they are used in LRUQueryCache), and 
bulk collection can be optimized with more knowledge of the underlying iterator



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

2022-06-08 Thread Kaival Parikh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551976#comment-17551976
 ] 

Kaival Parikh commented on LUCENE-10606:


Instead of collecting hit-by-hit using a LeafCollector, we can break down the 
search by instantiating a weight, creating scorers, and checking the underlying 
iterator. If it is backed by a BitSet, we can directly update the reference (as 
we won't be editing it). Else we can create a new BitSet from the iterator 
using BitSet.of

This way the collection is optimized (and can be advantageous as LRUQueryCache 
internally uses a BitSet, so such iterators will be common). Sample 
[code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection]

> Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed 
> queries
> 
>
> Key: LUCENE-10606
> URL: https://issues.apache.org/jira/browse/LUCENE-10606
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Kaival Parikh
>Priority: Minor
>  Labels: performance
>
> While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
> prefilter testing support, we saw that hit collection took a long time for 
> BitSetIterator backed scorers (due to iteration over the entire underlying 
> BitSet, and copying it into an internal one) (Link to 
> [numbers|https://github.com/apache/lucene/pull/932#discussion_r96850], 
> second table)
> These BitSetIterators can be frequent (as they are used in LRUQueryCache), 
> and bulk collection can be optimized with more knowledge of the underlying 
> iterator



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

2022-06-08 Thread Kaival Parikh (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaival Parikh updated LUCENE-10606:
---
Description: 
While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
prefilter testing support, we saw that hit collection took a long time for 
BitSetIterator backed scorers (due to iteration over the entire underlying 
BitSet, and copying it into an internal one)

These BitSetIterators can be frequent (as they are used in LRUQueryCache), and 
bulk collection can be optimized with more knowledge of the underlying iterator

  was:
While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
prefilter testing support, we saw that hit collection took a long time for 
BitSetIterator backed scorers (due to iteration over the entire underlying 
BitSet, and copying it into an internal one) (Link to 
[numbers|https://github.com/apache/lucene/pull/932#discussion_r96850], 
second table)

These BitSetIterators can be frequent (as they are used in LRUQueryCache), and 
bulk collection can be optimized with more knowledge of the underlying iterator


> Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed 
> queries
> 
>
> Key: LUCENE-10606
> URL: https://issues.apache.org/jira/browse/LUCENE-10606
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Kaival Parikh
>Priority: Minor
>  Labels: performance
>
> While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
> prefilter testing support, we saw that hit collection took a long time for 
> BitSetIterator backed scorers (due to iteration over the entire underlying 
> BitSet, and copying it into an internal one)
> These BitSetIterators can be frequent (as they are used in LRUQueryCache), 
> and bulk collection can be optimized with more knowledge of the underlying 
> iterator



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

2022-06-08 Thread Kaival Parikh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551976#comment-17551976
 ] 

Kaival Parikh edited comment on LUCENE-10606 at 6/9/22 6:21 AM:


Instead of collecting hit-by-hit using a LeafCollector, we can break down the 
search by instantiating a weight, creating scorers, and checking the underlying 
iterator. If it is backed by a BitSet, we can directly update the reference (as 
we won't be editing it). Else we can create a new BitSet from the iterator 
using BitSet.of

This way the collection is optimized (and can be advantageous as LRUQueryCache 
internally uses a BitSet, so such iterators will be common). Sample 
[code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection]

This is the baseline:
||selectivity||effective topK||post-filter recall||post-filter time||pre-filter 
recall||pre-filter time||
|0.8|125|0.967|1.55|0.976|19.65|
|0.6|166|0.964|1.94|0.981|17.79|
|0.4|250|0.961|2.69|0.986|14.71|
|0.2|500|0.958|4.78|0.992|11.19|
|0.1|1000|0.959|8.53|0.994|11.50|
|0.01|1|0.937|58.32|1.000|10.34|

This is after alternate collection:
||selectivity||effective topK||post-filter recall||post-filter time||pre-filter 
recall||pre-filter time||
|0.8|125|0.961|1.56|0.976|1.64|
|0.6|166|0.960|1.93|0.980|1.98|
|0.4|250|0.961|2.68|0.987|2.68|
|0.2|500|0.960|4.76|0.991|4.55|
|0.1|1000|0.961|8.50|0.995|7.84|
|0.01|1|0.953|58.17|1.000|9.71|


was (Author: JIRAUSER289654):
Instead of collecting hit-by-hit using a LeafCollector, we can break down the 
search by instantiating a weight, creating scorers, and checking the underlying 
iterator. If it is backed by a BitSet, we can directly update the reference (as 
we won't be editing it). Else we can create a new BitSet from the iterator 
using BitSet.of

This way the collection is optimized (and can be advantageous as LRUQueryCache 
internally uses a BitSet, so such iterators will be common). Sample 
[code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection]

> Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed 
> queries
> 
>
> Key: LUCENE-10606
> URL: https://issues.apache.org/jira/browse/LUCENE-10606
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Kaival Parikh
>Priority: Minor
>  Labels: performance
>
> While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
> prefilter testing support, we saw that hit collection took a long time for 
> BitSetIterator backed scorers (due to iteration over the entire underlying 
> BitSet, and copying it into an internal one)
> These BitSetIterators can be frequent (as they are used in LRUQueryCache), 
> and bulk collection can be optimized with more knowledge of the underlying 
> iterator



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

2022-06-08 Thread GitBox



kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r893116012


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
   return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
 }
 
+public void setBitSet(BitSet bitSet, int cost) {
+  bitSets[ord] = bitSet;

Review Comment:
   Thank You! I have opened a JIRA 
[issue](https://issues.apache.org/jira/browse/LUCENE-10606), hope it is okay
   Please feel free to suggest edits / alternate approaches / provide feedback



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

2022-06-08 Thread Kaival Parikh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551976#comment-17551976
 ] 

Kaival Parikh edited comment on LUCENE-10606 at 6/9/22 6:29 AM:


Instead of collecting hit-by-hit using a LeafCollector, we can break down the 
search by instantiating a weight, creating scorers, and checking the underlying 
iterator. If it is backed by a BitSet, we can directly update the reference (as 
we won't be editing it). Else we can create a new BitSet from the iterator 
using BitSet.of

This way the collection is optimized (and can be advantageous as LRUQueryCache 
internally uses a BitSet, so such iterators will be common). Sample 
[code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection]

This is the baseline:
||selectivity||pre-filter recall||pre-filter time||
|0.8|0.976|19.65|
|0.6|0.981|17.79|
|0.4|0.986|14.71|
|0.2|0.992|11.19|
|0.1|0.994|11.50|
|0.01|1.000|10.34|

This is after alternate collection:
||selectivity||pre-filter recall||pre-filter time||
|0.8|0.976|1.64|
|0.6|0.980|1.98|
|0.4|0.987|2.68|
|0.2|0.991|4.55|
|0.1|0.995|7.84|
|0.01|1.000|9.71|


was (Author: JIRAUSER289654):
Instead of collecting hit-by-hit using a LeafCollector, we can break down the 
search by instantiating a weight, creating scorers, and checking the underlying 
iterator. If it is backed by a BitSet, we can directly update the reference (as 
we won't be editing it). Else we can create a new BitSet from the iterator 
using BitSet.of

This way the collection is optimized (and can be advantageous as LRUQueryCache 
internally uses a BitSet, so such iterators will be common). Sample 
[code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection]

This is the baseline:
||selectivity||effective topK||post-filter recall||post-filter time||pre-filter 
recall||pre-filter time||
|0.8|125|0.967|1.55|0.976|19.65|
|0.6|166|0.964|1.94|0.981|17.79|
|0.4|250|0.961|2.69|0.986|14.71|
|0.2|500|0.958|4.78|0.992|11.19|
|0.1|1000|0.959|8.53|0.994|11.50|
|0.01|1|0.937|58.32|1.000|10.34|

This is after alternate collection:
||selectivity||effective topK||post-filter recall||post-filter time||pre-filter 
recall||pre-filter time||
|0.8|125|0.961|1.56|0.976|1.64|
|0.6|166|0.960|1.93|0.980|1.98|
|0.4|250|0.961|2.68|0.987|2.68|
|0.2|500|0.960|4.76|0.991|4.55|
|0.1|1000|0.961|8.50|0.995|7.84|
|0.01|1|0.953|58.17|1.000|9.71|

> Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed 
> queries
> 
>
> Key: LUCENE-10606
> URL: https://issues.apache.org/jira/browse/LUCENE-10606
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Kaival Parikh
>Priority: Minor
>  Labels: performance
>
> While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
> prefilter testing support, we saw that hit collection took a long time for 
> BitSetIterator backed scorers (due to iteration over the entire underlying 
> BitSet, and copying it into an internal one)
> These BitSetIterators can be frequent (as they are used in LRUQueryCache), 
> and bulk collection can be optimized with more knowledge of the underlying 
> iterator



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

58 matches

Mail list logo