[GitHub] [lucene] gsmiller opened a new issue, #11742: MatchingFacetSetsCounts doesn't properly implement getTopChildren

2022-09-02 Thread GitBox


gsmiller opened a new issue, #11742:
URL: https://github.com/apache/lucene/issues/11742

   ### Description
   
   `MatchingFacetSetsCounts#getTopChildren` is currently just delegating to 
`#getAllChildren`, which isn't really the correct thing to do. We should 
properly implement "top children."
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on issue #11574: TopN is not being used in getTopChildren() [LUCENE-10538]

2022-09-02 Thread GitBox


gsmiller commented on issue #11574:
URL: https://github.com/apache/lucene/issues/11574#issuecomment-1235534712

   Let's resolve this out. This issue grew into a few different spin-offs, 
which are now all resolved. But... it looks like we introduced another case of 
not properly implementing "top children" with facet sets. I created a specific 
issue for this to carry it forward: #11742


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed issue #11574: TopN is not being used in getTopChildren() [LUCENE-10538]

2022-09-02 Thread GitBox


gsmiller closed issue #11574: TopN is not being used in getTopChildren() 
[LUCENE-10538]
URL: https://github.com/apache/lucene/issues/11574


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova opened a new pull request, #11743: LUCENE-10592 Better estimate memory for HNSW graph

2022-09-02 Thread GitBox


mayya-sharipova opened a new pull request, #11743:
URL: https://github.com/apache/lucene/pull/11743

   Better estimate memory used for OnHeapHnswGraph,
   as well as add tests.
   
   Also don't over-allocate arrays in NeighborArray.
   
   Relates to #992


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #11744: Remove LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality

2022-09-02 Thread GitBox


gsmiller opened a new pull request, #11744:
URL: https://github.com/apache/lucene/pull/11744

   ### Description
   
   `LongValueFacetCounts#getTopChildrenSortByCount` does exactly the same thing 
as the more standard `LongValueFacetCounts#getTopChildren`, so we can tighten 
up our API.
   
   Note that this PR would go with 10.0, and I'll put up a separate PR to mark 
it deprecated against 9.x.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rishabhmaurya opened a new issue, #11745: Store summarized results in internal nodes of BKD for time series points

2022-09-02 Thread GitBox


rishabhmaurya opened a new issue, #11745:
URL: https://github.com/apache/lucene/issues/11745

   ### Description
   
   Time series points have a timestamp, measurement and dimensions associated 
with them. The common queries are range queries on timestamp, metric 
aggregation on measurement and grouping on dimensions. Or similar query with 
histogram on timestamp. 
   
   **Proposal:**
   
   Prototype can be found [here](https://github.com/rishabhmaurya/lucene/pull/1)
   
   1. Introduce a new time series point as a field in lucene - `TSPoint` which 
can be added as - 
   
   ```java
   
   Document doc = new Document(); doc.add(new TSIntPoint("tsid1", "cpu", 
timestamp, measurement));
   ```
   
   `tsid1` is the time series ID of the point. It will be the unit of storage 
for time series points and for prototype each of them represents a unique field 
in lucene. 
   
   `timestamp` is the actual point in BKD on which the index is created. 
   
   Full definition can be found 
[here](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/document/TSIntPoint.java).
   
   2. Interface for decomposable aggregate function, which can be defined as 
part of index configuration - 
   
   Sum function - 
   
   ```java
   
   new BKDSummaryWriter.SummaryMergeFunction() {
   
         @Override
         public int getSummarySize()
   
   {         return Integer.BYTES;       }
   
         @Override
         public void merge(byte[] a, byte[] b, byte[] c)
   
   {         packBytes(unpackBytes(a) + unpackBytes(b), c);       }
   
         @Override
         public Integer unpackBytes(byte[] val)
   
   {         return NumericUtils.sortableBytesToInt(val, 0);       }
   
         @Override
         public void packBytes(Integer val, byte[] res)
   
   {         NumericUtils.intToSortableBytes(val, res, 0);       }
   
       };
   
   ```
   
   3. New query per `LeafReader` to perform range queries on TSPoint and 
retrieve summarized results - 
   
   
   ```java
   
   LeafReader leafReader; PointValues points = 
leafReader.getPointValues("tsid1");
   
   TSPointQuery tsPointQuery = new TSPointQuery("tsid1", lowerBoundTimestamp, 
upperBoundTimestamp);
   
   byte[] res = tsPointQuery.getSummary((BKDWithSummaryReader.BKDSummaryTree) 
points.getPointTree(), mergeFunction);
   
   ```
   
   Instead of BKDReader and BKDWriter, we will be using 
[BKDSummaryWriter](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/util/bkd/BKDSummaryWriter.java)
 
[BKDSummaryReader](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/util/bkd/BKDWithSummaryReader.java)
 which supports writing summaries in internal nodes of the tree. 
   
   Changes in 
[IntersectVisitor](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/index/PointValues.java#L320-L337)
 interface here.
   
    Comparison with DocValues
   Below is the comparison of running unit test for 
[DocValue](https://github.com/rishabhmaurya/lucene/pull/1/commits/157215cd2c4787787748625e10b58001583c6e6e#diff-87c31ac3b1cd1ef45b15c84d9cf1c5bab03feb94f199256f859f14ed4747abd2R129)
 approach vs 
[TSPoint](https://github.com/rishabhmaurya/lucene/pull/1/commits/157215cd2c4787787748625e10b58001583c6e6e#diff-87c31ac3b1cd1ef45b15c84d9cf1c5bab03feb94f199256f859f14ed4747abd2R52)
 approach - 
   
   This test ingests `1000` docs against a given TSID and performs a range 
query on timestamp 100 times against the same TSID. Merge function used is 
`sum`.
   
   | DocValues approach |  TSPoint approach|
   | --- | -- |
   |Indexing took: 42948 ms  | Indexing took: 32985   |
   |Matching docs count:1304624 \| Segments:3 \| DiskAccess: 1304624 | Matching 
docs count:8784032 \| Segments:10 \| DiskAccess: 302   |
   |Search took: 12382 ms | Search took: 50ms |
   
   This is not apple to apple comparison since number of segments are 3 in 
DocValues approach whereas its 10 in TSPoint approach. 
   
    Limitation of this feature
   * Doc deletion currently not supported. We need to evaluate how important is 
it and possibly find a way to support it in future.
   * Only 
[decomposable](https://en.wikipedia.org/wiki/Aggregate_function#Decomposable_aggregate_functions)
 aggregation functions can be supported. E.g. min, max, sum, avg, count. 
   
    Todos
   * Implementation for multiple TSIDs. For now we need to create a new field 
with the name same as TSID for a timeseries.
   * Segment merge for BKD with summaries. Currently, the UTs disables merge 
and perform search across multiple segments and cumulate the results.
   * Pluggable merge function to merge 2 `TSPoint`. Currently its hardcoded in 
`FieldInfo.java` which isn't the right pl

[GitHub] [lucene] gsmiller opened a new pull request, #11746: Deprecate LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality

2022-09-02 Thread GitBox


gsmiller opened a new pull request, #11746:
URL: https://github.com/apache/lucene/pull/11746

   ### Description
   
   This is a companion (backport) PR to #11744 that marks functionality 
deprecated instead of outright removing it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

2022-09-02 Thread GitBox


msokolov commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r961747487


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -104,8 +104,8 @@ public void removeLast() {
   }
 
   public void removeIndex(int idx) {
-System.arraycopy(node, idx + 1, node, idx, size - idx);
-System.arraycopy(score, idx + 1, score, idx, size - idx);
+System.arraycopy(node, idx + 1, node, idx, size - idx - 1);

Review Comment:
   oh, good! Is it OK when `size - idx - 1` == 0? I think so, at least I 
checked javadocs and they say only length < 0 raises an error



##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -175,20 +175,28 @@ public long ramBytesUsed() {
 long neighborArrayBytes0 =
 nsize0 * (Integer.BYTES + Float.BYTES)
 + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2
-+ RamUsageEstimator.NUM_BYTES_OBJECT_REF;
++ RamUsageEstimator.NUM_BYTES_OBJECT_REF
++ Integer.BYTES * 2;
 long neighborArrayBytes =
 nsize * (Integer.BYTES + Float.BYTES)
 + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2
-+ RamUsageEstimator.NUM_BYTES_OBJECT_REF;
-
++ RamUsageEstimator.NUM_BYTES_OBJECT_REF
++ Integer.BYTES * 2;
 long total = 0;
 for (int l = 0; l < numLevels; l++) {
   int numNodesOnLevel = graph.get(l).size();

Review Comment:
   were we overestimating before because not all the nodes were populated?



##
lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java:
##
@@ -222,6 +229,33 @@ public void testPrintValues() {
 System.out.println("LONG_SIZE = " + LONG_SIZE);
   }
 
+  public void testHnswGraph() throws IOException {
+int size = atLeast(2000);
+int dim = randomIntBetween(100, 1024);
+int M = randomIntBetween(4, 96);
+VectorSimilarityFunction similarityFunction =
+VectorSimilarityFunction.values()[
+random().nextInt(VectorSimilarityFunction.values().length - 1) + 
1];
+VectorEncoding vectorEncoding;
+if (similarityFunction == VectorSimilarityFunction.DOT_PRODUCT) {

Review Comment:
   I don't think you need this check -- it's OK now to use BYTE encoding with 
other similarities



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #11737: Simplify dense optimization check in TermInSetQuery

2022-09-02 Thread GitBox


gsmiller merged PR #11737:
URL: https://github.com/apache/lucene/pull/11737


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new pull request, #11747: update DOAP and releaseWizard to reflect migration to github

2022-09-02 Thread GitBox


msokolov opened a new pull request, #11747:
URL: https://github.com/apache/lucene/pull/11747

   While preparing the 9.4.0 release, I ran across some references to JIRA and 
update those. I also found that the region names used in the holiday.py module 
installed by pip for me were different than those used in releaseWizard.py, and 
the ones in the latest module seem to be more standard (two-letter ISO codes), 
so I switched to those. Finally I fixed some references to outdated locations 
in the DOAP.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

2022-09-02 Thread GitBox


mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r962070814


##
lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java:
##
@@ -222,6 +229,33 @@ public void testPrintValues() {
 System.out.println("LONG_SIZE = " + LONG_SIZE);
   }
 
+  public void testHnswGraph() throws IOException {
+int size = atLeast(2000);
+int dim = randomIntBetween(100, 1024);
+int M = randomIntBetween(4, 96);
+VectorSimilarityFunction similarityFunction =
+VectorSimilarityFunction.values()[
+random().nextInt(VectorSimilarityFunction.values().length - 1) + 
1];
+VectorEncoding vectorEncoding;
+if (similarityFunction == VectorSimilarityFunction.DOT_PRODUCT) {

Review Comment:
   Thanks, good to know. I copied that part from  another test in 
`TestHnswGraph`. I've corrected here as well in as in that test. Addressed in 
2e01deab2e1678b4687250d22f62ad0cd6196bc9



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

2022-09-02 Thread GitBox


mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r962071131


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -104,8 +104,8 @@ public void removeLast() {
   }
 
   public void removeIndex(int idx) {
-System.arraycopy(node, idx + 1, node, idx, size - idx);
-System.arraycopy(score, idx + 1, score, idx, size - idx);
+System.arraycopy(node, idx + 1, node, idx, size - idx - 1);

Review Comment:
   Thank you for bringing my attention to this, I checked, and indeed 
`length=0` is permissible in `System.arraycopy` function. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on issue #11553: Improve performance of SortedSetDV faceting by iterating on class types [LUCENE-10517]

2022-09-02 Thread GitBox


gsmiller commented on issue #11553:
URL: https://github.com/apache/lucene/issues/11553#issuecomment-1235993045

   @ChrisHegarty I was looking back through faceting improvements and trying to 
catch up on this change, which I hadn't followed closely when it was being 
made. When you get a chance (not urgent), could you help me understand where 
`invokeinterface` was coming into play before your change? I can't reason about 
where an interface-defined method would be involved with `#nextDoc()`, but I'm 
sure this is just my detailed JVM ignorance surfacing. Hoping you could help me 
understand some detail here. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

2022-09-02 Thread GitBox


mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r962072588


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -175,20 +175,28 @@ public long ramBytesUsed() {
 long neighborArrayBytes0 =
 nsize0 * (Integer.BYTES + Float.BYTES)
 + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2
-+ RamUsageEstimator.NUM_BYTES_OBJECT_REF;
++ RamUsageEstimator.NUM_BYTES_OBJECT_REF
++ Integer.BYTES * 2;
 long neighborArrayBytes =
 nsize * (Integer.BYTES + Float.BYTES)
 + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2
-+ RamUsageEstimator.NUM_BYTES_OBJECT_REF;
-
++ RamUsageEstimator.NUM_BYTES_OBJECT_REF
++ Integer.BYTES * 2;
 long total = 0;
 for (int l = 0; l < numLevels; l++) {
   int numNodesOnLevel = graph.get(l).size();

Review Comment:
   In comparison with what `RamUsageTester::ramUsed` reports, we are actually 
underestimating. I still need to do more investigation.
   Also, we preemptively allocate memory for all `NeighborArray` objects which 
may never get populated with real data.  We probably need to look into this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org