[GitHub] [lucene] mdmarshmallow commented on a change in pull request #747: LUCENE-10325: Add getTopDims functionality to Facets

2022-03-19 Thread GitBox


mdmarshmallow commented on a change in pull request #747:
URL: https://github.com/apache/lucene/pull/747#discussion_r830444294



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##
@@ -114,7 +116,7 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
 return null;
   }
   DimTree dimTree = state.getDimTree(dim);
-  return getPathResult(dimConfig, dim, path, pathOrd, 
dimTree.iterator(pathOrd), topN);
+  return getPathResult(dimConfig, dim, path, pathOrd, 
dimTree.iterator(pathOrd), topN, null);

Review comment:
   Small nit, but could we have an overloaded method signature here? I 
think it will make the code a bit cleaner than just passing `null`.

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##
@@ -133,7 +135,7 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
 // child:
 childIt.next();
   }
-  return getPathResult(dimConfig, dim, null, dimOrd, childIt, topN);
+  return getPathResult(dimConfig, dim, null, dimOrd, childIt, topN, null);

Review comment:
   Same thing here. Actually there is a few places where `null` is just 
passed into various methods. I think new signatures excluding the 
`cacheChildOrdsResult` with comments explaining what they are would make the 
code clearer.

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##
@@ -411,7 +484,109 @@ public int compare(FacetResult a, FacetResult b) {
 }
   }
 });
-
 return results;
   }
+
+  @Override
+  public List getTopDims(int topNDims, int topNChildren) throws 
IOException {
+// Creates priority queue to store top dimensions and sort by their 
aggregated values/hits and
+// string values.
+PriorityQueue pq =
+new PriorityQueue<>(topNDims) {
+  @Override
+  protected boolean lessThan(DimValueResult a, DimValueResult b) {
+if (a.value > b.value) {
+  return false;
+} else if (a.value < b.value) {
+  return true;
+} else {
+  return a.dim.compareTo(b.dim) > 0;
+}
+  }
+};
+
+HashMap cacheChildOrdsResult = new HashMap<>();
+int dimCount;
+
+for (String dim : state.getDims()) {
+  FacetsConfig.DimConfig dimConfig = stateConfig.getDimConfig(dim);
+  if (dimConfig.hierarchical) {
+DimTree dimTree = state.getDimTree(dim);
+int dimOrd = dimTree.dimStartOrd;
+// get dim value
+dimCount =
+getDimValue(
+dimConfig, dim, dimOrd, dimTree.iterator(), topNChildren, 
cacheChildOrdsResult);
+  } else {
+OrdRange ordRange = state.getOrdRange(dim);
+int dimOrd = ordRange.start;
+PrimitiveIterator.OfInt childIt = ordRange.iterator();
+if (dimConfig.multiValued && dimConfig.requireDimCount) {
+  // If the dim is multi-valued and requires dim counts, we know we've 
explicitly indexed
+  // the dimension and we need to skip past it so the iterator is 
positioned on the first
+  // child:
+  childIt.next();
+}
+dimCount = getDimValue(dimConfig, dim, dimOrd, childIt, topNChildren, 
cacheChildOrdsResult);
+  }
+
+  if (dimCount != 0) {
+// use priority queue to store DimValueResult for topNDims
+if (pq.size() < topNDims) {
+  pq.insertWithOverflow(new DimValueResult(dim, dimCount));
+} else {
+  if (dimCount > pq.top().value
+  || (dimCount == pq.top().value && dim.compareTo(pq.top().dim) < 
0)) {
+DimValueResult bottomDim = pq.pop();
+bottomDim.dim = dim;
+bottomDim.value = dimCount;
+pq.insertWithOverflow(bottomDim);
+  }
+}
+  }
+}
+
+// get FacetResult for topNDims
+int resultSize = pq.size();
+FacetResult[] results = new FacetResult[resultSize];
+
+while (pq.size() > 0) {
+  DimValueResult dimValueResult = pq.pop();
+  FacetResult facetResult =
+  getFacetResultForDim(
+  dimValueResult.dim, topNChildren, 
cacheChildOrdsResult.get(dimValueResult.dim));
+  results[--resultSize] = facetResult;
+}
+return Arrays.asList(results);
+  }
+
+  /**
+   * Creates ChildOrdsResult to store dimCount, childCount, and 
TopOrdAndIntQueue q for
+   * getPathResult.

Review comment:
   The `TopOrdAndIntQueue` is a queue for the dimension's top children 
right? Maybe just include that in the comments to make it a bit more clear 
about what it is.

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##
@@ -411,7 +484,109 @@ public int compare(FacetResult a, F

[GitHub] [lucene-solr] uschindler commented on a change in pull request #2649: Remove '-' between base.version and version.suffix and change common-build to allow the new format

2022-03-19 Thread GitBox


uschindler commented on a change in pull request #2649:
URL: https://github.com/apache/lucene-solr/pull/2649#discussion_r830460575



##
File path: lucene/common-build.xml
##
@@ -63,7 +63,7 @@
 
   
 
-  
+  
 

Review comment:
   At least there must be a `\b` (boundary) instead of the dash. This would 
match a version like 1.2.34 as valid for q.w.e base version.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler commented on a change in pull request #2649: Remove '-' between base.version and version.suffix and change common-build to allow the new format

2022-03-19 Thread GitBox


uschindler commented on a change in pull request #2649:
URL: https://github.com/apache/lucene-solr/pull/2649#discussion_r830460575



##
File path: lucene/common-build.xml
##
@@ -63,7 +63,7 @@
 
   
 
-  
+  
 

Review comment:
   At least there must be a `\b` (boundary) instead of the dash. The 
current change would match a version like 1.2.34 as valid for 1.2.3 base 
version.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler commented on a change in pull request #2649: Remove '-' between base.version and version.suffix and change common-build to allow the new format

2022-03-19 Thread GitBox


uschindler commented on a change in pull request #2649:
URL: https://github.com/apache/lucene-solr/pull/2649#discussion_r830460575



##
File path: lucene/common-build.xml
##
@@ -63,7 +63,7 @@
 
   
 
-  
+  
 

Review comment:
   At least there must be a `\b` (boundary) instead of the dash. This would 
match a version like 1.2.34 as valid for 1.2.3 base version.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler commented on pull request #2649: Remove '-' between base.version and version.suffix and change common-build to allow the new format

2022-03-19 Thread GitBox


uschindler commented on pull request #2649:
URL: https://github.com/apache/lucene-solr/pull/2649#issuecomment-1072980556


   I am not sure about the consequences for scripts, but this variant should be 
fine. The spec version inside jar files stays with version base so the 
compatibility is ensured.
   
   I still don't understand why somebody doing a fork or custom build needs to 
have a dot. Lucene says "appendix with dash", so downstream users should 
respect this. If you create a custom jdk you also have to use their very strict 
versioning patterns (e.g adoptopenjdk, Coretto). They also require plain base 
version numbers and strings like Coretto or bugfix versions must match main 
branch.
   
   If one makes an incompatible fork, Maven coordinates should be adapted.
   
   This is just my personal view on it, others may see it different.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler edited a comment on pull request #2649: Remove '-' between base.version and version.suffix and change common-build to allow the new format

2022-03-19 Thread GitBox


uschindler edited a comment on pull request #2649:
URL: https://github.com/apache/lucene-solr/pull/2649#issuecomment-1072980556


   I am not sure about the consequences for scripts, but this variant should be 
fine. The spec version inside jar files stays with version base so the 
compatibility is ensured.
   
   I still don't understand why somebody doing a fork or custom build needs to 
have a dot. Lucene says "appendix with dash", so downstream users should 
respect this. If you create a custom jdk you also have to use their very strict 
versioning patterns (e.g adoptopenjdk, Coretto). They also require plain base 
version numbers and strings like Coretto or bugfix versions must match be in 
implementation appendixes clearly specified.
   
   If one makes an incompatible fork, Maven coordinates should be adapted.
   
   This is just my personal view on it, others may see it different.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request #755: Add note for --tmp-dir option in releaseWizard

2022-03-19 Thread GitBox


mocobeta opened a new pull request #755:
URL: https://github.com/apache/lucene/pull/755


   I encountered several times a "disk full" error when running the smoke 
tester.
   From my observation, there is a test that (intentionally) creates very large 
index files. It might be helpful to have a notice about that in the rc 
announcing mails and encourage to explicitly use `--tmp-dir` option to avoid 
unnecessary retries.
   
   Here is a snapshot of the disk usage in a smoke tester run.
   ```
   $ dust -b ~/tmp/smoketester/
 47M   ┌── lucene-9.1.0-src.tgz
 64M   ├── lucene-9.1.0.tgz
 37M   │ ┌── test-framework
 59M   │ │ ┌── lucene-9.1.0-SNAPSHOT-itests
 59M   │ │   ┌─┴ packages
 59M   │ │ ┌─┴ build
 59M   │ ├─┴ distribution
 51M   │ │ ┌── build
 63M   │ ├─┴ backward-codecs
125M   │ │   ┌── site
125M   │ │ ┌─┴ build
125M   │ ├─┴ documentation
 55M   │ │ ┌── nori
133M   │ ├─┴ analysis
3.0G   │ │   ┌── _0.cfs
4.0G   │ │   ├── _0.fdt
7.1G   │ │ ┌─┴ 4GBStoredFields-001
7.1G   │ │   ┌─┴ 
lucene.index.Test4GBStoredFields_1FCE4B52E7A84AF3-001
7.1G   │ │ ┌─┴ tests-tmp
7.1G   │ │   ┌─┴ tmp
7.1G   │ │ ┌─┴ build
7.1G   │ ├─┴ core
7.6G   │   ┌─┴ lucene
7.6G   │ ┌─┴ lucene-9.1.0
7.6G   ├─┴ unpack
7.7G ┌─┴ smoketester
   ```
   
   I didn't (couldn't) test this but just grepped the scripts to find where to 
change; would you tell me the correct file if I'm modifying the wrong file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #755: Add note for smoke tester --tmp-dir option in rc announcing

2022-03-19 Thread GitBox


msokolov commented on pull request #755:
URL: https://github.com/apache/lucene/pull/755#issuecomment-1073041616


   Big plus one for this. I can never complete the smoke test without it and I
   always forget the parameter name and where it has to go on the command
   line. I don't think it can be tacked on the end
   
   On Sat, Mar 19, 2022, 12:27 PM Tomoko Uchida ***@***.***>
   wrote:
   
   > I encountered several times a "disk full" error when running the smoke
   > tester.
   > From my observation, there is a test that (intentionally) creates very
   > large index files. It might be helpful to have a notice about that in the
   > rc announcing mails and encourage to explicitly use --tmp-dir option to
   > avoid unnecessary retries.
   >
   > Here is a snapshot of the disk usage in a smoke tester run.
   >
   > $ dust -b ~/tmp/smoketester/
   >
   >   47M   ┌── lucene-9.1.0-src.tgz
   >
   >   64M   ├── lucene-9.1.0.tgz
   >
   >   37M   │ ┌── test-framework
   >
   >   59M   │ │ ┌── lucene-9.1.0-SNAPSHOT-itests
   >
   >   59M   │ │   ┌─┴ packages
   >
   >   59M   │ │ ┌─┴ build
   >
   >   59M   │ ├─┴ distribution
   >
   >   51M   │ │ ┌── build
   >
   >   63M   │ ├─┴ backward-codecs
   >
   >  125M   │ │   ┌── site
   >
   >  125M   │ │ ┌─┴ build
   >
   >  125M   │ ├─┴ documentation
   >
   >   55M   │ │ ┌── nori
   >
   >  133M   │ ├─┴ analysis
   >
   >  3.0G   │ │   ┌── _0.cfs
   >
   >  4.0G   │ │   ├── _0.fdt
   >
   >  7.1G   │ │ ┌─┴ 4GBStoredFields-001
   >
   >  7.1G   │ │   ┌─┴ 
lucene.index.Test4GBStoredFields_1FCE4B52E7A84AF3-001
   >
   >  7.1G   │ │ ┌─┴ tests-tmp
   >
   >  7.1G   │ │   ┌─┴ tmp
   >
   >  7.1G   │ │ ┌─┴ build
   >
   >  7.1G   │ ├─┴ core
   >
   >  7.6G   │   ┌─┴ lucene
   >
   >  7.6G   │ ┌─┴ lucene-9.1.0
   >
   >  7.6G   ├─┴ unpack
   >
   >  7.7G ┌─┴ smoketester
   >
   >
   > I didn't (couldn't) test this but just grepped the scripts to find where
   > to change; would you tell me the correct file if I'm modifying the wrong
   > file.
   > --
   > You can view, comment on, or merge this pull request online at:
   >
   >   https://github.com/apache/lucene/pull/755
   > Commit Summary
   >
   >- b633070
   >

   >add note for --tmp-dir option
   >
   > File Changes
   >
   > (1 file )
   >
   >- *M* dev-tools/scripts/releaseWizard.yaml
   >

   >(4)
   >
   > Patch Links:
   >
   >- https://github.com/apache/lucene/pull/755.patch
   >- https://github.com/apache/lucene/pull/755.diff
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or unsubscribe
   > 

   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > 

   > or Android
   > 
.
   >
   > You are receiving this because you are subscribed to this thread.Message
   > ID: ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a change in pull request #747: LUCENE-10325: Add getTopDims functionality to Facets

2022-03-19 Thread GitBox


Yuti-G commented on a change in pull request #747:
URL: https://github.com/apache/lucene/pull/747#discussion_r830539688



##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/sortedset/TestSortedSetDocValuesFacets.java
##
@@ -1221,6 +1326,12 @@ public void testRandomHierarchicalFlatMix() throws 
Exception {
 
   assertEquals(expectedAllDims, actualAllDims);
 
+  // test getTopDims(1, 10)
+  if (actualAllDims.size() > 0) {
+List topDimsResults1 = facets.getTopDims(1, 10);

Review comment:
   > I didn't really have a lot of real feedback, just minor nits here and 
there. I did want to say that I have a PR out for a benchmark that compares the 
faceting API performance for SSDV vs taxonomy for high cardinality hierarchical 
faceting. Here is the link: 
[mikemccand/luceneutil#166](https://github.com/mikemccand/luceneutil/pull/166). 
We should add `getTopDims` to this benchmark once it is done :)
   
   Thanks @mdmarshmallow for all the great feedback! This is my first official 
PR and really appreciate all the suggestions. I'd like to add `getTopDims`to 
benchmark once it is done and can I ask you questions if I come across any? 
Thanks again :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a change in pull request #747: LUCENE-10325: Add getTopDims functionality to Facets

2022-03-19 Thread GitBox


Yuti-G commented on a change in pull request #747:
URL: https://github.com/apache/lucene/pull/747#discussion_r830539760



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##
@@ -114,7 +116,7 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
 return null;
   }
   DimTree dimTree = state.getDimTree(dim);
-  return getPathResult(dimConfig, dim, path, pathOrd, 
dimTree.iterator(pathOrd), topN);
+  return getPathResult(dimConfig, dim, path, pathOrd, 
dimTree.iterator(pathOrd), topN, null);

Review comment:
   Great idea, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a change in pull request #747: LUCENE-10325: Add getTopDims functionality to Facets

2022-03-19 Thread GitBox


Yuti-G commented on a change in pull request #747:
URL: https://github.com/apache/lucene/pull/747#discussion_r830539828



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##
@@ -411,7 +484,109 @@ public int compare(FacetResult a, FacetResult b) {
 }
   }
 });
-
 return results;
   }
+
+  @Override
+  public List getTopDims(int topNDims, int topNChildren) throws 
IOException {
+// Creates priority queue to store top dimensions and sort by their 
aggregated values/hits and
+// string values.
+PriorityQueue pq =
+new PriorityQueue<>(topNDims) {
+  @Override
+  protected boolean lessThan(DimValueResult a, DimValueResult b) {
+if (a.value > b.value) {
+  return false;
+} else if (a.value < b.value) {
+  return true;
+} else {
+  return a.dim.compareTo(b.dim) > 0;
+}
+  }
+};
+
+HashMap cacheChildOrdsResult = new HashMap<>();
+int dimCount;
+
+for (String dim : state.getDims()) {
+  FacetsConfig.DimConfig dimConfig = stateConfig.getDimConfig(dim);
+  if (dimConfig.hierarchical) {
+DimTree dimTree = state.getDimTree(dim);
+int dimOrd = dimTree.dimStartOrd;
+// get dim value
+dimCount =
+getDimValue(
+dimConfig, dim, dimOrd, dimTree.iterator(), topNChildren, 
cacheChildOrdsResult);
+  } else {
+OrdRange ordRange = state.getOrdRange(dim);
+int dimOrd = ordRange.start;
+PrimitiveIterator.OfInt childIt = ordRange.iterator();
+if (dimConfig.multiValued && dimConfig.requireDimCount) {
+  // If the dim is multi-valued and requires dim counts, we know we've 
explicitly indexed
+  // the dimension and we need to skip past it so the iterator is 
positioned on the first
+  // child:
+  childIt.next();
+}
+dimCount = getDimValue(dimConfig, dim, dimOrd, childIt, topNChildren, 
cacheChildOrdsResult);
+  }
+
+  if (dimCount != 0) {
+// use priority queue to store DimValueResult for topNDims
+if (pq.size() < topNDims) {
+  pq.insertWithOverflow(new DimValueResult(dim, dimCount));
+} else {
+  if (dimCount > pq.top().value
+  || (dimCount == pq.top().value && dim.compareTo(pq.top().dim) < 
0)) {
+DimValueResult bottomDim = pq.pop();
+bottomDim.dim = dim;
+bottomDim.value = dimCount;
+pq.insertWithOverflow(bottomDim);
+  }
+}
+  }
+}
+
+// get FacetResult for topNDims
+int resultSize = pq.size();
+FacetResult[] results = new FacetResult[resultSize];
+
+while (pq.size() > 0) {
+  DimValueResult dimValueResult = pq.pop();
+  FacetResult facetResult =
+  getFacetResultForDim(
+  dimValueResult.dim, topNChildren, 
cacheChildOrdsResult.get(dimValueResult.dim));
+  results[--resultSize] = facetResult;

Review comment:
   Thanks! I should've paid more attention to the good code style :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a change in pull request #747: LUCENE-10325: Add getTopDims functionality to Facets

2022-03-19 Thread GitBox


Yuti-G commented on a change in pull request #747:
URL: https://github.com/apache/lucene/pull/747#discussion_r830539916



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##
@@ -411,7 +484,109 @@ public int compare(FacetResult a, FacetResult b) {
 }
   }
 });
-
 return results;
   }
+
+  @Override
+  public List getTopDims(int topNDims, int topNChildren) throws 
IOException {
+// Creates priority queue to store top dimensions and sort by their 
aggregated values/hits and
+// string values.
+PriorityQueue pq =
+new PriorityQueue<>(topNDims) {
+  @Override
+  protected boolean lessThan(DimValueResult a, DimValueResult b) {
+if (a.value > b.value) {
+  return false;
+} else if (a.value < b.value) {
+  return true;
+} else {
+  return a.dim.compareTo(b.dim) > 0;
+}
+  }
+};
+
+HashMap cacheChildOrdsResult = new HashMap<>();
+int dimCount;
+
+for (String dim : state.getDims()) {
+  FacetsConfig.DimConfig dimConfig = stateConfig.getDimConfig(dim);
+  if (dimConfig.hierarchical) {
+DimTree dimTree = state.getDimTree(dim);
+int dimOrd = dimTree.dimStartOrd;
+// get dim value
+dimCount =
+getDimValue(
+dimConfig, dim, dimOrd, dimTree.iterator(), topNChildren, 
cacheChildOrdsResult);
+  } else {
+OrdRange ordRange = state.getOrdRange(dim);
+int dimOrd = ordRange.start;
+PrimitiveIterator.OfInt childIt = ordRange.iterator();
+if (dimConfig.multiValued && dimConfig.requireDimCount) {
+  // If the dim is multi-valued and requires dim counts, we know we've 
explicitly indexed
+  // the dimension and we need to skip past it so the iterator is 
positioned on the first
+  // child:
+  childIt.next();
+}
+dimCount = getDimValue(dimConfig, dim, dimOrd, childIt, topNChildren, 
cacheChildOrdsResult);
+  }
+
+  if (dimCount != 0) {
+// use priority queue to store DimValueResult for topNDims
+if (pq.size() < topNDims) {
+  pq.insertWithOverflow(new DimValueResult(dim, dimCount));
+} else {
+  if (dimCount > pq.top().value
+  || (dimCount == pq.top().value && dim.compareTo(pq.top().dim) < 
0)) {
+DimValueResult bottomDim = pq.pop();
+bottomDim.dim = dim;
+bottomDim.value = dimCount;
+pq.insertWithOverflow(bottomDim);
+  }
+}
+  }
+}
+
+// get FacetResult for topNDims
+int resultSize = pq.size();
+FacetResult[] results = new FacetResult[resultSize];
+
+while (pq.size() > 0) {
+  DimValueResult dimValueResult = pq.pop();
+  FacetResult facetResult =
+  getFacetResultForDim(
+  dimValueResult.dim, topNChildren, 
cacheChildOrdsResult.get(dimValueResult.dim));
+  results[--resultSize] = facetResult;
+}
+return Arrays.asList(results);
+  }
+
+  /**
+   * Creates ChildOrdsResult to store dimCount, childCount, and 
TopOrdAndIntQueue q for
+   * getPathResult.

Review comment:
   Thanks! Added it to all javadoc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on pull request #747: LUCENE-10325: Add getTopDims functionality to Facets

2022-03-19 Thread GitBox


Yuti-G commented on pull request #747:
URL: https://github.com/apache/lucene/pull/747#issuecomment-1073132559


   > I didn't really have a lot of real feedback, just minor nits here and 
there. I did want to say that I have a PR out for a benchmark that compares the 
faceting API performance for SSDV vs taxonomy for high cardinality hierarchical 
faceting. Here is the link: 
[mikemccand/luceneutil#166](https://github.com/mikemccand/luceneutil/pull/166). 
We should add `getTopDims` to this benchmark once it is done :)
   
   Thanks @mdmarshmallow for all the great feedback! This is my first official 
PR and really appreciate all the suggestions. I'd like to add `getTopDims`to 
benchmark once it is done and can I ask you questions if I come across any? 
Thanks again :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a change in pull request #747: LUCENE-10325: Add getTopDims functionality to Facets

2022-03-19 Thread GitBox


Yuti-G commented on a change in pull request #747:
URL: https://github.com/apache/lucene/pull/747#discussion_r830539688



##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/sortedset/TestSortedSetDocValuesFacets.java
##
@@ -1221,6 +1326,12 @@ public void testRandomHierarchicalFlatMix() throws 
Exception {
 
   assertEquals(expectedAllDims, actualAllDims);
 
+  // test getTopDims(1, 10)
+  if (actualAllDims.size() > 0) {
+List topDimsResults1 = facets.getTopDims(1, 10);

Review comment:
   > I didn't really have a lot of real feedback, just minor nits here and 
there. I did want to say that I have a PR out for a benchmark that compares the 
faceting API performance for SSDV vs taxonomy for high cardinality hierarchical 
faceting. Here is the link: 
[mikemccand/luceneutil#166](https://github.com/mikemccand/luceneutil/pull/166). 
We should add `getTopDims` to this benchmark once it is done :)
   
   Thanks @mdmarshmallow for all the great feedback! This is my first official 
PR and really appreciate all the suggestions. I'd like to add `getTopDims`to 
benchmark once it is done and can I ask you questions if I come across any? 
Thanks again :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

2022-03-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509368#comment-17509368
 ] 

ASF subversion and git services commented on LUCENE-9905:
-

Commit a4b30b4cf4c0c9b4de0c27893ea2350af498b1c0 in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a4b30b4 ]

LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat

Before the assertion checked if two sets were equal, which resulted in rare
failures. Now we use 'contains' from hamcrest matchers.


> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Julie Tibshirani
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

2022-03-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509371#comment-17509371
 ] 

ASF subversion and git services commented on LUCENE-9905:
-

Commit f09b08563074379da2b6f24054ffe7243ba365a1 in lucene's branch 
refs/heads/branch_9x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f09b085 ]

LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat

Before the assertion checked if two sets were equal, which resulted in rare
failures. Now we use 'contains' from hamcrest matchers.


> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Julie Tibshirani
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

2022-03-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509373#comment-17509373
 ] 

ASF subversion and git services commented on LUCENE-9905:
-

Commit fcacd22a80565758155a6cb6973a5ca92918d957 in lucene's branch 
refs/heads/branch_9_1 from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fcacd22 ]

LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat

Before the assertion checked if two sets were equal, which resulted in rare
failures. Now we use 'contains' from hamcrest matchers.


> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Julie Tibshirani
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #755: Add note for smoke tester --tmp-dir option in rc announcing

2022-03-19 Thread GitBox


mocobeta commented on pull request #755:
URL: https://github.com/apache/lucene/pull/755#issuecomment-1073174420


   My `tmpfs` partition (where system `/tmp` is mounted) has 8GB of available 
disk space.
   The test itself has been there for a long time and I think I had not seen 
this "disk full" error before 9.0. The disk usage could have increased since 
9.x for some reason?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta edited a comment on pull request #755: Add note for smoke tester --tmp-dir option in rc announcing

2022-03-19 Thread GitBox


mocobeta edited a comment on pull request #755:
URL: https://github.com/apache/lucene/pull/755#issuecomment-1073174420


   My `tmpfs` partition (where system `/tmp` is mounted) has 8GB of available 
disk space.
   The test itself has been there for a long time and I think I had not seen 
this "disk full" error before 9.0. The disk usage in tests could have increased 
since 9.x for some reason?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10448) MergeRateLimiter doesn't always limit instant rate.

2022-03-19 Thread Vigya Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509380#comment-17509380
 ] 

Vigya Sharma commented on LUCENE-10448:
---

{color:#172b4d}Thanks for making the changes and confirming, [~kkewwei] !{color}

 

Re: reordering the lines, I think it is better to check and pause before 
actually writing instead of after. Sometimes, the write may happen after a long 
time gap. In such cases, MergeRateLimiter does not pause, as the write is 
already arriving after a long pause (gap). {_}If we write first and checkRate 
later{_}, then we will end up not pausing even for the next subsequent write..

Just my understanding from working on this part of code, and I might have 
missed something. Like you, I'm also curious to learn about the design choices 
here, from more seasoned Lucene developers.

 

 

 

 

 

> MergeRateLimiter doesn't always limit instant rate.
> ---
>
> Key: LUCENE-10448
> URL: https://issues.apache.org/jira/browse/LUCENE-10448
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.11.1
>Reporter: kkewwei
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We can see the code in *MergeRateLimiter*:
> {code:java}
> private long maybePause(long bytes, long curNS) throws 
> MergePolicy.MergeAbortedException {
>
> double rate = mbPerSec; 
> double secondsToPause = (bytes / 1024. / 1024.) / rate;
> long targetNS = lastNS + (long) (10 * secondsToPause);
> long curPauseNS = targetNS - curNS;
> // We don't bother with thread pausing if the pause is smaller than 2 
> msec.
> if (curPauseNS <= MIN_PAUSE_NS) {
>   // Set to curNS, not targetNS, to enforce the instant rate, not
>   // the "averaged over all history" rate:
>   lastNS = curNS;
>   return -1;
> }
>..
>   }
> {code}
> If a Segment is been merged, *maybePause* is called in 7:00, lastNS=7:00, 
> then the *maybePause* is called in 7:05 again,  so the value of 
> *targetNS=lastNS + (long) (10 * secondsToPause)* must be smaller than 
> *curNS*, no matter how big the bytes is, we will return -1 and ignore to 
> pause. 
> I count the total times(callTimes) calling *maybePause* and ignored pause 
> times(ignorePauseTimes) and detail ignored bytes(detailBytes):
> {code:java}
> [2022-03-02T15:16:51,972][DEBUG][o.e.i.e.I.EngineMergeScheduler] [node1] 
> [index1][21] merge segment [_4h] done: took [26.8s], [123.6 MB], [61,219 
> docs], [0s stopped], [24.4s throttled], [242.5 MB written], [11.2 MB/sec 
> throttle], [callTimes=857], [ignorePauseTimes=25],  [detailBytes(mb) = 
> [0.28899956, 0.28140354, 0.28015518, 0.27990818, 0.2801447, 0.27991104, 
> 0.27990723, 0.27990913, 0.2799101, 0.28010082, 0.2799921, 0.2799673, 
> 0.28144264, 0.27991295, 0.27990818, 0.27993107, 0.2799387, 0.27998447, 
> 0.28002167, 0.27992058, 0.27998066, 0.28098202, 0.28125, 0.28125, 0.28125]]
> {code}
> There are 857 times calling *maybePause*, including 25 times which is ignored 
> to pause, we can see that the ignored detail bytes (such as 0.28125mb) are 
> not small.
> As long as the interval between two *maybePause* calls is relatively long, 
> the pause action that should be executed will not be executed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10448) MergeRateLimiter doesn't always limit instant rate.

2022-03-19 Thread Vigya Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509380#comment-17509380
 ] 

Vigya Sharma edited comment on LUCENE-10448 at 3/20/22, 6:11 AM:
-

{color:#172b4d}Thanks for making the changes and confirming, [~kkewwei] !{color}

 

Re: reordering the lines, I think it is better to check and pause before 
actually writing instead of after. Sometimes, the write may happen after a long 
time gap. In such cases, MergeRateLimiter does not pause, as the write is 
already arriving after a long pause (gap). {_}If we write first and checkRate 
later{_}, then we will end up not pausing even for the next subsequent write..

Just my understanding from working on this part of code, and I might have 
missed something. Like you, I'm also curious to learn about the design choices 
here, from more seasoned Lucene developers.


was (Author: vigyas):
{color:#172b4d}Thanks for making the changes and confirming, [~kkewwei] !{color}

 

Re: reordering the lines, I think it is better to check and pause before 
actually writing instead of after. Sometimes, the write may happen after a long 
time gap. In such cases, MergeRateLimiter does not pause, as the write is 
already arriving after a long pause (gap). {_}If we write first and checkRate 
later{_}, then we will end up not pausing even for the next subsequent write..

Just my understanding from working on this part of code, and I might have 
missed something. Like you, I'm also curious to learn about the design choices 
here, from more seasoned Lucene developers.

 

 

 

 

 

> MergeRateLimiter doesn't always limit instant rate.
> ---
>
> Key: LUCENE-10448
> URL: https://issues.apache.org/jira/browse/LUCENE-10448
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.11.1
>Reporter: kkewwei
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We can see the code in *MergeRateLimiter*:
> {code:java}
> private long maybePause(long bytes, long curNS) throws 
> MergePolicy.MergeAbortedException {
>
> double rate = mbPerSec; 
> double secondsToPause = (bytes / 1024. / 1024.) / rate;
> long targetNS = lastNS + (long) (10 * secondsToPause);
> long curPauseNS = targetNS - curNS;
> // We don't bother with thread pausing if the pause is smaller than 2 
> msec.
> if (curPauseNS <= MIN_PAUSE_NS) {
>   // Set to curNS, not targetNS, to enforce the instant rate, not
>   // the "averaged over all history" rate:
>   lastNS = curNS;
>   return -1;
> }
>..
>   }
> {code}
> If a Segment is been merged, *maybePause* is called in 7:00, lastNS=7:00, 
> then the *maybePause* is called in 7:05 again,  so the value of 
> *targetNS=lastNS + (long) (10 * secondsToPause)* must be smaller than 
> *curNS*, no matter how big the bytes is, we will return -1 and ignore to 
> pause. 
> I count the total times(callTimes) calling *maybePause* and ignored pause 
> times(ignorePauseTimes) and detail ignored bytes(detailBytes):
> {code:java}
> [2022-03-02T15:16:51,972][DEBUG][o.e.i.e.I.EngineMergeScheduler] [node1] 
> [index1][21] merge segment [_4h] done: took [26.8s], [123.6 MB], [61,219 
> docs], [0s stopped], [24.4s throttled], [242.5 MB written], [11.2 MB/sec 
> throttle], [callTimes=857], [ignorePauseTimes=25],  [detailBytes(mb) = 
> [0.28899956, 0.28140354, 0.28015518, 0.27990818, 0.2801447, 0.27991104, 
> 0.27990723, 0.27990913, 0.2799101, 0.28010082, 0.2799921, 0.2799673, 
> 0.28144264, 0.27991295, 0.27990818, 0.27993107, 0.2799387, 0.27998447, 
> 0.28002167, 0.27992058, 0.27998066, 0.28098202, 0.28125, 0.28125, 0.28125]]
> {code}
> There are 857 times calling *maybePause*, including 25 times which is ignored 
> to pause, we can see that the ignored detail bytes (such as 0.28125mb) are 
> not small.
> As long as the interval between two *maybePause* calls is relatively long, 
> the pause action that should be executed will not be executed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on pull request #738: LUCENE-10448: Avoid instant rate write bursts by writing bytes buffer in chunks

2022-03-19 Thread GitBox


vigyasharma commented on pull request #738:
URL: https://github.com/apache/lucene/pull/738#issuecomment-1073176564


   > Since a release candidate is out for testing, I'll wait until that process 
completes before merging this. It would be safe anyway, and this is small, but 
just in case there is some problem and we need to do reverts or something it 
may be simpler for the release management.
   
   Thanks for the review, @msokolov, sounds good. Should I add a CHANGES file 
entry for this? If so, release version section should this go to?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-19 Thread GitBox


vigyasharma commented on pull request #633:
URL: https://github.com/apache/lucene/pull/633#issuecomment-1073177695


   Thanks @mikemccand . I figured that I had changed the exception signature 
for addIndexes() in my last revision. I was throwing a blanket 
MergePolicyException when addIndexes failed to merge all the readers in; and a 
lot of these tests were designed to look for specific exceptions that would've 
earlier gotten thrown from the SegmentMerger code  running in main addIndexes 
thread.. This was causing the randomly invoked tests to fail.
   
   I changed that, to `rethrow` the exceptions seen by any failed background 
OneMerges, and that appeased the gradle gods and builds began to pass again. 
   
   I think this is better, because now the API throws the actual error seen by 
a thread doing the merge, instead of some blanket failure message. In general, 
I have to say, I'm fascinated by the breadth of test cases around this API, and 
the code framework present to add more tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2022-03-19 Thread Vigya Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509383#comment-17509383
 ] 

Vigya Sharma commented on LUCENE-10216:
---

Updated the PR to retain transactionality, while using the MergePolicy and 
configured MergeScheduler to trigger merges for this API

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2022-03-19 Thread Vigya Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509384#comment-17509384
 ] 

Vigya Sharma commented on LUCENE-10216:
---

While making this change, I found an existing 
[TODO|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L3158-L3159],
 which I think we're able to address now..
{code:java}
// TODO: somehow we should fix this merge so it's  
// abortable so that IW.close(false) is able to stop it {code}
Earlier, {{addIndexes(CodecReader...)}} triggered merges directly via 
SegmentMerger. As a result, the running merges could be tracked in 
{{{}Set runningAddIndexesMerges{}}}, but any pending merge 
operations could not be aborted preemptively.

Now, merges are defined via {{MergePolicy.OneMerge}} objects, and scheduled by 
the {{{}MergeScheduler{}}}. We are able to leverage the OneMerge abstractions 
to abort any pending merges. They are aborted when IndexWriter is rolled back 
or closed.

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8580) Make segment merging parallel in SegmentMerger

2022-03-19 Thread Vigya Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509385#comment-17509385
 ] 

Vigya Sharma commented on LUCENE-8580:
--

I started looking into this, and had some early questions... What would be the 
right benchmark to look at, for measuring any improvements here? Would it be 
the Indexing Throughput benchmark?

Should we add some benchmark specifically for merge related performance, like 
triggering a force merge? I suppose it wont be a measure of qps, but rather, 
the time to complete a merge on a given index? Do we have such benchmarks 
already?

Apologies if these are obvious things, I've only recently started looking at 
benchmarks. Will update here if I'm able to figure something out about these 
myself..

> Make segment merging parallel in SegmentMerger
> --
>
> Key: LUCENE-8580
> URL: https://issues.apache.org/jira/browse/LUCENE-8580
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-8580.patch
>
>
> A placeholder issue stemming from the discussion on the mailing list [1]. Not 
> of any high priority.
> At the moment any merging from N segments into one will happen sequentially 
> for each data structure involved in a segment (postings, norms, points, 
> etc.). If the input segments are large, the CPU (and I/O) are mostly unused 
> and the process takes a long time. 
> Merging of these data structures is mostly independent of each other, so it'd 
> be interesting to see if we can speed things up by allowing them to run 
> concurrently. I investigated this on a 40GB index with 22 segments, 
> force-merging this into 1 segment (of similar size). Quick and dirty patch 
> attached.
> I see some improvement, although it's not by much; the largest component 
> dominates everything else.
> Results from an 8-core CPU.
> Before:
> {code}
> SM 0 [2018-11-30T09:21:11.662Z; main]: 347237 msec to merge stored fields 
> [41922110 docs]
> SM 0 [2018-11-30T09:21:18.236Z; main]: 6562 msec to merge norms [41922110 
> docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 755507 msec to merge postings 
> [41922110 docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge doc values [41922110 
> docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge points [41922110 docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 7 msec to write field infos [41922110 
> docs]
> IW 0 [2018-11-30T09:33:56.124Z; main]: merge time 1112238 msec for 41922110 
> docs
> {code}
> After:
> {code}
> SM 0 [2018-11-30T10:16:42.179Z; ForkJoinPool.commonPool-worker-1]: 8189 msec 
> to merge norms
> SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to 
> merge doc values
> SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to 
> merge points
> SM 0 [2018-11-30T10:16:42.211Z; ForkJoinPool.commonPool-worker-1]: merge 
> store matchedCount=22 vs 22
> SM 0 [2018-11-30T10:23:24.574Z; ForkJoinPool.commonPool-worker-1]: 402381 
> msec to merge stored fields [41922110 docs]
> SM 0 [2018-11-30T10:32:20.862Z; ForkJoinPool.commonPool-worker-2]: 938668 
> msec to merge postings
> IW 0 [2018-11-30T10:32:23.513Z; main]: merge time  950249 msec for 41922110 
> docs
> {code}
> Ideally, one would need to push forkjoin into individual subroutines so that, 
> for example, postings utilize concurrency when merging (pulling blocks of 
> terms concurrently from the input, calculating statistics, etc. and then 
> pushing in an ordered fashion to the codec). 
> [1] https://markmail.org/thread/dtejwq42qagykeac



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a change in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-19 Thread GitBox


vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r830568250



##
File path: lucene/core/src/java/org/apache/lucene/index/MergePolicy.java
##
@@ -813,12 +866,24 @@ protected final boolean verbose(MergeContext 
mergeContext) {
   }
 
   static final class MergeReader {
+final CodecReader codecReader;
 final SegmentReader reader;

Review comment:
   I looked into this more, and I'm not able to get rid of either of them 
in the current structure. SegmentReader seems to hold important information 
about segments that is required by consumers in different places, like the 
regular segment merges, so I can't always just use CodecReader in OneMerge 
objects.
   
   AddIndexes() can only work with CodecReaders, as there are no segments to 
begin with (which means I can also not create a segment reader from the 
provided codec readers here).
   
   Maybe, I can extend `OneMerge` to create a `ReaderOneMerge` class that only 
works with CodecReaders? Would love to hear your thoughts, or any other better 
ideas..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10476) Modify the findMerges policy for addIndexes(CodecReader[]) to create 1 segment per reader

2022-03-19 Thread Vigya Sharma (Jira)
Vigya Sharma created LUCENE-10476:
-

 Summary: Modify the findMerges policy for 
addIndexes(CodecReader[]) to create 1 segment per reader
 Key: LUCENE-10476
 URL: https://issues.apache.org/jira/browse/LUCENE-10476
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Vigya Sharma


Once we have LUCENE-10216, we can trigger concurrent merges for 
addIndexes(CodecReader...) api, These merges are defined in the MergePolicy and 
triggered by the MergeScheduler.

We can now create a merge policy that does a 1:1 addIndexes call, where the 
merger creates one segment each for every provided reader. It would create a 
faster, more concurrent {{{}addIndexes(CodecReader...){}}}, at the cost of 
deferring some merges to be done later in background.

Which, I believe is similar to the behavior in {{addIndexes(Directory...)}} - 
all incoming segments are simply added to IW, and any merging happens in later 
in background.

The change, would make both APIs behave consistently.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a change in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-19 Thread GitBox


vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r830568703



##
File path: lucene/core/src/java/org/apache/lucene/index/MergePolicy.java
##
@@ -567,6 +605,21 @@ public abstract MergeSpecification findMerges(
   MergeTrigger mergeTrigger, SegmentInfos segmentInfos, MergeContext 
mergeContext)
   throws IOException;
 
+  /**
+   * Define {@link OneMerge} operations for a list of codec readers. This call 
is used to
+   * define merges for input readers in {@link 
IndexWriter#addIndexes(CodecReader...)}.
+   * Default implementation adds all readers to a single merge. This can be 
overridden in custom
+   * merge policies.
+   *
+   * @param readers set of readers to merge into the main index
+   */
+  public MergeSpecification findMerges(List readers) throws 
IOException {
+OneMerge merge = new OneMerge(readers, leaf -> new MergeReader(leaf, 
leaf.getLiveDocs()));

Review comment:
   Create https://issues.apache.org/jira/browse/LUCENE-10476 for the follow 
up task.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org