[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

2022-09-28 Thread GitBox


vsop-479 commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1260490249

   @jpountz Thranks for your review and suggestion. I have added a CHANGES 
entry and assert term value code. 
   Please have a review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on issue #11827: Release manager should review lucene benchmarks before building release candidates

2022-09-28 Thread GitBox


uschindler commented on issue #11827:
URL: https://github.com/apache/lucene/issues/11827#issuecomment-1260501918

   I fully agree, some checks should be done.
   But here are a few bits that came into my mind:
   - The @mikemccand benchmarks are running against main branch only. So the 
first check should be *before* creating the release branch, so we know that the 
start of branch is "clean".
   - Anything else that gets included into release branch should be monitored 
by the RM if it negatively affects perf on main.
   - Is there a benchmark available for the 9.x branch? If it would be 
configurable / modular we could have a base branch (against main and in 
parallel against 9.x). When a release branch is created we could compare the 
changes on the release branch with the baselines!?
   
   These are just ideas, but basically the important thing would be: Before 
creating the release branch do a quick check through the benchmark suite on 
Mike's web page and find potential issues. During the release process when 
stuff is merged into release branch monitor all merged changes manually on 
main's benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

2022-09-28 Thread GitBox


jpountz commented on code in PR #11722:
URL: https://github.com/apache/lucene/pull/11722#discussion_r982050404


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/BasePostingsFormatTestCase.java:
##
@@ -367,6 +367,53 @@ public void testGhosts() throws Exception {
 dir.close();
   }
 
+  public void testBinarySearchTermLeaf() throws Exception {
+Directory dir = newDirectory();
+
+IndexWriterConfig iwc = newIndexWriterConfig(null);
+iwc.setCodec(getCodec());
+iwc.setMergePolicy(newTieredMergePolicy());
+IndexWriter iw = new IndexWriter(dir, iwc);
+
+for (int i = 10; i <= 100400; i++){
+  // only add odd number
+  if (i % 2 == 1) {
+Document document = new Document();
+document.add(new StringField("id", i + "", Field.Store.NO));
+iw.addDocument(document);
+  }
+}
+iw.commit();
+iw.forceMerge(1);
+
+DirectoryReader reader = DirectoryReader.open(iw);
+TermsEnum termsEnum = getOnlyLeafReader(reader).terms("id").iterator();
+// test seekExact
+for (int i = 10; i <= 100400; i++){
+  BytesRef target = new BytesRef(i + "");
+  if (i % 2 == 1) {
+assertTrue(termsEnum.seekExact(target));
+assertEquals(termsEnum.term(), target);
+  } else {
+assertFalse(termsEnum.seekExact(target));
+  }
+}
+// test seekCeil
+for (int i = 10; i < 100400; i++){
+  BytesRef target = new BytesRef(i + "");
+  if (i % 2 == 1) {
+assertEquals(SeekStatus.FOUND, termsEnum.seekCeil(target));
+assertEquals(termsEnum.term(), target);
+  } else {
+assertEquals(SeekStatus.NOT_FOUND, termsEnum.seekCeil(target));

Review Comment:
   maybe also assert here that the terms enum is positioned on the next term?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Mahdi-Seeker commented on issue #10177: Introduce IVFFlat to Lucene for ANN similarity search [LUCENE-9136]

2022-09-28 Thread GitBox


Mahdi-Seeker commented on issue #10177:
URL: https://github.com/apache/lucene/issues/10177#issuecomment-1260514435

   Hi guys
   Thanks for your great job on Lucene and specially this ANN search!
   
   Any progress on this issue? We're trying to use vector search, but HNSW 
seems to take too much memory...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on issue #11824: Performance regression on LatLonPoint#newPolygonQuery

2022-09-28 Thread GitBox


iverase commented on issue #11824:
URL: https://github.com/apache/lucene/issues/11824#issuecomment-1260667934

   Fix seems to bring performance back to previous levels:
   
   https://user-images.githubusercontent.com/29038686/192749125-0ac0b341-b2cb-4395-991c-9f676322592a.png";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #1039: LUCENE-10635: Ensure test coverage for WANDScorer by using a test query

2022-09-28 Thread GitBox


jpountz commented on code in PR #1039:
URL: https://github.com/apache/lucene/pull/1039#discussion_r982215123


##
lucene/core/src/test/org/apache/lucene/search/TestWANDScorer.java:
##
@@ -947,4 +988,82 @@ public long cost() {
   };
 }
   }
+
+  private static class WANDScorerQuery extends Query {
+private final BooleanQuery query;
+
+private WANDScorerQuery(BooleanQuery query) {

Review Comment:
   I was thinking of something like that indeed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #11827: Release manager should review lucene benchmarks before building release candidates

2022-09-28 Thread GitBox


mikemccand commented on issue #11827:
URL: https://github.com/apache/lucene/issues/11827#issuecomment-1260778045

   > yup. Possibly too if Mike M is bored he could implement an alarming system 
:) or export the data somehow so we could bolt one on the side?
   
   Actually I rather like the alarm idea.  Maybe we could start with something 
simple -- more than X% change over Y data points?  Maybe for starters the 
nightly bench just writes these alarms in (blinking red font, heh) at the [top 
of the index page](https://home.apache.org/~mikemccand/lucenebench/)?  If it 
seems like that is not too noisy, maybe we find some stronger way to "notify" 
us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #11824: Performance regression on LatLonPoint#newPolygonQuery

2022-09-28 Thread GitBox


mikemccand commented on issue #11824:
URL: https://github.com/apache/lucene/issues/11824#issuecomment-1260787630

   Thanks for catching this @iverase and the quick fix, and the follow-on issue 
to better detect such regressions before release: #11827


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #11827: Release manager should review lucene benchmarks before building release candidates

2022-09-28 Thread GitBox


mikemccand commented on issue #11827:
URL: https://github.com/apache/lucene/issues/11827#issuecomment-1260788095

   This was a spinoff from #11824.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new issue, #11829: Reproducible TestShapeDocValues failure

2022-09-28 Thread GitBox


jpountz opened a new issue, #11829:
URL: https://github.com/apache/lucene/issues/11829

   ### Description
   
   The test seems to be creating invalid polygons.
   
   ```
   03:40:22 org.apache.lucene.document.TestShapeDocValues > 
testXYPolygonCentroid FAILED
   03:40:22 WARNING: The Security Manager is deprecated and will be removed in 
a future release
   03:40:22 java.lang.IllegalArgumentException: Polygon self-intersection 
at lat=2.9067992002286562 lon=6.0168174085684076E7
   03:40:22 at 
__randomizedtesting.SeedInfo.seed([29AE25A2D4FB94F9:CC0CD5ABA2A74207]:0)
   03:40:22 at 
org.apache.lucene.geo.Tessellator.checkIntersectionPoint(Tessellator.java:958)
   03:40:22 at 
org.apache.lucene.geo.Tessellator.checkIntersection(Tessellator.java:870)
   03:40:22 at 
org.apache.lucene.geo.Tessellator.tessellate(Tessellator.java:198)
   03:40:22 at 
org.apache.lucene.geo.Tessellator.tessellate(Tessellator.java:152)
   03:40:22 at 
org.apache.lucene.document.XYShape.createIndexableFields(XYShape.java:99)
   03:40:22 at 
org.apache.lucene.document.TestShapeDocValues.getTessellation(TestShapeDocValues.java:249)
   03:40:22 at 
org.apache.lucene.document.TestShapeDocValues.getTessellation(TestShapeDocValues.java:232)
   03:40:22 at 
org.apache.lucene.document.TestShapeDocValues.computeCentroid(TestShapeDocValues.java:98)
   03:40:22 at 
org.apache.lucene.document.TestShapeDocValues.testXYPolygonCentroid(TestShapeDocValues.java:90)
   03:40:22 at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   03:40:22 at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
   03:40:22 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   03:40:22 at 
java.base/java.lang.reflect.Method.invoke(Method.java:568)
   03:40:22 at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   03:40:22 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   03:40:22 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   03:40:22 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   03:40:22 at 
org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   03:40:22 at 
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   03:40:22 at 
org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   03:40:22 at 
org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   03:40:22 at 
org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   03:40:22 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
   03:40:22 at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   03:40:22 at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   03:40:22 at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
   03:40:22 at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
   03:40:22 at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
   03:40:22 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
   03:40:22 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
   03:40:22 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
   03:40:22 at 
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   03:40:22 at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   03:40:22 at 
org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
   03:40:22 at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   03:40:22 at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   03:40:22 at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   03:40:22 at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   03:40:22 at 
org.a

[GitHub] [lucene] nknize commented on issue #11824: Performance regression on LatLonPoint#newPolygonQuery

2022-09-28 Thread GitBox


nknize commented on issue #11824:
URL: https://github.com/apache/lucene/issues/11824#issuecomment-1260938633

   What's annoying is how incredibly trappy this override logic is. That a 
method call literally moving from `createWeight` to `getScorerSupplier` results 
in a 72.2% regression even slipped by me before merge doesn't bode well for new 
committers. And then it sat in regression until an entire company was 
interested in releasing. 
   
   I wonder if we can do better? Like maybe figure out better guardrails in 
these methods? Perhaps by something as simple as a rename to signal one happens 
per segment? This isn't the first and will certainly not be the last time an 
expensive operation accidentally slips to a critical path. Any other ideas how 
to lower the bar here for new committers? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a diff in pull request #11780: GH#11601: Add ability to compute reader states after refresh

2022-09-28 Thread GitBox


mikemccand commented on code in PR #11780:
URL: https://github.com/apache/lucene/pull/11780#discussion_r982553986


##
lucene/core/src/java/org/apache/lucene/search/ReferenceManager.java:
##
@@ -219,6 +219,36 @@ public final boolean maybeRefresh() throws IOException {
 return doTryRefresh;
   }
 
+  /** Compute some state of the reference using a parameter. */
+  @FunctionalInterface
+  public interface StateCalculator {
+R calculate(G current, P param);
+  }
+
+  /**
+   * If you need to compute something after the refresh, you can use this 
method instead of {@link
+   * #maybeRefresh()}.
+   *
+   * @throws IOException if refreshing the resource causes an {@link 
IOException}
+   */
+  public final  R maybeRefreshAndReturnState(
+  StateCalculator refreshedStateCalculator, P param) throws 
IOException {
+ensureOpen();
+
+// Ensure only 1 thread does refresh at once; other threads just return 
immediately:
+final boolean doTryRefresh = refreshLock.tryLock();
+if (doTryRefresh) {
+  try {
+doMaybeRefresh();
+return refreshedStateCalculator.calculate(current, param);

Review Comment:
   Hmm why is it necessary to compute this state under lock?  Can't the caller 
just compute the state themselves when they get the refreshed reader?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on a diff in pull request #11780: GH#11601: Add ability to compute reader states after refresh

2022-09-28 Thread GitBox


stefanvodita commented on code in PR #11780:
URL: https://github.com/apache/lucene/pull/11780#discussion_r982630324


##
lucene/core/src/java/org/apache/lucene/search/ReferenceManager.java:
##
@@ -219,6 +219,36 @@ public final boolean maybeRefresh() throws IOException {
 return doTryRefresh;
   }
 
+  /** Compute some state of the reference using a parameter. */
+  @FunctionalInterface
+  public interface StateCalculator {
+R calculate(G current, P param);
+  }
+
+  /**
+   * If you need to compute something after the refresh, you can use this 
method instead of {@link
+   * #maybeRefresh()}.
+   *
+   * @throws IOException if refreshing the resource causes an {@link 
IOException}
+   */
+  public final  R maybeRefreshAndReturnState(
+  StateCalculator refreshedStateCalculator, P param) throws 
IOException {
+ensureOpen();
+
+// Ensure only 1 thread does refresh at once; other threads just return 
immediately:
+final boolean doTryRefresh = refreshLock.tryLock();
+if (doTryRefresh) {
+  try {
+doMaybeRefresh();
+return refreshedStateCalculator.calculate(current, param);

Review Comment:
   We’re guaranteeing that no other refresh happens before computing the reader 
states. Maybe that’s not important to do? If so, then the user can always 
refresh and then compute their own reader states.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11828: TermInSetQuery optimization when all docs in a field match a term

2022-09-28 Thread GitBox


gsmiller commented on PR #11828:
URL: https://github.com/apache/lucene/pull/11828#issuecomment-1261195636

   > I assume we already have tests that cover this case?
   
   Good question. I'm going to go tweak our tests. We added tests that cover 
the completely dense case (i.e., all docs in a segment) when adding that 
optimization, but we should augment the tests for this more specific 
optimization. Will rev the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11803: DrillSideways optimizations

2022-09-28 Thread GitBox


gsmiller commented on PR #11803:
URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261198752

   @zhaih I reexamined our test coverage and think we're in good shape already 
actually. We've got good coverage for covering drill-sideways correctness with 
multiple dimensions, etc. (including random and non-random). We could try to 
take these further by somehow asserting that advance is being used in favor of 
nextDoc when appropriate, but I think those tests would be reasonably complex 
to write and I'm not sure they add tremendous value. I'd rather we spent time 
building drill-sideways benchmarks that focus on ensuring our performance 
doesn't regress. But that's just my opinion. Please let me know if you feel 
differently and we can keep discussing. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


zhaih commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982692228


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();

Review Comment:
   This real time counter could potentially harm the performance since it is 
increased every byte you write for every thread.
   
   Since the IndexOutput is already not thread-safe, we probably can let 
`ByteTrackingIndexOutput` keep a private (volatile?) counter of bytes, and 
expose a public getter. Then in this directory wrapper keep a list of tracking 
output so that when we need we can just call the getter methods and sum up the 
numbers. Also to avoid tracking too much index output we can deregister them 
when they're closed.
   
   It's just another idea and could be a follow-up, but I think we need to be 
careful about using the atomics.



##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();
+  private final AtomicLong realTimeMergedBytes = new AtomicLong();
+
+  public final boolean trackTempOutput;
+
+  /**
+   * Constructor defaults to not tracking temp outputs
+   *
+   * @param in input Directory
+   */
+  public ByteWritesTrackingDirectoryWrapper(Directory in) {
+this(in, false);
+  }
+
+  /**
+   * Constructor with option to track tempOutput
+   *
+   * @param in input Directory
+   * @param trackTempOutput if true, will also track temporary outputs created 
by this directory
+   */
+  public ByteWritesTrackingDirectoryWrapper(Directory in, boolean 
trackTempOutput) {
+super(in);
+this.trackTempOutput = trackTempOutput;
+  }
+
+  @Override
+  public IndexOutput createOutput(String name, IOContext ioContext) throws 
IOException {
+IndexOutput output = in.createOutput(name, ioContext);
+IndexOutput byteTrackingIndexOutput;
+if (ioContext.context.equals(IOContext.Context.FLUSH)) {
+  byteTrackingIndexOutput =
+  new ByteTrackingIndexOutput(output, flushedBy

[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


mdmarshmallow commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982759362


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();
+  private final AtomicLong realTimeMergedBytes = new AtomicLong();
+
+  public final boolean trackTempOutput;
+
+  /**
+   * Constructor defaults to not tracking temp outputs
+   *
+   * @param in input Directory
+   */
+  public ByteWritesTrackingDirectoryWrapper(Directory in) {
+this(in, false);
+  }
+
+  /**
+   * Constructor with option to track tempOutput
+   *
+   * @param in input Directory
+   * @param trackTempOutput if true, will also track temporary outputs created 
by this directory
+   */
+  public ByteWritesTrackingDirectoryWrapper(Directory in, boolean 
trackTempOutput) {
+super(in);
+this.trackTempOutput = trackTempOutput;
+  }
+
+  @Override
+  public IndexOutput createOutput(String name, IOContext ioContext) throws 
IOException {
+IndexOutput output = in.createOutput(name, ioContext);
+IndexOutput byteTrackingIndexOutput;
+if (ioContext.context.equals(IOContext.Context.FLUSH)) {
+  byteTrackingIndexOutput =
+  new ByteTrackingIndexOutput(output, flushedBytes, 
realTimeFlushedBytes);
+} else if (ioContext.context.equals(IOContext.Context.MERGE)) {
+  byteTrackingIndexOutput =
+  new ByteTrackingIndexOutput(output, mergedBytes, 
realTimeMergedBytes);
+} else {
+  return output;
+}
+return byteTrackingIndexOutput;
+  }
+
+  @Override
+  public IndexOutput createTempOutput(String prefix, String suffix, IOContext 
ioContext)
+  throws IOException {
+IndexOutput output = in.createTempOutput(prefix, suffix, ioContext);
+if (trackTempOutput) {
+  IndexOutput byteTrackingIndexOutput;
+  if (ioContext.context.equals(IOContext.Context.FLUSH)) {
+byteTrackingIndexOutput =
+new ByteTrackingIndexOutput(output, flushedBytes, 
realTimeFlushedBytes);
+  } else if (ioContext.context.equals(IOContext.Context.MERGE)) {
+byteTrackingIndexOutput =
+new ByteTrackingIndexOutput(output, mergedBytes, 
realTimeMergedBytes);
+  } else {
+return output;
+  }
+  return byteTrackingIndexOutput;
+}
+return output;
+  }
+
+  public double getApproximateWriteAmplificationFactor() {
+double flushedBytes = (double) this.flushedBytes.get();
+if (flushedBytes == 0.0) {
+  return 1.0;
+}
+double mergedBytes = (double) this.mergedBytes.get();
+return (flushedBytes + mergedBytes) / flushedBytes;
+  }
+
+  /** Gets a more up-to-date but less accurate write amplification factor */

Review Comment:
   My understanding is that the real time metric would track partially 
completed flushed and merges, so it wouldn't be the "true" write amplification 
factor of all completed flushes and merges. I'm not sure if there are any cases 
where a flush or merge would get cancelled before it's completed though to be 
honest.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For addition

[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


mdmarshmallow commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982764818


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();

Review Comment:
   I think that's a good idea. Not sure we need to the counter to be volatile 
though since each `ByteTrackingIndexOutput` operates on a single thread right? 
So if the counter is private it should be fine I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


zhaih commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982825433


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();

Review Comment:
   I think volatile is necessary because it need to be read by another thread 
possibly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


zhaih commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982825989


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();
+  private final AtomicLong realTimeMergedBytes = new AtomicLong();
+
+  public final boolean trackTempOutput;
+
+  /**
+   * Constructor defaults to not tracking temp outputs
+   *
+   * @param in input Directory
+   */
+  public ByteWritesTrackingDirectoryWrapper(Directory in) {
+this(in, false);
+  }
+
+  /**
+   * Constructor with option to track tempOutput
+   *
+   * @param in input Directory
+   * @param trackTempOutput if true, will also track temporary outputs created 
by this directory
+   */
+  public ByteWritesTrackingDirectoryWrapper(Directory in, boolean 
trackTempOutput) {
+super(in);
+this.trackTempOutput = trackTempOutput;
+  }
+
+  @Override
+  public IndexOutput createOutput(String name, IOContext ioContext) throws 
IOException {
+IndexOutput output = in.createOutput(name, ioContext);
+IndexOutput byteTrackingIndexOutput;
+if (ioContext.context.equals(IOContext.Context.FLUSH)) {
+  byteTrackingIndexOutput =
+  new ByteTrackingIndexOutput(output, flushedBytes, 
realTimeFlushedBytes);
+} else if (ioContext.context.equals(IOContext.Context.MERGE)) {
+  byteTrackingIndexOutput =
+  new ByteTrackingIndexOutput(output, mergedBytes, 
realTimeMergedBytes);
+} else {
+  return output;
+}
+return byteTrackingIndexOutput;
+  }
+
+  @Override
+  public IndexOutput createTempOutput(String prefix, String suffix, IOContext 
ioContext)
+  throws IOException {
+IndexOutput output = in.createTempOutput(prefix, suffix, ioContext);
+if (trackTempOutput) {
+  IndexOutput byteTrackingIndexOutput;
+  if (ioContext.context.equals(IOContext.Context.FLUSH)) {
+byteTrackingIndexOutput =
+new ByteTrackingIndexOutput(output, flushedBytes, 
realTimeFlushedBytes);
+  } else if (ioContext.context.equals(IOContext.Context.MERGE)) {
+byteTrackingIndexOutput =
+new ByteTrackingIndexOutput(output, mergedBytes, 
realTimeMergedBytes);
+  } else {
+return output;
+  }
+  return byteTrackingIndexOutput;
+}
+return output;
+  }
+
+  public double getApproximateWriteAmplificationFactor() {
+double flushedBytes = (double) this.flushedBytes.get();
+if (flushedBytes == 0.0) {
+  return 1.0;
+}
+double mergedBytes = (double) this.mergedBytes.get();
+return (flushedBytes + mergedBytes) / flushedBytes;
+  }
+
+  /** Gets a more up-to-date but less accurate write amplification factor */

Review Comment:
   OK, in case any file are cancelled then this one will be inaccurate, makes 
sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


zhaih commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982826316


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();
+  private final AtomicLong realTimeMergedBytes = new AtomicLong();
+
+  public final boolean trackTempOutput;
+
+  /**
+   * Constructor defaults to not tracking temp outputs
+   *
+   * @param in input Directory
+   */
+  public ByteWritesTrackingDirectoryWrapper(Directory in) {
+this(in, false);
+  }
+
+  /**
+   * Constructor with option to track tempOutput
+   *
+   * @param in input Directory
+   * @param trackTempOutput if true, will also track temporary outputs created 
by this directory
+   */
+  public ByteWritesTrackingDirectoryWrapper(Directory in, boolean 
trackTempOutput) {
+super(in);
+this.trackTempOutput = trackTempOutput;
+  }
+
+  @Override
+  public IndexOutput createOutput(String name, IOContext ioContext) throws 
IOException {
+IndexOutput output = in.createOutput(name, ioContext);
+IndexOutput byteTrackingIndexOutput;
+if (ioContext.context.equals(IOContext.Context.FLUSH)) {
+  byteTrackingIndexOutput =
+  new ByteTrackingIndexOutput(output, flushedBytes, 
realTimeFlushedBytes);
+} else if (ioContext.context.equals(IOContext.Context.MERGE)) {
+  byteTrackingIndexOutput =
+  new ByteTrackingIndexOutput(output, mergedBytes, 
realTimeMergedBytes);
+} else {
+  return output;
+}
+return byteTrackingIndexOutput;
+  }
+
+  @Override
+  public IndexOutput createTempOutput(String prefix, String suffix, IOContext 
ioContext)
+  throws IOException {
+IndexOutput output = in.createTempOutput(prefix, suffix, ioContext);
+if (trackTempOutput) {
+  IndexOutput byteTrackingIndexOutput;
+  if (ioContext.context.equals(IOContext.Context.FLUSH)) {
+byteTrackingIndexOutput =
+new ByteTrackingIndexOutput(output, flushedBytes, 
realTimeFlushedBytes);
+  } else if (ioContext.context.equals(IOContext.Context.MERGE)) {
+byteTrackingIndexOutput =
+new ByteTrackingIndexOutput(output, mergedBytes, 
realTimeMergedBytes);
+  } else {
+return output;
+  }
+  return byteTrackingIndexOutput;
+}
+return output;
+  }
+
+  public double getApproximateWriteAmplificationFactor() {
+double flushedBytes = (double) this.flushedBytes.get();
+if (flushedBytes == 0.0) {
+  return 1.0;
+}
+double mergedBytes = (double) this.mergedBytes.get();
+return (flushedBytes + mergedBytes) / flushedBytes;
+  }
+
+  /** Gets a more up-to-date but less accurate write amplification factor */

Review Comment:
   Maybe add the explanation to javadoc?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11803: DrillSideways optimizations

2022-09-28 Thread GitBox


gsmiller commented on PR #11803:
URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261458849

   @zhaih that's a good point and valid concern. I dug into the existing tests 
and it looks like we have lots of coverage _except_ that the majority of the 
coverage is using basic, single-phase drill-down dimensions. I'm going to 
augment our randomized testing to randomly use two-phase drill-downs to broaden 
coverage. Thanks for the discussion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


mdmarshmallow commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982891007


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();

Review Comment:
   Oh I think you're right. I'll make that change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


uschindler commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982891128


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();

Review Comment:
   Volatile is worse than AtomicLong, because it has a barrier on both read and 
write. So the atomic update is better.
   



##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();

Review Comment:
   Do we really read from another thread while writing? Updating the counter is 
fine as it only happens in one thread. The problem is reading so we get a value 
that's up-to-date. The question is how up-to-date it needs to be.
   
   Maybe we can have a standard long value that gets updated by the writing 
thread. Every n~1000 bytes written we update the atomic one. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


mdmarshmallow commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982902771


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();

Review Comment:
   I don't see any issue with having a real time counter lagging by 1000 bytes, 
though maybe we can also provide that as an option that can be changed and have 
the default at 1000?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


mdmarshmallow commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1261519604

   @jpountz , in response to this:
   
   > I'm considering exposing write amplification separately for flushes (as 
flushedBytes / totalIndexSize), merges (as (totalIndexSize + mergedBytes) / 
totalIndexSize) and temporary files (as (totalIndexSize + tempBytes) / 
totalIndexSize) and pushing the responsibility to users of whether and how they 
would like to combine these various metrics?
   
   We could add these in addition to the write amplification factor methods 
that are already present?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


uschindler commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982909584


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();

Review Comment:
   Alternatively use a varhandle and mostly call get/set to update it in writer 
thread and setVolatile from time to time to persist value so it is visible by 
other threads. Other threads always use getVolatile to check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

2022-09-28 Thread GitBox


uschindler commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1261531792

   I was also doing consulting for an huge Elasticsearch user and they also had 
this problem of keeping deletes as low as possible and the 20% limit was way 
too high. 20% looks like an arbitrary limitation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-28 Thread GitBox


zhaih commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r982921333


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FilterDirectory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexOutput;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+  private final AtomicLong realTimeFlushedBytes = new AtomicLong();

Review Comment:
   > Volatile is worse than AtomicLong, because it has a barrier on both read 
and write. So the atomic update is better.
   
   Thanks for pointing that out, new knowledge got :)
   > Maybe we can have a standard long value that gets updated by the writing 
thread. Every n~1000 bytes written we update the atomic one. What do you think?
   
   I'll vote for this approach, it sounds neat and I don't think a 1KB error 
will make too much difference, and we can make it configurable even.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on pull request #11803: DrillSideways optimizations

2022-09-28 Thread GitBox


zhaih commented on PR #11803:
URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261541919

   @gsmiller Thank you for checking and continuous effort!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11803: DrillSideways optimizations

2022-09-28 Thread GitBox


gsmiller commented on PR #11803:
URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261579341

   @zhaih well, thank you for keeping me honest with testing. I think I've 
already found an insidious, potential bug with some beefier tests. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new issue, #11830: Store HNSW graph connections more compactly

2022-09-28 Thread GitBox


jtibshirani opened a new issue, #11830:
URL: https://github.com/apache/lucene/issues/11830

   ### Description
   
   HNSW search is most efficient when all vector data fits in page cache. So 
good to keep the size of vector files as small as possible.
   
   We currently write all HNSW graph connections as fixed-size integers. This 
is wasteful since most graphs have far fewer nodes than the max integer value:
   
https://github.com/apache/lucene/blob/d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956/lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java#L478
   
   Maybe instead we could store the connection list using `PackedInts.Writer`. 
This would decrease the bits needed to store each connection. We could still 
ensure that every connection list takes the same number of bytes, to continue 
being able to index into the graph data easily.
   
   I quickly tested a version of the idea on the 1 million vector GloVe 
dataset, and saw the graph data size decrease by ~30%:
   
   ```
   Baseline
   103M 
luceneknn-100-16-100.train-16-100.index/_6_Lucene94HnswVectorsFormat_0.vex
   
   Hacky patch
   155M 
luceneknn-100-16-100.train-16-100.index/_5_Lucene94HnswVectorsFormat_0.vex
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org