[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.
vsop-479 commented on PR #11722: URL: https://github.com/apache/lucene/pull/11722#issuecomment-1260490249 @jpountz Thranks for your review and suggestion. I have added a CHANGES entry and assert term value code. Please have a review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #11827: Release manager should review lucene benchmarks before building release candidates
uschindler commented on issue #11827: URL: https://github.com/apache/lucene/issues/11827#issuecomment-1260501918 I fully agree, some checks should be done. But here are a few bits that came into my mind: - The @mikemccand benchmarks are running against main branch only. So the first check should be *before* creating the release branch, so we know that the start of branch is "clean". - Anything else that gets included into release branch should be monitored by the RM if it negatively affects perf on main. - Is there a benchmark available for the 9.x branch? If it would be configurable / modular we could have a base branch (against main and in parallel against 9.x). When a release branch is created we could compare the changes on the release branch with the baselines!? These are just ideas, but basically the important thing would be: Before creating the release branch do a quick check through the benchmark suite on Mike's web page and find potential issues. During the release process when stuff is merged into release branch monitor all merged changes manually on main's benchmark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.
jpountz commented on code in PR #11722: URL: https://github.com/apache/lucene/pull/11722#discussion_r982050404 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BasePostingsFormatTestCase.java: ## @@ -367,6 +367,53 @@ public void testGhosts() throws Exception { dir.close(); } + public void testBinarySearchTermLeaf() throws Exception { +Directory dir = newDirectory(); + +IndexWriterConfig iwc = newIndexWriterConfig(null); +iwc.setCodec(getCodec()); +iwc.setMergePolicy(newTieredMergePolicy()); +IndexWriter iw = new IndexWriter(dir, iwc); + +for (int i = 10; i <= 100400; i++){ + // only add odd number + if (i % 2 == 1) { +Document document = new Document(); +document.add(new StringField("id", i + "", Field.Store.NO)); +iw.addDocument(document); + } +} +iw.commit(); +iw.forceMerge(1); + +DirectoryReader reader = DirectoryReader.open(iw); +TermsEnum termsEnum = getOnlyLeafReader(reader).terms("id").iterator(); +// test seekExact +for (int i = 10; i <= 100400; i++){ + BytesRef target = new BytesRef(i + ""); + if (i % 2 == 1) { +assertTrue(termsEnum.seekExact(target)); +assertEquals(termsEnum.term(), target); + } else { +assertFalse(termsEnum.seekExact(target)); + } +} +// test seekCeil +for (int i = 10; i < 100400; i++){ + BytesRef target = new BytesRef(i + ""); + if (i % 2 == 1) { +assertEquals(SeekStatus.FOUND, termsEnum.seekCeil(target)); +assertEquals(termsEnum.term(), target); + } else { +assertEquals(SeekStatus.NOT_FOUND, termsEnum.seekCeil(target)); Review Comment: maybe also assert here that the terms enum is positioned on the next term? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Mahdi-Seeker commented on issue #10177: Introduce IVFFlat to Lucene for ANN similarity search [LUCENE-9136]
Mahdi-Seeker commented on issue #10177: URL: https://github.com/apache/lucene/issues/10177#issuecomment-1260514435 Hi guys Thanks for your great job on Lucene and specially this ANN search! Any progress on this issue? We're trying to use vector search, but HNSW seems to take too much memory... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on issue #11824: Performance regression on LatLonPoint#newPolygonQuery
iverase commented on issue #11824: URL: https://github.com/apache/lucene/issues/11824#issuecomment-1260667934 Fix seems to bring performance back to previous levels: https://user-images.githubusercontent.com/29038686/192749125-0ac0b341-b2cb-4395-991c-9f676322592a.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #1039: LUCENE-10635: Ensure test coverage for WANDScorer by using a test query
jpountz commented on code in PR #1039: URL: https://github.com/apache/lucene/pull/1039#discussion_r982215123 ## lucene/core/src/test/org/apache/lucene/search/TestWANDScorer.java: ## @@ -947,4 +988,82 @@ public long cost() { }; } } + + private static class WANDScorerQuery extends Query { +private final BooleanQuery query; + +private WANDScorerQuery(BooleanQuery query) { Review Comment: I was thinking of something like that indeed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on issue #11827: Release manager should review lucene benchmarks before building release candidates
mikemccand commented on issue #11827: URL: https://github.com/apache/lucene/issues/11827#issuecomment-1260778045 > yup. Possibly too if Mike M is bored he could implement an alarming system :) or export the data somehow so we could bolt one on the side? Actually I rather like the alarm idea. Maybe we could start with something simple -- more than X% change over Y data points? Maybe for starters the nightly bench just writes these alarms in (blinking red font, heh) at the [top of the index page](https://home.apache.org/~mikemccand/lucenebench/)? If it seems like that is not too noisy, maybe we find some stronger way to "notify" us. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on issue #11824: Performance regression on LatLonPoint#newPolygonQuery
mikemccand commented on issue #11824: URL: https://github.com/apache/lucene/issues/11824#issuecomment-1260787630 Thanks for catching this @iverase and the quick fix, and the follow-on issue to better detect such regressions before release: #11827 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on issue #11827: Release manager should review lucene benchmarks before building release candidates
mikemccand commented on issue #11827: URL: https://github.com/apache/lucene/issues/11827#issuecomment-1260788095 This was a spinoff from #11824. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new issue, #11829: Reproducible TestShapeDocValues failure
jpountz opened a new issue, #11829: URL: https://github.com/apache/lucene/issues/11829 ### Description The test seems to be creating invalid polygons. ``` 03:40:22 org.apache.lucene.document.TestShapeDocValues > testXYPolygonCentroid FAILED 03:40:22 WARNING: The Security Manager is deprecated and will be removed in a future release 03:40:22 java.lang.IllegalArgumentException: Polygon self-intersection at lat=2.9067992002286562 lon=6.0168174085684076E7 03:40:22 at __randomizedtesting.SeedInfo.seed([29AE25A2D4FB94F9:CC0CD5ABA2A74207]:0) 03:40:22 at org.apache.lucene.geo.Tessellator.checkIntersectionPoint(Tessellator.java:958) 03:40:22 at org.apache.lucene.geo.Tessellator.checkIntersection(Tessellator.java:870) 03:40:22 at org.apache.lucene.geo.Tessellator.tessellate(Tessellator.java:198) 03:40:22 at org.apache.lucene.geo.Tessellator.tessellate(Tessellator.java:152) 03:40:22 at org.apache.lucene.document.XYShape.createIndexableFields(XYShape.java:99) 03:40:22 at org.apache.lucene.document.TestShapeDocValues.getTessellation(TestShapeDocValues.java:249) 03:40:22 at org.apache.lucene.document.TestShapeDocValues.getTessellation(TestShapeDocValues.java:232) 03:40:22 at org.apache.lucene.document.TestShapeDocValues.computeCentroid(TestShapeDocValues.java:98) 03:40:22 at org.apache.lucene.document.TestShapeDocValues.testXYPolygonCentroid(TestShapeDocValues.java:90) 03:40:22 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 03:40:22 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) 03:40:22 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 03:40:22 at java.base/java.lang.reflect.Method.invoke(Method.java:568) 03:40:22 at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758) 03:40:22 at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946) 03:40:22 at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982) 03:40:22 at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) 03:40:22 at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48) 03:40:22 at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) 03:40:22 at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) 03:40:22 at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) 03:40:22 at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) 03:40:22 at org.junit.rules.RunRules.evaluate(RunRules.java:20) 03:40:22 at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 03:40:22 at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) 03:40:22 at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843) 03:40:22 at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490) 03:40:22 at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955) 03:40:22 at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840) 03:40:22 at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891) 03:40:22 at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902) 03:40:22 at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) 03:40:22 at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 03:40:22 at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) 03:40:22 at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) 03:40:22 at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) 03:40:22 at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 03:40:22 at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 03:40:22 at org.a
[GitHub] [lucene] nknize commented on issue #11824: Performance regression on LatLonPoint#newPolygonQuery
nknize commented on issue #11824: URL: https://github.com/apache/lucene/issues/11824#issuecomment-1260938633 What's annoying is how incredibly trappy this override logic is. That a method call literally moving from `createWeight` to `getScorerSupplier` results in a 72.2% regression even slipped by me before merge doesn't bode well for new committers. And then it sat in regression until an entire company was interested in releasing. I wonder if we can do better? Like maybe figure out better guardrails in these methods? Perhaps by something as simple as a rename to signal one happens per segment? This isn't the first and will certainly not be the last time an expensive operation accidentally slips to a critical path. Any other ideas how to lower the bar here for new committers? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #11780: GH#11601: Add ability to compute reader states after refresh
mikemccand commented on code in PR #11780: URL: https://github.com/apache/lucene/pull/11780#discussion_r982553986 ## lucene/core/src/java/org/apache/lucene/search/ReferenceManager.java: ## @@ -219,6 +219,36 @@ public final boolean maybeRefresh() throws IOException { return doTryRefresh; } + /** Compute some state of the reference using a parameter. */ + @FunctionalInterface + public interface StateCalculator { +R calculate(G current, P param); + } + + /** + * If you need to compute something after the refresh, you can use this method instead of {@link + * #maybeRefresh()}. + * + * @throws IOException if refreshing the resource causes an {@link IOException} + */ + public final R maybeRefreshAndReturnState( + StateCalculator refreshedStateCalculator, P param) throws IOException { +ensureOpen(); + +// Ensure only 1 thread does refresh at once; other threads just return immediately: +final boolean doTryRefresh = refreshLock.tryLock(); +if (doTryRefresh) { + try { +doMaybeRefresh(); +return refreshedStateCalculator.calculate(current, param); Review Comment: Hmm why is it necessary to compute this state under lock? Can't the caller just compute the state themselves when they get the refreshed reader? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita commented on a diff in pull request #11780: GH#11601: Add ability to compute reader states after refresh
stefanvodita commented on code in PR #11780: URL: https://github.com/apache/lucene/pull/11780#discussion_r982630324 ## lucene/core/src/java/org/apache/lucene/search/ReferenceManager.java: ## @@ -219,6 +219,36 @@ public final boolean maybeRefresh() throws IOException { return doTryRefresh; } + /** Compute some state of the reference using a parameter. */ + @FunctionalInterface + public interface StateCalculator { +R calculate(G current, P param); + } + + /** + * If you need to compute something after the refresh, you can use this method instead of {@link + * #maybeRefresh()}. + * + * @throws IOException if refreshing the resource causes an {@link IOException} + */ + public final R maybeRefreshAndReturnState( + StateCalculator refreshedStateCalculator, P param) throws IOException { +ensureOpen(); + +// Ensure only 1 thread does refresh at once; other threads just return immediately: +final boolean doTryRefresh = refreshLock.tryLock(); +if (doTryRefresh) { + try { +doMaybeRefresh(); +return refreshedStateCalculator.calculate(current, param); Review Comment: We’re guaranteeing that no other refresh happens before computing the reader states. Maybe that’s not important to do? If so, then the user can always refresh and then compute their own reader states. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11828: TermInSetQuery optimization when all docs in a field match a term
gsmiller commented on PR #11828: URL: https://github.com/apache/lucene/pull/11828#issuecomment-1261195636 > I assume we already have tests that cover this case? Good question. I'm going to go tweak our tests. We added tests that cover the completely dense case (i.e., all docs in a segment) when adding that optimization, but we should augment the tests for this more specific optimization. Will rev the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11803: DrillSideways optimizations
gsmiller commented on PR #11803: URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261198752 @zhaih I reexamined our test coverage and think we're in good shape already actually. We've got good coverage for covering drill-sideways correctness with multiple dimensions, etc. (including random and non-random). We could try to take these further by somehow asserting that advance is being used in favor of nextDoc when appropriate, but I think those tests would be reasonably complex to write and I'm not sure they add tremendous value. I'd rather we spent time building drill-sideways benchmarks that focus on ensuring our performance doesn't regress. But that's just my opinion. Please let me know if you feel differently and we can keep discussing. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
zhaih commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982692228 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); Review Comment: This real time counter could potentially harm the performance since it is increased every byte you write for every thread. Since the IndexOutput is already not thread-safe, we probably can let `ByteTrackingIndexOutput` keep a private (volatile?) counter of bytes, and expose a public getter. Then in this directory wrapper keep a list of tracking output so that when we need we can just call the getter methods and sum up the numbers. Also to avoid tracking too much index output we can deregister them when they're closed. It's just another idea and could be a follow-up, but I think we need to be careful about using the atomics. ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); + private final AtomicLong realTimeMergedBytes = new AtomicLong(); + + public final boolean trackTempOutput; + + /** + * Constructor defaults to not tracking temp outputs + * + * @param in input Directory + */ + public ByteWritesTrackingDirectoryWrapper(Directory in) { +this(in, false); + } + + /** + * Constructor with option to track tempOutput + * + * @param in input Directory + * @param trackTempOutput if true, will also track temporary outputs created by this directory + */ + public ByteWritesTrackingDirectoryWrapper(Directory in, boolean trackTempOutput) { +super(in); +this.trackTempOutput = trackTempOutput; + } + + @Override + public IndexOutput createOutput(String name, IOContext ioContext) throws IOException { +IndexOutput output = in.createOutput(name, ioContext); +IndexOutput byteTrackingIndexOutput; +if (ioContext.context.equals(IOContext.Context.FLUSH)) { + byteTrackingIndexOutput = + new ByteTrackingIndexOutput(output, flushedBy
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982759362 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); + private final AtomicLong realTimeMergedBytes = new AtomicLong(); + + public final boolean trackTempOutput; + + /** + * Constructor defaults to not tracking temp outputs + * + * @param in input Directory + */ + public ByteWritesTrackingDirectoryWrapper(Directory in) { +this(in, false); + } + + /** + * Constructor with option to track tempOutput + * + * @param in input Directory + * @param trackTempOutput if true, will also track temporary outputs created by this directory + */ + public ByteWritesTrackingDirectoryWrapper(Directory in, boolean trackTempOutput) { +super(in); +this.trackTempOutput = trackTempOutput; + } + + @Override + public IndexOutput createOutput(String name, IOContext ioContext) throws IOException { +IndexOutput output = in.createOutput(name, ioContext); +IndexOutput byteTrackingIndexOutput; +if (ioContext.context.equals(IOContext.Context.FLUSH)) { + byteTrackingIndexOutput = + new ByteTrackingIndexOutput(output, flushedBytes, realTimeFlushedBytes); +} else if (ioContext.context.equals(IOContext.Context.MERGE)) { + byteTrackingIndexOutput = + new ByteTrackingIndexOutput(output, mergedBytes, realTimeMergedBytes); +} else { + return output; +} +return byteTrackingIndexOutput; + } + + @Override + public IndexOutput createTempOutput(String prefix, String suffix, IOContext ioContext) + throws IOException { +IndexOutput output = in.createTempOutput(prefix, suffix, ioContext); +if (trackTempOutput) { + IndexOutput byteTrackingIndexOutput; + if (ioContext.context.equals(IOContext.Context.FLUSH)) { +byteTrackingIndexOutput = +new ByteTrackingIndexOutput(output, flushedBytes, realTimeFlushedBytes); + } else if (ioContext.context.equals(IOContext.Context.MERGE)) { +byteTrackingIndexOutput = +new ByteTrackingIndexOutput(output, mergedBytes, realTimeMergedBytes); + } else { +return output; + } + return byteTrackingIndexOutput; +} +return output; + } + + public double getApproximateWriteAmplificationFactor() { +double flushedBytes = (double) this.flushedBytes.get(); +if (flushedBytes == 0.0) { + return 1.0; +} +double mergedBytes = (double) this.mergedBytes.get(); +return (flushedBytes + mergedBytes) / flushedBytes; + } + + /** Gets a more up-to-date but less accurate write amplification factor */ Review Comment: My understanding is that the real time metric would track partially completed flushed and merges, so it wouldn't be the "true" write amplification factor of all completed flushes and merges. I'm not sure if there are any cases where a flush or merge would get cancelled before it's completed though to be honest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For addition
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982764818 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); Review Comment: I think that's a good idea. Not sure we need to the counter to be volatile though since each `ByteTrackingIndexOutput` operates on a single thread right? So if the counter is private it should be fine I think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
zhaih commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982825433 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); Review Comment: I think volatile is necessary because it need to be read by another thread possibly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
zhaih commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982825989 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); + private final AtomicLong realTimeMergedBytes = new AtomicLong(); + + public final boolean trackTempOutput; + + /** + * Constructor defaults to not tracking temp outputs + * + * @param in input Directory + */ + public ByteWritesTrackingDirectoryWrapper(Directory in) { +this(in, false); + } + + /** + * Constructor with option to track tempOutput + * + * @param in input Directory + * @param trackTempOutput if true, will also track temporary outputs created by this directory + */ + public ByteWritesTrackingDirectoryWrapper(Directory in, boolean trackTempOutput) { +super(in); +this.trackTempOutput = trackTempOutput; + } + + @Override + public IndexOutput createOutput(String name, IOContext ioContext) throws IOException { +IndexOutput output = in.createOutput(name, ioContext); +IndexOutput byteTrackingIndexOutput; +if (ioContext.context.equals(IOContext.Context.FLUSH)) { + byteTrackingIndexOutput = + new ByteTrackingIndexOutput(output, flushedBytes, realTimeFlushedBytes); +} else if (ioContext.context.equals(IOContext.Context.MERGE)) { + byteTrackingIndexOutput = + new ByteTrackingIndexOutput(output, mergedBytes, realTimeMergedBytes); +} else { + return output; +} +return byteTrackingIndexOutput; + } + + @Override + public IndexOutput createTempOutput(String prefix, String suffix, IOContext ioContext) + throws IOException { +IndexOutput output = in.createTempOutput(prefix, suffix, ioContext); +if (trackTempOutput) { + IndexOutput byteTrackingIndexOutput; + if (ioContext.context.equals(IOContext.Context.FLUSH)) { +byteTrackingIndexOutput = +new ByteTrackingIndexOutput(output, flushedBytes, realTimeFlushedBytes); + } else if (ioContext.context.equals(IOContext.Context.MERGE)) { +byteTrackingIndexOutput = +new ByteTrackingIndexOutput(output, mergedBytes, realTimeMergedBytes); + } else { +return output; + } + return byteTrackingIndexOutput; +} +return output; + } + + public double getApproximateWriteAmplificationFactor() { +double flushedBytes = (double) this.flushedBytes.get(); +if (flushedBytes == 0.0) { + return 1.0; +} +double mergedBytes = (double) this.mergedBytes.get(); +return (flushedBytes + mergedBytes) / flushedBytes; + } + + /** Gets a more up-to-date but less accurate write amplification factor */ Review Comment: OK, in case any file are cancelled then this one will be inaccurate, makes sense -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
zhaih commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982826316 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); + private final AtomicLong realTimeMergedBytes = new AtomicLong(); + + public final boolean trackTempOutput; + + /** + * Constructor defaults to not tracking temp outputs + * + * @param in input Directory + */ + public ByteWritesTrackingDirectoryWrapper(Directory in) { +this(in, false); + } + + /** + * Constructor with option to track tempOutput + * + * @param in input Directory + * @param trackTempOutput if true, will also track temporary outputs created by this directory + */ + public ByteWritesTrackingDirectoryWrapper(Directory in, boolean trackTempOutput) { +super(in); +this.trackTempOutput = trackTempOutput; + } + + @Override + public IndexOutput createOutput(String name, IOContext ioContext) throws IOException { +IndexOutput output = in.createOutput(name, ioContext); +IndexOutput byteTrackingIndexOutput; +if (ioContext.context.equals(IOContext.Context.FLUSH)) { + byteTrackingIndexOutput = + new ByteTrackingIndexOutput(output, flushedBytes, realTimeFlushedBytes); +} else if (ioContext.context.equals(IOContext.Context.MERGE)) { + byteTrackingIndexOutput = + new ByteTrackingIndexOutput(output, mergedBytes, realTimeMergedBytes); +} else { + return output; +} +return byteTrackingIndexOutput; + } + + @Override + public IndexOutput createTempOutput(String prefix, String suffix, IOContext ioContext) + throws IOException { +IndexOutput output = in.createTempOutput(prefix, suffix, ioContext); +if (trackTempOutput) { + IndexOutput byteTrackingIndexOutput; + if (ioContext.context.equals(IOContext.Context.FLUSH)) { +byteTrackingIndexOutput = +new ByteTrackingIndexOutput(output, flushedBytes, realTimeFlushedBytes); + } else if (ioContext.context.equals(IOContext.Context.MERGE)) { +byteTrackingIndexOutput = +new ByteTrackingIndexOutput(output, mergedBytes, realTimeMergedBytes); + } else { +return output; + } + return byteTrackingIndexOutput; +} +return output; + } + + public double getApproximateWriteAmplificationFactor() { +double flushedBytes = (double) this.flushedBytes.get(); +if (flushedBytes == 0.0) { + return 1.0; +} +double mergedBytes = (double) this.mergedBytes.get(); +return (flushedBytes + mergedBytes) / flushedBytes; + } + + /** Gets a more up-to-date but less accurate write amplification factor */ Review Comment: Maybe add the explanation to javadoc? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11803: DrillSideways optimizations
gsmiller commented on PR #11803: URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261458849 @zhaih that's a good point and valid concern. I dug into the existing tests and it looks like we have lots of coverage _except_ that the majority of the coverage is using basic, single-phase drill-down dimensions. I'm going to augment our randomized testing to randomly use two-phase drill-downs to broaden coverage. Thanks for the discussion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982891007 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); Review Comment: Oh I think you're right. I'll make that change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
uschindler commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982891128 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); Review Comment: Volatile is worse than AtomicLong, because it has a barrier on both read and write. So the atomic update is better. ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); Review Comment: Do we really read from another thread while writing? Updating the counter is fine as it only happens in one thread. The problem is reading so we get a value that's up-to-date. The question is how up-to-date it needs to be. Maybe we can have a standard long value that gets updated by the writing thread. Every n~1000 bytes written we update the atomic one. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982902771 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); Review Comment: I don't see any issue with having a real time counter lagging by 1000 bytes, though maybe we can also provide that as an option that can be changed and have the default at 1000? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1261519604 @jpountz , in response to this: > I'm considering exposing write amplification separately for flushes (as flushedBytes / totalIndexSize), merges (as (totalIndexSize + mergedBytes) / totalIndexSize) and temporary files (as (totalIndexSize + tempBytes) / totalIndexSize) and pushing the responsibility to users of whether and how they would like to combine these various metrics? We could add these in addition to the write amplification factor methods that are already present? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
uschindler commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982909584 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); Review Comment: Alternatively use a varhandle and mostly call get/set to update it in writer thread and setVolatile from time to time to persist value so it is visible by other threads. Other threads always use getVolatile to check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits
uschindler commented on issue #11761: URL: https://github.com/apache/lucene/issues/11761#issuecomment-1261531792 I was also doing consulting for an huge Elasticsearch user and they also had this problem of keeping deletes as low as possible and the 20% limit was way too high. 20% looks like an arbitrary limitation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
zhaih commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r982921333 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteWritesTrackingDirectoryWrapper.java: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FilterDirectory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexOutput; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class ByteWritesTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + private final AtomicLong realTimeFlushedBytes = new AtomicLong(); Review Comment: > Volatile is worse than AtomicLong, because it has a barrier on both read and write. So the atomic update is better. Thanks for pointing that out, new knowledge got :) > Maybe we can have a standard long value that gets updated by the writing thread. Every n~1000 bytes written we update the atomic one. What do you think? I'll vote for this approach, it sounds neat and I don't think a 1KB error will make too much difference, and we can make it configurable even. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on pull request #11803: DrillSideways optimizations
zhaih commented on PR #11803: URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261541919 @gsmiller Thank you for checking and continuous effort! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11803: DrillSideways optimizations
gsmiller commented on PR #11803: URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261579341 @zhaih well, thank you for keeping me honest with testing. I think I've already found an insidious, potential bug with some beefier tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani opened a new issue, #11830: Store HNSW graph connections more compactly
jtibshirani opened a new issue, #11830: URL: https://github.com/apache/lucene/issues/11830 ### Description HNSW search is most efficient when all vector data fits in page cache. So good to keep the size of vector files as small as possible. We currently write all HNSW graph connections as fixed-size integers. This is wasteful since most graphs have far fewer nodes than the max integer value: https://github.com/apache/lucene/blob/d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956/lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java#L478 Maybe instead we could store the connection list using `PackedInts.Writer`. This would decrease the bits needed to store each connection. We could still ensure that every connection list takes the same number of bytes, to continue being able to index into the graph data easily. I quickly tested a version of the idea on the 1 million vector GloVe dataset, and saw the graph data size decrease by ~30%: ``` Baseline 103M luceneknn-100-16-100.train-16-100.index/_6_Lucene94HnswVectorsFormat_0.vex Hacky patch 155M luceneknn-100-16-100.train-16-100.index/_5_Lucene94HnswVectorsFormat_0.vex ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org