[GitHub] [lucene] mayya-sharipova merged pull request #734: LUCENE-10408 Test: correct type of checksum
mayya-sharipova merged pull request #734: URL: https://github.com/apache/lucene/pull/734 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
[ https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503366#comment-17503366 ] ASF subversion and git services commented on LUCENE-10408: -- Commit e5717cddfda68dace6e45357f5e33d81c368db31 in lucene's branch refs/heads/main from Mayya Sharipova [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e5717cd ] LUCENE-10408 Test correction checksum (#734) Use double instead of float to test vector values checksum > Better dense encoding of doc Ids in Lucene91HnswVectorsFormat > - > > Key: LUCENE-10408 > URL: https://issues.apache.org/jira/browse/LUCENE-10408 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Fix For: 9.1 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Currently we write doc Ids of all documents that have vectors as is. We > should improve their encoding either using delta encoding or bitset. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek commented on a change in pull request #679: Monitor Improvements LUCENE-10422
romseygeek commented on a change in pull request #679: URL: https://github.com/apache/lucene/pull/679#discussion_r822427351 ## File path: lucene/monitor/src/java/org/apache/lucene/monitor/ReadonlyQueryIndex.java ## @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.monitor; + +import java.io.IOException; +import java.util.List; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.search.*; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.NamedThreadFactory; + +class ReadonlyQueryIndex extends QueryIndex { + + private final ScheduledExecutorService refreshExecutor; + + public ReadonlyQueryIndex(MonitorConfiguration configuration) throws IOException { +if (configuration.getDirectoryProvider() == null) { + throw new IllegalStateException( + "You must specify a Directory when configuring a Monitor as read-only."); +} +Directory directory = configuration.getDirectoryProvider().get(); +this.manager = new SearcherManager(directory, new TermsHashBuilder(termFilters)); +this.decomposer = configuration.getQueryDecomposer(); +this.serializer = configuration.getQuerySerializer(); +this.refreshExecutor = +Executors.newSingleThreadScheduledExecutor(new NamedThreadFactory("cache-purge")); +long refreshFrequency = configuration.getPurgeFrequency(); +this.refreshExecutor.scheduleAtFixedRate( +() -> { + try { +manager.maybeRefresh(); + } catch (IOException e) { +throw new RuntimeException(e); + } +}, +refreshFrequency, +refreshFrequency, +configuration.getPurgeFrequencyUnits()); + } + + @Override + public void commit(List updates) throws IOException { +throw new IllegalStateException("Monitor is readOnly cannot commit"); + } + + @Override + public long search(QueryBuilder queryBuilder, QueryCollector matcher) throws IOException { +IndexSearcher searcher = null; +try { + searcher = manager.acquire(); + LazyMonitorQueryCollector collector = + new LazyMonitorQueryCollector(matcher, serializer, decomposer); + long buildTime = System.nanoTime(); + Query query = + queryBuilder.buildQuery( + termFilters.get(searcher.getIndexReader().getReaderCacheHelper().getKey())); + buildTime = System.nanoTime() - buildTime; + searcher.search(query, collector); + return buildTime; +} finally { + if (searcher != null) { +manager.release(searcher); + } +} + } + + @Override + public void purgeCache() { +throw new IllegalStateException("Monitor is readOnly, it has no cache"); Review comment: Let's make this call `manager.maybeRefresh` ## File path: lucene/monitor/src/java/org/apache/lucene/monitor/ReadonlyQueryIndex.java ## @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.monitor; + +import java.io.IOException; +import java.util.List; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.search.*;
[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?
[ https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503480#comment-17503480 ] ASF subversion and git services commented on LUCENE-10311: -- Commit e999056c19d98b5dbd6434f6986e19c69cdb28ab in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e999056 ] LUCENE-10311: avoid division by zero on small sets. > Should DocIdSetBuilder have different implementations for point and terms? > -- > > Key: LUCENE-10311 > URL: https://issues.apache.org/jira/browse/LUCENE-10311 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Major > Time Spent: 9.5h > Remaining Estimate: 0h > > DocIdSetBuilder has two API implementations, one for terms queries and one > for point values queries. In each cases they are used in totally different > way. > For terms the API looks like: > > {code:java} > /** > * Add the content of the provided {@link DocIdSetIterator} to this builder. > NOTE: if you need to > * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you > should rather use {@link > * RoaringDocIdSet.Builder}. > */ > void add(DocIdSetIterator iter) throws IOException; > /** Build a {@link DocIdSet} from the accumulated doc IDs. */ > DocIdSet build() > {code} > > For Point Values it looks like: > > {code:java} > /** > * Utility class to efficiently add many docs in one go. > * > * @see DocIdSetBuilder#grow > */ > public abstract static class BulkAdder { > public abstract void add(int doc); > public void add(DocIdSetIterator iterator) throws IOException { > int docID; > while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { > add(docID); > } > } > } > /** > * Reserve space and return a {@link BulkAdder} object that can be used to > add up to {@code > * numDocs} documents. > */ > /** Build a {@link DocIdSet} from the accumulated doc IDs. */ > DocIdSet build() public BulkAdder grow(int numDocs) > {code} > > > This is becoming trappy for new developments in the PointValue API. > 1) When we call #grow() from the PointValues API, we are not telling the > builder how many docs we are going to add (as we don't really know it) but > the number of points we are about to visit. This number can be bigger than > Integer.MAX_VALUE. Until now, we get around this issue by making sure we > don't call this API when we need to add more than Integer.MAX_VALUE points. > In that case we will navigate the tree down until the number of points is > reduced and they can fit in an int. > This has work well until now because we are calling grow from inside the BKD > reader, and the BKD writer/reader makes sure than the number of points in a > leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which > does not enforce that the number of points on a leaf needs to fit in an int. > This causes friction and inconsistency in the API. > > 2) This a secondary issue that I found when thinking in this issue. In > Lucene- we added the possibility to add a `DocIdSetIterator` from the > PointValues API. Therefore there are two ways to add those kind of objects > to a DocIdSetBuilder which can end up in different results: > > {code:java} > { > // Terms API > docIdSetBuilder.add(docIdSetIterator); > } > { > // Point values API > docIdSetBuilder.grow(doc).add(docIdSetIterator) > }{code} > > I wonder if we need to rethink this API, should we have different > implementation for Terms and Point values? > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?
[ https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503484#comment-17503484 ] ASF subversion and git services commented on LUCENE-10311: -- Commit 38b4bbf74e25a5e578486ba434751a3f361912f5 in lucene's branch refs/heads/branch_9x from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=38b4bbf ] LUCENE-10311: avoid division by zero on small sets. > Should DocIdSetBuilder have different implementations for point and terms? > -- > > Key: LUCENE-10311 > URL: https://issues.apache.org/jira/browse/LUCENE-10311 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Major > Time Spent: 9.5h > Remaining Estimate: 0h > > DocIdSetBuilder has two API implementations, one for terms queries and one > for point values queries. In each cases they are used in totally different > way. > For terms the API looks like: > > {code:java} > /** > * Add the content of the provided {@link DocIdSetIterator} to this builder. > NOTE: if you need to > * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you > should rather use {@link > * RoaringDocIdSet.Builder}. > */ > void add(DocIdSetIterator iter) throws IOException; > /** Build a {@link DocIdSet} from the accumulated doc IDs. */ > DocIdSet build() > {code} > > For Point Values it looks like: > > {code:java} > /** > * Utility class to efficiently add many docs in one go. > * > * @see DocIdSetBuilder#grow > */ > public abstract static class BulkAdder { > public abstract void add(int doc); > public void add(DocIdSetIterator iterator) throws IOException { > int docID; > while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { > add(docID); > } > } > } > /** > * Reserve space and return a {@link BulkAdder} object that can be used to > add up to {@code > * numDocs} documents. > */ > /** Build a {@link DocIdSet} from the accumulated doc IDs. */ > DocIdSet build() public BulkAdder grow(int numDocs) > {code} > > > This is becoming trappy for new developments in the PointValue API. > 1) When we call #grow() from the PointValues API, we are not telling the > builder how many docs we are going to add (as we don't really know it) but > the number of points we are about to visit. This number can be bigger than > Integer.MAX_VALUE. Until now, we get around this issue by making sure we > don't call this API when we need to add more than Integer.MAX_VALUE points. > In that case we will navigate the tree down until the number of points is > reduced and they can fit in an int. > This has work well until now because we are calling grow from inside the BKD > reader, and the BKD writer/reader makes sure than the number of points in a > leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which > does not enforce that the number of points on a leaf needs to fit in an int. > This causes friction and inconsistency in the API. > > 2) This a secondary issue that I found when thinking in this issue. In > Lucene- we added the possibility to add a `DocIdSetIterator` from the > PointValues API. Therefore there are two ways to add those kind of objects > to a DocIdSetBuilder which can end up in different results: > > {code:java} > { > // Terms API > docIdSetBuilder.add(docIdSetIterator); > } > { > // Point values API > docIdSetBuilder.grow(doc).add(docIdSetIterator) > }{code} > > I wonder if we need to rethink this API, should we have different > implementation for Terms and Point values? > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links
[ https://issues.apache.org/jira/browse/LUCENE-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503533#comment-17503533 ] Robert Muir commented on LUCENE-10457: -- A symlink to a file is not a file, a symlink to a directory is not a directory. It is its own horrible thing. Simple facts. We can argue all day, but my position will stand firm due to this. I'll also do everything in my power to keep symlink support out of lucene, too. We all make choices in life. For example users/developers can decide to use finalizers, shutdown hooks, other horrible available things in java, it is a bad idea. we dont need tests around this, we just don't support it. If you decide to use these in your design (NOT RECOMMENDED), then your tests need to be explicit. There are tons of system-specific stuff about them, and you need OS-specific tests basically. If java tries to hide this with abstractions, that doesn't change a thing. Java is wrong if they do that, that's all. > LuceneTestCase.createTempDir could randomly return symbolic links > - > > Key: LUCENE-10457 > URL: https://issues.apache.org/jira/browse/LUCENE-10457 > Project: Lucene - Core > Issue Type: Task > Components: general/test >Reporter: Mike Drob >Priority: Major > > When we are creating temporary directories to use for other Lucene functions, > we could occasionally provide symbolic links instead of direct references to > directories. If the system running tests doesn't support symbolic links, then > we should ignore this option. > Providing links would be useful to test scenarios for example where users > have a symbolic link for the "current" index directory and then rotate that > over time but applications still use the same link. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
[ https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503543#comment-17503543 ] ASF subversion and git services commented on LUCENE-10408: -- Commit 1f497819e6b60db7908657056512e4c65fef420a in lucene's branch refs/heads/branch_9x from Mayya Sharipova [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1f49781 ] LUCENE-10408 Test correction checksum (#734) Use double instead of float to test vector values checksum > Better dense encoding of doc Ids in Lucene91HnswVectorsFormat > - > > Key: LUCENE-10408 > URL: https://issues.apache.org/jira/browse/LUCENE-10408 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Fix For: 9.1 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Currently we write doc Ids of all documents that have vectors as is. We > should improve their encoding either using delta encoding or bitset. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
[ https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503547#comment-17503547 ] ASF subversion and git services commented on LUCENE-10408: -- Commit 8f399572c99786b859123bca8ff50e99692d4ae3 in lucene's branch refs/heads/branch_9_1 from Mayya Sharipova [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8f39957 ] LUCENE-10408 Test correction checksum (#734) Use double instead of float to test vector values checksum > Better dense encoding of doc Ids in Lucene91HnswVectorsFormat > - > > Key: LUCENE-10408 > URL: https://issues.apache.org/jira/browse/LUCENE-10408 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Fix For: 9.1 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Currently we write doc Ids of all documents that have vectors as is. We > should improve their encoding either using delta encoding or bitset. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links
[ https://issues.apache.org/jira/browse/LUCENE-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503608#comment-17503608 ] Uwe Schindler commented on LUCENE-10457: No symlinks please. The symlinks in the git checkout of Solr for the refguide already causes heavy issues on Windows (see my complaints). On Windows you need admin rights or need to set the computer or Jenkins server into "Win10 developer mode" to create them (which is debatable, but the reasons for Microsoft doing this are legit because of malware). E.g. Git on Windows can't handle symlinks unless you set a specific option on checking out. In Solr's repo I get just text files with the target of symlink as contents when checking out the refguide of Solr! Better sometimes add a whitespace into the temporary directory. Much more efficient to find bugs! :D > LuceneTestCase.createTempDir could randomly return symbolic links > - > > Key: LUCENE-10457 > URL: https://issues.apache.org/jira/browse/LUCENE-10457 > Project: Lucene - Core > Issue Type: Task > Components: general/test >Reporter: Mike Drob >Priority: Major > > When we are creating temporary directories to use for other Lucene functions, > we could occasionally provide symbolic links instead of direct references to > directories. If the system running tests doesn't support symbolic links, then > we should ignore this option. > Providing links would be useful to test scenarios for example where users > have a symbolic link for the "current" index directory and then rotate that > over time but applications still use the same link. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude opened a new pull request #2645: Set cluster size to 2 and maxShardsPerNode to -1 for HttpClusterStateSSLTest
thelabdude opened a new pull request #2645: URL: https://github.com/apache/lucene-solr/pull/2645 Test requires maxShardsPerNode to be -1 to allow multiple shards / replicas in a single node. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude merged pull request #2645: Set cluster size to 2 and maxShardsPerNode to -1 for HttpClusterStateSSLTest
thelabdude merged pull request #2645: URL: https://github.com/apache/lucene-solr/pull/2645 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] cpoerschke opened a new pull request #737: Reduce for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms method.
cpoerschke opened a new pull request #737: URL: https://github.com/apache/lucene/pull/737 Draft pull request only for today, no JIRA issue as yet, and not yet fully analysed things but from code reading it seems the query rewrite and term collecting need not happen in a loop? The https://github.com/apache/lucene/commit/81c7ba4601a9aaf16e2255fe493ee582abe72a90 change included ``` - final SpanQuery rewrittenQuery = (SpanQuery) spanQuery.rewrite(getLeafContextForField(field).reader()); + final SpanQuery rewrittenQuery = (SpanQuery) spanQuery.rewrite(getLeafContext().reader()); ``` as a change i.e. previously more needed to happen in the loop. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10448) MergeRateLimiter doesn't always limit instant rate.
[ https://issues.apache.org/jira/browse/LUCENE-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503809#comment-17503809 ] Vigya Sharma commented on LUCENE-10448: --- Hi [~kkewwei] - Thanks for collecting and sharing these detailed metrics and logs. Can you help me understand why {{detailBytes(mb)}} are always {{~0.73mb}} when we are seeing an instant rate of up to {{460 mb/s}} in these runs? I was expecting to find large values in "detailBytes(mb)", corresponding to the high instant rate entries in "detailRate(mb/s)". Maybe "detailRate(mb/s)" is logged from {{writeBytes()}} while "detailBytes(mb)" is logged from the other APIs - in which case, it is curious how we have exactly 49 entries for both of them. Separately, I see that you calculated instant rate from {{{}writeBytes(){}}}, which in theory, can end up writing large bursts of data. I believe there is some scope for improvement there. > MergeRateLimiter doesn't always limit instant rate. > --- > > Key: LUCENE-10448 > URL: https://issues.apache.org/jira/browse/LUCENE-10448 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 8.11.1 >Reporter: kkewwei >Priority: Major > > We can see the code in *MergeRateLimiter*: > {code:java} > private long maybePause(long bytes, long curNS) throws > MergePolicy.MergeAbortedException { > > double rate = mbPerSec; > double secondsToPause = (bytes / 1024. / 1024.) / rate; > long targetNS = lastNS + (long) (10 * secondsToPause); > long curPauseNS = targetNS - curNS; > // We don't bother with thread pausing if the pause is smaller than 2 > msec. > if (curPauseNS <= MIN_PAUSE_NS) { > // Set to curNS, not targetNS, to enforce the instant rate, not > // the "averaged over all history" rate: > lastNS = curNS; > return -1; > } >.. > } > {code} > If a Segment is been merged, *maybePause* is called in 7:00, lastNS=7:00, > then the *maybePause* is called in 7:05 again, so the value of > *targetNS=lastNS + (long) (10 * secondsToPause)* must be smaller than > *curNS*, no matter how big the bytes is, we will return -1 and ignore to > pause. > I count the total times(callTimes) calling *maybePause* and ignored pause > times(ignorePauseTimes) and detail ignored bytes(detailBytes): > {code:java} > [2022-03-02T15:16:51,972][DEBUG][o.e.i.e.I.EngineMergeScheduler] [node1] > [index1][21] merge segment [_4h] done: took [26.8s], [123.6 MB], [61,219 > docs], [0s stopped], [24.4s throttled], [242.5 MB written], [11.2 MB/sec > throttle], [callTimes=857], [ignorePauseTimes=25], [detailBytes(mb) = > [0.28899956, 0.28140354, 0.28015518, 0.27990818, 0.2801447, 0.27991104, > 0.27990723, 0.27990913, 0.2799101, 0.28010082, 0.2799921, 0.2799673, > 0.28144264, 0.27991295, 0.27990818, 0.27993107, 0.2799387, 0.27998447, > 0.28002167, 0.27992058, 0.27998066, 0.28098202, 0.28125, 0.28125, 0.28125]] > {code} > There are 857 times calling *maybePause*, including 25 times which is ignored > to pause, we can see that the ignored detail bytes (such as 0.28125mb) are > not small. > As long as the interval between two *maybePause* calls is relatively long, > the pause action that should be executed will not be executed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma opened a new pull request #738: LUCENE-10448: Avoid instant rate write bursts by writing bytes buffer in chunks
vigyasharma opened a new pull request #738: URL: https://github.com/apache/lucene/pull/738 ## Description `RateLimitedIndexOutput#writeBytes()` checks for rate only at the start of writing the byte buffer. This can result in large instant rate write bursts if the provided byte buffer is large. To avoid this, we write bytes in chunks and check for rate between each chunk. ## Tests Added `TestRateLimitedIndexOutput` to verify rate check between chunks. ## Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [ ] I have created a Jira issue and added the issue ID to my pull request title. - [ ] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `main` branch. - [ ] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10448) MergeRateLimiter doesn't always limit instant rate.
[ https://issues.apache.org/jira/browse/LUCENE-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503881#comment-17503881 ] Vigya Sharma commented on LUCENE-10448: --- I raised [PR - #738|https://github.com/apache/lucene/pull/738] with changes to write the byteBuffer in chunks, and check for rate between each write. This could help avoid potential large instant rate write bursts due to big byte buffers in {{writeBytes()}}. Let me know if you think this would help - [~kkewwei], [~jpountz] > MergeRateLimiter doesn't always limit instant rate. > --- > > Key: LUCENE-10448 > URL: https://issues.apache.org/jira/browse/LUCENE-10448 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 8.11.1 >Reporter: kkewwei >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We can see the code in *MergeRateLimiter*: > {code:java} > private long maybePause(long bytes, long curNS) throws > MergePolicy.MergeAbortedException { > > double rate = mbPerSec; > double secondsToPause = (bytes / 1024. / 1024.) / rate; > long targetNS = lastNS + (long) (10 * secondsToPause); > long curPauseNS = targetNS - curNS; > // We don't bother with thread pausing if the pause is smaller than 2 > msec. > if (curPauseNS <= MIN_PAUSE_NS) { > // Set to curNS, not targetNS, to enforce the instant rate, not > // the "averaged over all history" rate: > lastNS = curNS; > return -1; > } >.. > } > {code} > If a Segment is been merged, *maybePause* is called in 7:00, lastNS=7:00, > then the *maybePause* is called in 7:05 again, so the value of > *targetNS=lastNS + (long) (10 * secondsToPause)* must be smaller than > *curNS*, no matter how big the bytes is, we will return -1 and ignore to > pause. > I count the total times(callTimes) calling *maybePause* and ignored pause > times(ignorePauseTimes) and detail ignored bytes(detailBytes): > {code:java} > [2022-03-02T15:16:51,972][DEBUG][o.e.i.e.I.EngineMergeScheduler] [node1] > [index1][21] merge segment [_4h] done: took [26.8s], [123.6 MB], [61,219 > docs], [0s stopped], [24.4s throttled], [242.5 MB written], [11.2 MB/sec > throttle], [callTimes=857], [ignorePauseTimes=25], [detailBytes(mb) = > [0.28899956, 0.28140354, 0.28015518, 0.27990818, 0.2801447, 0.27991104, > 0.27990723, 0.27990913, 0.2799101, 0.28010082, 0.2799921, 0.2799673, > 0.28144264, 0.27991295, 0.27990818, 0.27993107, 0.2799387, 0.27998447, > 0.28002167, 0.27992058, 0.27998066, 0.28098202, 0.28125, 0.28125, 0.28125]] > {code} > There are 857 times calling *maybePause*, including 25 times which is ignored > to pause, we can see that the ignored detail bytes (such as 0.28125mb) are > not small. > As long as the interval between two *maybePause* calls is relatively long, > the pause action that should be executed will not be executed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang closed pull request #422: LUCENE-10120: Lazy initialize FixedBitSet in LRUQueryCache
LuXugang closed pull request #422: URL: https://github.com/apache/lucene/pull/422 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10458) BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables
[ https://issues.apache.org/jira/browse/LUCENE-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10458: --- Issue Type: Bug (was: Improvement) > BoundedDocSetIdIterator may supply error count in > Weigth#count(LeafReaderContext) when missingValue enables > --- > > Key: LUCENE-10458 > URL: https://issues.apache.org/jira/browse/LUCENE-10458 > Project: Lucene - Core > Issue Type: Bug >Reporter: Lu Xugang >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When IndexSortSortedNumericDocValuesRangeQuery can take advantage of index > sort, Weight#count will use BoundedDocSetIdIterator's lastDoc and firstDoc to > calculate count, but if missingValue enables, those Documents which not > contain DocValues may be involved in calculating count. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10458) BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables
[ https://issues.apache.org/jira/browse/LUCENE-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10458: --- Fix Version/s: 9.1 > BoundedDocSetIdIterator may supply error count in > Weigth#count(LeafReaderContext) when missingValue enables > --- > > Key: LUCENE-10458 > URL: https://issues.apache.org/jira/browse/LUCENE-10458 > Project: Lucene - Core > Issue Type: Bug >Reporter: Lu Xugang >Priority: Major > Fix For: 9.1 > > Time Spent: 10m > Remaining Estimate: 0h > > When IndexSortSortedNumericDocValuesRangeQuery can take advantage of index > sort, Weight#count will use BoundedDocSetIdIterator's lastDoc and firstDoc to > calculate count, but if missingValue enables, those Documents which not > contain DocValues may be involved in calculating count. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani opened a new pull request #739: Adapt release smoke tester for 9.1
jtibshirani opened a new pull request #739: URL: https://github.com/apache/lucene/pull/739 This PR shows what parts of the smoke tester break when run on 9.1: * We ship a new directory `module-test-framework`. Do we mean to be including this in the binary distribution? * There are several new test folders like `analysis.tests`, `core.tests`, etc. that we could also omit. * The demo can't be launched because it's missing external dependencies (org.carrotsearch.hppc for `:lucene:facets` and antlr/ asm for `:lucene:expressions`). How should we make these available? NOTE: this PR is very hacky and is not meant to be merged, it just demonstrates the problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10459) Update smoke tester for 9.1
Julie Tibshirani created LUCENE-10459: - Summary: Update smoke tester for 9.1 Key: LUCENE-10459 URL: https://issues.apache.org/jira/browse/LUCENE-10459 Project: Lucene - Core Issue Type: Bug Reporter: Julie Tibshirani While working on the 9.1 release, I ran into several failures in the smoke tester that seem related to our move to the module system. At a high level, they include: * Including test directories in the binary distribution * Missing dependencies for the demo I opened this PR to show the details of the issues: https://github.com/apache/lucene/pull/739. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-10382: -- Fix Version/s: 9.1 > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Fix For: 9.1 > > Time Spent: 7h 50m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10459) Update smoke tester for 9.1
[ https://issues.apache.org/jira/browse/LUCENE-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504037#comment-17504037 ] Dawid Weiss commented on LUCENE-10459: -- It was an intentional decision _not_ to include all of the binary dependencies for each and every module - we only ship the ones for Luke in the binary distribution so that it works. I also believe we did verify the smoke tester when the change was made and everything worked - have there been any dependencies add that broke this? > Update smoke tester for 9.1 > --- > > Key: LUCENE-10459 > URL: https://issues.apache.org/jira/browse/LUCENE-10459 > Project: Lucene - Core > Issue Type: Bug >Reporter: Julie Tibshirani >Priority: Major > > While working on the 9.1 release, I ran into several failures in the smoke > tester that seem related to our move to the module system. At a high level, > they include: > * Including test directories in the binary distribution > * Missing dependencies for the demo > I opened this PR to show the details of the issues: > https://github.com/apache/lucene/pull/739. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a change in pull request #739: Adapt release smoke tester for 9.1
dweiss commented on a change in pull request #739: URL: https://github.com/apache/lucene/pull/739#discussion_r823407371 ## File path: dev-tools/scripts/smokeTestRelease.py ## @@ -574,10 +574,11 @@ def verifyUnpacked(java, artifact, unpackPath, gitRevision, version, testArgs): # raise RuntimeError('lucene: file "%s" is missing from artifact %s' % (fileName, artifact)) # in_root_folder.remove(fileName) - expected_folders = ['analysis', 'backward-codecs', 'benchmark', 'classification', 'codecs', 'core', - 'demo', 'expressions', 'facet', 'grouping', 'highlighter', 'join', - 'luke', 'memory', 'misc', 'monitor', 'queries', 'queryparser', 'replicator', - 'sandbox', 'spatial-extras', 'spatial3d', 'suggest', 'test-framework', 'licenses'] + expected_folders = ['analysis', 'analysis.tests', 'backward-codecs', 'benchmark', 'classification', 'codecs', Review comment: These "*.tests" folders should be excluded from the binary distribution at the build level - they're integration tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org