date:20220309

[GitHub] [lucene] mayya-sharipova merged pull request #734: LUCENE-10408 Test: correct type of checksum

2022-03-09 Thread GitBox



mayya-sharipova merged pull request #734:
URL: https://github.com/apache/lucene/pull/734


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-03-09 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503366#comment-17503366
 ] 

ASF subversion and git services commented on LUCENE-10408:
--

Commit e5717cddfda68dace6e45357f5e33d81c368db31 in lucene's branch 
refs/heads/main from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e5717cd ]

LUCENE-10408 Test correction checksum (#734)

Use double instead of float to test vector values checksum

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] romseygeek commented on a change in pull request #679: Monitor Improvements LUCENE-10422

2022-03-09 Thread GitBox



romseygeek commented on a change in pull request #679:
URL: https://github.com/apache/lucene/pull/679#discussion_r822427351



##
File path: 
lucene/monitor/src/java/org/apache/lucene/monitor/ReadonlyQueryIndex.java
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.monitor;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.search.*;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.NamedThreadFactory;
+
+class ReadonlyQueryIndex extends QueryIndex {
+
+  private final ScheduledExecutorService refreshExecutor;
+
+  public ReadonlyQueryIndex(MonitorConfiguration configuration) throws 
IOException {
+if (configuration.getDirectoryProvider() == null) {
+  throw new IllegalStateException(
+  "You must specify a Directory when configuring a Monitor as 
read-only.");
+}
+Directory directory = configuration.getDirectoryProvider().get();
+this.manager = new SearcherManager(directory, new 
TermsHashBuilder(termFilters));
+this.decomposer = configuration.getQueryDecomposer();
+this.serializer = configuration.getQuerySerializer();
+this.refreshExecutor =
+Executors.newSingleThreadScheduledExecutor(new 
NamedThreadFactory("cache-purge"));
+long refreshFrequency = configuration.getPurgeFrequency();
+this.refreshExecutor.scheduleAtFixedRate(
+() -> {
+  try {
+manager.maybeRefresh();
+  } catch (IOException e) {
+throw new RuntimeException(e);
+  }
+},
+refreshFrequency,
+refreshFrequency,
+configuration.getPurgeFrequencyUnits());
+  }
+
+  @Override
+  public void commit(List updates) throws IOException {
+throw new IllegalStateException("Monitor is readOnly cannot commit");
+  }
+
+  @Override
+  public long search(QueryBuilder queryBuilder, QueryCollector matcher) throws 
IOException {
+IndexSearcher searcher = null;
+try {
+  searcher = manager.acquire();
+  LazyMonitorQueryCollector collector =
+  new LazyMonitorQueryCollector(matcher, serializer, decomposer);
+  long buildTime = System.nanoTime();
+  Query query =
+  queryBuilder.buildQuery(
+  
termFilters.get(searcher.getIndexReader().getReaderCacheHelper().getKey()));
+  buildTime = System.nanoTime() - buildTime;
+  searcher.search(query, collector);
+  return buildTime;
+} finally {
+  if (searcher != null) {
+manager.release(searcher);
+  }
+}
+  }
+
+  @Override
+  public void purgeCache() {
+throw new IllegalStateException("Monitor is readOnly, it has no cache");

Review comment:
   Let's make this call `manager.maybeRefresh`

##
File path: 
lucene/monitor/src/java/org/apache/lucene/monitor/ReadonlyQueryIndex.java
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.monitor;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.search.*;

[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

2022-03-09 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503480#comment-17503480
 ] 

ASF subversion and git services commented on LUCENE-10311:
--

Commit e999056c19d98b5dbd6434f6986e19c69cdb28ab in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e999056 ]

LUCENE-10311: avoid division by zero on small sets.


> Should DocIdSetBuilder have different implementations for point and terms?
> --
>
> Key: LUCENE-10311
> URL: https://issues.apache.org/jira/browse/LUCENE-10311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder has two API implementations, one for terms queries and one 
> for point values queries. In each cases they are used in totally different 
> way.
> For terms the API looks like:
>  
> {code:java}
> /**
>  * Add the content of the provided {@link DocIdSetIterator} to this builder. 
> NOTE: if you need to
>  * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you 
> should rather use {@link
>  * RoaringDocIdSet.Builder}.
>  */
> void add(DocIdSetIterator iter) throws IOException;
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build() 
> {code}
>  
> For Point Values it looks like:
>  
> {code:java}
> /**
>  * Utility class to efficiently add many docs in one go.
>  *
>  * @see DocIdSetBuilder#grow
>  */
> public abstract static class BulkAdder {
>   public abstract void add(int doc);
>   public void add(DocIdSetIterator iterator) throws IOException {
> int docID;
> while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>   add(docID);
> }
>   }
> }
> /**
>  * Reserve space and return a {@link BulkAdder} object that can be used to 
> add up to {@code
>  * numDocs} documents.
>  */
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build()  public BulkAdder grow(int numDocs) 
> {code}
>  
>  
> This is becoming trappy for new developments in the PointValue API.
> 1) When we call #grow() from the PointValues API, we are not telling the 
> builder how many docs we are going to add (as we don't really know it) but 
> the number of points we are about to visit. This number can be bigger than 
> Integer.MAX_VALUE. Until now, we get around this issue by making sure we 
> don't call this API when we need to add more than Integer.MAX_VALUE points. 
> In that case we will navigate the tree down until the number of points is 
> reduced and they can fit in an int.
> This has work well until now because we are calling grow from inside the BKD 
> reader, and the BKD writer/reader makes sure than the number of points in a 
> leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which 
> does not enforce that the number of points on a leaf needs to fit in an int.  
> This causes friction and inconsistency in the API.
>  
> 2) This a secondary issue that I found when thinking in this issue. In 
> Lucene- we added the possibility to add a `DocIdSetIterator` from the 
> PointValues API.  Therefore there are two ways to add those kind of objects 
> to a DocIdSetBuilder which can end up in different results:
>  
> {code:java}
> {
>   // Terms API
>   docIdSetBuilder.add(docIdSetIterator); 
> }
> {
>   // Point values API
>   docIdSetBuilder.grow(doc).add(docIdSetIterator)
> }{code}
>  
> I wonder if we need to rethink this API, should we have different 
> implementation for Terms and Point values?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

2022-03-09 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503484#comment-17503484
 ] 

ASF subversion and git services commented on LUCENE-10311:
--

Commit 38b4bbf74e25a5e578486ba434751a3f361912f5 in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=38b4bbf ]

LUCENE-10311: avoid division by zero on small sets.


> Should DocIdSetBuilder have different implementations for point and terms?
> --
>
> Key: LUCENE-10311
> URL: https://issues.apache.org/jira/browse/LUCENE-10311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder has two API implementations, one for terms queries and one 
> for point values queries. In each cases they are used in totally different 
> way.
> For terms the API looks like:
>  
> {code:java}
> /**
>  * Add the content of the provided {@link DocIdSetIterator} to this builder. 
> NOTE: if you need to
>  * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you 
> should rather use {@link
>  * RoaringDocIdSet.Builder}.
>  */
> void add(DocIdSetIterator iter) throws IOException;
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build() 
> {code}
>  
> For Point Values it looks like:
>  
> {code:java}
> /**
>  * Utility class to efficiently add many docs in one go.
>  *
>  * @see DocIdSetBuilder#grow
>  */
> public abstract static class BulkAdder {
>   public abstract void add(int doc);
>   public void add(DocIdSetIterator iterator) throws IOException {
> int docID;
> while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>   add(docID);
> }
>   }
> }
> /**
>  * Reserve space and return a {@link BulkAdder} object that can be used to 
> add up to {@code
>  * numDocs} documents.
>  */
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build()  public BulkAdder grow(int numDocs) 
> {code}
>  
>  
> This is becoming trappy for new developments in the PointValue API.
> 1) When we call #grow() from the PointValues API, we are not telling the 
> builder how many docs we are going to add (as we don't really know it) but 
> the number of points we are about to visit. This number can be bigger than 
> Integer.MAX_VALUE. Until now, we get around this issue by making sure we 
> don't call this API when we need to add more than Integer.MAX_VALUE points. 
> In that case we will navigate the tree down until the number of points is 
> reduced and they can fit in an int.
> This has work well until now because we are calling grow from inside the BKD 
> reader, and the BKD writer/reader makes sure than the number of points in a 
> leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which 
> does not enforce that the number of points on a leaf needs to fit in an int.  
> This causes friction and inconsistency in the API.
>  
> 2) This a secondary issue that I found when thinking in this issue. In 
> Lucene- we added the possibility to add a `DocIdSetIterator` from the 
> PointValues API.  Therefore there are two ways to add those kind of objects 
> to a DocIdSetBuilder which can end up in different results:
>  
> {code:java}
> {
>   // Terms API
>   docIdSetBuilder.add(docIdSetIterator); 
> }
> {
>   // Point values API
>   docIdSetBuilder.grow(doc).add(docIdSetIterator)
> }{code}
>  
> I wonder if we need to rethink this API, should we have different 
> implementation for Terms and Point values?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

2022-03-09 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503533#comment-17503533
 ] 

Robert Muir commented on LUCENE-10457:
--

A symlink to a file is not a file, a symlink to a directory is not a directory. 
It is its own horrible thing.

Simple facts. We can argue all day, but my position will stand firm due to this.

I'll also do everything in my power to keep symlink support out of lucene, too. 
We all make choices in life. For example users/developers can decide to use 
finalizers, shutdown hooks, other horrible available things in java, it is a 
bad idea. we dont need tests around this, we just don't support it.

If you decide to use these in your design (NOT RECOMMENDED), then your tests 
need to be explicit. There are tons of system-specific stuff about them, and 
you need OS-specific tests basically. If java tries to hide this with 
abstractions, that doesn't change a thing. Java is wrong if they do that, 
that's all.


> LuceneTestCase.createTempDir could randomly return symbolic links
> -
>
> Key: LUCENE-10457
> URL: https://issues.apache.org/jira/browse/LUCENE-10457
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Mike Drob
>Priority: Major
>
> When we are creating temporary directories to use for other Lucene functions, 
> we could occasionally provide symbolic links instead of direct references to 
> directories. If the system running tests doesn't support symbolic links, then 
> we should ignore this option.
> Providing links would be useful to test scenarios for example where users 
> have a symbolic link for the "current" index directory and then rotate that 
> over time but applications still use the same link.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-03-09 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503543#comment-17503543
 ] 

ASF subversion and git services commented on LUCENE-10408:
--

Commit 1f497819e6b60db7908657056512e4c65fef420a in lucene's branch 
refs/heads/branch_9x from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1f49781 ]

LUCENE-10408 Test correction checksum (#734)

Use double instead of float to test vector values checksum

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-03-09 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503547#comment-17503547
 ] 

ASF subversion and git services commented on LUCENE-10408:
--

Commit 8f399572c99786b859123bca8ff50e99692d4ae3 in lucene's branch 
refs/heads/branch_9_1 from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8f39957 ]

LUCENE-10408 Test correction checksum (#734)

Use double instead of float to test vector values checksum

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

2022-03-09 Thread Uwe Schindler (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503608#comment-17503608
 ] 

Uwe Schindler commented on LUCENE-10457:


No symlinks please. The symlinks in the git checkout of Solr for the refguide 
already causes heavy issues on Windows (see my complaints). On Windows you need 
admin rights or need to set the computer or Jenkins server into "Win10 
developer mode" to create them (which is debatable, but the reasons for 
Microsoft doing this are legit because of malware). E.g. Git on Windows can't 
handle symlinks unless you set a specific option on checking out. In Solr's 
repo I get just text files with the target of symlink as contents when checking 
out the refguide of Solr!

Better sometimes add a whitespace into the temporary directory. Much more 
efficient to find bugs! :D

> LuceneTestCase.createTempDir could randomly return symbolic links
> -
>
> Key: LUCENE-10457
> URL: https://issues.apache.org/jira/browse/LUCENE-10457
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Mike Drob
>Priority: Major
>
> When we are creating temporary directories to use for other Lucene functions, 
> we could occasionally provide symbolic links instead of direct references to 
> directories. If the system running tests doesn't support symbolic links, then 
> we should ignore this option.
> Providing links would be useful to test scenarios for example where users 
> have a symbolic link for the "current" index directory and then rotate that 
> over time but applications still use the same link.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] thelabdude opened a new pull request #2645: Set cluster size to 2 and maxShardsPerNode to -1 for HttpClusterStateSSLTest

2022-03-09 Thread GitBox



thelabdude opened a new pull request #2645:
URL: https://github.com/apache/lucene-solr/pull/2645


   Test requires maxShardsPerNode to be -1 to allow multiple shards / replicas 
in a single node.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] thelabdude merged pull request #2645: Set cluster size to 2 and maxShardsPerNode to -1 for HttpClusterStateSSLTest

2022-03-09 Thread GitBox



thelabdude merged pull request #2645:
URL: https://github.com/apache/lucene-solr/pull/2645


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] cpoerschke opened a new pull request #737: Reduce for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms method.

2022-03-09 Thread GitBox



cpoerschke opened a new pull request #737:
URL: https://github.com/apache/lucene/pull/737


   Draft pull request only for today, no JIRA issue as yet, and not yet fully 
analysed things but from code reading it seems the query rewrite and term 
collecting need not happen in a loop?
   
   The 
https://github.com/apache/lucene/commit/81c7ba4601a9aaf16e2255fe493ee582abe72a90
 change included
   
   ```
   - final SpanQuery rewrittenQuery = (SpanQuery) 
spanQuery.rewrite(getLeafContextForField(field).reader());
   + final SpanQuery rewrittenQuery = (SpanQuery) 
spanQuery.rewrite(getLeafContext().reader());
   ```
   
   as a change i.e. previously more needed to happen in the loop.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10448) MergeRateLimiter doesn't always limit instant rate.

2022-03-09 Thread Vigya Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503809#comment-17503809
 ] 

Vigya Sharma commented on LUCENE-10448:
---

Hi [~kkewwei] - Thanks for collecting and sharing these detailed metrics and 
logs.

Can you help me understand why {{detailBytes(mb)}} are always {{~0.73mb}} when 
we are seeing an instant rate of up to {{460 mb/s}} in these runs? I was 
expecting to find large values in "detailBytes(mb)", corresponding to the high 
instant rate entries in "detailRate(mb/s)". 
Maybe "detailRate(mb/s)" is logged from {{writeBytes()}} while 
"detailBytes(mb)" is logged from the other APIs - in which case, it is curious 
how we have exactly 49 entries for both of them.

Separately, I see that you calculated instant rate from {{{}writeBytes(){}}}, 
which in theory, can end up writing large bursts of data. I believe there is 
some scope for improvement there.

 

> MergeRateLimiter doesn't always limit instant rate.
> ---
>
> Key: LUCENE-10448
> URL: https://issues.apache.org/jira/browse/LUCENE-10448
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.11.1
>Reporter: kkewwei
>Priority: Major
>
> We can see the code in *MergeRateLimiter*:
> {code:java}
> private long maybePause(long bytes, long curNS) throws 
> MergePolicy.MergeAbortedException {
>
> double rate = mbPerSec; 
> double secondsToPause = (bytes / 1024. / 1024.) / rate;
> long targetNS = lastNS + (long) (10 * secondsToPause);
> long curPauseNS = targetNS - curNS;
> // We don't bother with thread pausing if the pause is smaller than 2 
> msec.
> if (curPauseNS <= MIN_PAUSE_NS) {
>   // Set to curNS, not targetNS, to enforce the instant rate, not
>   // the "averaged over all history" rate:
>   lastNS = curNS;
>   return -1;
> }
>..
>   }
> {code}
> If a Segment is been merged, *maybePause* is called in 7:00, lastNS=7:00, 
> then the *maybePause* is called in 7:05 again,  so the value of 
> *targetNS=lastNS + (long) (10 * secondsToPause)* must be smaller than 
> *curNS*, no matter how big the bytes is, we will return -1 and ignore to 
> pause. 
> I count the total times(callTimes) calling *maybePause* and ignored pause 
> times(ignorePauseTimes) and detail ignored bytes(detailBytes):
> {code:java}
> [2022-03-02T15:16:51,972][DEBUG][o.e.i.e.I.EngineMergeScheduler] [node1] 
> [index1][21] merge segment [_4h] done: took [26.8s], [123.6 MB], [61,219 
> docs], [0s stopped], [24.4s throttled], [242.5 MB written], [11.2 MB/sec 
> throttle], [callTimes=857], [ignorePauseTimes=25],  [detailBytes(mb) = 
> [0.28899956, 0.28140354, 0.28015518, 0.27990818, 0.2801447, 0.27991104, 
> 0.27990723, 0.27990913, 0.2799101, 0.28010082, 0.2799921, 0.2799673, 
> 0.28144264, 0.27991295, 0.27990818, 0.27993107, 0.2799387, 0.27998447, 
> 0.28002167, 0.27992058, 0.27998066, 0.28098202, 0.28125, 0.28125, 0.28125]]
> {code}
> There are 857 times calling *maybePause*, including 25 times which is ignored 
> to pause, we can see that the ignored detail bytes (such as 0.28125mb) are 
> not small.
> As long as the interval between two *maybePause* calls is relatively long, 
> the pause action that should be executed will not be executed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma opened a new pull request #738: LUCENE-10448: Avoid instant rate write bursts by writing bytes buffer in chunks

2022-03-09 Thread GitBox

vigyasharma opened a new pull request #738:
URL: https://github.com/apache/lucene/pull/738

## Description

`RateLimitedIndexOutput#writeBytes()` checks for rate only at the start of
writing the byte buffer. This can result in large instant rate write bursts if
the provided byte buffer is large.
To avoid this, we write bytes in chunks and check for rate between each
chunk.

## Tests

Added `TestRateLimitedIndexOutput` to verify rate check between chunks.

## Checklist

Please review the following and check all that apply:

- [ ] I have reviewed the guidelines for [How to
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code
conforms to the standards described there to the best of my ability.
- [ ] I have created a Jira issue and added the issue ID to my pull request
title.
- [ ] I have given Lucene maintainers
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
to contribute to my PR branch. (optional but recommended)
- [ ] I have developed this patch against the `main` branch.
- [ ] I have run `./gradlew check`.
- [ ] I have added tests for my changes.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10448) MergeRateLimiter doesn't always limit instant rate.

2022-03-09 Thread Vigya Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503881#comment-17503881
 ] 

Vigya Sharma commented on LUCENE-10448:
---

I raised [PR - #738|https://github.com/apache/lucene/pull/738] with changes to 
write the byteBuffer in chunks, and check for rate between each write. This 
could help avoid potential large instant rate write bursts due to big byte 
buffers in {{writeBytes()}}.

Let me know if you think this would help - [~kkewwei], [~jpountz]

> MergeRateLimiter doesn't always limit instant rate.
> ---
>
> Key: LUCENE-10448
> URL: https://issues.apache.org/jira/browse/LUCENE-10448
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.11.1
>Reporter: kkewwei
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We can see the code in *MergeRateLimiter*:
> {code:java}
> private long maybePause(long bytes, long curNS) throws 
> MergePolicy.MergeAbortedException {
>
> double rate = mbPerSec; 
> double secondsToPause = (bytes / 1024. / 1024.) / rate;
> long targetNS = lastNS + (long) (10 * secondsToPause);
> long curPauseNS = targetNS - curNS;
> // We don't bother with thread pausing if the pause is smaller than 2 
> msec.
> if (curPauseNS <= MIN_PAUSE_NS) {
>   // Set to curNS, not targetNS, to enforce the instant rate, not
>   // the "averaged over all history" rate:
>   lastNS = curNS;
>   return -1;
> }
>..
>   }
> {code}
> If a Segment is been merged, *maybePause* is called in 7:00, lastNS=7:00, 
> then the *maybePause* is called in 7:05 again,  so the value of 
> *targetNS=lastNS + (long) (10 * secondsToPause)* must be smaller than 
> *curNS*, no matter how big the bytes is, we will return -1 and ignore to 
> pause. 
> I count the total times(callTimes) calling *maybePause* and ignored pause 
> times(ignorePauseTimes) and detail ignored bytes(detailBytes):
> {code:java}
> [2022-03-02T15:16:51,972][DEBUG][o.e.i.e.I.EngineMergeScheduler] [node1] 
> [index1][21] merge segment [_4h] done: took [26.8s], [123.6 MB], [61,219 
> docs], [0s stopped], [24.4s throttled], [242.5 MB written], [11.2 MB/sec 
> throttle], [callTimes=857], [ignorePauseTimes=25],  [detailBytes(mb) = 
> [0.28899956, 0.28140354, 0.28015518, 0.27990818, 0.2801447, 0.27991104, 
> 0.27990723, 0.27990913, 0.2799101, 0.28010082, 0.2799921, 0.2799673, 
> 0.28144264, 0.27991295, 0.27990818, 0.27993107, 0.2799387, 0.27998447, 
> 0.28002167, 0.27992058, 0.27998066, 0.28098202, 0.28125, 0.28125, 0.28125]]
> {code}
> There are 857 times calling *maybePause*, including 25 times which is ignored 
> to pause, we can see that the ignored detail bytes (such as 0.28125mb) are 
> not small.
> As long as the interval between two *maybePause* calls is relatively long, 
> the pause action that should be executed will not be executed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] LuXugang closed pull request #422: LUCENE-10120: Lazy initialize FixedBitSet in LRUQueryCache

2022-03-09 Thread GitBox



LuXugang closed pull request #422:
URL: https://github.com/apache/lucene/pull/422


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10458) BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables

2022-03-09 Thread Lu Xugang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10458:
---
Issue Type: Bug  (was: Improvement)

> BoundedDocSetIdIterator may supply error count in 
> Weigth#count(LeafReaderContext) when missingValue enables
> ---
>
> Key: LUCENE-10458
> URL: https://issues.apache.org/jira/browse/LUCENE-10458
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Lu Xugang
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When IndexSortSortedNumericDocValuesRangeQuery can take advantage of index 
> sort, Weight#count will use BoundedDocSetIdIterator's lastDoc and firstDoc to 
> calculate count, but if missingValue enables, those Documents which not 
> contain DocValues may be involved in calculating count.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10458) BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables

2022-03-09 Thread Lu Xugang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10458:
---
Fix Version/s: 9.1

> BoundedDocSetIdIterator may supply error count in 
> Weigth#count(LeafReaderContext) when missingValue enables
> ---
>
> Key: LUCENE-10458
> URL: https://issues.apache.org/jira/browse/LUCENE-10458
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Lu Xugang
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When IndexSortSortedNumericDocValuesRangeQuery can take advantage of index 
> sort, Weight#count will use BoundedDocSetIdIterator's lastDoc and firstDoc to 
> calculate count, but if missingValue enables, those Documents which not 
> contain DocValues may be involved in calculating count.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani opened a new pull request #739: Adapt release smoke tester for 9.1

2022-03-09 Thread GitBox



jtibshirani opened a new pull request #739:
URL: https://github.com/apache/lucene/pull/739


   This PR shows what parts of the smoke tester break when run on 9.1:
   * We ship a new directory `module-test-framework`. Do we mean to be 
including this in the binary distribution?
   * There are several new test folders like `analysis.tests`, `core.tests`, 
etc. that we could also omit.
   * The demo can't be launched because it's missing external dependencies 
(org.carrotsearch.hppc for `:lucene:facets` and antlr/ asm for 
`:lucene:expressions`). How should we make these available?
   
   
   NOTE: this PR is very hacky and is not meant to be merged, it just 
demonstrates the problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10459) Update smoke tester for 9.1

2022-03-09 Thread Julie Tibshirani (Jira)

Julie Tibshirani created LUCENE-10459:
-

 Summary: Update smoke tester for 9.1
 Key: LUCENE-10459
 URL: https://issues.apache.org/jira/browse/LUCENE-10459
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Julie Tibshirani


While working on the 9.1 release, I ran into several failures in the smoke 
tester that seem related to our move to the module system. At a high level, 
they include:
* Including test directories in the binary distribution
* Missing dependencies for the demo

I opened this PR to show the details of the issues: 
https://github.com/apache/lucene/pull/739.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-03-09 Thread Julie Tibshirani (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10382:
--
Fix Version/s: 9.1

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10459) Update smoke tester for 9.1

2022-03-09 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504037#comment-17504037
 ] 

Dawid Weiss commented on LUCENE-10459:
--

It was an intentional decision _not_ to include all of the binary dependencies 
for each and every module - we only ship the ones for Luke in the binary 
distribution so that it works. I also believe we did verify the smoke tester 
when the change was made and everything worked - have there been any 
dependencies add that broke this?

> Update smoke tester for 9.1
> ---
>
> Key: LUCENE-10459
> URL: https://issues.apache.org/jira/browse/LUCENE-10459
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Major
>
> While working on the 9.1 release, I ran into several failures in the smoke 
> tester that seem related to our move to the module system. At a high level, 
> they include:
> * Including test directories in the binary distribution
> * Missing dependencies for the demo
> I opened this PR to show the details of the issues: 
> https://github.com/apache/lucene/pull/739.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a change in pull request #739: Adapt release smoke tester for 9.1

2022-03-09 Thread GitBox



dweiss commented on a change in pull request #739:
URL: https://github.com/apache/lucene/pull/739#discussion_r823407371



##
File path: dev-tools/scripts/smokeTestRelease.py
##
@@ -574,10 +574,11 @@ def verifyUnpacked(java, artifact, unpackPath, 
gitRevision, version, testArgs):
   #   raise RuntimeError('lucene: file "%s" is missing from artifact %s' % 
(fileName, artifact))
   # in_root_folder.remove(fileName)
 
-  expected_folders = ['analysis', 'backward-codecs', 'benchmark', 
'classification', 'codecs', 'core',
-  'demo', 'expressions', 'facet', 'grouping', 
'highlighter', 'join',
-  'luke', 'memory', 'misc', 'monitor', 'queries', 
'queryparser', 'replicator',
-  'sandbox', 'spatial-extras', 'spatial3d', 'suggest', 
'test-framework', 'licenses']
+  expected_folders = ['analysis', 'analysis.tests', 'backward-codecs', 
'benchmark', 'classification', 'codecs',

Review comment:
   These "*.tests" folders should be excluded from the binary distribution 
at the build level - they're integration tests.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova merged pull request #734: LUCENE-10408 Test: correct type of checksum

[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

[GitHub] [lucene] romseygeek commented on a change in pull request #679: Monitor Improvements LUCENE-10422

[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

[GitHub] [lucene-solr] thelabdude opened a new pull request #2645: Set cluster size to 2 and maxShardsPerNode to -1 for HttpClusterStateSSLTest

[GitHub] [lucene-solr] thelabdude merged pull request #2645: Set cluster size to 2 and maxShardsPerNode to -1 for HttpClusterStateSSLTest

[GitHub] [lucene] cpoerschke opened a new pull request #737: Reduce for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms method.

[jira] [Commented] (LUCENE-10448) MergeRateLimiter doesn't always limit instant rate.

[GitHub] [lucene] vigyasharma opened a new pull request #738: LUCENE-10448: Avoid instant rate write bursts by writing bytes buffer in chunks

[jira] [Commented] (LUCENE-10448) MergeRateLimiter doesn't always limit instant rate.

[GitHub] [lucene] LuXugang closed pull request #422: LUCENE-10120: Lazy initialize FixedBitSet in LRUQueryCache

[jira] [Updated] (LUCENE-10458) BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables

[jira] [Updated] (LUCENE-10458) BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables

[GitHub] [lucene] jtibshirani opened a new pull request #739: Adapt release smoke tester for 9.1

[jira] [Created] (LUCENE-10459) Update smoke tester for 9.1

[jira] [Updated] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

[jira] [Commented] (LUCENE-10459) Update smoke tester for 9.1

[GitHub] [lucene] dweiss commented on a change in pull request #739: Adapt release smoke tester for 9.1

23 matches

Site Navigation

Mail list logo

Footer information