[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-01 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313013#comment-17313013
 ] 

Adrien Grand commented on LUCENE-9850:
--

I'm looking forward to JDK17 being out so that we can more easily leverage the 
vector API for postings formats and in particular better optimize the prefix 
sum logic, maybe this will help reduce the gap between FOR and PFOR?

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> ---
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9899) Numeric DV block compression ignores the gcd when computing the number of bits required

2021-04-01 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313041#comment-17313041
 ] 

Adrien Grand commented on LUCENE-9899:
--

We have a test for GCD compression in 
{{BaseCompressingDocValuesFormatTestCase#testDateCompression}}, I wonder if we 
can have a similar test for block+GCD compression. Otherwise +1 to the patch, 
great catch!

> Numeric DV block compression ignores the gcd when computing the number of 
> bits required
> ---
>
> Key: LUCENE-9899
> URL: https://issues.apache.org/jira/browse/LUCENE-9899
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-9899.patch
>
>
> When numeric doc values are splitted per block we compute the number of bits 
> per value [from the minimum and maximum value present in the 
> block|https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java#L390].
>  However, the greatest common divisor is not taken into account so the number 
> is overvalued for cases where it is greater than 1.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase opened a new pull request #59: LUCENE-9705: Move all test classes under lucene90 package

2021-04-01 Thread GitBox


iverase opened a new pull request #59:
URL: https://github.com/apache/lucene/pull/59


   This PR moves three test classes under the lucene90 package and add an entry 
in CHANGES.txt. With this we can close the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul commented on a change in pull request #2479: SOLR-15288 Replicas stay DOWN after a new node is restarted when using the same directory

2021-04-01 Thread GitBox


noblepaul commented on a change in pull request #2479:
URL: https://github.com/apache/lucene-solr/pull/2479#discussion_r605514906



##
File path: 
solr/solrj/src/test/org/apache/solr/client/solrj/impl/CloudSolrClientTest.java
##
@@ -1099,7 +1102,7 @@ public void testPerReplicaStateCollection() throws 
Exception {
 c.forEachReplica((s, replica) -> assertNotNull(replica.getReplicaState()));
 PerReplicaStates prs = 
PerReplicaStates.fetch(ZkStateReader.getCollectionPath(testCollection), 
cluster.getZkClient(), null);
 assertEquals(4, prs.states.size());
-
+JettySolrRunner jsr = cluster.startJettySolrRunner();

Review comment:
   this is a cluster created for this particular test

##
File path: solr/core/src/test/org/apache/solr/cloud/NodeMutatorTest.java
##
@@ -43,7 +43,7 @@
 
   @Test
   public void downNodeReportsAllImpactedCollectionsAndNothingElse() throws 
IOException {
-NodeMutator nm = new NodeMutator();
+NodeMutator nm = new NodeMutator(null);

Review comment:
   It's just a testcase. Ideally, nobody should create it without a ZkClient




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #56: LUCENE-9883: Turn on ecj missingEnumCaseDespiteDefault setting

2021-04-01 Thread GitBox


mocobeta commented on pull request #56:
URL: https://github.com/apache/lucene/pull/56#issuecomment-811843729


   OK, it doesn't seem to be a fun job anyway;  let's leave it as it is and 
please turn your efforts towards things you feel more enjoyable or beneficial, 
@zacharymorn  :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss opened a new pull request #60: LUCENE-9872: Make the most painful tasks in regenerate fully incremental

2021-04-01 Thread GitBox


dweiss opened a new pull request #60:
URL: https://github.com/apache/lucene/pull/60


   This also adds persistence of checksums and checksum validation for selected 
generation tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9872) Make the most painful tasks in regenerate fully incremental

2021-04-01 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313129#comment-17313129
 ] 

Dawid Weiss commented on LUCENE-9872:
-

I filed a pull request. Getting this right (and easy) is surprisingly 
difficult. I've added checksum validation and skipping to the most problematic 
pieces (jflex, javacc, moman-derived sources). Works for me. Making it work for 
everything (binary derived resources) requires jumping through hoops. I think 
I'm happy with what it is now.

> Make the most painful tasks in regenerate fully incremental
> ---
>
> Key: LUCENE-9872
> URL: https://issues.apache.org/jira/browse/LUCENE-9872
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: main (9.0)
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is particularly important for that one jflex task that is currently 
> mood-killer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9868) Verify checksums on generated files

2021-04-01 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9868.
-
Resolution: Duplicate

> Verify checksums on generated files
> ---
>
> Key: LUCENE-9868
> URL: https://issues.apache.org/jira/browse/LUCENE-9868
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Attachments: quietExec.patch
>
>
> This would prevent accidental changes to generated resources/ files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9897) Use gradle's built-in artifact checksum verification

2021-04-01 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-9897:

Parent: (was: LUCENE-9871)
Issue Type: Task  (was: Sub-task)

> Use gradle's built-in artifact checksum verification
> 
>
> Key: LUCENE-9897
> URL: https://issues.apache.org/jira/browse/LUCENE-9897
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Priority: Minor
>
> This is not something I'll be working on but just for reference - 
> https://docs.gradle.org/current/userguide/dependency_verification.html
> this could replace the manual code we currently use to validate dependency 
> JARs. I know nothing about how gradle's system works (or if it's going to 
> clash with palantir's version resolution, for example).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova edited a comment on pull request #11: LUCENE-9334 Consistency of field data structures

2021-04-01 Thread GitBox


mayya-sharipova edited a comment on pull request #11:
URL: https://github.com/apache/lucene/pull/11#issuecomment-805296177


   I've run indexing benchmarking using 
[luceneutil](https://github.com/mikemccand/luceneutil). And here are the 
results:
   
   - indexing time in ms
   - baseline: master branch
   - candidate: this PR
   
   
   | Dataset| Baseline  | Candidate | Difference |
   | :---   | ---:  |  ---: |  ---:  |
   | wikimedium500k | 98387 | 106189| 7.9%   |
   | wikimedium1m   | 174246| 177075| 1.6%   |
   | wikimedium10m  | 1356184   | 1359149   | 0.2%   |
   
   ---
   
   [wikimedium1m 
profiler](https://gist.github.com/mayya-sharipova/b4c8f47165a4bde8d2625487d2132319)
   
   | CPU profile % samples, Baseline  | CPU profile % samples, Candidate | 
   | :---  |  :---  |   
|0.73% 783 `IndexingChain$PerField#invert` | 0.80%  864  
`IndexingChain#getOrAddPerField`| 
| | 0.61% 658  `IndexingChain$FieldSchema#`| 
|  | 0.58%  633  `IndexingChain#processField` | 
   
   
   ---
   [wikimedium10m 
profiler](https://gist.github.com/mayya-sharipova/68cf6d543863029777ad3028c662ccd1):
   
   Extracting from CPU profiler everything related to `IndexingChain`, we can 
see that in **Candidate** there is an overhead spent on `assertSameSchema` that 
is a part of `processDocument`.
   
   | CPU profile % samples, Baseline  | CPU profile % samples, Candidate | 
   | :---  |  :---  |   
|0.90% 8259 `IndexingChain$PerField#invert` | 1.00%  9162  
`IndexingChain#getOrAddPerField`| 
| 0.65% 5956 `IndexingChain#getOrAddField`| 0.89%   8091  
`IndexingChain#processDocument`| 
| 0.56% 5161 `IndexingChain#processField` | 0.69%  6255  
`IndexingChain$PerField#invert` | 
|   | 0.55%  5044  `IndexingChain$FieldSchema#` | 
|   | 0.52% 4744 `IndexingChain$FieldSchema#assertSameSchema` | 
   
   cc @jpountz 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant merged pull request #2472: SOLR-15217: Use shardsWhitelist in ReplicationHandler.

2021-04-01 Thread GitBox


bruno-roustant merged pull request #2472:
URL: https://github.com/apache/lucene-solr/pull/2472


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11: LUCENE-9334 Consistency of field data structures

2021-04-01 Thread GitBox


jpountz commented on pull request #11:
URL: https://github.com/apache/lucene/pull/11#issuecomment-811987053


   Thanks @mayya-sharipova for the benchmark. The overhead looks very 
reasonable to me, I don't think it should be a reason not to proceed with this 
change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta merged pull request #58: Ignore sdkmanrc file on Git

2021-04-01 Thread GitBox


mocobeta merged pull request #58:
URL: https://github.com/apache/lucene/pull/58


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #59: LUCENE-9705: Move all test classes under lucene90 package

2021-04-01 Thread GitBox


jtibshirani commented on a change in pull request #59:
URL: https://github.com/apache/lucene/pull/59#discussion_r605784161



##
File path: 
lucene/core/src/test/org/apache/lucene/codecs/lucene90/TestLucene90LiveDocsFormat.java
##
@@ -14,7 +14,7 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.lucene.codecs.lucene50;
+package org.apache.lucene.codecs.lucene90;

Review comment:
   Oops, thanks for fixing these.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #11: LUCENE-9334 Consistency of field data structures

2021-04-01 Thread GitBox


jpountz commented on a change in pull request #11:
URL: https://github.com/apache/lucene/pull/11#discussion_r605809392



##
File path: 
lucene/core/src/test/org/apache/lucene/document/TestPerFieldConsistency.java
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.document;
+
+import static com.carrotsearch.randomizedtesting.RandomizedTest.randomDouble;
+import static com.carrotsearch.randomizedtesting.RandomizedTest.randomFloat;
+import static com.carrotsearch.randomizedtesting.RandomizedTest.randomInt;
+import static 
com.carrotsearch.randomizedtesting.RandomizedTest.randomIntBetween;
+import static com.carrotsearch.randomizedtesting.RandomizedTest.randomLong;
+
+import com.carrotsearch.randomizedtesting.generators.RandomPicks;
+import java.io.IOException;
+import java.util.Random;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestPerFieldConsistency extends LuceneTestCase {
+
+  private static Field randomIndexedField(Random random, String fieldName) {
+FieldType fieldType = new FieldType();
+IndexOptions indexOptions = RandomPicks.randomFrom(random, 
IndexOptions.values());
+while (indexOptions == IndexOptions.NONE) {
+  indexOptions = RandomPicks.randomFrom(random, IndexOptions.values());
+}
+fieldType.setIndexOptions(indexOptions);
+fieldType.setStoreTermVectors(random.nextBoolean());
+if (fieldType.storeTermVectors()) {
+  fieldType.setStoreTermVectorPositions(random.nextBoolean());
+  if (fieldType.storeTermVectorPositions()) {
+fieldType.setStoreTermVectorPayloads(random.nextBoolean());
+fieldType.setStoreTermVectorOffsets(random.nextBoolean());
+  }
+}
+fieldType.setOmitNorms(random.nextBoolean());
+fieldType.setStored(random.nextBoolean());
+fieldType.freeze();
+
+return new Field(fieldName, "randomValue", fieldType);
+  }
+
+  private static Field randomPointField(Random random, String fieldName) {
+switch (random.nextInt(4)) {
+  case 0:
+return new LongPoint(fieldName, randomLong());
+  case 1:
+return new IntPoint(fieldName, randomInt());
+  case 2:
+return new DoublePoint(fieldName, randomDouble());
+  default:
+return new FloatPoint(fieldName, randomFloat());
+}
+  }
+
+  private static Field randomDocValuesField(Random random, String fieldName) {
+switch (random.nextInt(4)) {
+  case 0:
+return new BinaryDocValuesField(fieldName, new 
BytesRef("randomValue"));
+  case 1:
+return new NumericDocValuesField(fieldName, randomLong());
+  case 2:
+return new DoubleDocValuesField(fieldName, randomDouble());
+  default:
+return new SortedSetDocValuesField(fieldName, new 
BytesRef("randomValue"));
+}
+  }
+
+  private static Field randomVectorField(Random random, String fieldName) {
+VectorValues.SearchStrategy searchStrategy =
+RandomPicks.randomFrom(random, VectorValues.SearchStrategy.values());
+while (searchStrategy == VectorValues.SearchStrategy.NONE) {
+  searchStrategy = RandomPicks.randomFrom(random, 
VectorValues.SearchStrategy.values());
+}
+float[] values = new float[randomIntBetween(1, 10)];
+for (int i = 0; i < values.length; i++) {
+  values[i] = randomFloat();
+}
+return new VectorField(fieldName, values, searchStrategy);
+  }
+
+  private static Field[] randomFieldsWithTheSameName(String fieldName) {
+final Field textField = randomIndexedField(random(), fieldName);
+final Field docValuesField = randomDocValuesField(random(), fieldName);
+final Field pointField = randomPointField(random(), fieldName);
+final Field vectorField = randomVectorField(random(), fieldName);
+return new Field[] {textFi

[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-01 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313509#comment-17313509
 ] 

Julie Tibshirani commented on LUCENE-9855:
--

To me it seems best to avoid tying the format name to HNSW. It's very possible 
that we'll evolve the implementation, as ANN is a developing area. I don't 
think we typically mention a specific algorithm/ data structure in format 
names, for example {{PointsFormat}} doesn't mention BKD trees.

{{NeighborsFormat}} also doesn't feel precise to me. We support NN search on 
points, so it doesn't distinguish this format carefully. And in the future, it 
may be possible the format will offer other operations on high-dimensional 
vectors like radius queries?

My current favorite is {{NumericVectorsFormat}} then {{VectorValuesFormat}}. 
{{DenseVectorFormat}} could work too (as long as we don't add sparse 
high-dimensional vectors!) but I understand [~sokolov]'s concern around 'dense' 
having multiple meanings.

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org