[GitHub] [lucene] dweiss commented on pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification

2021-04-27 Thread GitBox


dweiss commented on pull request #108:
URL: https://github.com/apache/lucene/pull/108#issuecomment-827381909


   Thanks, this looks suspiciously simple!... :) I'll be glad to experiment 
with it a bit. 
   
   I'm not a big fan of the monolithic checksum file - the expanded version 
(per-jar checksum) seems easier. 
   
   Checksums should only be generated for a subset of configurations - I don't 
think it's realistic to assume we can get checksums of everything (detached 
configurations, etc.).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification

2021-04-27 Thread GitBox


dweiss commented on a change in pull request #108:
URL: https://github.com/apache/lucene/pull/108#discussion_r620938812



##
File path: gradle/validation/jar-checks.gradle
##
@@ -140,41 +139,6 @@ subprojects {
 }
   }
 
-  // Verifies that each JAR has a corresponding checksum and that it matches 
actual JAR available for this dependency.
-  task validateJarChecksums() {

Review comment:
   Is there any way we can leave this task (empty) and with a dependency on 
whatever task gradle generates for checksum validation?

##
File path: gradle/validation/jar-checks.gradle
##
@@ -242,62 +206,14 @@ subprojects {
 }
   }
 
-  licenses.dependsOn validateJarChecksums, validateJarLicenses
+  licenses.dependsOn validateJarLicenses
 }
 
 // Add top-project level tasks validating dangling files
 // and regenerating dependency checksums.
 
 configure(project(":lucene")) {
   def validationTasks = subprojects.collectMany { it.tasks.matching { it.name 
== "licenses" } }
-  def jarInfoTasks = subprojects.collectMany { it.tasks.matching { it.name == 
"collectJarInfos" } }
-
-  // Update dependency checksums.
-  task updateLicenses() {

Review comment:
   Same here. I'd leave this task and use:
   ```
   ./gradlew --write-verification-metadata sha256 updateLicenses
   ```
   
   I hate to remember these option switches... the task could verify if they're 
in place in doFirst and maybe with a hint on how to issue the full command 
properly if they're missing. Or, alternatively, it could be a GradleBuild task 
that would recursively invoke the same build with the right options...

##
File path: gradle/verification-metadata.xml
##
@@ -0,0 +1,2198 @@
+
+https://schema.gradle.org/dependency-verification"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xsi:schemaLocation="https://schema.gradle.org/dependency-verification 
https://schema.gradle.org/dependency-verification/dependency-verification-1.0.xsd";>
+   
+  true
+  false
+   
+   
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  

Review comment:
   We only need checksums for a subset of configurations (like before). I'm 
pretty sure this dependency is from a plugin somewhere, not from Lucene code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)

2021-04-27 Thread GitBox


jpountz commented on a change in pull request #107:
URL: https://github.com/apache/lucene/pull/107#discussion_r620941684



##
File path: lucene/core/src/java/org/apache/lucene/codecs/CodecUtil.java
##
@@ -640,6 +640,33 @@ static void writeCRC(IndexOutput output) throws 
IOException {
   throw new IllegalStateException(
   "Illegal CRC-32 checksum: " + value + " (resource=" + output + ")");
 }
-output.writeLong(value);
+writeLong(output, value);
+  }
+
+  /** write int value on header / footer */
+  public static void writeInt(DataOutput out, int i) throws IOException {

Review comment:
   Maybe say explicitly on these methods that they write in big-endian 
order?

##
File path: 
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/store/DirectoryUtil.java
##
@@ -0,0 +1,56 @@
+package org.apache.lucene.backward_codecs.store;
+
+import java.io.IOException;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+
+/**
+ * Utility class to wrap open files
+ *
+ * @lucene.internal
+ */
+public final class DirectoryUtil {

Review comment:
   Give it a more descriptive name, e.g. `EndiannessReverserUtil` or 
something along these lines for consistency with the input/output wrapper names?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)

2021-04-27 Thread GitBox


jpountz commented on a change in pull request #107:
URL: https://github.com/apache/lucene/pull/107#discussion_r620951695



##
File path: 
lucene/core/src/java/org/apache/lucene/store/ByteBufferIndexInput.java
##
@@ -177,7 +177,7 @@ public void readLELongs(long[] dst, int offset, int length) 
throws IOException {
   }
 
   @Override
-  public final void readLEFloats(float[] floats, int offset, int len) throws 
IOException {
+  public final void readFloats(float[] floats, int offset, int len) throws 
IOException {
 // See notes about readELongs above

Review comment:
   ```suggestion
   // See notes about readLongs above
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)

2021-04-27 Thread GitBox


dweiss commented on a change in pull request #107:
URL: https://github.com/apache/lucene/pull/107#discussion_r620962715



##
File path: 
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/packed/LegacyDirectReader.java
##
@@ -0,0 +1,368 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.backward_codecs.packed;
+
+import java.io.IOException;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.LongValues;
+
+/**
+ * Retrieves an instance previously written by {@link LegacyDirectWriter}
+ *
+ * Example usage:
+ *
+ * 
+ *   int bitsPerValue = 100;
+ *   IndexInput in = dir.openInput("packed", IOContext.DEFAULT);
+ *   LongValues values = 
LegacyDirectReader.getInstance(in.randomAccessSlice(start, end), bitsPerValue);
+ *   for (int i = 0; i < numValues; i++) {
+ * long value = values.get(i);
+ *   }
+ * 
+ *
+ * @see LegacyDirectWriter
+ */
+public class LegacyDirectReader {
+
+  private LegacyDirectReader() {
+// no instances
+  }
+
+  /**
+   * Retrieves an instance from the specified slice written decoding {@code 
bitsPerValue} for each
+   * value
+   */
+  public static LongValues getInstance(RandomAccessInput slice, int 
bitsPerValue) {
+return getInstance(slice, bitsPerValue, 0);
+  }
+
+  /**
+   * Retrieves an instance from the specified {@code offset} of the given 
slice decoding {@code
+   * bitsPerValue} for each value
+   */
+  public static LongValues getInstance(RandomAccessInput slice, int 
bitsPerValue, long offset) {
+switch (bitsPerValue) {
+  case 1:
+return new DirectPackedReader1(slice, offset);
+  case 2:
+return new DirectPackedReader2(slice, offset);
+  case 4:
+return new DirectPackedReader4(slice, offset);
+  case 8:
+return new DirectPackedReader8(slice, offset);
+  case 12:
+return new DirectPackedReader12(slice, offset);
+  case 16:
+return new DirectPackedReader16(slice, offset);
+  case 20:
+return new DirectPackedReader20(slice, offset);
+  case 24:
+return new DirectPackedReader24(slice, offset);
+  case 28:
+return new DirectPackedReader28(slice, offset);
+  case 32:
+return new DirectPackedReader32(slice, offset);
+  case 40:
+return new DirectPackedReader40(slice, offset);
+  case 48:
+return new DirectPackedReader48(slice, offset);
+  case 56:
+return new DirectPackedReader56(slice, offset);
+  case 64:
+return new DirectPackedReader64(slice, offset);
+  default:
+throw new IllegalArgumentException("unsupported bitsPerValue: " + 
bitsPerValue);
+}
+  }
+
+  static final class DirectPackedReader1 extends LongValues {
+final RandomAccessInput in;
+final long offset;
+
+DirectPackedReader1(RandomAccessInput in, long offset) {
+  this.in = in;
+  this.offset = offset;
+}
+
+@Override
+public long get(long index) {
+  try {
+int shift = 7 - (int) (index & 7);
+return (in.readByte(offset + (index >>> 3)) >>> shift) & 0x1;
+  } catch (IOException e) {
+throw new RuntimeException(e);
+  }
+}
+  }
+
+  static final class DirectPackedReader2 extends LongValues {
+final RandomAccessInput in;
+final long offset;
+
+DirectPackedReader2(RandomAccessInput in, long offset) {
+  this.in = in;
+  this.offset = offset;
+}
+
+@Override
+public long get(long index) {
+  try {
+int shift = (3 - (int) (index & 3)) << 1;
+return (in.readByte(offset + (index >>> 2)) >>> shift) & 0x3;
+  } catch (IOException e) {
+throw new RuntimeException(e);
+  }
+}
+  }
+
+  static final class DirectPackedReader4 extends LongValues {
+final RandomAccessInput in;
+final long offset;
+
+DirectPackedReader4(RandomAccessInput in, long offset) {
+  this.in = in;
+  this.offset = offset;
+}
+
+@Override
+public long get(long index) {
+  try {
+int shift = (int) ((index + 1) & 1) << 2;
+return (in.readByte(offset + (index >>> 1)) >>> shift) & 0xF;
+  } catch (IOException e) {
+   

[jira] [Commented] (LUCENE-8069) Allow index sorting by field length

2021-04-27 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333040#comment-17333040
 ] 

Adrien Grand commented on LUCENE-8069:
--

Since I was playing with the MSMarco passages dataset for other reasons I 
wanted to give this change a try again with the first 1000 queries from the 
`eval` file. Unlike the wikipedia tasks file, queries in this dataset have many 
terms, often 5+, sometimes even 10+. All of them are disjunctions.

Lucene defaults:
 - avg: 11ms
 - median: 6ms
 - p90: 28ms
 - p99: 80ms

Index sorted by increasing field length:
 - avg: 7ms
 - median: 2ms
 - p90: 6ms
 - p99: 17ms

This seems to confirm that this approach could be very valuable.

> Allow index sorting by field length
> ---
>
> Key: LUCENE-8069
> URL: https://issues.apache.org/jira/browse/LUCENE-8069
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by 
> field length would mean we would be likely to collect best matches first. 
> Depending on the similarity implementation, this might even allow to early 
> terminate collection of top documents on term queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8069) Allow index sorting by field length

2021-04-27 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333044#comment-17333044
 ] 

Adrien Grand commented on LUCENE-8069:
--

bq. I guess people wanting these benefits today without any changes to Lucene 
could simply add a norm-like field (e.g. sum of raw char lengths of all 
tokenized fields) and then configure Lucene to sort on that. Would that work?

One thing that occurred to me recently is that we could make indexing faster if 
we actually used the norm instead of requiring users to index some for of proxy 
for the length normalization factor: because Lucene encodes norms on bytes, 
norms are low-cardinality fields, which in-turn gives us more options to make 
indexing faster when sorting is enabled via something like LUCENE-9935 (stored 
fields merging is currently a major bottleneck when doing bulk indexing with 
index sorting enabled).

> Allow index sorting by field length
> ---
>
> Key: LUCENE-8069
> URL: https://issues.apache.org/jira/browse/LUCENE-8069
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by 
> field length would mean we would be likely to collect best matches first. 
> Depending on the similarity implementation, this might even allow to early 
> terminate collection of top documents on term queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å

2021-04-27 Thread Jacob Lauritzen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Lauritzen updated LUCENE-9939:

Status: Patch Available  (was: Open)

> Proper ASCII folding of Danish/Norwegian characters Ø, Å
> 
>
> Key: LUCENE-9939
> URL: https://issues.apache.org/jira/browse/LUCENE-9939
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jacob Lauritzen
>Priority: Minor
>  Labels: easyfix
>
> The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to 
> O, o which I believe is incorrect.
> Å was added by Norway as a replacement for the Aa (which is mapped to aa in 
> the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a 
> lot of names (as an example the second largest city in Denmark was originally 
> named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for 
> internationalization purposes).
> The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not 
> ö (which is mapped to o) and is generally mapped to oe in ascii text.
> The third Danish character Æ is already properly mapped to AE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å

2021-04-27 Thread Jacob Lauritzen (Jira)
Jacob Lauritzen created LUCENE-9939:
---

 Summary: Proper ASCII folding of Danish/Norwegian characters Ø, Å
 Key: LUCENE-9939
 URL: https://issues.apache.org/jira/browse/LUCENE-9939
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Jacob Lauritzen


The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to O, 
o which I believe is incorrect.

Å was added by Norway as a replacement for the Aa (which is mapped to aa in the 
AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a lot 
of names (as an example the second largest city in Denmark was originally named 
Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for 
internationalization purposes).

The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not ö 
(which is mapped to o) and is generally mapped to oe in ascii text.

The third Danish character Æ is already properly mapped to AE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å

2021-04-27 Thread Jacob Lauritzen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Lauritzen updated LUCENE-9939:

Attachment: LUCENE-9939.patch
Labels: easyfix patch patch-available  (was: easyfix)
Status: Patch Available  (was: Patch Available)

> Proper ASCII folding of Danish/Norwegian characters Ø, Å
> 
>
> Key: LUCENE-9939
> URL: https://issues.apache.org/jira/browse/LUCENE-9939
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jacob Lauritzen
>Priority: Minor
>  Labels: easyfix, patch, patch-available
> Attachments: LUCENE-9939.patch
>
>
> The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to 
> O, o which I believe is incorrect.
> Å was added by Norway as a replacement for the Aa (which is mapped to aa in 
> the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a 
> lot of names (as an example the second largest city in Denmark was originally 
> named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for 
> internationalization purposes).
> The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not 
> ö (which is mapped to o) and is generally mapped to oe in ascii text.
> The third Danish character Æ is already properly mapped to AE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å

2021-04-27 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333128#comment-17333128
 ] 

Robert Muir commented on LUCENE-9939:
-

This isn't the way to go: these aren't the only languages using the letter. So 
we shouldn't change it in some way that only makes sense for these languages.

Place ScandinavianFoldingFilter or ScandinavianNormalizationFilter in your 
analysis chain before this thing: 
* 
https://lucene.apache.org/core/8_8_2/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html
* 
https://lucene.apache.org/core/8_8_2/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html

> Proper ASCII folding of Danish/Norwegian characters Ø, Å
> 
>
> Key: LUCENE-9939
> URL: https://issues.apache.org/jira/browse/LUCENE-9939
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jacob Lauritzen
>Priority: Minor
>  Labels: easyfix, patch, patch-available
> Attachments: LUCENE-9939.patch
>
>
> The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to 
> O, o which I believe is incorrect.
> Å was added by Norway as a replacement for the Aa (which is mapped to aa in 
> the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a 
> lot of names (as an example the second largest city in Denmark was originally 
> named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for 
> internationalization purposes).
> The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not 
> ö (which is mapped to o) and is generally mapped to oe in ascii text.
> The third Danish character Æ is already properly mapped to AE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul merged pull request #2481: SOLR-15337 Avoid XPath in solrconfig.xml parsing

2021-04-27 Thread GitBox


noblepaul merged pull request #2481:
URL: https://github.com/apache/lucene-solr/pull/2481


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8069) Allow index sorting by field length

2021-04-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333213#comment-17333213
 ] 

Michael McCandless commented on LUCENE-8069:


{quote}This seems to confirm that this approach could be very valuable.
{quote}
+1

I would expect that NOT insisting on total hit count (the way Lucene defaults 
now) is more common use case, so this optimization is indeed compelling.

> Allow index sorting by field length
> ---
>
> Key: LUCENE-8069
> URL: https://issues.apache.org/jira/browse/LUCENE-8069
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by 
> field length would mean we would be likely to collect best matches first. 
> Depending on the similarity implementation, this might even allow to early 
> terminate collection of top documents on term queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9940) The order of disjuncts in DisjunctionMaxQuery affects equals() impl

2021-04-27 Thread Alan Woodward (Jira)
Alan Woodward created LUCENE-9940:
-

 Summary: The order of disjuncts in DisjunctionMaxQuery affects 
equals() impl
 Key: LUCENE-9940
 URL: https://issues.apache.org/jira/browse/LUCENE-9940
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Alan Woodward
Assignee: Alan Woodward


DisjunctionMaxQuery stores its disjuncts in a java array, and its equals() 
implementation uses Arrays.equal() when checking equality.  This means that two 
queries with the same disjuncts but added in a different order will compare as 
different, even though their results will be identical.  We should replace the 
array with a Set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9940) The order of disjuncts in DisjunctionMaxQuery affects equals() impl

2021-04-27 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333278#comment-17333278
 ] 

Adrien Grand commented on LUCENE-9940:
--

+1, it might need to be a MultiSet in order to preserve scoring in the case 
when tieBreakerMultiplier is not 0?

> The order of disjuncts in DisjunctionMaxQuery affects equals() impl
> ---
>
> Key: LUCENE-9940
> URL: https://issues.apache.org/jira/browse/LUCENE-9940
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>
> DisjunctionMaxQuery stores its disjuncts in a java array, and its equals() 
> implementation uses Arrays.equal() when checking equality.  This means that 
> two queries with the same disjuncts but added in a different order will 
> compare as different, even though their results will be identical.  We should 
> replace the array with a Set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9204) Move span queries to the queries module

2021-04-27 Thread Alan Woodward (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1725#comment-1725
 ] 

Alan Woodward commented on LUCENE-9204:
---

I'd like to try and get this change in for 9.0, which seems like as good a time 
as any to move groups of queries around.

> Move span queries to the queries module
> ---
>
> Key: LUCENE-9204
> URL: https://issues.apache.org/jira/browse/LUCENE-9204
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>
> We have a slightly odd situation currently, with two parallel query 
> structures for building complex positional queries: the long-standing span 
> queries, in core; and interval queries, in the queries module.  Given that 
> interval queries solve at least some of the problems we've had with Spans, I 
> think we should be pushing users more towards these implementations.  It's 
> counter-intuitive to do that when Spans are in core though.  I've opened this 
> issue to discuss moving the spans package as a whole to the queries module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] neoremind commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building

2021-04-27 Thread GitBox


neoremind commented on pull request #91:
URL: https://github.com/apache/lucene/pull/91#issuecomment-827678981


   I spent some time trying to use the real case benchmark. The speedup of 
`IndexWriter` is what we expected, faster than main branch, total time elapsed 
(include adding doc, building index and merging) decreased by about 20%. If we 
only consider `flush_time`, the speedup is more obvious, time cost drops about 
40% - 50%.
   
   1) Run 
[IndexAndSearchOpenStreetMaps1D.java](https://github.com/neoremind/luceneutil/blob/master/src/main/perf/IndexAndSearchOpenStreetMaps1D.java)
 against the two branches and take down the 
[log](https://github.com/neoremind/luceneutil/tree/master/log/OpenStreetMaps).
   _note: comment query stage, modify some of the code to adapt to latest 
Lucene main branch._
   
   main branch:
   ```
   # egrep "flush time|sec to build index" open-street-maps.log
   DWPT 0 [2021-04-27T11:33:04.518908Z; main]: flush time 17284.537739 msec
   DWPT 0 [2021-04-27T11:33:37.888449Z; main]: flush time 12039.476885 msec
   72.49147722 sec to build index
   ```
   PR branch:
   ```
   #egrep "flush time|sec to build index" open-street-maps-optimized.log
   DWPT 0 [2021-04-27T11:35:00.619683Z; main]: flush time 9313.007647 msec
   DWPT 0 [2021-04-27T11:35:29.575254Z; main]: flush time 8631.820226 msec
   59.252797133 sec to build index
   ```
   
   2) Further more, I come up with an idea to use TPC-H LINEITEM to verify. I 
have a 10GB TPC-H dataset and develop a new test case to import the first 5 INT 
fields, which is more typical in real case.
   
   Run 
[IndexAndSearchTpcHLineItem.java](https://github.com/neoremind/luceneutil/blob/master/src/main/perf/IndexAndSearchTpcHLineItem.java)
 against the two branches and take down the 
[log](https://github.com/neoremind/luceneutil/tree/master/log/TPC-H-LINEITEM).
   
   main branch:
   ```
   egrep "flush time|sec to build index" tpch-lineitem.log
   DWPT 0 [2021-04-27T11:17:25.329006Z; main]: flush time 13850.23328 msec
   DWPT 0 [2021-04-27T11:17:50.289370Z; main]: flush time 12228.723665 msec
   DWPT 0 [2021-04-27T11:18:15.546002Z; main]: flush time 12537.085005 msec
   DWPT 0 [2021-04-27T11:18:40.140413Z; main]: flush time 11819.225223 msec
   DWPT 0 [2021-04-27T11:19:04.850989Z; main]: flush time 12004.380921 msec
   DWPT 0 [2021-04-27T11:19:29.435183Z; main]: flush time 11850.273453 msec
   DWPT 0 [2021-04-27T11:19:54.016951Z; main]: flush time 11882.316067 msec
   DWPT 0 [2021-04-27T11:20:18.932727Z; main]: flush time 12223.151464 msec
   DWPT 0 [2021-04-27T11:20:43.522117Z; main]: flush time 11871.276323 msec
   DWPT 0 [2021-04-27T11:20:52.060300Z; main]: flush time 3422.434221 msec
   271.188917715 sec to build index
   ```
   PR branch:
   ```
egrep "flush time|sec to build index" tpch-lineitem-optimized.log
   DWPT 0 [2021-04-27T11:24:00.362128Z; main]: flush time 7573.05091 msec
   DWPT 0 [2021-04-27T11:24:19.498948Z; main]: flush time 7355.376016 msec
   DWPT 0 [2021-04-27T11:24:38.602117Z; main]: flush time 7287.306154 msec
   DWPT 0 [2021-04-27T11:24:57.541930Z; main]: flush time 7227.514396 msec
   DWPT 0 [2021-04-27T11:25:16.474158Z; main]: flush time 7236.208865 msec
   DWPT 0 [2021-04-27T11:25:35.339855Z; main]: flush time 7152.876269 msec
   DWPT 0 [2021-04-27T11:25:54.10Z; main]: flush time 7080.405571 msec
   DWPT 0 [2021-04-27T11:26:12.985489Z; main]: flush time 7188.012278 msec
   DWPT 0 [2021-04-27T11:26:31.857053Z; main]: flush time 7176.303704 msec
   DWPT 0 [2021-04-27T11:26:38.838771Z; main]: flush time 2185.742347 msec
   213.175509249 sec to build index
   ```
   
   For benchmark command, please refer to [my 
document](https://github.com/neoremind/luceneutil/tree/master/command). 
   
   Test environment:
   ```
   CPU: 
   Architecture:  x86_64
   CPU op-mode(s):32-bit, 64-bit
   Byte Order:Little Endian
   CPU(s):32
   On-line CPU(s) list:   0-31
   Thread(s) per core:2
   Core(s) per socket:16
   Socket(s): 1
   NUMA node(s):  1
   Vendor ID: GenuineIntel
   CPU family:6
   Model: 85
   Model name:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
   Stepping:  4
   CPU MHz:   2500.000
   BogoMIPS:  5000.00
   Hypervisor vendor: KVM
   Virtualization type:   full
   L1d cache: 32K
   L1i cache: 32K
   L2 cache:  1024K
   L3 cache:  33792K
   NUMA node0 CPU(s): 0-31
   
   Memory: 
   $cat /proc/meminfo
   MemTotal:   65703704 kB
   
   Disk: SATA 
   $fdisk -l | grep Disk
   Disk /dev/vdb: 35184.4 GB, 35184372088832 bytes, 68719476736 sectors
   
   OS: 
   Linux 4.19.57-15.1.al7.x86_64
   
   JDK:
   openjdk version "11.0.11" 2021-04-20 LTS
   OpenJDK Runtime Environment 18.9 (build 11.0.11+9-LTS)
   OpenJDK 64-Bit Server VM 18.9 (build 11.0.11+9-LTS, mixed mode, sharing)
   ```


-- 

[GitHub] [lucene] mayya-sharipova commented on pull request #103: Fix regression to account payloads while merging

2021-04-27 Thread GitBox


mayya-sharipova commented on pull request #103:
URL: https://github.com/apache/lucene/pull/103#issuecomment-827737112


   @jpountz Thank you for the review. I've added the test to `TestTermVectors` 
in dc660968003cbaf6bb80c59c78b34af67fdedc03


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification

2021-04-27 Thread GitBox


gautamworah96 commented on a change in pull request #108:
URL: https://github.com/apache/lucene/pull/108#discussion_r621414185



##
File path: gradle/verification-metadata.xml
##
@@ -0,0 +1,2198 @@
+
+https://schema.gradle.org/dependency-verification"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xsi:schemaLocation="https://schema.gradle.org/dependency-verification 
https://schema.gradle.org/dependency-verification/dependency-verification-1.0.xsd";>
+   
+  true
+  false
+   
+   
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  

Review comment:
   Yes. I decided to keep the `` flag on which causes 
gradle to track metadata and transitive dependencies as well. Let me see if 
disabling this flag removes these plugins.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification

2021-04-27 Thread GitBox


dweiss commented on a change in pull request #108:
URL: https://github.com/apache/lucene/pull/108#discussion_r621486745



##
File path: gradle/verification-metadata.xml
##
@@ -0,0 +1,2198 @@
+
+https://schema.gradle.org/dependency-verification"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xsi:schemaLocation="https://schema.gradle.org/dependency-verification 
https://schema.gradle.org/dependency-verification/dependency-verification-1.0.xsd";>
+   
+  true
+  false
+   
+   
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  

Review comment:
   There should be a way to restrict this to only selected configurations, 
right? So dependencies of selected configurations. This would make things 
simpler as you would point at what it was before, for example.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification

2021-04-27 Thread GitBox


gautamworah96 commented on a change in pull request #108:
URL: https://github.com/apache/lucene/pull/108#discussion_r621512372



##
File path: gradle/verification-metadata.xml
##
@@ -0,0 +1,2198 @@
+
+https://schema.gradle.org/dependency-verification"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xsi:schemaLocation="https://schema.gradle.org/dependency-verification 
https://schema.gradle.org/dependency-verification/dependency-verification-1.0.xsd";>
+   
+  true
+  false
+   
+   
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  

Review comment:
   There 
[is](https://docs.gradle.org/6.8.1/userguide/dependency_verification.html#sub:disabling-specific-verification)!
 I'll try to tinker with it and see what turns out. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification

2021-04-27 Thread GitBox


gautamworah96 commented on pull request #108:
URL: https://github.com/apache/lucene/pull/108#issuecomment-827847811


   > Thanks, this looks suspiciously simple!... :) I'll be glad to experiment 
with it a bit.
   
   💯 
   
   > 
   > I'm not a big fan of the monolithic checksum file - the expanded version 
(per-jar checksum) seems easier.
   
   I actually thought having a single file would be better for editing and 
understanding dependencies from a single place.
   I don't think there is a way to give multiple checksum file inputs to gradle 
at this moment.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification

2021-04-27 Thread GitBox


dweiss commented on pull request #108:
URL: https://github.com/apache/lucene/pull/108#issuecomment-827851298


   It's fine. I kind of prefer filesystem (file name)-based correspondence of 
checksums to files but I can live with a monolithic file too. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9941) ann-benchmarks results for HNSW indexing

2021-04-27 Thread Julie Tibshirani (Jira)
Julie Tibshirani created LUCENE-9941:


 Summary: ann-benchmarks results for HNSW indexing
 Key: LUCENE-9941
 URL: https://issues.apache.org/jira/browse/LUCENE-9941
 Project: Lucene - Core
  Issue Type: Task
Reporter: Julie Tibshirani


This is a continuation of LUCENE-9937, but for HNSW index performance.

Approaches
 * LuceneVectorsOnly: a baseline that only indexes vectors
 * LuceneHnsw: our HNSW implementation, with a force merge to one segment
 * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
 * hnswlib: a C++ HNSW implementation from the author of the paper

Datasets
 * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, euclidean 
distance
 * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, euclidean 
distance

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach Index time (sec)
LuceneVectorsOnly  14.93
LuceneHnsw   3191.16
LuceneHnswNoForceMerge   1194.31
hnswlib   311.09
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
Approach  Index time (sec)
LuceneVectorsOnly  14.17
LuceneHnsw   8940.41
LuceneHnswNoForceMerge   3623.68 
hnswlib   587.23
{code}
We force merge to one segment to emulate a case where vectors aren't 
continually being indexed. In these situations, it seems likely users would 
force merge to optimize search speed: searching a single large graph is 
expected to be faster than searching several small ones serially. To see how 
long the force merge takes, we can subtract LuceneHnswNoForceMerge from 
LuceneHnsw.

The construction parameters match those in LUCENE-9937 and are optimized for 
search recall + QPS instead of index speed, as I figured this would be a common 
set-up.

Some observations:
 * In cases when segments are eventually force merged, we do a lot of extra 
work building intermediate graphs that are eventually merged away. This is a 
difficult problem, and one that's been raised in the past. As a simple step, I 
wonder if we should not build graphs for segments that are below a certain 
size. For sufficiently small segments, it could be a better trade-off to avoid 
building a graph and support nearest-neighbor search through a brute-force scan?
 * Indexing is slow compared to what we're used to for other formats, even if 
we disregard the extra work mentioned above. For sift-128-euclidean, building 
only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 min.
 * As a note, graph indexing uses ANN searches in order to add each new vector 
to the graph. So the slower search speed between Lucene and hnswlib may 
contribute to slower indexing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9941) ann-benchmarks results for HNSW indexing

2021-04-27 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-9941:
-
Description: 
This is a continuation of LUCENE-9937, but for HNSW index performance.

Approaches
 * LuceneVectorsOnly: a baseline that only indexes vectors
 * LuceneHnsw: our HNSW implementation, with a force merge to one segment
 * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
 * hnswlib: a C++ HNSW implementation from the author of the paper

Datasets
 * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, comparing 
euclidean distance
 * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, comparing 
cosine similarity

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach Index time (sec)
LuceneVectorsOnly  14.93
LuceneHnsw   3191.16
LuceneHnswNoForceMerge   1194.31
hnswlib   311.09
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
Approach  Index time (sec)
LuceneVectorsOnly  14.17
LuceneHnsw   8940.41
LuceneHnswNoForceMerge   3623.68 
hnswlib   587.23
{code}
We force merge to one segment to emulate a case where vectors aren't 
continually being indexed. In these situations, it seems likely users would 
force merge to optimize search speed: searching a single large graph is 
expected to be faster than searching several small ones serially. To see how 
long the force merge takes, we can subtract LuceneHnswNoForceMerge from 
LuceneHnsw.

The construction parameters match those in LUCENE-9937 and are optimized for 
search recall + QPS instead of index speed, as I figured this would be a common 
set-up.

Some observations:
 * In cases when segments are eventually force merged, we do a lot of extra 
work building intermediate graphs that are eventually merged away. This is a 
difficult problem, and one that's been raised in the past. As a simple step, I 
wonder if we should not build graphs for segments that are below a certain 
size. For sufficiently small segments, it could be a better trade-off to avoid 
building a graph and support nearest-neighbor search through a brute-force scan?
 * Indexing is slow compared to what we're used to for other formats, even if 
we disregard the extra work mentioned above. For sift-128-euclidean, building 
only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 min.
 * As a note, graph indexing uses ANN searches in order to add each new vector 
to the graph. So the slower search speed between Lucene and hnswlib may 
contribute to slower indexing.

  was:
This is a continuation of LUCENE-9937, but for HNSW index performance.

Approaches
 * LuceneVectorsOnly: a baseline that only indexes vectors
 * LuceneHnsw: our HNSW implementation, with a force merge to one segment
 * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
 * hnswlib: a C++ HNSW implementation from the author of the paper

Datasets
 * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, euclidean 
distance
 * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, euclidean 
distance

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach Index time (sec)
LuceneVectorsOnly  14.93
LuceneHnsw   3191.16
LuceneHnswNoForceMerge   1194.31
hnswlib   311.09
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
Approach  Index time (sec)
LuceneVectorsOnly  14.17
LuceneHnsw   8940.41
LuceneHnswNoForceMerge   3623.68 
hnswlib   587.23
{code}
We force merge to one segment to emulate a case where vectors aren't 
continually being indexed. In these situations, it seems likely users would 
force merge to optimize search speed: searching a single large graph is 
expected to be faster than searching several small ones serially. To see how 
long the force merge takes, we can subtract LuceneHnswNoForceMerge from 
LuceneHnsw.

The construction parameters match those in LUCENE-9937 and are optimized for 
search recall + QPS instead of index speed, as I figured this would be a common 
set-up.

Some observations:
 * In cases when segments are eventually force merged, we do a lot of extra 
work building intermediate graphs that are eventually merged away. This is a 
difficult problem, and one that's been raised in the past. As a simple step, I 
wonder if we should not build graphs for segments that are below a certain 
size. For sufficiently small segments, it could be a better trade-off to avoid 
building a graph and support nearest-neighbor search through a brute-force scan?
 * Indexing is slow comp

[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

2021-04-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334374#comment-17334374
 ] 

ASF subversion and git services commented on LUCENE-9905:
-

Commit 6d4b5eaba359d4b09114484bb144a724a920c122 in lucene's branch 
refs/heads/main from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6d4b5ea ]

LUCENE-9905: rename VectorValues.SearchStrategy to 
VectorValues.SimilarityFunction


> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: main (9.0)
>Reporter: Julie Tibshirani
>Priority: Blocker
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #106: LUCENE-9905: rename VectorValues.SearchStrategy to VectorValues.SimilarityFunction

2021-04-27 Thread GitBox


msokolov commented on a change in pull request #106:
URL: https://github.com/apache/lucene/pull/106#discussion_r621665922



##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -2336,6 +2338,29 @@ static void checkImpacts(Impacts impacts, int 
lastTarget) {
   + docCount
   + " docs with values");
 }
+VectorReader vectorReader = reader.getVectorReader();
+if (vectorReader instanceof Lucene90HnswVectorReader) {
+  KnnGraphValues graphValues =
+  ((Lucene90HnswVectorReader) 
vectorReader).getGraphValues(fieldInfo.name);
+  int size = graphValues.size();
+  for (int i = 0; i < size; i++) {
+graphValues.seek(i);
+for (int neighbor = graphValues.nextNeighbor();
+neighbor != NO_MORE_DOCS;
+neighbor = graphValues.nextNeighbor()) {
+  if (neighbor < 0 || neighbor >= size) {
+throw new RuntimeException(
+"Field \""
++ fieldInfo.name
++ "\" has an invalid neighbor ordinal: "
++ neighbor
++ " which should be in [0,"
++ size
++ ")");
+  }
+}
+  }
+}

Review comment:
   Ah, this slipped in here by accident. I'll remove and add back in a 
separate commit. My understanding about CheckIndex may be incomplete - I 
thought it was mostly intended as an operational testing and recovery tool, but 
I think you're saying it's part if the unit test framework?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

2021-04-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334375#comment-17334375
 ] 

ASF subversion and git services commented on LUCENE-9905:
-

Commit 45bd06c8041a2ce7af13e5f1b985ee7cfbb38e7c in lucene's branch 
refs/heads/main from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=45bd06c ]

LUCENE-9905: rename Lucene90VectorFormat and its reader and writer


> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: main (9.0)
>Reporter: Julie Tibshirani
>Priority: Blocker
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gus-asf opened a new pull request #2483: LUCENE-9574 - Add a token filter to drop tokens based on flags.

2021-04-27 Thread GitBox


gus-asf opened a new pull request #2483:
URL: https://github.com/apache/lucene-solr/pull/2483


   Backport


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gus-asf merged pull request #2483: LUCENE-9574 - Add a token filter to drop tokens based on flags.

2021-04-27 Thread GitBox


gus-asf merged pull request #2483:
URL: https://github.com/apache/lucene-solr/pull/2483


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9574) Add a token filter to drop tokens based on flags.

2021-04-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334403#comment-17334403
 ] 

ASF subversion and git services commented on LUCENE-9574:
-

Commit 1c815fb788d604ff440686581caa7ef9c48e757f in lucene-solr's branch 
refs/heads/branch_8x from Gus Heck
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1c815fb ]

Backport LUCENE-9574 - Add a token filter to drop tokens based on flags. (#2483)



> Add a token filter to drop tokens based on flags.
> -
>
> Key: LUCENE-9574
> URL: https://issues.apache.org/jira/browse/LUCENE-9574
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Gus Heck
>Assignee: Gus Heck
>Priority: Major
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> (Breaking this off of SOLR-14597 for independent review)
> A filter that tests flags on tokens vs a bitmask and drops tokens that have 
> all specified flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9574) Add a token filter to drop tokens based on flags.

2021-04-27 Thread Gus Heck (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gus Heck resolved LUCENE-9574.
--
Resolution: Implemented

> Add a token filter to drop tokens based on flags.
> -
>
> Key: LUCENE-9574
> URL: https://issues.apache.org/jira/browse/LUCENE-9574
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Gus Heck
>Assignee: Gus Heck
>Priority: Major
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> (Breaking this off of SOLR-14597 for independent review)
> A filter that tests flags on tokens vs a bitmask and drops tokens that have 
> all specified flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gus-asf opened a new pull request #2484: LUCENE-9574 CHANGES.txt entry

2021-04-27 Thread GitBox


gus-asf opened a new pull request #2484:
URL: https://github.com/apache/lucene-solr/pull/2484


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gus-asf merged pull request #2484: LUCENE-9574 CHANGES.txt entry

2021-04-27 Thread GitBox


gus-asf merged pull request #2484:
URL: https://github.com/apache/lucene-solr/pull/2484


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9574) Add a token filter to drop tokens based on flags.

2021-04-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334425#comment-17334425
 ] 

ASF subversion and git services commented on LUCENE-9574:
-

Commit 958b9f5850a4d2954e6eaa081abf46735aea5645 in lucene-solr's branch 
refs/heads/branch_8x from Gus Heck
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=958b9f5 ]

LUCENE-9574 CHANGES.txt entry (#2484)



> Add a token filter to drop tokens based on flags.
> -
>
> Key: LUCENE-9574
> URL: https://issues.apache.org/jira/browse/LUCENE-9574
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Gus Heck
>Assignee: Gus Heck
>Priority: Major
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> (Breaking this off of SOLR-14597 for independent review)
> A filter that tests flags on tokens vs a bitmask and drops tokens that have 
> all specified flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9574) Add a token filter to drop tokens based on flags.

2021-04-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334429#comment-17334429
 ] 

ASF subversion and git services commented on LUCENE-9574:
-

Commit 0c33e621f9b9da18a996a45bde6ef59e97150f23 in lucene's branch 
refs/heads/main from Gus Heck
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0c33e62 ]

LUCENE-9574 adjust changes entry


> Add a token filter to drop tokens based on flags.
> -
>
> Key: LUCENE-9574
> URL: https://issues.apache.org/jira/browse/LUCENE-9574
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Gus Heck
>Assignee: Gus Heck
>Priority: Major
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> (Breaking this off of SOLR-14597 for independent review)
> A filter that tests flags on tokens vs a bitmask and drops tokens that have 
> all specified flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9574) Add a token filter to drop tokens based on flags.

2021-04-27 Thread Gus Heck (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gus Heck updated LUCENE-9574:
-
Fix Version/s: 8.9

> Add a token filter to drop tokens based on flags.
> -
>
> Key: LUCENE-9574
> URL: https://issues.apache.org/jira/browse/LUCENE-9574
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Gus Heck
>Assignee: Gus Heck
>Priority: Major
> Fix For: 8.9
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> (Breaking this off of SOLR-14597 for independent review)
> A filter that tests flags on tokens vs a bitmask and drops tokens that have 
> all specified flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org