[GitHub] [lucene] dweiss commented on pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification
dweiss commented on pull request #108: URL: https://github.com/apache/lucene/pull/108#issuecomment-827381909 Thanks, this looks suspiciously simple!... :) I'll be glad to experiment with it a bit. I'm not a big fan of the monolithic checksum file - the expanded version (per-jar checksum) seems easier. Checksums should only be generated for a subset of configurations - I don't think it's realistic to assume we can get checksums of everything (detached configurations, etc.). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification
dweiss commented on a change in pull request #108: URL: https://github.com/apache/lucene/pull/108#discussion_r620938812 ## File path: gradle/validation/jar-checks.gradle ## @@ -140,41 +139,6 @@ subprojects { } } - // Verifies that each JAR has a corresponding checksum and that it matches actual JAR available for this dependency. - task validateJarChecksums() { Review comment: Is there any way we can leave this task (empty) and with a dependency on whatever task gradle generates for checksum validation? ## File path: gradle/validation/jar-checks.gradle ## @@ -242,62 +206,14 @@ subprojects { } } - licenses.dependsOn validateJarChecksums, validateJarLicenses + licenses.dependsOn validateJarLicenses } // Add top-project level tasks validating dangling files // and regenerating dependency checksums. configure(project(":lucene")) { def validationTasks = subprojects.collectMany { it.tasks.matching { it.name == "licenses" } } - def jarInfoTasks = subprojects.collectMany { it.tasks.matching { it.name == "collectJarInfos" } } - - // Update dependency checksums. - task updateLicenses() { Review comment: Same here. I'd leave this task and use: ``` ./gradlew --write-verification-metadata sha256 updateLicenses ``` I hate to remember these option switches... the task could verify if they're in place in doFirst and maybe with a hint on how to issue the full command properly if they're missing. Or, alternatively, it could be a GradleBuild task that would recursively invoke the same build with the right options... ## File path: gradle/verification-metadata.xml ## @@ -0,0 +1,2198 @@ + +https://schema.gradle.org/dependency-verification"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="https://schema.gradle.org/dependency-verification https://schema.gradle.org/dependency-verification/dependency-verification-1.0.xsd";> + + true + false + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Review comment: We only need checksums for a subset of configurations (like before). I'm pretty sure this dependency is from a plugin somewhere, not from Lucene code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)
jpountz commented on a change in pull request #107: URL: https://github.com/apache/lucene/pull/107#discussion_r620941684 ## File path: lucene/core/src/java/org/apache/lucene/codecs/CodecUtil.java ## @@ -640,6 +640,33 @@ static void writeCRC(IndexOutput output) throws IOException { throw new IllegalStateException( "Illegal CRC-32 checksum: " + value + " (resource=" + output + ")"); } -output.writeLong(value); +writeLong(output, value); + } + + /** write int value on header / footer */ + public static void writeInt(DataOutput out, int i) throws IOException { Review comment: Maybe say explicitly on these methods that they write in big-endian order? ## File path: lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/store/DirectoryUtil.java ## @@ -0,0 +1,56 @@ +package org.apache.lucene.backward_codecs.store; + +import java.io.IOException; +import org.apache.lucene.store.ChecksumIndexInput; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; + +/** + * Utility class to wrap open files + * + * @lucene.internal + */ +public final class DirectoryUtil { Review comment: Give it a more descriptive name, e.g. `EndiannessReverserUtil` or something along these lines for consistency with the input/output wrapper names? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)
jpountz commented on a change in pull request #107: URL: https://github.com/apache/lucene/pull/107#discussion_r620951695 ## File path: lucene/core/src/java/org/apache/lucene/store/ByteBufferIndexInput.java ## @@ -177,7 +177,7 @@ public void readLELongs(long[] dst, int offset, int length) throws IOException { } @Override - public final void readLEFloats(float[] floats, int offset, int len) throws IOException { + public final void readFloats(float[] floats, int offset, int len) throws IOException { // See notes about readELongs above Review comment: ```suggestion // See notes about readLongs above ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a change in pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)
dweiss commented on a change in pull request #107: URL: https://github.com/apache/lucene/pull/107#discussion_r620962715 ## File path: lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/packed/LegacyDirectReader.java ## @@ -0,0 +1,368 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.backward_codecs.packed; + +import java.io.IOException; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.LongValues; + +/** + * Retrieves an instance previously written by {@link LegacyDirectWriter} + * + * Example usage: + * + * + * int bitsPerValue = 100; + * IndexInput in = dir.openInput("packed", IOContext.DEFAULT); + * LongValues values = LegacyDirectReader.getInstance(in.randomAccessSlice(start, end), bitsPerValue); + * for (int i = 0; i < numValues; i++) { + * long value = values.get(i); + * } + * + * + * @see LegacyDirectWriter + */ +public class LegacyDirectReader { + + private LegacyDirectReader() { +// no instances + } + + /** + * Retrieves an instance from the specified slice written decoding {@code bitsPerValue} for each + * value + */ + public static LongValues getInstance(RandomAccessInput slice, int bitsPerValue) { +return getInstance(slice, bitsPerValue, 0); + } + + /** + * Retrieves an instance from the specified {@code offset} of the given slice decoding {@code + * bitsPerValue} for each value + */ + public static LongValues getInstance(RandomAccessInput slice, int bitsPerValue, long offset) { +switch (bitsPerValue) { + case 1: +return new DirectPackedReader1(slice, offset); + case 2: +return new DirectPackedReader2(slice, offset); + case 4: +return new DirectPackedReader4(slice, offset); + case 8: +return new DirectPackedReader8(slice, offset); + case 12: +return new DirectPackedReader12(slice, offset); + case 16: +return new DirectPackedReader16(slice, offset); + case 20: +return new DirectPackedReader20(slice, offset); + case 24: +return new DirectPackedReader24(slice, offset); + case 28: +return new DirectPackedReader28(slice, offset); + case 32: +return new DirectPackedReader32(slice, offset); + case 40: +return new DirectPackedReader40(slice, offset); + case 48: +return new DirectPackedReader48(slice, offset); + case 56: +return new DirectPackedReader56(slice, offset); + case 64: +return new DirectPackedReader64(slice, offset); + default: +throw new IllegalArgumentException("unsupported bitsPerValue: " + bitsPerValue); +} + } + + static final class DirectPackedReader1 extends LongValues { +final RandomAccessInput in; +final long offset; + +DirectPackedReader1(RandomAccessInput in, long offset) { + this.in = in; + this.offset = offset; +} + +@Override +public long get(long index) { + try { +int shift = 7 - (int) (index & 7); +return (in.readByte(offset + (index >>> 3)) >>> shift) & 0x1; + } catch (IOException e) { +throw new RuntimeException(e); + } +} + } + + static final class DirectPackedReader2 extends LongValues { +final RandomAccessInput in; +final long offset; + +DirectPackedReader2(RandomAccessInput in, long offset) { + this.in = in; + this.offset = offset; +} + +@Override +public long get(long index) { + try { +int shift = (3 - (int) (index & 3)) << 1; +return (in.readByte(offset + (index >>> 2)) >>> shift) & 0x3; + } catch (IOException e) { +throw new RuntimeException(e); + } +} + } + + static final class DirectPackedReader4 extends LongValues { +final RandomAccessInput in; +final long offset; + +DirectPackedReader4(RandomAccessInput in, long offset) { + this.in = in; + this.offset = offset; +} + +@Override +public long get(long index) { + try { +int shift = (int) ((index + 1) & 1) << 2; +return (in.readByte(offset + (index >>> 1)) >>> shift) & 0xF; + } catch (IOException e) { +
[jira] [Commented] (LUCENE-8069) Allow index sorting by field length
[ https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333040#comment-17333040 ] Adrien Grand commented on LUCENE-8069: -- Since I was playing with the MSMarco passages dataset for other reasons I wanted to give this change a try again with the first 1000 queries from the `eval` file. Unlike the wikipedia tasks file, queries in this dataset have many terms, often 5+, sometimes even 10+. All of them are disjunctions. Lucene defaults: - avg: 11ms - median: 6ms - p90: 28ms - p99: 80ms Index sorted by increasing field length: - avg: 7ms - median: 2ms - p90: 6ms - p99: 17ms This seems to confirm that this approach could be very valuable. > Allow index sorting by field length > --- > > Key: LUCENE-8069 > URL: https://issues.apache.org/jira/browse/LUCENE-8069 > Project: Lucene - Core > Issue Type: Wish >Reporter: Adrien Grand >Priority: Minor > > Short documents are more likely to get higher scores, so sorting an index by > field length would mean we would be likely to collect best matches first. > Depending on the similarity implementation, this might even allow to early > terminate collection of top documents on term queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8069) Allow index sorting by field length
[ https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333044#comment-17333044 ] Adrien Grand commented on LUCENE-8069: -- bq. I guess people wanting these benefits today without any changes to Lucene could simply add a norm-like field (e.g. sum of raw char lengths of all tokenized fields) and then configure Lucene to sort on that. Would that work? One thing that occurred to me recently is that we could make indexing faster if we actually used the norm instead of requiring users to index some for of proxy for the length normalization factor: because Lucene encodes norms on bytes, norms are low-cardinality fields, which in-turn gives us more options to make indexing faster when sorting is enabled via something like LUCENE-9935 (stored fields merging is currently a major bottleneck when doing bulk indexing with index sorting enabled). > Allow index sorting by field length > --- > > Key: LUCENE-8069 > URL: https://issues.apache.org/jira/browse/LUCENE-8069 > Project: Lucene - Core > Issue Type: Wish >Reporter: Adrien Grand >Priority: Minor > > Short documents are more likely to get higher scores, so sorting an index by > field length would mean we would be likely to collect best matches first. > Depending on the similarity implementation, this might even allow to early > terminate collection of top documents on term queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å
[ https://issues.apache.org/jira/browse/LUCENE-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Lauritzen updated LUCENE-9939: Status: Patch Available (was: Open) > Proper ASCII folding of Danish/Norwegian characters Ø, Å > > > Key: LUCENE-9939 > URL: https://issues.apache.org/jira/browse/LUCENE-9939 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Jacob Lauritzen >Priority: Minor > Labels: easyfix > > The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to > O, o which I believe is incorrect. > Å was added by Norway as a replacement for the Aa (which is mapped to aa in > the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a > lot of names (as an example the second largest city in Denmark was originally > named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for > internationalization purposes). > The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not > ö (which is mapped to o) and is generally mapped to oe in ascii text. > The third Danish character Æ is already properly mapped to AE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å
Jacob Lauritzen created LUCENE-9939: --- Summary: Proper ASCII folding of Danish/Norwegian characters Ø, Å Key: LUCENE-9939 URL: https://issues.apache.org/jira/browse/LUCENE-9939 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Jacob Lauritzen The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to O, o which I believe is incorrect. Å was added by Norway as a replacement for the Aa (which is mapped to aa in the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a lot of names (as an example the second largest city in Denmark was originally named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for internationalization purposes). The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not ö (which is mapped to o) and is generally mapped to oe in ascii text. The third Danish character Æ is already properly mapped to AE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å
[ https://issues.apache.org/jira/browse/LUCENE-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Lauritzen updated LUCENE-9939: Attachment: LUCENE-9939.patch Labels: easyfix patch patch-available (was: easyfix) Status: Patch Available (was: Patch Available) > Proper ASCII folding of Danish/Norwegian characters Ø, Å > > > Key: LUCENE-9939 > URL: https://issues.apache.org/jira/browse/LUCENE-9939 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Jacob Lauritzen >Priority: Minor > Labels: easyfix, patch, patch-available > Attachments: LUCENE-9939.patch > > > The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to > O, o which I believe is incorrect. > Å was added by Norway as a replacement for the Aa (which is mapped to aa in > the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a > lot of names (as an example the second largest city in Denmark was originally > named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for > internationalization purposes). > The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not > ö (which is mapped to o) and is generally mapped to oe in ascii text. > The third Danish character Æ is already properly mapped to AE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å
[ https://issues.apache.org/jira/browse/LUCENE-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333128#comment-17333128 ] Robert Muir commented on LUCENE-9939: - This isn't the way to go: these aren't the only languages using the letter. So we shouldn't change it in some way that only makes sense for these languages. Place ScandinavianFoldingFilter or ScandinavianNormalizationFilter in your analysis chain before this thing: * https://lucene.apache.org/core/8_8_2/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html * https://lucene.apache.org/core/8_8_2/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html > Proper ASCII folding of Danish/Norwegian characters Ø, Å > > > Key: LUCENE-9939 > URL: https://issues.apache.org/jira/browse/LUCENE-9939 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Jacob Lauritzen >Priority: Minor > Labels: easyfix, patch, patch-available > Attachments: LUCENE-9939.patch > > > The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to > O, o which I believe is incorrect. > Å was added by Norway as a replacement for the Aa (which is mapped to aa in > the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a > lot of names (as an example the second largest city in Denmark was originally > named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for > internationalization purposes). > The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not > ö (which is mapped to o) and is generally mapped to oe in ascii text. > The third Danish character Æ is already properly mapped to AE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] noblepaul merged pull request #2481: SOLR-15337 Avoid XPath in solrconfig.xml parsing
noblepaul merged pull request #2481: URL: https://github.com/apache/lucene-solr/pull/2481 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8069) Allow index sorting by field length
[ https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333213#comment-17333213 ] Michael McCandless commented on LUCENE-8069: {quote}This seems to confirm that this approach could be very valuable. {quote} +1 I would expect that NOT insisting on total hit count (the way Lucene defaults now) is more common use case, so this optimization is indeed compelling. > Allow index sorting by field length > --- > > Key: LUCENE-8069 > URL: https://issues.apache.org/jira/browse/LUCENE-8069 > Project: Lucene - Core > Issue Type: Wish >Reporter: Adrien Grand >Priority: Minor > > Short documents are more likely to get higher scores, so sorting an index by > field length would mean we would be likely to collect best matches first. > Depending on the similarity implementation, this might even allow to early > terminate collection of top documents on term queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9940) The order of disjuncts in DisjunctionMaxQuery affects equals() impl
Alan Woodward created LUCENE-9940: - Summary: The order of disjuncts in DisjunctionMaxQuery affects equals() impl Key: LUCENE-9940 URL: https://issues.apache.org/jira/browse/LUCENE-9940 Project: Lucene - Core Issue Type: Bug Reporter: Alan Woodward Assignee: Alan Woodward DisjunctionMaxQuery stores its disjuncts in a java array, and its equals() implementation uses Arrays.equal() when checking equality. This means that two queries with the same disjuncts but added in a different order will compare as different, even though their results will be identical. We should replace the array with a Set. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9940) The order of disjuncts in DisjunctionMaxQuery affects equals() impl
[ https://issues.apache.org/jira/browse/LUCENE-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333278#comment-17333278 ] Adrien Grand commented on LUCENE-9940: -- +1, it might need to be a MultiSet in order to preserve scoring in the case when tieBreakerMultiplier is not 0? > The order of disjuncts in DisjunctionMaxQuery affects equals() impl > --- > > Key: LUCENE-9940 > URL: https://issues.apache.org/jira/browse/LUCENE-9940 > Project: Lucene - Core > Issue Type: Bug >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > > DisjunctionMaxQuery stores its disjuncts in a java array, and its equals() > implementation uses Arrays.equal() when checking equality. This means that > two queries with the same disjuncts but added in a different order will > compare as different, even though their results will be identical. We should > replace the array with a Set. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9204) Move span queries to the queries module
[ https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1725#comment-1725 ] Alan Woodward commented on LUCENE-9204: --- I'd like to try and get this change in for 9.0, which seems like as good a time as any to move groups of queries around. > Move span queries to the queries module > --- > > Key: LUCENE-9204 > URL: https://issues.apache.org/jira/browse/LUCENE-9204 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > > We have a slightly odd situation currently, with two parallel query > structures for building complex positional queries: the long-standing span > queries, in core; and interval queries, in the queries module. Given that > interval queries solve at least some of the problems we've had with Spans, I > think we should be pushing users more towards these implementations. It's > counter-intuitive to do that when Spans are in core though. I've opened this > issue to discuss moving the spans package as a whole to the queries module. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] neoremind commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building
neoremind commented on pull request #91: URL: https://github.com/apache/lucene/pull/91#issuecomment-827678981 I spent some time trying to use the real case benchmark. The speedup of `IndexWriter` is what we expected, faster than main branch, total time elapsed (include adding doc, building index and merging) decreased by about 20%. If we only consider `flush_time`, the speedup is more obvious, time cost drops about 40% - 50%. 1) Run [IndexAndSearchOpenStreetMaps1D.java](https://github.com/neoremind/luceneutil/blob/master/src/main/perf/IndexAndSearchOpenStreetMaps1D.java) against the two branches and take down the [log](https://github.com/neoremind/luceneutil/tree/master/log/OpenStreetMaps). _note: comment query stage, modify some of the code to adapt to latest Lucene main branch._ main branch: ``` # egrep "flush time|sec to build index" open-street-maps.log DWPT 0 [2021-04-27T11:33:04.518908Z; main]: flush time 17284.537739 msec DWPT 0 [2021-04-27T11:33:37.888449Z; main]: flush time 12039.476885 msec 72.49147722 sec to build index ``` PR branch: ``` #egrep "flush time|sec to build index" open-street-maps-optimized.log DWPT 0 [2021-04-27T11:35:00.619683Z; main]: flush time 9313.007647 msec DWPT 0 [2021-04-27T11:35:29.575254Z; main]: flush time 8631.820226 msec 59.252797133 sec to build index ``` 2) Further more, I come up with an idea to use TPC-H LINEITEM to verify. I have a 10GB TPC-H dataset and develop a new test case to import the first 5 INT fields, which is more typical in real case. Run [IndexAndSearchTpcHLineItem.java](https://github.com/neoremind/luceneutil/blob/master/src/main/perf/IndexAndSearchTpcHLineItem.java) against the two branches and take down the [log](https://github.com/neoremind/luceneutil/tree/master/log/TPC-H-LINEITEM). main branch: ``` egrep "flush time|sec to build index" tpch-lineitem.log DWPT 0 [2021-04-27T11:17:25.329006Z; main]: flush time 13850.23328 msec DWPT 0 [2021-04-27T11:17:50.289370Z; main]: flush time 12228.723665 msec DWPT 0 [2021-04-27T11:18:15.546002Z; main]: flush time 12537.085005 msec DWPT 0 [2021-04-27T11:18:40.140413Z; main]: flush time 11819.225223 msec DWPT 0 [2021-04-27T11:19:04.850989Z; main]: flush time 12004.380921 msec DWPT 0 [2021-04-27T11:19:29.435183Z; main]: flush time 11850.273453 msec DWPT 0 [2021-04-27T11:19:54.016951Z; main]: flush time 11882.316067 msec DWPT 0 [2021-04-27T11:20:18.932727Z; main]: flush time 12223.151464 msec DWPT 0 [2021-04-27T11:20:43.522117Z; main]: flush time 11871.276323 msec DWPT 0 [2021-04-27T11:20:52.060300Z; main]: flush time 3422.434221 msec 271.188917715 sec to build index ``` PR branch: ``` egrep "flush time|sec to build index" tpch-lineitem-optimized.log DWPT 0 [2021-04-27T11:24:00.362128Z; main]: flush time 7573.05091 msec DWPT 0 [2021-04-27T11:24:19.498948Z; main]: flush time 7355.376016 msec DWPT 0 [2021-04-27T11:24:38.602117Z; main]: flush time 7287.306154 msec DWPT 0 [2021-04-27T11:24:57.541930Z; main]: flush time 7227.514396 msec DWPT 0 [2021-04-27T11:25:16.474158Z; main]: flush time 7236.208865 msec DWPT 0 [2021-04-27T11:25:35.339855Z; main]: flush time 7152.876269 msec DWPT 0 [2021-04-27T11:25:54.10Z; main]: flush time 7080.405571 msec DWPT 0 [2021-04-27T11:26:12.985489Z; main]: flush time 7188.012278 msec DWPT 0 [2021-04-27T11:26:31.857053Z; main]: flush time 7176.303704 msec DWPT 0 [2021-04-27T11:26:38.838771Z; main]: flush time 2185.742347 msec 213.175509249 sec to build index ``` For benchmark command, please refer to [my document](https://github.com/neoremind/luceneutil/tree/master/command). Test environment: ``` CPU: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):32 On-line CPU(s) list: 0-31 Thread(s) per core:2 Core(s) per socket:16 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family:6 Model: 85 Model name:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz Stepping: 4 CPU MHz: 2500.000 BogoMIPS: 5000.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 33792K NUMA node0 CPU(s): 0-31 Memory: $cat /proc/meminfo MemTotal: 65703704 kB Disk: SATA $fdisk -l | grep Disk Disk /dev/vdb: 35184.4 GB, 35184372088832 bytes, 68719476736 sectors OS: Linux 4.19.57-15.1.al7.x86_64 JDK: openjdk version "11.0.11" 2021-04-20 LTS OpenJDK Runtime Environment 18.9 (build 11.0.11+9-LTS) OpenJDK 64-Bit Server VM 18.9 (build 11.0.11+9-LTS, mixed mode, sharing) ``` --
[GitHub] [lucene] mayya-sharipova commented on pull request #103: Fix regression to account payloads while merging
mayya-sharipova commented on pull request #103: URL: https://github.com/apache/lucene/pull/103#issuecomment-827737112 @jpountz Thank you for the review. I've added the test to `TestTermVectors` in dc660968003cbaf6bb80c59c78b34af67fdedc03 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification
gautamworah96 commented on a change in pull request #108: URL: https://github.com/apache/lucene/pull/108#discussion_r621414185 ## File path: gradle/verification-metadata.xml ## @@ -0,0 +1,2198 @@ + +https://schema.gradle.org/dependency-verification"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="https://schema.gradle.org/dependency-verification https://schema.gradle.org/dependency-verification/dependency-verification-1.0.xsd";> + + true + false + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Review comment: Yes. I decided to keep the `` flag on which causes gradle to track metadata and transitive dependencies as well. Let me see if disabling this flag removes these plugins. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification
dweiss commented on a change in pull request #108: URL: https://github.com/apache/lucene/pull/108#discussion_r621486745 ## File path: gradle/verification-metadata.xml ## @@ -0,0 +1,2198 @@ + +https://schema.gradle.org/dependency-verification"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="https://schema.gradle.org/dependency-verification https://schema.gradle.org/dependency-verification/dependency-verification-1.0.xsd";> + + true + false + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Review comment: There should be a way to restrict this to only selected configurations, right? So dependencies of selected configurations. This would make things simpler as you would point at what it was before, for example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification
gautamworah96 commented on a change in pull request #108: URL: https://github.com/apache/lucene/pull/108#discussion_r621512372 ## File path: gradle/verification-metadata.xml ## @@ -0,0 +1,2198 @@ + +https://schema.gradle.org/dependency-verification"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="https://schema.gradle.org/dependency-verification https://schema.gradle.org/dependency-verification/dependency-verification-1.0.xsd";> + + true + false + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Review comment: There [is](https://docs.gradle.org/6.8.1/userguide/dependency_verification.html#sub:disabling-specific-verification)! I'll try to tinker with it and see what turns out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification
gautamworah96 commented on pull request #108: URL: https://github.com/apache/lucene/pull/108#issuecomment-827847811 > Thanks, this looks suspiciously simple!... :) I'll be glad to experiment with it a bit. 💯 > > I'm not a big fan of the monolithic checksum file - the expanded version (per-jar checksum) seems easier. I actually thought having a single file would be better for editing and understanding dependencies from a single place. I don't think there is a way to give multiple checksum file inputs to gradle at this moment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification
dweiss commented on pull request #108: URL: https://github.com/apache/lucene/pull/108#issuecomment-827851298 It's fine. I kind of prefer filesystem (file name)-based correspondence of checksums to files but I can live with a monolithic file too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9941) ann-benchmarks results for HNSW indexing
Julie Tibshirani created LUCENE-9941: Summary: ann-benchmarks results for HNSW indexing Key: LUCENE-9941 URL: https://issues.apache.org/jira/browse/LUCENE-9941 Project: Lucene - Core Issue Type: Task Reporter: Julie Tibshirani This is a continuation of LUCENE-9937, but for HNSW index performance. Approaches * LuceneVectorsOnly: a baseline that only indexes vectors * LuceneHnsw: our HNSW implementation, with a force merge to one segment * LuceneHnswNoForceMerge: our HNSW implementation without the force merge * hnswlib: a C++ HNSW implementation from the author of the paper Datasets * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, euclidean distance * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, euclidean distance *Results on sift-128-euclidean* Parameters: M=16, efConstruction=500 {code:java} Approach Index time (sec) LuceneVectorsOnly 14.93 LuceneHnsw 3191.16 LuceneHnswNoForceMerge 1194.31 hnswlib 311.09 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} Approach Index time (sec) LuceneVectorsOnly 14.17 LuceneHnsw 8940.41 LuceneHnswNoForceMerge 3623.68 hnswlib 587.23 {code} We force merge to one segment to emulate a case where vectors aren't continually being indexed. In these situations, it seems likely users would force merge to optimize search speed: searching a single large graph is expected to be faster than searching several small ones serially. To see how long the force merge takes, we can subtract LuceneHnswNoForceMerge from LuceneHnsw. The construction parameters match those in LUCENE-9937 and are optimized for search recall + QPS instead of index speed, as I figured this would be a common set-up. Some observations: * In cases when segments are eventually force merged, we do a lot of extra work building intermediate graphs that are eventually merged away. This is a difficult problem, and one that's been raised in the past. As a simple step, I wonder if we should not build graphs for segments that are below a certain size. For sufficiently small segments, it could be a better trade-off to avoid building a graph and support nearest-neighbor search through a brute-force scan? * Indexing is slow compared to what we're used to for other formats, even if we disregard the extra work mentioned above. For sift-128-euclidean, building only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 min. * As a note, graph indexing uses ANN searches in order to add each new vector to the graph. So the slower search speed between Lucene and hnswlib may contribute to slower indexing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9941) ann-benchmarks results for HNSW indexing
[ https://issues.apache.org/jira/browse/LUCENE-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-9941: - Description: This is a continuation of LUCENE-9937, but for HNSW index performance. Approaches * LuceneVectorsOnly: a baseline that only indexes vectors * LuceneHnsw: our HNSW implementation, with a force merge to one segment * LuceneHnswNoForceMerge: our HNSW implementation without the force merge * hnswlib: a C++ HNSW implementation from the author of the paper Datasets * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, comparing euclidean distance * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, comparing cosine similarity *Results on sift-128-euclidean* Parameters: M=16, efConstruction=500 {code:java} Approach Index time (sec) LuceneVectorsOnly 14.93 LuceneHnsw 3191.16 LuceneHnswNoForceMerge 1194.31 hnswlib 311.09 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} Approach Index time (sec) LuceneVectorsOnly 14.17 LuceneHnsw 8940.41 LuceneHnswNoForceMerge 3623.68 hnswlib 587.23 {code} We force merge to one segment to emulate a case where vectors aren't continually being indexed. In these situations, it seems likely users would force merge to optimize search speed: searching a single large graph is expected to be faster than searching several small ones serially. To see how long the force merge takes, we can subtract LuceneHnswNoForceMerge from LuceneHnsw. The construction parameters match those in LUCENE-9937 and are optimized for search recall + QPS instead of index speed, as I figured this would be a common set-up. Some observations: * In cases when segments are eventually force merged, we do a lot of extra work building intermediate graphs that are eventually merged away. This is a difficult problem, and one that's been raised in the past. As a simple step, I wonder if we should not build graphs for segments that are below a certain size. For sufficiently small segments, it could be a better trade-off to avoid building a graph and support nearest-neighbor search through a brute-force scan? * Indexing is slow compared to what we're used to for other formats, even if we disregard the extra work mentioned above. For sift-128-euclidean, building only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 min. * As a note, graph indexing uses ANN searches in order to add each new vector to the graph. So the slower search speed between Lucene and hnswlib may contribute to slower indexing. was: This is a continuation of LUCENE-9937, but for HNSW index performance. Approaches * LuceneVectorsOnly: a baseline that only indexes vectors * LuceneHnsw: our HNSW implementation, with a force merge to one segment * LuceneHnswNoForceMerge: our HNSW implementation without the force merge * hnswlib: a C++ HNSW implementation from the author of the paper Datasets * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, euclidean distance * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, euclidean distance *Results on sift-128-euclidean* Parameters: M=16, efConstruction=500 {code:java} Approach Index time (sec) LuceneVectorsOnly 14.93 LuceneHnsw 3191.16 LuceneHnswNoForceMerge 1194.31 hnswlib 311.09 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} Approach Index time (sec) LuceneVectorsOnly 14.17 LuceneHnsw 8940.41 LuceneHnswNoForceMerge 3623.68 hnswlib 587.23 {code} We force merge to one segment to emulate a case where vectors aren't continually being indexed. In these situations, it seems likely users would force merge to optimize search speed: searching a single large graph is expected to be faster than searching several small ones serially. To see how long the force merge takes, we can subtract LuceneHnswNoForceMerge from LuceneHnsw. The construction parameters match those in LUCENE-9937 and are optimized for search recall + QPS instead of index speed, as I figured this would be a common set-up. Some observations: * In cases when segments are eventually force merged, we do a lot of extra work building intermediate graphs that are eventually merged away. This is a difficult problem, and one that's been raised in the past. As a simple step, I wonder if we should not build graphs for segments that are below a certain size. For sufficiently small segments, it could be a better trade-off to avoid building a graph and support nearest-neighbor search through a brute-force scan? * Indexing is slow comp
[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm
[ https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334374#comment-17334374 ] ASF subversion and git services commented on LUCENE-9905: - Commit 6d4b5eaba359d4b09114484bb144a724a920c122 in lucene's branch refs/heads/main from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6d4b5ea ] LUCENE-9905: rename VectorValues.SearchStrategy to VectorValues.SimilarityFunction > Revise approach to specifying NN algorithm > -- > > Key: LUCENE-9905 > URL: https://issues.apache.org/jira/browse/LUCENE-9905 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: main (9.0) >Reporter: Julie Tibshirani >Priority: Blocker > Time Spent: 0.5h > Remaining Estimate: 0h > > In LUCENE-9322 we decided that the new vectors API shouldn’t assume a > particular nearest-neighbor search data structure and algorithm. This > flexibility is important since NN search is a developing area and we'd like > to be able to experiment and evolve the algorithm. Right now we only have one > algorithm (HNSW), but we want to maintain the ability to use another. > Currently the algorithm to use is specified through {{SearchStrategy}}, for > example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation > is expected to handle multiple algorithms. Instead we could have one format > implementation per algorithm. Our current implementation would be > HNSW-specific like {{HnswVectorFormat}}, and to experiment with another > algorithm you could create a new implementation like {{ClusterVectorFormat}}. > This would be better aligned with the codec framework, and help avoid > exposing algorithm details in the API. > A concrete proposal (note many of these names will change when LUCENE-9855 is > addressed): > # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add > HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}. > # Remove references to HNSW in {{SearchStrategy}}, so there is just > {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something > like {{SimilarityFunction}}. > # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and > beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}. > # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or > parameters to be configured per-field \(?\) > One note: the current HNSW-based format includes logic for storing a numeric > vector per document, as well as constructing + storing a HNSW graph. When > adding another implementation, it’d be nice to be able to reuse logic for > reading/ writing numeric vectors. I don’t think we need to design for this > right now, but we can keep it in mind for the future? > This issue is based on a thread [~jpountz] started: > [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a change in pull request #106: LUCENE-9905: rename VectorValues.SearchStrategy to VectorValues.SimilarityFunction
msokolov commented on a change in pull request #106: URL: https://github.com/apache/lucene/pull/106#discussion_r621665922 ## File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java ## @@ -2336,6 +2338,29 @@ static void checkImpacts(Impacts impacts, int lastTarget) { + docCount + " docs with values"); } +VectorReader vectorReader = reader.getVectorReader(); +if (vectorReader instanceof Lucene90HnswVectorReader) { + KnnGraphValues graphValues = + ((Lucene90HnswVectorReader) vectorReader).getGraphValues(fieldInfo.name); + int size = graphValues.size(); + for (int i = 0; i < size; i++) { +graphValues.seek(i); +for (int neighbor = graphValues.nextNeighbor(); +neighbor != NO_MORE_DOCS; +neighbor = graphValues.nextNeighbor()) { + if (neighbor < 0 || neighbor >= size) { +throw new RuntimeException( +"Field \"" ++ fieldInfo.name ++ "\" has an invalid neighbor ordinal: " ++ neighbor ++ " which should be in [0," ++ size ++ ")"); + } +} + } +} Review comment: Ah, this slipped in here by accident. I'll remove and add back in a separate commit. My understanding about CheckIndex may be incomplete - I thought it was mostly intended as an operational testing and recovery tool, but I think you're saying it's part if the unit test framework? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm
[ https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334375#comment-17334375 ] ASF subversion and git services commented on LUCENE-9905: - Commit 45bd06c8041a2ce7af13e5f1b985ee7cfbb38e7c in lucene's branch refs/heads/main from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=45bd06c ] LUCENE-9905: rename Lucene90VectorFormat and its reader and writer > Revise approach to specifying NN algorithm > -- > > Key: LUCENE-9905 > URL: https://issues.apache.org/jira/browse/LUCENE-9905 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: main (9.0) >Reporter: Julie Tibshirani >Priority: Blocker > Time Spent: 0.5h > Remaining Estimate: 0h > > In LUCENE-9322 we decided that the new vectors API shouldn’t assume a > particular nearest-neighbor search data structure and algorithm. This > flexibility is important since NN search is a developing area and we'd like > to be able to experiment and evolve the algorithm. Right now we only have one > algorithm (HNSW), but we want to maintain the ability to use another. > Currently the algorithm to use is specified through {{SearchStrategy}}, for > example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation > is expected to handle multiple algorithms. Instead we could have one format > implementation per algorithm. Our current implementation would be > HNSW-specific like {{HnswVectorFormat}}, and to experiment with another > algorithm you could create a new implementation like {{ClusterVectorFormat}}. > This would be better aligned with the codec framework, and help avoid > exposing algorithm details in the API. > A concrete proposal (note many of these names will change when LUCENE-9855 is > addressed): > # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add > HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}. > # Remove references to HNSW in {{SearchStrategy}}, so there is just > {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something > like {{SimilarityFunction}}. > # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and > beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}. > # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or > parameters to be configured per-field \(?\) > One note: the current HNSW-based format includes logic for storing a numeric > vector per document, as well as constructing + storing a HNSW graph. When > adding another implementation, it’d be nice to be able to reuse logic for > reading/ writing numeric vectors. I don’t think we need to design for this > right now, but we can keep it in mind for the future? > This issue is based on a thread [~jpountz] started: > [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gus-asf opened a new pull request #2483: LUCENE-9574 - Add a token filter to drop tokens based on flags.
gus-asf opened a new pull request #2483: URL: https://github.com/apache/lucene-solr/pull/2483 Backport -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gus-asf merged pull request #2483: LUCENE-9574 - Add a token filter to drop tokens based on flags.
gus-asf merged pull request #2483: URL: https://github.com/apache/lucene-solr/pull/2483 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9574) Add a token filter to drop tokens based on flags.
[ https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334403#comment-17334403 ] ASF subversion and git services commented on LUCENE-9574: - Commit 1c815fb788d604ff440686581caa7ef9c48e757f in lucene-solr's branch refs/heads/branch_8x from Gus Heck [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1c815fb ] Backport LUCENE-9574 - Add a token filter to drop tokens based on flags. (#2483) > Add a token filter to drop tokens based on flags. > - > > Key: LUCENE-9574 > URL: https://issues.apache.org/jira/browse/LUCENE-9574 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Gus Heck >Assignee: Gus Heck >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > (Breaking this off of SOLR-14597 for independent review) > A filter that tests flags on tokens vs a bitmask and drops tokens that have > all specified flags. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9574) Add a token filter to drop tokens based on flags.
[ https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gus Heck resolved LUCENE-9574. -- Resolution: Implemented > Add a token filter to drop tokens based on flags. > - > > Key: LUCENE-9574 > URL: https://issues.apache.org/jira/browse/LUCENE-9574 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Gus Heck >Assignee: Gus Heck >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > (Breaking this off of SOLR-14597 for independent review) > A filter that tests flags on tokens vs a bitmask and drops tokens that have > all specified flags. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gus-asf opened a new pull request #2484: LUCENE-9574 CHANGES.txt entry
gus-asf opened a new pull request #2484: URL: https://github.com/apache/lucene-solr/pull/2484 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gus-asf merged pull request #2484: LUCENE-9574 CHANGES.txt entry
gus-asf merged pull request #2484: URL: https://github.com/apache/lucene-solr/pull/2484 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9574) Add a token filter to drop tokens based on flags.
[ https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334425#comment-17334425 ] ASF subversion and git services commented on LUCENE-9574: - Commit 958b9f5850a4d2954e6eaa081abf46735aea5645 in lucene-solr's branch refs/heads/branch_8x from Gus Heck [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=958b9f5 ] LUCENE-9574 CHANGES.txt entry (#2484) > Add a token filter to drop tokens based on flags. > - > > Key: LUCENE-9574 > URL: https://issues.apache.org/jira/browse/LUCENE-9574 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Gus Heck >Assignee: Gus Heck >Priority: Major > Time Spent: 8h 50m > Remaining Estimate: 0h > > (Breaking this off of SOLR-14597 for independent review) > A filter that tests flags on tokens vs a bitmask and drops tokens that have > all specified flags. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9574) Add a token filter to drop tokens based on flags.
[ https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334429#comment-17334429 ] ASF subversion and git services commented on LUCENE-9574: - Commit 0c33e621f9b9da18a996a45bde6ef59e97150f23 in lucene's branch refs/heads/main from Gus Heck [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0c33e62 ] LUCENE-9574 adjust changes entry > Add a token filter to drop tokens based on flags. > - > > Key: LUCENE-9574 > URL: https://issues.apache.org/jira/browse/LUCENE-9574 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Gus Heck >Assignee: Gus Heck >Priority: Major > Time Spent: 8h 50m > Remaining Estimate: 0h > > (Breaking this off of SOLR-14597 for independent review) > A filter that tests flags on tokens vs a bitmask and drops tokens that have > all specified flags. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9574) Add a token filter to drop tokens based on flags.
[ https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gus Heck updated LUCENE-9574: - Fix Version/s: 8.9 > Add a token filter to drop tokens based on flags. > - > > Key: LUCENE-9574 > URL: https://issues.apache.org/jira/browse/LUCENE-9574 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Gus Heck >Assignee: Gus Heck >Priority: Major > Fix For: 8.9 > > Time Spent: 8h 50m > Remaining Estimate: 0h > > (Breaking this off of SOLR-14597 for independent review) > A filter that tests flags on tokens vs a bitmask and drops tokens that have > all specified flags. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org