[GitHub] [lucene-jira-archive] mikemccand commented on issue #61: Should we carry over Jira "labels"?
mikemccand commented on issue #61: URL: https://github.com/apache/lucene-jira-archive/issues/61#issuecomment-1192414741 > Personally, I don't think we should bloat issue labels in GitHub... should we port all Jira "Labels" to GitHub labels? Well, I don't think it is our position (too much lol?) to decide what is bloat and what is not? I myself do not use Jira labels much, but others seem to. I feel we should port what is in Jira as best we can (as GitHub's metdata model and import API allow) and not judge what is bloat or not. Or do you think somehow all the labels in Jira might actually be problematic for GitHub? Like too high cardinality or something? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?
mikemccand commented on issue #59: URL: https://github.com/apache/lucene-jira-archive/issues/59#issuecomment-1192415309 > This is a bug (typo) in the label mapping; I'll fix this. Oooh! Awesome, thanks for tracking it down!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?
mikemccand commented on issue #59: URL: https://github.com/apache/lucene-jira-archive/issues/59#issuecomment-1192416865 Could we maybe make this map picky/brittle (throw an exception during conversion if we come across a component/module that isn't in this map)? It would catch such mistakes sooner I think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #58: Errors setting assignee when running `import_github_issues.py`
mikemccand commented on issue #58: URL: https://github.com/apache/lucene-jira-archive/issues/58#issuecomment-1192417612 > You cannot assign accounts that have no push access to the repository. > This is the reason I invited you to my test repository in #8. Aha! OK thanks, so this is a non-issue. I'll close it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand closed issue #58: Errors setting assignee when running `import_github_issues.py`
mikemccand closed issue #58: Errors setting assignee when running `import_github_issues.py` URL: https://github.com/apache/lucene-jira-archive/issues/58 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?
mikemccand commented on issue #59: URL: https://github.com/apache/lucene-jira-archive/issues/59#issuecomment-1192419660 I'm trying to find a way in our Jira instance to see all labels Lucene's issues use, to get a global sense of label usage. But the closest I can come to is the autosuggest by label, and there I think it might be all labels across all Apache projects since there seem to be way too many labels that would not usually apply to Lucene (C++ etc.) ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #59: Module label is sometimes missing?
mocobeta commented on issue #59: URL: https://github.com/apache/lucene-jira-archive/issues/59#issuecomment-1192434735 > Could we maybe make this map picky/brittle (throw an exception during conversion if we come across a component/module that isn't in this map)? Yes, I think it'd be better/safe to raise an exception if there is no corresponding key for Jira's Component in the map. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape
iverase commented on code in PR #1017: URL: https://github.com/apache/lucene/pull/1017#discussion_r927333254 ## lucene/core/src/java/org/apache/lucene/document/ShapeDocValuesField.java: ## @@ -0,0 +1,896 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Comparator; +import java.util.List; +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.document.ShapeField.DecodedTriangle.TYPE; +import org.apache.lucene.document.ShapeField.QueryRelation; +import org.apache.lucene.document.SpatialQuery.EncodedRectangle; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.IndexableFieldType; +import org.apache.lucene.index.PointValues.Relation; +import org.apache.lucene.search.Query; +import org.apache.lucene.store.ByteArrayDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; + +/** A doc values field representation for {@link LatLonShape} and {@link XYShape} */ +public final class ShapeDocValuesField extends Field { + private final ShapeComparator shapeComparator; + + private static final FieldType FIELD_TYPE = new FieldType(); + + static { +FIELD_TYPE.setDocValuesType(DocValuesType.BINARY); +FIELD_TYPE.setOmitNorms(true); +FIELD_TYPE.freeze(); + } + + /** + * Creates a {@ShapeDocValueField} instance from a shape tessellation + * + * @param name The Field Name (must not be null) + * @param tessellation The tessellation (must not be null) + */ + ShapeDocValuesField(String name, List tessellation) { +super(name, FIELD_TYPE); +BytesRef b = computeBinaryValue(tessellation); +this.fieldsData = b; +try { + this.shapeComparator = new ShapeComparator(b); +} catch (IOException e) { + throw new IllegalArgumentException("unable to read binary shape doc value field. ", e); +} + } + + /** Creates a {@code ShapeDocValue} field from a given serialized value */ + ShapeDocValuesField(String name, BytesRef binaryValue) { +super(name, FIELD_TYPE); +this.fieldsData = binaryValue; +try { + this.shapeComparator = new ShapeComparator(binaryValue); +} catch (IOException e) { + throw new IllegalArgumentException("unable to read binary shape doc value field. ", e); +} + } + + /** The name of the field */ + @Override + public String name() { +return name; + } + + /** Gets the {@code IndexableFieldType} for this ShapeDocValue field */ + @Override + public IndexableFieldType fieldType() { +return FIELD_TYPE; + } + + /** Currently there is no string representation for the ShapeDocValueField */ + @Override + public String stringValue() { +return null; + } + + /** TokenStreams are not yet supported */ + @Override + public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse) { +return null; + } + + /** create a shape docvalue field from indexable fields */ + public static ShapeDocValuesField createDocValueField(String fieldName, Field[] indexableFields) { +ArrayList tess = new ArrayList<>(indexableFields.length); +final byte[] scratch = new byte[7 * Integer.BYTES]; +for (Field f : indexableFields) { + BytesRef br = f.binaryValue(); + assert br.length == 7 * ShapeField.BYTES; + System.arraycopy(br.bytes, br.offset, scratch, 0, 7 * ShapeField.BYTES); + ShapeField.DecodedTriangle t = new ShapeField.DecodedTriangle(); + ShapeField.decodeTriangle(scratch, t); + tess.add(t); +} +return new ShapeDocValuesField(fieldName, tess); + } + + /** Returns the number of terms (tessellated triangles) for this shape */ + public int numberOfTerms() { +return shapeComparator.numberOfTerms(); + } + + /** Creates a geometry query for shape docvalues */ + public static Query newGeometryQuery( + final String field, final QueryRelation relation, Object... geometries) { +return null; +// TODO +// return new ShapeDocValuesQuery(field, relation,
[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?
mikemccand commented on issue #59: URL: https://github.com/apache/lucene-jira-archive/issues/59#issuecomment-1192483254 > I'm trying to find a way in our Jira instance to see all labels Lucene's issues use, to get a global sense of label usage. But the closest I can come to is the autosuggest by label, and there I think it might be all labels across all Apache projects since there seem to be way too many labels that would not usually apply to Lucene (C++ etc.) ;) Woops, wrong issue ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #61: Should we carry over Jira "labels"?
mikemccand commented on issue #61: URL: https://github.com/apache/lucene-jira-archive/issues/61#issuecomment-1192483422 I'm trying to find a way in our Jira instance to see all labels Lucene's issues use, to get a global sense of label usage. But the closest I can come to is the autosuggest by label, and there I think it might be all labels across all Apache projects since there seem to be way too many labels that would not usually apply to Lucene (C++ etc.) ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova merged pull request #1041: Create Lucene94 Codec and move Lucene92 to backwards_codecs
mayya-sharipova merged PR #1041: URL: https://github.com/apache/lucene/pull/1041 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova merged pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova merged PR #992: URL: https://github.com/apache/lucene/pull/992 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570081#comment-17570081 ] ASF subversion and git services commented on LUCENE-10592: -- Commit ba4bc0427146669ffd1c41fc0151db33e5a5be33 in lucene's branch refs/heads/main from Mayya Sharipova [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ba4bc042714 ] LUCENE-10592 Build HNSW Graph on indexing (#992) Currently, when indexing knn vectors, we buffer them in memory and on flush during a segment construction we build an HNSW graph. As building an HNSW graph is very expensive, this makes flush operation take a lot of time. This also makes overall indexing performance quite unpredictable – some indexing operations return almost instantly while others that trigger flush take a lot of time. This happens because flushes are unpredictable and trigged by memory used, presence of concurrent searches etc. Building an HNSW graph as we index documents avoid these problems, as the load of HNSW graph construction is spread evenly during indexing. Co-authored-by: Adrien Grand > Should we build HNSW graph on the fly during indexing > - > > Key: LUCENE-10592 > URL: https://issues.apache.org/jira/browse/LUCENE-10592 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Time Spent: 7h 40m > Remaining Estimate: 0h > > Currently, when we index vectors for KnnVectorField, we buffer those vectors > in memory and on flush during a segment construction we build an HNSW graph. > As building an HNSW graph is very expensive, this makes flush operation take > a lot of time. This also makes overall indexing performance quite > unpredictable (as the number of flushes are defined by memory used, and the > presence of concurrent searches), e.g. some indexing operations return almost > instantly while others that trigger flush take a lot of time. > Building an HNSW graph on the fly as we index vectors allows to avoid this > problem, and spread a load of HNSW graph construction evenly during indexing. > This will also supersede LUCENE-10194 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova opened a new pull request, #1043: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova opened a new pull request, #1043: URL: https://github.com/apache/lucene/pull/1043 Currently, when indexing knn vectors, we buffer them in memory and on flush during a segment construction we build an HNSW graph. As building an HNSW graph is very expensive, this makes flush operation take a lot of time. This also makes overall indexing performance quite unpredictable – some indexing operations return almost instantly while others that trigger flush take a lot of time. This happens because flushes are unpredictable and trigged by memory used, presence of concurrent searches etc. Building an HNSW graph as we index vectors avoid these problems, as the load of HNSW graph construction is spread evenly during indexing. Backport for #992 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova merged pull request #1043: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova merged PR #1043: URL: https://github.com/apache/lucene/pull/1043 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570098#comment-17570098 ] ASF subversion and git services commented on LUCENE-10592: -- Commit a65a41855c7f7e93a5852d8af34d37fa01e0972b in lucene's branch refs/heads/branch_9x from Mayya Sharipova [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a65a41855c7 ] LUCENE-10592 Build HNSW Graph on indexing (#1043) Currently, when indexing knn vectors, we buffer them in memory and on flush during a segment construction we build an HNSW graph. As building an HNSW graph is very expensive, this makes flush operation take a lot of time. This also makes overall indexing performance quite unpredictable – some indexing operations return almost instantly while others that trigger flush take a lot of time. This happens because flushes are unpredictable and trigged by memory used, presence of concurrent searches etc. Building an HNSW graph as we index documents avoid these problems, as the load of HNSW graph construction is spread evenly during indexing. Co-authored-by: Adrien Grand > Should we build HNSW graph on the fly during indexing > - > > Key: LUCENE-10592 > URL: https://issues.apache.org/jira/browse/LUCENE-10592 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Time Spent: 8h > Remaining Estimate: 0h > > Currently, when we index vectors for KnnVectorField, we buffer those vectors > in memory and on flush during a segment construction we build an HNSW graph. > As building an HNSW graph is very expensive, this makes flush operation take > a lot of time. This also makes overall indexing performance quite > unpredictable (as the number of flushes are defined by memory used, and the > presence of concurrent searches), e.g. some indexing operations return almost > instantly while others that trigger flush take a lot of time. > Building an HNSW graph on the fly as we index vectors allows to avoid this > problem, and spread a load of HNSW graph construction evenly during indexing. > This will also supersede LUCENE-10194 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570111#comment-17570111 ] Julie Tibshirani commented on LUCENE-10404: --- Those numbers look good! Is my understanding right that these experiments use k=10, and fanout = 0 and 50? Maybe we could also try with a high fanout (like 100 or 500) to double-check the case when we need to visit a larger number of nodes. > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570112#comment-17570112 ] ASF subversion and git services commented on LUCENE-10592: -- Commit bd06cebfc2815bb508314ed8a4215e9da7f36de6 in lucene's branch refs/heads/main from Mayya Sharipova [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bd06cebfc28 ] Add change log for LUCENE-10592 > Should we build HNSW graph on the fly during indexing > - > > Key: LUCENE-10592 > URL: https://issues.apache.org/jira/browse/LUCENE-10592 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Time Spent: 8h > Remaining Estimate: 0h > > Currently, when we index vectors for KnnVectorField, we buffer those vectors > in memory and on flush during a segment construction we build an HNSW graph. > As building an HNSW graph is very expensive, this makes flush operation take > a lot of time. This also makes overall indexing performance quite > unpredictable (as the number of flushes are defined by memory used, and the > presence of concurrent searches), e.g. some indexing operations return almost > instantly while others that trigger flush take a lot of time. > Building an HNSW graph on the fly as we index vectors allows to avoid this > problem, and spread a load of HNSW graph construction evenly during indexing. > This will also supersede LUCENE-10194 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-10592. > Should we build HNSW graph on the fly during indexing > - > > Key: LUCENE-10592 > URL: https://issues.apache.org/jira/browse/LUCENE-10592 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Fix For: 9.4 > > Time Spent: 8h > Remaining Estimate: 0h > > Currently, when we index vectors for KnnVectorField, we buffer those vectors > in memory and on flush during a segment construction we build an HNSW graph. > As building an HNSW graph is very expensive, this makes flush operation take > a lot of time. This also makes overall indexing performance quite > unpredictable (as the number of flushes are defined by memory used, and the > presence of concurrent searches), e.g. some indexing operations return almost > instantly while others that trigger flush take a lot of time. > Building an HNSW graph on the fly as we index vectors allows to avoid this > problem, and spread a load of HNSW graph construction evenly during indexing. > This will also supersede LUCENE-10194 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova resolved LUCENE-10592. -- Fix Version/s: 9.4 Resolution: Fixed > Should we build HNSW graph on the fly during indexing > - > > Key: LUCENE-10592 > URL: https://issues.apache.org/jira/browse/LUCENE-10592 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Fix For: 9.4 > > Time Spent: 8h > Remaining Estimate: 0h > > Currently, when we index vectors for KnnVectorField, we buffer those vectors > in memory and on flush during a segment construction we build an HNSW graph. > As building an HNSW graph is very expensive, this makes flush operation take > a lot of time. This also makes overall indexing performance quite > unpredictable (as the number of flushes are defined by memory used, and the > presence of concurrent searches), e.g. some indexing operations return almost > instantly while others that trigger flush take a lot of time. > Building an HNSW graph on the fly as we index vectors allows to avoid this > problem, and spread a load of HNSW graph construction evenly during indexing. > This will also supersede LUCENE-10194 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller reopened LUCENE-10659: -- Assignee: Greg Miller There's still an issue with the test. Tripped it again last night. Working on a fix now. Let's block 9.3 until this fix is in. PR will be up shortly. > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Blocker > Fix For: 9.3 > > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nknize commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape
nknize commented on code in PR #1017: URL: https://github.com/apache/lucene/pull/1017#discussion_r928013152 ## lucene/core/src/java/org/apache/lucene/document/ShapeDocValuesField.java: ## @@ -0,0 +1,896 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Comparator; +import java.util.List; +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.document.ShapeField.DecodedTriangle.TYPE; +import org.apache.lucene.document.ShapeField.QueryRelation; +import org.apache.lucene.document.SpatialQuery.EncodedRectangle; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.IndexableFieldType; +import org.apache.lucene.index.PointValues.Relation; +import org.apache.lucene.search.Query; +import org.apache.lucene.store.ByteArrayDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; + +/** A doc values field representation for {@link LatLonShape} and {@link XYShape} */ +public final class ShapeDocValuesField extends Field { + private final ShapeComparator shapeComparator; + + private static final FieldType FIELD_TYPE = new FieldType(); + + static { +FIELD_TYPE.setDocValuesType(DocValuesType.BINARY); +FIELD_TYPE.setOmitNorms(true); +FIELD_TYPE.freeze(); + } + + /** + * Creates a {@ShapeDocValueField} instance from a shape tessellation + * + * @param name The Field Name (must not be null) + * @param tessellation The tessellation (must not be null) + */ + ShapeDocValuesField(String name, List tessellation) { +super(name, FIELD_TYPE); +BytesRef b = computeBinaryValue(tessellation); +this.fieldsData = b; +try { + this.shapeComparator = new ShapeComparator(b); +} catch (IOException e) { + throw new IllegalArgumentException("unable to read binary shape doc value field. ", e); +} + } + + /** Creates a {@code ShapeDocValue} field from a given serialized value */ + ShapeDocValuesField(String name, BytesRef binaryValue) { +super(name, FIELD_TYPE); +this.fieldsData = binaryValue; +try { + this.shapeComparator = new ShapeComparator(binaryValue); +} catch (IOException e) { + throw new IllegalArgumentException("unable to read binary shape doc value field. ", e); +} + } + + /** The name of the field */ + @Override + public String name() { +return name; + } + + /** Gets the {@code IndexableFieldType} for this ShapeDocValue field */ + @Override + public IndexableFieldType fieldType() { +return FIELD_TYPE; + } + + /** Currently there is no string representation for the ShapeDocValueField */ + @Override + public String stringValue() { +return null; + } + + /** TokenStreams are not yet supported */ + @Override + public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse) { +return null; + } + + /** create a shape docvalue field from indexable fields */ + public static ShapeDocValuesField createDocValueField(String fieldName, Field[] indexableFields) { +ArrayList tess = new ArrayList<>(indexableFields.length); +final byte[] scratch = new byte[7 * Integer.BYTES]; +for (Field f : indexableFields) { + BytesRef br = f.binaryValue(); + assert br.length == 7 * ShapeField.BYTES; + System.arraycopy(br.bytes, br.offset, scratch, 0, 7 * ShapeField.BYTES); + ShapeField.DecodedTriangle t = new ShapeField.DecodedTriangle(); + ShapeField.decodeTriangle(scratch, t); + tess.add(t); +} +return new ShapeDocValuesField(fieldName, tess); + } + + /** Returns the number of terms (tessellated triangles) for this shape */ + public int numberOfTerms() { +return shapeComparator.numberOfTerms(); + } + + /** Creates a geometry query for shape docvalues */ + public static Query newGeometryQuery( + final String field, final QueryRelation relation, Object... geometries) { +return null; +// TODO +// return new ShapeDocValuesQuery(field, relation,
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570195#comment-17570195 ] Michael Sokolov commented on LUCENE-10404: -- The default `topK` in KnnGraphTester is 100, so these test runs are maintaining results queues of 100 or 150 (when searching). During indexing this is driven by beamWidth, and 32/64 is lower than is typical, I think. Still I think it's encouraging that we see gains in both searching (when the queue size is 100-150) and indexing, when it is 32-64. I won't be able to run more tests for a few days, but I agree that it would be interesting to see how the gains correlate with the queue sizes. But I was motivated to get some quick look! Will run some more exhaustive tests next week. > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570209#comment-17570209 ] Greg Miller commented on LUCENE-10659: -- Another fix here: https://github.com/apache/lucene/pull/1044 > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Blocker > Fix For: 9.3 > > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #64: Cover all Jira components in module label mapping
mocobeta opened a new pull request, #64: URL: https://github.com/apache/lucene-jira-archive/pull/64 Close #59 - Add missing module labels for Jira "Component". If there is no suitable module label, map the component to `None` (there are obsolete Components no longer used). - Log an error if there is no corresponding module label in the mapping. I ran the conversion script for the whole Jira dump and confirmed all Component is covered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta merged pull request #64: Cover all Jira components in module label mapping
mocobeta merged PR #64: URL: https://github.com/apache/lucene-jira-archive/pull/64 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta closed issue #59: Module label is sometimes missing?
mocobeta closed issue #59: Module label is sometimes missing? URL: https://github.com/apache/lucene-jira-archive/issues/59 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org