date:20220722

[GitHub] [lucene-jira-archive] mikemccand commented on issue #61: Should we carry over Jira "labels"?

2022-07-22 Thread GitBox



mikemccand commented on issue #61:
URL: 
https://github.com/apache/lucene-jira-archive/issues/61#issuecomment-1192414741

   > Personally, I don't think we should bloat issue labels in GitHub... should 
we port all Jira "Labels" to GitHub labels?
   
   Well, I don't think it is our position (too much lol?) to decide what is 
bloat and what is not?
   
   I myself do not use Jira labels much, but others seem to.  I feel we should 
port what is in Jira as best we can (as GitHub's metdata model and import API 
allow) and not judge what is bloat or not.
   
   Or do you think somehow all the labels in Jira might actually be problematic 
for GitHub?  Like too high cardinality or something?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?

2022-07-22 Thread GitBox



mikemccand commented on issue #59:
URL: 
https://github.com/apache/lucene-jira-archive/issues/59#issuecomment-1192415309

   > This is a bug (typo) in the label mapping; I'll fix this.
   
   Oooh!  Awesome, thanks for tracking it down!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?

2022-07-22 Thread GitBox



mikemccand commented on issue #59:
URL: 
https://github.com/apache/lucene-jira-archive/issues/59#issuecomment-1192416865

   Could we maybe make this map picky/brittle (throw an exception during 
conversion if we come across a component/module that isn't in this map)?  It 
would catch such mistakes sooner I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #58: Errors setting assignee when running `import_github_issues.py`

2022-07-22 Thread GitBox



mikemccand commented on issue #58:
URL: 
https://github.com/apache/lucene-jira-archive/issues/58#issuecomment-1192417612

   > You cannot assign accounts that have no push access to the repository.
   > This is the reason I invited you to my test repository in #8.
   
   Aha!  OK thanks, so this is a non-issue.  I'll close it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand closed issue #58: Errors setting assignee when running `import_github_issues.py`

2022-07-22 Thread GitBox



mikemccand closed issue #58: Errors setting assignee when running 
`import_github_issues.py`
URL: https://github.com/apache/lucene-jira-archive/issues/58


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?

2022-07-22 Thread GitBox



mikemccand commented on issue #59:
URL: 
https://github.com/apache/lucene-jira-archive/issues/59#issuecomment-1192419660

   I'm trying to find a way in our Jira instance to see all labels Lucene's 
issues use, to get a global sense of label usage.  But the closest I can come 
to is the autosuggest by label, and there I think it might be all labels across 
all Apache projects since there seem to be way too many labels that would not 
usually apply to Lucene (C++ etc.) ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #59: Module label is sometimes missing?

2022-07-22 Thread GitBox



mocobeta commented on issue #59:
URL: 
https://github.com/apache/lucene-jira-archive/issues/59#issuecomment-1192434735

   > Could we maybe make this map picky/brittle (throw an exception during 
conversion if we come across a component/module that isn't in this map)?
   
   Yes, I think it'd be better/safe to raise an exception if there is no 
corresponding key for Jira's Component in the map.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

2022-07-22 Thread GitBox



iverase commented on code in PR #1017:
URL: https://github.com/apache/lucene/pull/1017#discussion_r927333254


##
lucene/core/src/java/org/apache/lucene/document/ShapeDocValuesField.java:
##
@@ -0,0 +1,896 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.document.ShapeField.DecodedTriangle.TYPE;
+import org.apache.lucene.document.ShapeField.QueryRelation;
+import org.apache.lucene.document.SpatialQuery.EncodedRectangle;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.IndexableFieldType;
+import org.apache.lucene.index.PointValues.Relation;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+
+/** A doc values field representation for {@link LatLonShape} and {@link 
XYShape} */
+public final class ShapeDocValuesField extends Field {
+  private final ShapeComparator shapeComparator;
+
+  private static final FieldType FIELD_TYPE = new FieldType();
+
+  static {
+FIELD_TYPE.setDocValuesType(DocValuesType.BINARY);
+FIELD_TYPE.setOmitNorms(true);
+FIELD_TYPE.freeze();
+  }
+
+  /**
+   * Creates a {@ShapeDocValueField} instance from a shape tessellation
+   *
+   * @param name The Field Name (must not be null)
+   * @param tessellation The tessellation (must not be null)
+   */
+  ShapeDocValuesField(String name, List 
tessellation) {
+super(name, FIELD_TYPE);
+BytesRef b = computeBinaryValue(tessellation);
+this.fieldsData = b;
+try {
+  this.shapeComparator = new ShapeComparator(b);
+} catch (IOException e) {
+  throw new IllegalArgumentException("unable to read binary shape doc 
value field. ", e);
+}
+  }
+
+  /** Creates a {@code ShapeDocValue} field from a given serialized value */
+  ShapeDocValuesField(String name, BytesRef binaryValue) {
+super(name, FIELD_TYPE);
+this.fieldsData = binaryValue;
+try {
+  this.shapeComparator = new ShapeComparator(binaryValue);
+} catch (IOException e) {
+  throw new IllegalArgumentException("unable to read binary shape doc 
value field. ", e);
+}
+  }
+
+  /** The name of the field */
+  @Override
+  public String name() {
+return name;
+  }
+
+  /** Gets the {@code IndexableFieldType} for this ShapeDocValue field */
+  @Override
+  public IndexableFieldType fieldType() {
+return FIELD_TYPE;
+  }
+
+  /** Currently there is no string representation for the ShapeDocValueField */
+  @Override
+  public String stringValue() {
+return null;
+  }
+
+  /** TokenStreams are not yet supported */
+  @Override
+  public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse) {
+return null;
+  }
+
+  /** create a shape docvalue field from indexable fields */
+  public static ShapeDocValuesField createDocValueField(String fieldName, 
Field[] indexableFields) {
+ArrayList tess = new 
ArrayList<>(indexableFields.length);
+final byte[] scratch = new byte[7 * Integer.BYTES];
+for (Field f : indexableFields) {
+  BytesRef br = f.binaryValue();
+  assert br.length == 7 * ShapeField.BYTES;
+  System.arraycopy(br.bytes, br.offset, scratch, 0, 7 * ShapeField.BYTES);
+  ShapeField.DecodedTriangle t = new ShapeField.DecodedTriangle();
+  ShapeField.decodeTriangle(scratch, t);
+  tess.add(t);
+}
+return new ShapeDocValuesField(fieldName, tess);
+  }
+
+  /** Returns the number of terms (tessellated triangles) for this shape */
+  public int numberOfTerms() {
+return shapeComparator.numberOfTerms();
+  }
+
+  /** Creates a geometry query for shape docvalues */
+  public static Query newGeometryQuery(
+  final String field, final QueryRelation relation, Object... geometries) {
+return null;
+// TODO
+//  return new ShapeDocValuesQuery(field, relation,

[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?

2022-07-22 Thread GitBox



mikemccand commented on issue #59:
URL: 
https://github.com/apache/lucene-jira-archive/issues/59#issuecomment-1192483254

   > I'm trying to find a way in our Jira instance to see all labels Lucene's 
issues use, to get a global sense of label usage. But the closest I can come to 
is the autosuggest by label, and there I think it might be all labels across 
all Apache projects since there seem to be way too many labels that would not 
usually apply to Lucene (C++ etc.) ;)
   
   Woops, wrong issue ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #61: Should we carry over Jira "labels"?

2022-07-22 Thread GitBox



mikemccand commented on issue #61:
URL: 
https://github.com/apache/lucene-jira-archive/issues/61#issuecomment-1192483422

   I'm trying to find a way in our Jira instance to see all labels Lucene's 
issues use, to get a global sense of label usage. But the closest I can come to 
is the autosuggest by label, and there I think it might be all labels across 
all Apache projects since there seem to be way too many labels that would not 
usually apply to Lucene (C++ etc.) ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova merged pull request #1041: Create Lucene94 Codec and move Lucene92 to backwards_codecs

2022-07-22 Thread GitBox



mayya-sharipova merged PR #1041:
URL: https://github.com/apache/lucene/pull/1041


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova merged pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-22 Thread GitBox



mayya-sharipova merged PR #992:
URL: https://github.com/apache/lucene/pull/992


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-07-22 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570081#comment-17570081
 ] 

ASF subversion and git services commented on LUCENE-10592:
--

Commit ba4bc0427146669ffd1c41fc0151db33e5a5be33 in lucene's branch 
refs/heads/main from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ba4bc042714 ]

LUCENE-10592 Build HNSW Graph on indexing (#992)

Currently, when indexing knn vectors, we buffer them in memory and
on flush during a segment construction we build an HNSW graph.
As building an HNSW graph is very expensive, this makes flush
operation take a lot of time. This also makes overall indexing
performance quite unpredictable – some indexing operations return
almost instantly while others that trigger flush take a lot of time.
This happens because flushes are unpredictable and trigged
by memory used, presence of concurrent searches etc.

Building an HNSW graph as we index documents avoid these problems,
as the load of HNSW graph construction is spread evenly during indexing.

Co-authored-by: Adrien Grand 

> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova opened a new pull request, #1043: LUCENE-10592 Build HNSW Graph on indexing

2022-07-22 Thread GitBox



mayya-sharipova opened a new pull request, #1043:
URL: https://github.com/apache/lucene/pull/1043

   Currently, when indexing knn vectors, we buffer them in memory and
   on flush during a segment construction we build an HNSW graph.
   As building an HNSW graph is very expensive, this makes flush
   operation take a lot of time. This also makes overall indexing
   performance quite unpredictable – some indexing operations return
   almost instantly while others that trigger flush take a lot of time.
   This happens because flushes are unpredictable and trigged
   by memory used, presence of concurrent searches etc.
   
   Building an HNSW graph as we index vectors avoid these problems,
   as the load of HNSW graph construction is spread evenly during indexing.
   
   Backport for #992


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova merged pull request #1043: LUCENE-10592 Build HNSW Graph on indexing

2022-07-22 Thread GitBox



mayya-sharipova merged PR #1043:
URL: https://github.com/apache/lucene/pull/1043


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-07-22 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570098#comment-17570098
 ] 

ASF subversion and git services commented on LUCENE-10592:
--

Commit a65a41855c7f7e93a5852d8af34d37fa01e0972b in lucene's branch 
refs/heads/branch_9x from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a65a41855c7 ]

LUCENE-10592 Build HNSW Graph on indexing  (#1043)

Currently, when indexing knn vectors, we buffer them in memory and
on flush during a segment construction we build an HNSW graph.
As building an HNSW graph is very expensive, this makes flush
operation take a lot of time. This also makes overall indexing
performance quite unpredictable – some indexing operations return
almost instantly while others that trigger flush take a lot of time.
This happens because flushes are unpredictable and trigged
by memory used, presence of concurrent searches etc.

Building an HNSW graph as we index documents avoid these problems,
as the load of HNSW graph construction is spread evenly during indexing.

Co-authored-by: Adrien Grand 

> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-22 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570111#comment-17570111
 ] 

Julie Tibshirani commented on LUCENE-10404:
---

Those numbers look good! Is my understanding right that these experiments use 
k=10, and fanout = 0 and 50? Maybe we could also try with a high fanout (like 
100 or 500) to double-check the case when we need to visit a larger number of 
nodes.

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-07-22 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570112#comment-17570112
 ] 

ASF subversion and git services commented on LUCENE-10592:
--

Commit bd06cebfc2815bb508314ed8a4215e9da7f36de6 in lucene's branch 
refs/heads/main from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bd06cebfc28 ]

Add change log for LUCENE-10592


> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Closed] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-07-22 Thread Mayya Sharipova (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova closed LUCENE-10592.


> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.4
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-07-22 Thread Mayya Sharipova (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova resolved LUCENE-10592.
--
Fix Version/s: 9.4
   Resolution: Fixed

> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.4
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Reopened] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-22 Thread Greg Miller (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller reopened LUCENE-10659:
--
  Assignee: Greg Miller

There's still an issue with the test. Tripped it again last night. Working on a 
fix now. Let's block 9.3 until this fix is in. PR will be up shortly.

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Blocker
> Fix For: 9.3
>
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] nknize commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

2022-07-22 Thread GitBox



nknize commented on code in PR #1017:
URL: https://github.com/apache/lucene/pull/1017#discussion_r928013152


##
lucene/core/src/java/org/apache/lucene/document/ShapeDocValuesField.java:
##
@@ -0,0 +1,896 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.document.ShapeField.DecodedTriangle.TYPE;
+import org.apache.lucene.document.ShapeField.QueryRelation;
+import org.apache.lucene.document.SpatialQuery.EncodedRectangle;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.IndexableFieldType;
+import org.apache.lucene.index.PointValues.Relation;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+
+/** A doc values field representation for {@link LatLonShape} and {@link 
XYShape} */
+public final class ShapeDocValuesField extends Field {
+  private final ShapeComparator shapeComparator;
+
+  private static final FieldType FIELD_TYPE = new FieldType();
+
+  static {
+FIELD_TYPE.setDocValuesType(DocValuesType.BINARY);
+FIELD_TYPE.setOmitNorms(true);
+FIELD_TYPE.freeze();
+  }
+
+  /**
+   * Creates a {@ShapeDocValueField} instance from a shape tessellation
+   *
+   * @param name The Field Name (must not be null)
+   * @param tessellation The tessellation (must not be null)
+   */
+  ShapeDocValuesField(String name, List 
tessellation) {
+super(name, FIELD_TYPE);
+BytesRef b = computeBinaryValue(tessellation);
+this.fieldsData = b;
+try {
+  this.shapeComparator = new ShapeComparator(b);
+} catch (IOException e) {
+  throw new IllegalArgumentException("unable to read binary shape doc 
value field. ", e);
+}
+  }
+
+  /** Creates a {@code ShapeDocValue} field from a given serialized value */
+  ShapeDocValuesField(String name, BytesRef binaryValue) {
+super(name, FIELD_TYPE);
+this.fieldsData = binaryValue;
+try {
+  this.shapeComparator = new ShapeComparator(binaryValue);
+} catch (IOException e) {
+  throw new IllegalArgumentException("unable to read binary shape doc 
value field. ", e);
+}
+  }
+
+  /** The name of the field */
+  @Override
+  public String name() {
+return name;
+  }
+
+  /** Gets the {@code IndexableFieldType} for this ShapeDocValue field */
+  @Override
+  public IndexableFieldType fieldType() {
+return FIELD_TYPE;
+  }
+
+  /** Currently there is no string representation for the ShapeDocValueField */
+  @Override
+  public String stringValue() {
+return null;
+  }
+
+  /** TokenStreams are not yet supported */
+  @Override
+  public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse) {
+return null;
+  }
+
+  /** create a shape docvalue field from indexable fields */
+  public static ShapeDocValuesField createDocValueField(String fieldName, 
Field[] indexableFields) {
+ArrayList tess = new 
ArrayList<>(indexableFields.length);
+final byte[] scratch = new byte[7 * Integer.BYTES];
+for (Field f : indexableFields) {
+  BytesRef br = f.binaryValue();
+  assert br.length == 7 * ShapeField.BYTES;
+  System.arraycopy(br.bytes, br.offset, scratch, 0, 7 * ShapeField.BYTES);
+  ShapeField.DecodedTriangle t = new ShapeField.DecodedTriangle();
+  ShapeField.decodeTriangle(scratch, t);
+  tess.add(t);
+}
+return new ShapeDocValuesField(fieldName, tess);
+  }
+
+  /** Returns the number of terms (tessellated triangles) for this shape */
+  public int numberOfTerms() {
+return shapeComparator.numberOfTerms();
+  }
+
+  /** Creates a geometry query for shape docvalues */
+  public static Query newGeometryQuery(
+  final String field, final QueryRelation relation, Object... geometries) {
+return null;
+// TODO
+//  return new ShapeDocValuesQuery(field, relation,

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-22 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570195#comment-17570195
 ] 

Michael Sokolov commented on LUCENE-10404:
--

The default `topK` in KnnGraphTester is 100, so these test runs are maintaining 
results queues of 100 or 150 (when searching). During indexing this is driven 
by beamWidth, and 32/64 is lower than is typical, I think. Still I think it's 
encouraging that we see gains in both searching (when the queue size is 
100-150) and indexing, when it is 32-64.

I won't be able to run more tests for a few days, but I agree that it would be 
interesting to see how the gains correlate with the queue sizes. But I was 
motivated to get some quick look! Will run some more exhaustive tests next week.

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-22 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570209#comment-17570209
 ] 

Greg Miller commented on LUCENE-10659:
--

Another fix here: https://github.com/apache/lucene/pull/1044

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Blocker
> Fix For: 9.3
>
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #64: Cover all Jira components in module label mapping

2022-07-22 Thread GitBox



mocobeta opened a new pull request, #64:
URL: https://github.com/apache/lucene-jira-archive/pull/64

   Close #59 
   
   - Add missing module labels for Jira "Component". If there is no suitable 
module label, map the component to `None` (there are obsolete Components no 
longer used).
   - Log an error if there is no corresponding module label in the mapping.
   
   I ran the conversion script for the whole Jira dump and confirmed all 
Component is covered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta merged pull request #64: Cover all Jira components in module label mapping

2022-07-22 Thread GitBox



mocobeta merged PR #64:
URL: https://github.com/apache/lucene-jira-archive/pull/64


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta closed issue #59: Module label is sometimes missing?

2022-07-22 Thread GitBox



mocobeta closed issue #59: Module label is sometimes missing?
URL: https://github.com/apache/lucene-jira-archive/issues/59


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #61: Should we carry over Jira "labels"?

[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?

[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?

[GitHub] [lucene-jira-archive] mikemccand commented on issue #58: Errors setting assignee when running `import_github_issues.py`

[GitHub] [lucene-jira-archive] mikemccand closed issue #58: Errors setting assignee when running `import_github_issues.py`

[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?

[GitHub] [lucene-jira-archive] mocobeta commented on issue #59: Module label is sometimes missing?

[GitHub] [lucene] iverase commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

[GitHub] [lucene-jira-archive] mikemccand commented on issue #59: Module label is sometimes missing?

[GitHub] [lucene-jira-archive] mikemccand commented on issue #61: Should we carry over Jira "labels"?

[GitHub] [lucene] mayya-sharipova merged pull request #1041: Create Lucene94 Codec and move Lucene92 to backwards_codecs

[GitHub] [lucene] mayya-sharipova merged pull request #992: LUCENE-10592 Build HNSW Graph on indexing

[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

[GitHub] [lucene] mayya-sharipova opened a new pull request, #1043: LUCENE-10592 Build HNSW Graph on indexing

[GitHub] [lucene] mayya-sharipova merged pull request #1043: LUCENE-10592 Build HNSW Graph on indexing

[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

[jira] [Closed] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

[jira] [Resolved] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

[jira] [Reopened] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

[GitHub] [lucene] nknize commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #64: Cover all Jira components in module label mapping

[GitHub] [lucene-jira-archive] mocobeta merged pull request #64: Cover all Jira components in module label mapping

[GitHub] [lucene-jira-archive] mocobeta closed issue #59: Module label is sometimes missing?

27 matches

Site Navigation

Mail list logo

Footer information