[GitHub] [lucene] iverase merged pull request #12512: Remove unused variable in BKDWriter

2023-08-22 Thread via GitHub


iverase merged PR #12512:
URL: https://github.com/apache/lucene/pull/12512


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] easyice closed issue #12514: Could we add more index for BKD LeafNode?

2023-08-22 Thread via GitHub


easyice closed issue #12514: Could we add more index for BKD LeafNode?
URL: https://github.com/apache/lucene/issues/12514


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] easyice commented on issue #12514: Could we add more index for BKD LeafNode?

2023-08-22 Thread via GitHub


easyice commented on issue #12514:
URL: https://github.com/apache/lucene/issues/12514#issuecomment-1687673132

   Thanks for explaining, I will close the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] azagniotov opened a new pull request, #12517: LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

2023-08-22 Thread via GitHub


azagniotov opened a new pull request, #12517:
URL: https://github.com/apache/lucene/pull/12517

   ## TL;DR
   
   The current PR attempts to remediate 
https://issues.apache.org/jira/browse/LUCENE-4056 (`Japanese Tokenizer 
(Kuromoji) cannot build UniDic dictionary`)
   
   ## Description
   
   
   
   The current PR builds up upon @johtani's past PR: 
https://github.com/apache/lucene-solr/pull/935 and attempts to bring the code 
into a mergeable state, or at least to re-ignite the conversation about 
building a UniDic dictionary. 
   
   Before exporting the current PR, I verified the @johtani's added behavior in 
the aforementioned PR (plus some changes of my own) by successfully building a 
number of dictionaries outlined below and posted my findings in the comment: 
https://github.com/apache/lucene-solr/pull/935#issuecomment-1685887305 
   
   ### Philosophy of changes
   
   I tried to pick up @johtani's added behavior as-is, while minimizing the 
amount of changes that I added additionally.
   
   To be honest, I do not really like how `DictionaryBuilder.DictionaryFormat 
format` is being passed everywhere, as it creates these small decision trees 
`if IPADIC do this else do that`. If this PR gets merged, I would be happy to 
refactor in order to make the code look more elegant and perhaps with better 
separation of concerns. But, I defer to the reviewers of the Lucene library on 
this one. ๐Ÿ™‡๐Ÿผโ€โ™€๏ธ 
   
   The commits are fairly atomic and I hope this can make the reviewing 
experience more easier
   
   ## Built Dictionaries 
   
   - [unidic-cwj-202302_full](https://clrd.ninjal.ac.jp/unidic_archive/2302/) 
(NINJAL)
   - 
[unidic-cwj-3.1.1-full](https://clrd.ninjal.ac.jp/unidic_archive/cwj/3.1.1/) 
(NINJAL) (See **Caveat** below ๐Ÿ‘‡๐Ÿผ )
   - 
[unidic-mecab-2.1.2_src](https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/) 
(NINJAL)
   - [mecab-ipadic-2.7.0-20070801](https://taku910.github.io/mecab/)
   
   ### Caveat
   RE: Building the 
[unidic-cwj-3.1.1-full](https://clrd.ninjal.ac.jp/unidic_archive/cwj/3.1.1/): 
   
   1. I had to increase the the [length check of the 
baseForm](https://github.com/apache/lucene/commit/4bb6000cdc85aa4cc266d0a32a149d4d517d2e95)
 from `16` to `35`
   2. I had to stop [throwing an exception for multiple entries of 
LeftID](https://github.com/apache/lucene/commit/011efbe480dce5eafc2c81cdbf44a3fc2f07b61c),
 which, to be honest I do not understand the full ramifications of. Thus, I 
would value the input of subject matter experts here ๐Ÿ™‡๐Ÿผโ€โ™€๏ธ 
   
   ## Building a dictionary
   
   ### Gradle command
   
   My build command leverages the new Gradle setup and the [DictionaryBuilder 
JavaDoc 
comment](https://github.com/apache/lucene/commit/963ddc66f6d822bf877c7e6155d903babf937c13)
 about how to do it:
   
   I added in `lucene/analysis/kuromoji/build.gradle` a [default run task from 
the Gradle's application 
plugin](https://github.com/apache/lucene/commit/8d52f66d57b354fad6426c7c02b05a3b736d7dcd),
 which allows to build a dictionary are as follows. The command should be 
executed under the root directory `lucene`, where the `gradlew` file is.
   
   For example, the following is my command when building 
`unidic-cwj-202302_full` dictionary without NFKD normalization:
   ```
   ./gradlew -p lucene/analysis/kuromoji run --args='unidic 
"/Users/azagniotov/Downloads/unidic-cwj-202302_full" 
"/Users/azagniotov/Downloads/unidic-cwj-202302_full/lucene-kuromoji-built" 
"UTF-8" false'
   ```
   
   ## Unit testing
   
   Unfortunately, the current PR does not include unit tests because the built 
dictionaries files are very big, e.g.: built `unidic-cwj-202302_full ` is 
around ~700MB. Thus, the following unit tests were added and ran locally on my 
machine to verify that the built dictionaries can be used at runtime. 
   
   I did see there there is a 
[main/gradle/generation/kuromoji.gradle](https://github.com/apache/lucene/blob/main/gradle/generation/kuromoji.gradle)
 that downloads dictionaries and compiles them, but for now, I resorted not to 
add changes there as I think a dictionary should be decoupled.
   
   The built dictionaries were tested using the following Japanese strings (no 
particular reason, I just picked these four strings): 
   
   - `"ใซใ˜ใ•ใ‚“ใ˜"`
   - `"ใกใ„ใ‹ใ‚"`
   - `"ๆกƒๅคช้ƒŽ้›ป้‰„"`
   - `"่–ๅท็œŸๆ–—"`
   
   The dictionaries metadata  were placed (dictionary after dictionary) under 
the 
[lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict](https://github.com/apache/lucene/tree/main/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict)
 and a few unit test cases were added: 
   
    Existing default dictionary already included in Lucene
   ```
   assertAnalyzesTo(analyzerNoPunct, "ใซใ˜ใ•ใ‚“ใ˜", new String[] {"ใซ", "ใ˜ใ•", "ใ‚“", 
"ใ˜"}, new int[] {1, 1, 1, 1});
   assertAnalyzesTo(analyzerNoPunct, "ใกใ„ใ‹ใ‚", new String[] {"ใกใ„", "ใ‹", "ใ‚"}, new 
int[] {1, 1, 1});
   assertAnalyzesTo(analyzerNoPunct, "ๆกƒๅคช้ƒŽ้›ป้‰„", new String[] {"ๆกƒๅคช้ƒŽ", "้›ป้‰„"}, new 
int[] {1,

[GitHub] [lucene] mikemccand commented on pull request #12465: Potential bug in IndexedDISI90#SPARSE->advanceExactWithinBlock

2023-08-22 Thread via GitHub


mikemccand commented on PR #12465:
URL: https://github.com/apache/lucene/pull/12465#issuecomment-1688183077

   It's awesome that Lucene changed to be more picky (removing leniency) on 
catching invalid usage of the API!  Such improvements are great at ferreting 
out bugs (instead of masking them) in the application layer (Amazon Search in 
this case) using Lucene!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-22 Thread via GitHub


zhaih commented on code in PR #12480:
URL: https://github.com/apache/lucene/pull/12480#discussion_r1302276710


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -401,8 +404,23 @@ private boolean isDiverse(byte[] candidate, NeighborArray 
neighbors, float score
* Find first non-diverse neighbour among the list of neighbors starting 
from the most distant
* neighbours
*/
-  private int findWorstNonDiverse(NeighborArray neighbors) throws IOException {
-int[] uncheckedIndexes = neighbors.sort();
+  private int findWorstNonDiverse(NeighborArray neighbors, int nodeOrd) throws 
IOException {
+int[] uncheckedIndexes = neighbors.sort(nbrOrd -> {
+  float[] vectorValue = null;
+  byte[] binaryValue = null;
+  switch (this.vectorEncoding) {
+case FLOAT32 -> vectorValue = (float[]) vectors.vectorValue(nodeOrd);

Review Comment:
   Let's take this part outside of lambda to reduce number of times we call 
`vectorValue`, this operation involves some seek and parse operation on 
off-heap memory.



##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -197,4 +206,8 @@ private int descSortFindRightMostInsertionPoint(float 
newScore, int bound) {
 }
 return start;
   }
+
+  interface ScoringFunction {

Review Comment:
   Javadoc for both interface and the method?



##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -211,16 +214,16 @@ private void initializeFromGraph(
 oldNeighbor != NO_MORE_DOCS;
 oldNeighbor = initializerGraph.nextNeighbor()) {
   int newNeighbor = oldToNewOrdinalMap.get(oldNeighbor);
-  float score =
-  switch (this.vectorEncoding) {
-case FLOAT32 -> this.similarityFunction.compare(
-vectorValue, (float[]) 
vectorsCopy.vectorValue(newNeighbor));
-case BYTE -> this.similarityFunction.compare(
-binaryValue, (byte[]) 
vectorsCopy.vectorValue(newNeighbor));
-  };
+//  float score =

Review Comment:
   Let's remove those commented code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-22 Thread via GitHub


zhaih commented on code in PR #12480:
URL: https://github.com/apache/lucene/pull/12480#discussion_r1302278519


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -108,10 +111,16 @@ public int[] sort() {
   }
 
   /** insert the first unsorted node into its sorted position */
-  private int insertSortedInternal() {
+  private int insertSortedInternal(ScoringFunction scoringFunction) throws 
IOException {
 assert sortedNodeSize < size : "Call this method only when there's 
unsorted node";
 int tmpNode = node[sortedNodeSize];
 float tmpScore = score[sortedNodeSize];
+
+if (Float.isNaN(tmpScore)){
+  tmpScore = scoringFunction.computeScore(tmpNode);
+  System.out.println("Node: " + tmpNode + "   Score: " + tmpScore);

Review Comment:
   remove this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-22 Thread via GitHub


zhaih commented on PR #12480:
URL: https://github.com/apache/lucene/pull/12480#issuecomment-1689049573

   Ah sorry I almost forget:
   don't forget to create a CHANGES.txt entry :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org