[GitHub] [lucene] iverase merged pull request #12512: Remove unused variable in BKDWriter
iverase merged PR #12512: URL: https://github.com/apache/lucene/pull/12512 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] easyice closed issue #12514: Could we add more index for BKD LeafNode?
easyice closed issue #12514: Could we add more index for BKD LeafNode? URL: https://github.com/apache/lucene/issues/12514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] easyice commented on issue #12514: Could we add more index for BKD LeafNode?
easyice commented on issue #12514: URL: https://github.com/apache/lucene/issues/12514#issuecomment-1687673132 Thanks for explaining, I will close the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] azagniotov opened a new pull request, #12517: LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
azagniotov opened a new pull request, #12517: URL: https://github.com/apache/lucene/pull/12517 ## TL;DR The current PR attempts to remediate https://issues.apache.org/jira/browse/LUCENE-4056 (`Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary`) ## Description The current PR builds up upon @johtani's past PR: https://github.com/apache/lucene-solr/pull/935 and attempts to bring the code into a mergeable state, or at least to re-ignite the conversation about building a UniDic dictionary. Before exporting the current PR, I verified the @johtani's added behavior in the aforementioned PR (plus some changes of my own) by successfully building a number of dictionaries outlined below and posted my findings in the comment: https://github.com/apache/lucene-solr/pull/935#issuecomment-1685887305 ### Philosophy of changes I tried to pick up @johtani's added behavior as-is, while minimizing the amount of changes that I added additionally. To be honest, I do not really like how `DictionaryBuilder.DictionaryFormat format` is being passed everywhere, as it creates these small decision trees `if IPADIC do this else do that`. If this PR gets merged, I would be happy to refactor in order to make the code look more elegant and perhaps with better separation of concerns. But, I defer to the reviewers of the Lucene library on this one. ๐๐ผโโ๏ธ The commits are fairly atomic and I hope this can make the reviewing experience more easier ## Built Dictionaries - [unidic-cwj-202302_full](https://clrd.ninjal.ac.jp/unidic_archive/2302/) (NINJAL) - [unidic-cwj-3.1.1-full](https://clrd.ninjal.ac.jp/unidic_archive/cwj/3.1.1/) (NINJAL) (See **Caveat** below ๐๐ผ ) - [unidic-mecab-2.1.2_src](https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/) (NINJAL) - [mecab-ipadic-2.7.0-20070801](https://taku910.github.io/mecab/) ### Caveat RE: Building the [unidic-cwj-3.1.1-full](https://clrd.ninjal.ac.jp/unidic_archive/cwj/3.1.1/): 1. I had to increase the the [length check of the baseForm](https://github.com/apache/lucene/commit/4bb6000cdc85aa4cc266d0a32a149d4d517d2e95) from `16` to `35` 2. I had to stop [throwing an exception for multiple entries of LeftID](https://github.com/apache/lucene/commit/011efbe480dce5eafc2c81cdbf44a3fc2f07b61c), which, to be honest I do not understand the full ramifications of. Thus, I would value the input of subject matter experts here ๐๐ผโโ๏ธ ## Building a dictionary ### Gradle command My build command leverages the new Gradle setup and the [DictionaryBuilder JavaDoc comment](https://github.com/apache/lucene/commit/963ddc66f6d822bf877c7e6155d903babf937c13) about how to do it: I added in `lucene/analysis/kuromoji/build.gradle` a [default run task from the Gradle's application plugin](https://github.com/apache/lucene/commit/8d52f66d57b354fad6426c7c02b05a3b736d7dcd), which allows to build a dictionary are as follows. The command should be executed under the root directory `lucene`, where the `gradlew` file is. For example, the following is my command when building `unidic-cwj-202302_full` dictionary without NFKD normalization: ``` ./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-cwj-202302_full" "/Users/azagniotov/Downloads/unidic-cwj-202302_full/lucene-kuromoji-built" "UTF-8" false' ``` ## Unit testing Unfortunately, the current PR does not include unit tests because the built dictionaries files are very big, e.g.: built `unidic-cwj-202302_full ` is around ~700MB. Thus, the following unit tests were added and ran locally on my machine to verify that the built dictionaries can be used at runtime. I did see there there is a [main/gradle/generation/kuromoji.gradle](https://github.com/apache/lucene/blob/main/gradle/generation/kuromoji.gradle) that downloads dictionaries and compiles them, but for now, I resorted not to add changes there as I think a dictionary should be decoupled. The built dictionaries were tested using the following Japanese strings (no particular reason, I just picked these four strings): - `"ใซใใใใ"` - `"ใกใใใ"` - `"ๆกๅคช้้ป้"` - `"่ๅท็ๆ"` The dictionaries metadata were placed (dictionary after dictionary) under the [lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict](https://github.com/apache/lucene/tree/main/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict) and a few unit test cases were added: Existing default dictionary already included in Lucene ``` assertAnalyzesTo(analyzerNoPunct, "ใซใใใใ", new String[] {"ใซ", "ใใ", "ใ", "ใ"}, new int[] {1, 1, 1, 1}); assertAnalyzesTo(analyzerNoPunct, "ใกใใใ", new String[] {"ใกใ", "ใ", "ใ"}, new int[] {1, 1, 1}); assertAnalyzesTo(analyzerNoPunct, "ๆกๅคช้้ป้", new String[] {"ๆกๅคช้", "้ป้"}, new int[] {1,
[GitHub] [lucene] mikemccand commented on pull request #12465: Potential bug in IndexedDISI90#SPARSE->advanceExactWithinBlock
mikemccand commented on PR #12465: URL: https://github.com/apache/lucene/pull/12465#issuecomment-1688183077 It's awesome that Lucene changed to be more picky (removing leniency) on catching invalid usage of the API! Such improvements are great at ferreting out bugs (instead of masking them) in the application layer (Amazon Search in this case) using Lucene! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score
zhaih commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1302276710 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -401,8 +404,23 @@ private boolean isDiverse(byte[] candidate, NeighborArray neighbors, float score * Find first non-diverse neighbour among the list of neighbors starting from the most distant * neighbours */ - private int findWorstNonDiverse(NeighborArray neighbors) throws IOException { -int[] uncheckedIndexes = neighbors.sort(); + private int findWorstNonDiverse(NeighborArray neighbors, int nodeOrd) throws IOException { +int[] uncheckedIndexes = neighbors.sort(nbrOrd -> { + float[] vectorValue = null; + byte[] binaryValue = null; + switch (this.vectorEncoding) { +case FLOAT32 -> vectorValue = (float[]) vectors.vectorValue(nodeOrd); Review Comment: Let's take this part outside of lambda to reduce number of times we call `vectorValue`, this operation involves some seek and parse operation on off-heap memory. ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -197,4 +206,8 @@ private int descSortFindRightMostInsertionPoint(float newScore, int bound) { } return start; } + + interface ScoringFunction { Review Comment: Javadoc for both interface and the method? ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -211,16 +214,16 @@ private void initializeFromGraph( oldNeighbor != NO_MORE_DOCS; oldNeighbor = initializerGraph.nextNeighbor()) { int newNeighbor = oldToNewOrdinalMap.get(oldNeighbor); - float score = - switch (this.vectorEncoding) { -case FLOAT32 -> this.similarityFunction.compare( -vectorValue, (float[]) vectorsCopy.vectorValue(newNeighbor)); -case BYTE -> this.similarityFunction.compare( -binaryValue, (byte[]) vectorsCopy.vectorValue(newNeighbor)); - }; +// float score = Review Comment: Let's remove those commented code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score
zhaih commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1302278519 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -108,10 +111,16 @@ public int[] sort() { } /** insert the first unsorted node into its sorted position */ - private int insertSortedInternal() { + private int insertSortedInternal(ScoringFunction scoringFunction) throws IOException { assert sortedNodeSize < size : "Call this method only when there's unsorted node"; int tmpNode = node[sortedNodeSize]; float tmpScore = score[sortedNodeSize]; + +if (Float.isNaN(tmpScore)){ + tmpScore = scoringFunction.computeScore(tmpNode); + System.out.println("Node: " + tmpNode + " Score: " + tmpScore); Review Comment: remove this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on pull request #12480: Enhancement 11236 lazy compute similarity score
zhaih commented on PR #12480: URL: https://github.com/apache/lucene/pull/12480#issuecomment-1689049573 Ah sorry I almost forget: don't forget to create a CHANGES.txt entry :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org