[GitHub] [lucene] mikemccand opened a new issue, #12487: Can/should `KnnByte/FloatVectorQuery` carry some human-meaningful opaque `toString` fragment?
mikemccand opened a new issue, #12487: URL: https://github.com/apache/lucene/issues/12487 ### Description Over in https://github.com/mikemccand/luceneutil/issues/226 while trying to fix a sneaky and long-standing Lucene nightly benchmark non-determinism that affected `VectorSearch` and some `*TaxoFacets` performance measures, I struggled and failed/cheated to pick which `VectorSearch` queries to keep for disambiguation. The tasks file has: ``` VectorSearch: vector//publisher backstory # freq=194856 freq=148 VectorSearch: vector//many geografia # freq=99550 freq=104 VectorSearch: vector//many foundation # freq=99550 freq=10894 VectorSearch: vector//this school # freq=238551 freq=29912 VectorSearch: vector//such 2007 # freq=111526 freq=90200 1.2 VectorSearch: vector//year work # freq=175324 freq=102732 1.7 VectorSearch: vector//interviews # freq=31768 VectorSearch: vector//golf # freq=31760 VectorSearch: vector//http # freq=389790 ``` The benchy then computes embeddings from each of these lexical terms, and creates `KnnFloatVectorQuery` for each. But then later, if something goes wrong, the `toString` of these queries just renders the first dimension float: ``` TASK: cat=VectorSearch q=KnnFloatVectorQuery:vector[0.02625591,...][100] s=null group=null hits=100 facets=[] ``` I realize from the machine's standpoint it really is only this vector that "matters", but we humans still think in terms of words (so far, anyways, heh). Could we maybe allow for an optional opaque and not counting towards `hashCode`/`equals`/etc. string that is then regurgitated back out in `toString` to help we humans that still need to interact with the machines? If we had this, I could have made the correct fix over in https://github.com/mikemccand/luceneutil/issues/226 to try to gain back some continuity in the vector nightly charts. But instead I just picked the top 5 vector queries, which is most likely wrong. Also, there is precedent in Lucene for such "opaque for-human strings": the `String resourceDescription` passed to base `IndexInput` constructor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] turingmachine commented on pull request #12485: Fix onlyLongestMatch in DictionaryCompoundWordTokenFilter
turingmachine commented on PR #12485: URL: https://github.com/apache/lucene/pull/12485#issuecomment-1663953429 +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] turingmachine commented on pull request #12478: Add Option to Set Subtoken Position Increment for Dictonary Decompounder
turingmachine commented on PR #12478: URL: https://github.com/apache/lucene/pull/12478#issuecomment-1663954051 +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Frostfire25 commented on issue #12463: Learned sorting algorithm for Lucene
Frostfire25 commented on issue #12463: URL: https://github.com/apache/lucene/issues/12463#issuecomment-1664368419 Hey, very interested in assisting with the implementation of this algorithm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Jackyrie2 commented on pull request #12480: Enhancement 11236 lazy compute similarity score
Jackyrie2 commented on PR #12480: URL: https://github.com/apache/lucene/pull/12480#issuecomment-1664597207 Here is a quick re-run of benchmark(100 dim vectors) on the optimized code with a 90% - 10% split on documents addition: Baseline -> old candidate -> optimized candidate: 253,234,833 -> 265,153,500 -> 246,373,375 3,845,039,583 -> 4,338,130,875 -> 3,674,674,583 10,131,094,959 -> 9,869,441,083 -> 9,569,986,875 I will run the benchmark on 768 dim vectors later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org