date:20230803

[GitHub] [lucene] mikemccand opened a new issue, #12487: Can/should `KnnByte/FloatVectorQuery` carry some human-meaningful opaque `toString` fragment?

2023-08-03 Thread via GitHub



mikemccand opened a new issue, #12487:
URL: https://github.com/apache/lucene/issues/12487

   ### Description
   
   Over in https://github.com/mikemccand/luceneutil/issues/226 while trying to 
fix a sneaky and long-standing Lucene nightly benchmark non-determinism that 
affected `VectorSearch` and some `*TaxoFacets` performance measures, I 
struggled and failed/cheated to pick which `VectorSearch` queries to keep for 
disambiguation.
   
   The tasks file has:
   
   ```
   VectorSearch: vector//publisher backstory # freq=194856 freq=148
   VectorSearch: vector//many geografia # freq=99550 freq=104
   VectorSearch: vector//many foundation # freq=99550 freq=10894
   VectorSearch: vector//this school # freq=238551 freq=29912
   VectorSearch: vector//such 2007 # freq=111526 freq=90200 1.2
   VectorSearch: vector//year work # freq=175324 freq=102732 1.7
   VectorSearch: vector//interviews # freq=31768
   VectorSearch: vector//golf # freq=31760
   VectorSearch: vector//http # freq=389790
   ```
   
   The benchy then computes embeddings from each of these lexical terms, and 
creates `KnnFloatVectorQuery` for each.
   
   But then later, if something goes wrong, the `toString` of these queries 
just renders the first dimension float:
   
   ```
   TASK: cat=VectorSearch q=KnnFloatVectorQuery:vector[0.02625591,...][100] 
s=null group=null hits=100 facets=[]
   ```
   
   I realize from the machine's standpoint it really is only this vector that 
"matters", but we humans still think in terms of words (so far, anyways, heh).  
Could we maybe allow for an optional opaque and not counting towards 
`hashCode`/`equals`/etc. string that is then regurgitated back out in 
`toString` to help we humans that still need to interact with the machines?
   
   If we had this, I could have made the correct fix over in 
https://github.com/mikemccand/luceneutil/issues/226 to try to gain back some 
continuity in the vector nightly charts.  But instead I just picked the top 5 
vector queries, which is most likely wrong.  Also, there is precedent in Lucene 
for such "opaque for-human strings": the `String resourceDescription` passed to 
base `IndexInput` constructor.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] turingmachine commented on pull request #12485: Fix onlyLongestMatch in DictionaryCompoundWordTokenFilter

2023-08-03 Thread via GitHub



turingmachine commented on PR #12485:
URL: https://github.com/apache/lucene/pull/12485#issuecomment-1663953429

   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] turingmachine commented on pull request #12478: Add Option to Set Subtoken Position Increment for Dictonary Decompounder

2023-08-03 Thread via GitHub



turingmachine commented on PR #12478:
URL: https://github.com/apache/lucene/pull/12478#issuecomment-1663954051

   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Frostfire25 commented on issue #12463: Learned sorting algorithm for Lucene

2023-08-03 Thread via GitHub



Frostfire25 commented on issue #12463:
URL: https://github.com/apache/lucene/issues/12463#issuecomment-1664368419

   Hey, very interested in assisting with the implementation of this algorithm.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Jackyrie2 commented on pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-03 Thread via GitHub



Jackyrie2 commented on PR #12480:
URL: https://github.com/apache/lucene/pull/12480#issuecomment-1664597207

   Here is a quick re-run of benchmark(100 dim vectors)  on the optimized code 
with a 90% - 10% split on documents addition:
   
   Baseline -> old candidate -> optimized candidate:
   253,234,833 -> 265,153,500 -> 246,373,375
   3,845,039,583 -> 4,338,130,875 -> 3,674,674,583
   10,131,094,959 -> 9,869,441,083 -> 9,569,986,875
   
   I will run the benchmark on 768 dim vectors later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand opened a new issue, #12487: Can/should `KnnByte/FloatVectorQuery` carry some human-meaningful opaque `toString` fragment?

[GitHub] [lucene] turingmachine commented on pull request #12485: Fix onlyLongestMatch in DictionaryCompoundWordTokenFilter

[GitHub] [lucene] turingmachine commented on pull request #12478: Add Option to Set Subtoken Position Increment for Dictonary Decompounder

[GitHub] [lucene] Frostfire25 commented on issue #12463: Learned sorting algorithm for Lucene

[GitHub] [lucene] Jackyrie2 commented on pull request #12480: Enhancement 11236 lazy compute similarity score

5 matches

Site Navigation

Mail list logo

Footer information