mikemccand opened a new issue, #12487:
URL: https://github.com/apache/lucene/issues/12487

   ### Description
   
   Over in https://github.com/mikemccand/luceneutil/issues/226 while trying to 
fix a sneaky and long-standing Lucene nightly benchmark non-determinism that 
affected `VectorSearch` and some `*TaxoFacets` performance measures, I 
struggled and failed/cheated to pick which `VectorSearch` queries to keep for 
disambiguation.
   
   The tasks file has:
   
   ```
   VectorSearch: vector//publisher backstory # freq=194856 freq=148
   VectorSearch: vector//many geografia # freq=99550 freq=104
   VectorSearch: vector//many foundation # freq=99550 freq=10894
   VectorSearch: vector//this school # freq=238551 freq=29912
   VectorSearch: vector//such 2007 # freq=111526 freq=90200 1.2
   VectorSearch: vector//year work # freq=175324 freq=102732 1.7
   VectorSearch: vector//interviews # freq=31768
   VectorSearch: vector//golf # freq=31760
   VectorSearch: vector//http # freq=389790
   ```
   
   The benchy then computes embeddings from each of these lexical terms, and 
creates `KnnFloatVectorQuery` for each.
   
   But then later, if something goes wrong, the `toString` of these queries 
just renders the first dimension float:
   
   ```
   TASK: cat=VectorSearch q=KnnFloatVectorQuery:vector[0.02625591,...][100] 
s=null group=null hits=100 facets=[]
   ```
   
   I realize from the machine's standpoint it really is only this vector that 
"matters", but we humans still think in terms of words (so far, anyways, heh).  
Could we maybe allow for an optional opaque and not counting towards 
`hashCode`/`equals`/etc. string that is then regurgitated back out in 
`toString` to help we humans that still need to interact with the machines?
   
   If we had this, I could have made the correct fix over in 
https://github.com/mikemccand/luceneutil/issues/226 to try to gain back some 
continuity in the vector nightly charts.  But instead I just picked the top 5 
vector queries, which is most likely wrong.  Also, there is precedent in Lucene 
for such "opaque for-human strings": the `String resourceDescription` passed to 
base `IndexInput` constructor.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to