[GitHub] [lucene] mikemccand opened a new issue, #12476: Can we improve the linear scan part of skipping to possibly compile to CMOVcc?

2023-07-31 Thread via GitHub
mikemccand opened a new issue, #12476: URL: https://github.com/apache/lucene/issues/12476 ### Description @fulmicoton (Tantivy creator) reached out to me after our [fun discussion about how to tap into branchless CPU instructions (CMOVcc on x86-64)](https://markmail.org/message/rqktb

[GitHub] [lucene] mikemccand commented on issue #12476: Can we improve the linear scan part of skipping to possibly compile to CMOVcc?

2023-07-31 Thread via GitHub
mikemccand commented on issue #12476: URL: https://github.com/apache/lucene/issues/12476#issuecomment-1658179453 Thank you for the pointer @fulmicoton! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

[GitHub] [lucene] mikemccand opened a new issue, #12477: Could we encode postings the way we encode monotonic long doc values?

2023-07-31 Thread via GitHub
mikemccand opened a new issue, #12477: URL: https://github.com/apache/lucene/issues/12477 ### Description Lucene has an efficient (storage and CPU) compressor for monotonic long values, that simply makes "best fit" (ish?) linear model to the N monotonic values, and then encodes the p

[GitHub] [lucene] mikemccand commented on issue #12477: Could we encode postings the way we encode monotonic long doc values?

2023-07-31 Thread via GitHub
mikemccand commented on issue #12477: URL: https://github.com/apache/lucene/issues/12477#issuecomment-1658195108 Note that Tantivy uses binary search to locate the target docid in the block of docs -- somehow Tantivy uses SIMD to decode (docid-delta encoded) postings into absolute docids fi

[GitHub] [lucene] tang-hi commented on issue #12477: Could we encode postings the way we encode monotonic long doc values?

2023-07-31 Thread via GitHub
tang-hi commented on issue #12477: URL: https://github.com/apache/lucene/issues/12477#issuecomment-1658224212 I have attempted to encode/decode the post block using SIMD instructions. However, I believe it may not be the opportune moment to vectorize it. This is because we are currently una

[GitHub] [lucene] easyice commented on pull request #12435: Remove sort for uniqueValues in NumericDocValues

2023-07-31 Thread via GitHub
easyice commented on PR #12435: URL: https://github.com/apache/lucene/pull/12435#issuecomment-1658292502 I m sorry for the late reply, i agree with you, it has no impact on performance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [lucene] busykoala opened a new pull request, #12478: Add Option to Set Subtoken Position Increment for Dictonary Decompounder

2023-07-31 Thread via GitHub
busykoala opened a new pull request, #12478: URL: https://github.com/apache/lucene/pull/12478 ### Description This pull request adds a new feature to Lucene's DictionaryDecompounder. Now, you can set the position increment of subtokens to one. This feature is required when you're doi

[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support

2023-07-31 Thread via GitHub
benwtrent commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1658402079 > If it's a small number (say c children per parent), it may be better to use KNN search with K' = c * K. It would be interesting to compare these two approaches to see if we can prov

[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-31 Thread via GitHub
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1279318027 ## lucene/core/src/test/org/apache/lucene/util/hnsw/TestNeighborQueue.java: ## @@ -114,6 +114,38 @@ public void testUnboundedQueue() { assertEquals(maxNode, nn.

[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-31 Thread via GitHub
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1279321413 ## lucene/join/src/java/org/apache/lucene/search/join/ToParentJoinKnnCollector.java: ## @@ -0,0 +1,294 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-31 Thread via GitHub
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1279430631 ## lucene/core/src/java/org/apache/lucene/search/KnnCollector.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more +

[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-31 Thread via GitHub
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1279441401 ## lucene/join/src/java/org/apache/lucene/search/join/ToParentJoinKnnCollector.java: ## @@ -0,0 +1,294 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

[GitHub] [lucene] msokolov commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-31 Thread via GitHub
msokolov commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1279450672 ## lucene/join/src/java/org/apache/lucene/search/join/ToParentJoinKnnCollector.java: ## @@ -0,0 +1,294 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

[GitHub] [lucene] msokolov commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-31 Thread via GitHub
msokolov commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1279451735 ## lucene/core/src/java/org/apache/lucene/search/KnnCollector.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + *

[GitHub] [lucene] nreimers commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

2023-07-31 Thread via GitHub
nreimers commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1658640222 @msokolov In our BEIR paper we talked about this: https://arxiv.org/abs/2104.08663 The issue with cosine similarity is that it just encodes the topic. For the query `What

[GitHub] [lucene] benwtrent opened a new pull request, #12479: GITHUB#12342 Add new maximum inner product vector similarity method

2023-07-31 Thread via GitHub
benwtrent opened a new pull request, #12479: URL: https://github.com/apache/lucene/pull/12479 The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards information that the model may store within the magnitude. See: https://githu

[GitHub] [lucene] benwtrent commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

2023-07-31 Thread via GitHub
benwtrent commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1659123368 I found another dataset, Yandex Text-to-image: https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search I tested against the first 500_000 values in t