[GitHub] [lucene] uschindler commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-17 Thread GitBox


uschindler commented on PR #895:
URL: https://github.com/apache/lucene/pull/895#issuecomment-1128502847

   What I wanted to add: Merging is mostly an IO thing. More cores would not 
necessarily make it faster (your SSD  has a limited amount of parallelism). It 
may be better on different indexes placed on different SSDs, but those would 
have separate merge schedulers anyways. If you look a few lines up in code: If 
its a harddisk and spins the maximum number of threads is 1.
   
   P.S.: When we really want to change this, the documentation (javadocs) needs 
update, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1128594358

   The previous CI run result looks good to me. I also enabled GUI tests in the 
smoke tester.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1128599401

   How about renaming the action's directory name 
`.github/actions/yarn-caches/` to `.github/actions/gradle-caches/`? @dweiss 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


dweiss commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1128601383

   up to you, entirely. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


dweiss commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1128601991

   The yarn-caches name is wrong - it's something else I was working (!).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1128605453

   I'll update the directory name later.
   
   This time, the test timed out on Windows... I think this could occasionally 
happen :/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


dweiss commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1128610332

   Increase the timeout, maybe? Windows boxes on github are slow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1128612813

   I increased the timeout to 120 seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1128664466

   sorry couldn't resist.
   
   "Build Duke" 
   
![build_duke](https://user-images.githubusercontent.com/1825333/168782488-06cca107-7052-42e7-983c-843b43e70a28.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1128676036

   Passed all checks and I think we've done all I wanted to do here - we 
disabled the GUI test in the mandatory test runs, instead, we enabled it on all 
CI runs (Jenkins, GH Actions) and the smoke tester.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10575) Broken links in some javadocs

2022-05-17 Thread Alan Woodward (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Woodward resolved LUCENE-10575.

Fix Version/s: 9.2
   Resolution: Fixed

> Broken links in some javadocs
> -
>
> Key: LUCENE-10575
> URL: https://issues.apache.org/jira/browse/LUCENE-10575
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The release wizard for 9.2 has found some broken javadoc links:
>  * ExternalRefSorter refers to package-private implementations when it should 
> probably refer to the relevant interfaces instead
>  * STMergingTermsEnum refers to package-private classes.  I think we can 
> solve this by making the whole class package-private, given that it's an 
> implementation detail within a Codec?
>  * MatchRegionRetriever links to an internal implementation, which should 
> just be described rather than linked.
>  
> These are all fairly simple to fix, and I will open a PR to do so.  Slightly 
> more worrying is that running `./gradlew 
> lucene:documentation:checkBrokenLinks` does not seem to consistently find 
> these problems.  The release wizard runs against an entirely clean checkout 
> and fails, but attempting to reproduce the failure on an existing checkout 
> produces a green build.  Some of these broken links have been around for a 
> while - the STMergingTermsEnum ones since 2019 - so it may just be luck that 
> I found them this time round.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-17 Thread GitBox


jpountz commented on PR #895:
URL: https://github.com/apache/lucene/pull/895#issuecomment-1128736566

   The current calculation makes sense to me. Merge policies like to organize 
segments into tiers, where the number of segments on each tier is typically 
also the number of segments that can be merged together. So it doesn't make 
much sense to perform multiple merges on the same tier concurrently. The way 
I'm reading the current formula is that we are scaling the number of merge 
threads with the number of processors, but stop at 4 anyway because it already 
allows Lucene to perform merges on 4 different tiers concurrently, which is 
already a lot given that tiers have exponential sizes, and that 
TieredMergePolicy has a max merged segment size of 5GB.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz closed pull request #892: LUCENE-10573: Improve stored fields bulk merge for degenerate O(n^2) merges.

2022-05-17 Thread GitBox


jpountz closed pull request #892: LUCENE-10573: Improve stored fields bulk 
merge for degenerate O(n^2) merges.
URL: https://github.com/apache/lucene/pull/892


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10573) Improve stored fields bulk merge for degenerate O(n^2) merges

2022-05-17 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10573.
---
Resolution: Won't Fix

> Improve stored fields bulk merge for degenerate O(n^2) merges
> -
>
> Key: LUCENE-10573
> URL: https://issues.apache.org/jira/browse/LUCENE-10573
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Spin-off from LUCENE-10556.
> For small merges that are below the floor segment size, TieredMergePolicy may 
> merge segments that have vastly different sizes, e.g. one 10k-docs segment 
> with 9 100-docs segments. 
> While we might be able to improve TieredMergePolicy (LUCENE-10569), there are 
> also improvements we could make to stored fields, such as bulk-copying chunks 
> of the first segment until the first dirty chunk. In this scenario where 
> segments keep being rewritten, this would help significantly.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-17 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538131#comment-17538131
 ] 

Adrien Grand commented on LUCENE-10572:
---

If this is memory-bound, I wonder if we could get benefits e.g. by splitting 
the hash table into a hash table for short terms and another one for long 
terms. Since most frequent terms are usually short, maybe this would help 
reduce the number of cache misses and in-turn help improve indexing speed. And 
then if it helps indexing be less memory-bound, maybe changes like Uwe's would 
start making a difference.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10392) Handle soft deletes via LiveDocsFormat

2022-05-17 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538134#comment-17538134
 ] 

Adrien Grand commented on LUCENE-10392:
---

[~shahrs87] I set the priority to minor, but in my opinion this is a pretty 
hard task, so I'm not sure it's a good fit for a 2nd issue unless you're 
already very familiar with how Lucene handles file formats.

> Handle soft deletes via LiveDocsFormat
> --
>
> Key: LUCENE-10392
> URL: https://issues.apache.org/jira/browse/LUCENE-10392
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> We have been using doc values to handle soft deletes until now, but this is a 
> bit of a hack as it:
>  - forces users to reserve a field name for doc values
>  - generally doesn't read directly from doc values, instead docs values help 
> populate bitsets and then reads are performed via these bitsets
> It would also be more natural to have both hard and soft deletes handled by 
> the same file format?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID

2022-05-17 Thread GitBox


jpountz commented on PR #873:
URL: https://github.com/apache/lucene/pull/873#issuecomment-1128797150

   When the order is reversed, your change negates the `node` twice so that we 
keep tie-breaking by increasing node IDs in all cases. With this fix, I wonder 
if we could simplify the encoding logic by only making the `score` affected by 
the order, not the `node`? I'm thinking of something like this (which may be 
incorrect, I haven't tested it):
   
   ```
   float multiplicator = reversed ? -1f : 1f; // could be precomputed
   int sortableScore = NumericUtils.floatToSortableInt(multiplicator * score);
   long encoded = ((long) sortableScore << 32) | (Integer.MAX_VALUE - node);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10266) Move nearest-neighbor search on points to core?

2022-05-17 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538150#comment-17538150
 ] 

Adrien Grand commented on LUCENE-10266:
---

Let's add a method to `LatLonPoint` with the following signature and remove 
similar logic from sandbox?

{code}
  public static TopFieldDocs nearest(String field, double latitude, double 
longitude, IndexReader reader, int n);
{code}

> Move nearest-neighbor search on points to core?
> ---
>
> Key: LUCENE-10266
> URL: https://issues.apache.org/jira/browse/LUCENE-10266
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> Now that the Points' public API supports running nearest-nearest neighbor 
> search, should we move it to core via helper methods on {{LatLonPoint}} and 
> {{XYPoint}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #876: LUCENE-9356: Change test to detect mismatched checksums instead of byte flips.

2022-05-17 Thread GitBox


jpountz merged PR #876:
URL: https://github.com/apache/lucene/pull/876


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips

2022-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538153#comment-17538153
 ] 

ASF subversion and git services commented on LUCENE-9356:
-

Commit e65c0c777b61a964483d1f9ed645d91973a1540e in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e65c0c777b6 ]

LUCENE-9356: Change test to detect mismatched checksums instead of byte flips. 
(#876)

This makes the test more robust and gives a good sense of whether file formats
are implementing `checkIntegrity` correctly.

> Add tests for corruptions caused by byte flips
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips

2022-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538158#comment-17538158
 ] 

ASF subversion and git services commented on LUCENE-9356:
-

Commit f69dc58befea40f1cd802d8b0502748cc7daad96 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f69dc58befe ]

LUCENE-9356: Change test to detect mismatched checksums instead of byte flips. 
(#876)

This makes the test more robust and gives a good sense of whether file formats
are implementing `checkIntegrity` correctly.

> Add tests for corruptions caused by byte flips
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9356) Add tests for corruptions caused by byte flips

2022-05-17 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9356.
--
Fix Version/s: 9.2
   Resolution: Fixed

I pushed to the 9.2 branch since it included some fixes for vector file formats.

> Add tests for corruptions caused by byte flips
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips

2022-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538160#comment-17538160
 ] 

ASF subversion and git services commented on LUCENE-9356:
-

Commit 978eef5459c7683038ddcca4ec56e4baa63715d0 in lucene's branch 
refs/heads/branch_9_2 from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=978eef5459c ]

LUCENE-9356: Change test to detect mismatched checksums instead of byte flips. 
(#876)

This makes the test more robust and gives a good sense of whether file formats
are implementing `checkIntegrity` correctly.

> Add tests for corruptions caused by byte flips
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9356) Add tests for mismatched checksums

2022-05-17 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-9356:
-
Summary: Add tests for mismatched checksums  (was: Add tests for 
corruptions caused by byte flips)

> Add tests for mismatched checksums
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] risdenk commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-17 Thread GitBox


risdenk commented on PR #895:
URL: https://github.com/apache/lucene/pull/895#issuecomment-1128827348

   Fair enough - appreciate all the comments and additional context I couldn't 
find in the linked jiras. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] risdenk closed pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-17 Thread GitBox


risdenk closed pull request #895: LUCENE-10576: ConcurrentMergeScheduler 
maxThreadCount calculation is artificially low
URL: https://github.com/apache/lucene/pull/895


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-17 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177]
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
>  * LUCENE-6437
>  * LUCENE-6119
>  * LUCENE-5951
>  ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
> Runtime.getRuntime().availableProcessors()/2));"
>  ** 
> https://github.com/apache/lucene/commit/33410e30c1af7105a6b8b922255af047d13be626#diff-ceb8ec6fe5807682cfb691a8ec52bcc672fb7c5eeb6922c80da4c075f7f003c8R147



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-17 Thread Kevin Risden (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538164#comment-17538164
 ] 

Kevin Risden commented on LUCENE-10576:
---

This is marked as won't fix since some reasonable items were brought up on the 
PR - https://github.com/apache/lucene/pull/895

> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177]
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
>  * LUCENE-6437
>  * LUCENE-6119
>  * LUCENE-5951
>  ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
> Runtime.getRuntime().availableProcessors()/2));"
>  ** 
> https://github.com/apache/lucene/commit/33410e30c1af7105a6b8b922255af047d13be626#diff-ceb8ec6fe5807682cfb691a8ec52bcc672fb7c5eeb6922c80da4c075f7f003c8R147



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #896: LUCENE-9409: Reenable TestAllFilesDetectTruncation.

2022-05-17 Thread GitBox


jpountz opened a new pull request, #896:
URL: https://github.com/apache/lucene/pull/896

- Removed dependency on LineFileDocs to improve reproducibility.
- Relaxed the expected exception type: any exception is ok.
- Ignore rare cases when a file still appears to have a well-formed footer
  after truncation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-17 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538183#comment-17538183
 ] 

Adrien Grand commented on LUCENE-10574:
---

I was assuming we wanted to have strong guarantees about the number of segments 
in the index at search time, but it's a fair point that degrading to O(n^2) 
merging to meet this guarantee is not a good trade-off.

I tried to think of ways we could do this. One obvious option is to remove 
{{floorSegmentBytes}}, but this might be a bit too extreme as it would allow 
any index to have a long tail of small segments? One idea I started playing 
with consists of ensuring that every merge grows the largest input segment by 
at least some fraction, e.g. 50%. It tries to strike a balance between avoiding 
pathological merging and still trying to keep the number of segments contained 
at search time. I quickly hacked this into TieredMergePolicy and this made the 
StoredFieldsBenchmark more than 2x faster. I wonder if there are other 
approaches we should consider.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #876: LUCENE-9356: Change test to detect mismatched checksums instead of byte flips.

2022-05-17 Thread GitBox


mayya-sharipova commented on PR #876:
URL: https://github.com/apache/lucene/pull/876#issuecomment-1128871222

   Thanks Adrien for catching errors with vector files. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-17 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538194#comment-17538194
 ] 

Robert Muir commented on LUCENE-10574:
--

I think another approach is to actually remove the {{O(n)^2}}, remove 
{{floorSegmentBytes}}, let it kick into all the benchmarks. Now that the bad 
algorithm is gone, followup by looking at alternative, safe methods to try to 
keep the number of segments "contained" that don't cause pathological 
performance issues.

It seems we all just accept this {{O(n)^2}} as a necessity, but I really don't 
know why: I'm not sold on it at all.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-17 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538198#comment-17538198
 ] 

Uwe Schindler commented on LUCENE-10572:


If we have 2 hash tables, we could have one for short terms up to 255 bytes ( 
for sure could also make the limit smaller, but 255 is the limit to get the 1 
byte length encoding), and all longer ones in a separate hash (where also the 
comparisons are more expensive).

I am not sure if the additioal complexity is worth to do this.

About changing the hash algorithm: we may add a counter into the hash table to 
actually measure how many collisions we have during indexing wikipedia. But 
actually when inserting a term already in the hash-table, we get a hash 
collision and have to confirm with Array.equals() that the term is already 
there. I tend to think that the smaller terms are more often duplicates than 
larger ones, so having them in a separate table may be a good idea.

Maybe we should have some statistics during wikipedia indexing:

- how many hash collisions do we have (where term is actually not already in 
table)? => this ratio should be low. We can compare hash algorithms for that.
- how many hash collisions do we get because the term is already in table? => 
this is most expensive memory-wise, because hash AND equals have to be 
calculated.
- how many inserts of new terms without a collision do we get?

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-17 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538211#comment-17538211
 ] 

Robert Muir commented on LUCENE-10572:
--

These measurements are also going to be strange because of how that wikipedia 
indexing works. The stopwords are going to skew everything. If someone is 
removing them, the distribution of tokens will look much different.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #896: LUCENE-9409: Reenable TestAllFilesDetectTruncation.

2022-05-17 Thread GitBox


jpountz commented on PR #896:
URL: https://github.com/apache/lucene/pull/896#issuecomment-1128913704

   The test would still pass without the new checks (another check would fail 
later), but I thought it was more consistent if we call `checkFooter` for every 
`IndexInput` we open across all file formats.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring

2022-05-17 Thread Mike Drob (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538284#comment-17538284
 ] 

Mike Drob commented on LUCENE-10236:


[~zacharymorn] is this still relevant for 8.11? 
https://github.com/apache/lucene-solr/pull/2637

> CombinedFieldsQuery to use fieldAndWeights.values() when constructing 
> MultiNormsLeafSimScorer for scoring
> -
>
> Key: LUCENE-10236
> URL: https://issues.apache.org/jira/browse/LUCENE-10236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Zach Chen
>Assignee: Zach Chen
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> This is a spin-off issue from discussion in 
> [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a 
> quick fix in CombinedFieldsQuery scoring.
> Currently CombinedFieldsQuery would use a constructed 
> [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
>  object to create a MultiNormsLeafSimScorer for scoring, but the fields 
> object may contain duplicated field-weight pairs as it is [built from looping 
> over 
> fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
>  resulting into duplicated norms being added during scoring calculation in 
> MultiNormsLeafSimScorer. 
> E.g. for CombinedFieldsQuery with two fields and two values matching a 
> particular doc:
> {code:java}
> CombinedFieldQuery query =
> new CombinedFieldQuery.Builder()
> .addField("field1", (float) 1.0)
> .addField("field2", (float) 1.0)
> .addTerm(new BytesRef("foo"))
> .addTerm(new BytesRef("zoo"))
> .build(); {code}
> I would imagine the scoring to be based on the following:
>  # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + 
> freq(field1:zoo) + freq(field2:zoo)
>  # Sum of norms on doc = norm(field1) + norm(field2)
> but the current logic would use the following for scoring:
>  # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + 
> freq(field1:zoo) + freq(field2:zoo)
>  # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
> norm(field2)
>  
> In addition, this differs from how MultiNormsLeafSimScorer is constructed 
> from CombinedFieldsQuery explain function, which [uses 
> fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
>  and does not contain duplicated field-weight pairs. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob commented on pull request #2649: Remove '-' between base.version and version.suffix and change common-build to allow the new format

2022-05-17 Thread GitBox


madrob commented on PR #2649:
URL: https://github.com/apache/lucene-solr/pull/2649#issuecomment-1129013348

   @anshumg does 8.11.2 need this, or should we close this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-17 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538321#comment-17538321
 ] 

Chris M. Hostetter commented on LUCENE-10576:
-

Should those "reasonable items" be added as comments to the code so they aren't 
lost to time?

> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177]
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
>  * LUCENE-6437
>  * LUCENE-6119
>  * LUCENE-5951
>  ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
> Runtime.getRuntime().availableProcessors()/2));"
>  ** 
> https://github.com/apache/lucene/commit/33410e30c1af7105a6b8b922255af047d13be626#diff-ceb8ec6fe5807682cfb691a8ec52bcc672fb7c5eeb6922c80da4c075f7f003c8R147



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 opened a new pull request, #897: LUCENE-10266 Move nearest-neighbor search on points to core

2022-05-17 Thread GitBox


shahrs87 opened a new pull request, #897:
URL: https://github.com/apache/lucene/pull/897

   
   
   
   # Description
   
   Please provide a short description of the changes you're making with this 
pull request.
   
   # Solution
   
   Please provide a short description of the approach taken to implement your 
solution.
   
   # Tests
   
   Please describe the tests you've developed or run to confirm this patch 
implements the feature or solves the problem.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [ ] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `main` branch.
   - [ ] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 commented on pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core

2022-05-17 Thread GitBox


shahrs87 commented on PR #897:
URL: https://github.com/apache/lucene/pull/897#issuecomment-1129107476

   @jpountz I have created this PR as per your suggestion in LUCENE-10266 jira. 
I have made the following assumptions. Please correct me if needed.
   1. I have deleted LatLonPointPrototypeQueries class since there is no other 
sandobx query in that class. Should I keep the empty class ?
   2. I see we have FloatPointNearestNeighbor implementation in sandbox which 
is similar to NearestNeighbor. Do I need to remove FloatPointNearestNeighbor 
from sandbox and move it to lucene/core ?
   3. I have added this change to API Changes section in CHANGES.txt Please 
correct if it belongs somewhere else.
   
   Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10392) Handle soft deletes via LiveDocsFormat

2022-05-17 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538347#comment-17538347
 ] 

Rushabh Shah commented on LUCENE-10392:
---

> unless you're already very familiar with how Lucene handles file formats.

[~jpountz] Thank you for the reply. I am not at-all familiar with the file 
formats. Can you suggest some blog/article or some class names where I can 
learn more about the different file formats ?

> Handle soft deletes via LiveDocsFormat
> --
>
> Key: LUCENE-10392
> URL: https://issues.apache.org/jira/browse/LUCENE-10392
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> We have been using doc values to handle soft deletes until now, but this is a 
> bit of a hack as it:
>  - forces users to reserve a field name for doc values
>  - generally doesn't read directly from doc values, instead docs values help 
> populate bitsets and then reads are performed via these bitsets
> It would also be more natural to have both hard and soft deletes handled by 
> the same file format?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] cpoerschke opened a new pull request, #2656: LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently

2022-05-17 Thread GitBox


cpoerschke opened a new pull request, #2656:
URL: https://github.com/apache/lucene-solr/pull/2656

   backport of https://github.com/apache/lucene/pull/737 and 
https://github.com/apache/lucene/pull/758
   
   for https://issues.apache.org/jira/browse/LUCENE-10477 and 
https://issues.apache.org/jira/browse/LUCENE-10464
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor

2022-05-17 Thread Christine Poerschke (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke reopened LUCENE-10477:
--

re-opening for potential backport: 
https://github.com/apache/lucene-solr/pull/2656

> SpanBoostQuery.rewrite was incomplete for boost==1 factor
> -
>
> Key: LUCENE-10477
> URL: https://issues.apache.org/jira/browse/LUCENE-10477
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> _(This bug report concerns pre-9.0 code only but it's so subtle that it 
> warrants sharing I think and maybe fixing if there was to be a 8.11.2 release 
> in future.)_
> Some existing code e.g. 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54]
>  adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is 
> {{1.0}} i.e. technically wrapping is unnecessary.
> Query rewriting should counteract this somewhat except it might not e.g. note 
> at 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83]
>  how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called!
> This can then manifest in strange ways e.g. during highlighting:
> {code:java}
> ...
> java.lang.IllegalArgumentException: Rewrite first!
>   at 
> org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99)
>   at 
> org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183)
>   at 
> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295)
>   ...
> {code}
> This stacktrace is not from 8.11.1 code but the general logic is that at line 
> 293 rewrite was called (except it didn't a full rewrite because of 
> {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at 
> line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-17 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538389#comment-17538389
 ] 

Michael Sokolov commented on LUCENE-10574:
--

I'm not sure if I understand, but are we seeing O(N^2) because tiny segments 
get merged into small segments, which get merged into smallish segments, and so 
on, and because the original segments were so tiny we end up merging the same 
document(s) many times?

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks

2022-05-17 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538395#comment-17538395
 ] 

Michael Sokolov commented on LUCENE-9625:
-

There's no support for using an existing index; creating the index is an 
important part of the benchmark, I think? As for threading, no, it would be 
necessary to modify the test harness. But maybe you should consider 
contributing to ann-benchmarks?

> Benchmark KNN search with ann-benchmarks
> 
>
> Key: LUCENE-9625
> URL: https://issues.apache.org/jira/browse/LUCENE-9625
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In addition to benchmarking with luceneutil, it would be good to be able to 
> make use of ann-benchmarks, which is publishing results from many approximate 
> knn algorithms, including the hnsw implementation from its authors. We don't 
> expect to challenge the performance of these native code libraries, however 
> it would be good to know just how far off we are.
> I started looking into this and posted a fork of ann-benchmarks that uses 
> KnnGraphTester  class to run these: 
> https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to 
> manually copy jars and the KnnGraphTester.class to the test host machine 
> rather than downloading from a distribution. KnnGraphTester needs some 
> modifications in order to support this process - this issue is mostly about 
> that.
> One thing I noticed is that some of the index builds with higher fanout 
> (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is 
> concerning and I'll open a separate issue for trying to improve that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 opened a new pull request, #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos

2022-05-17 Thread GitBox


shahrs87 opened a new pull request, #898:
URL: https://github.com/apache/lucene/pull/898

   
   
   
   
   # Description
   
   Please provide a short description of the changes you're making with this 
pull request.
   
   # Solution
   
   Please provide a short description of the approach taken to implement your 
solution.
   
   # Tests
   
   Please describe the tests you've developed or run to confirm this patch 
implements the feature or solves the problem.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [ ] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `main` branch.
   - [ ] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 commented on pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos

2022-05-17 Thread GitBox


shahrs87 commented on PR #898:
URL: https://github.com/apache/lucene/pull/898#issuecomment-1129196482

   Hi @dsmiley 
   Can you please help me review this patch ? I have tried to implement this 
using your suggestion in the jira. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude opened a new pull request, #2657: SOLR-16199: Fix query syntax for LIKE queries with wildcard

2022-05-17 Thread GitBox


thelabdude opened a new pull request, #2657:
URL: https://github.com/apache/lucene-solr/pull/2657

   Backport of https://github.com/apache/solr/pull/865


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-17 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538424#comment-17538424
 ] 

Greg Miller commented on LUCENE-10544:
--

+1 to pursuing this delegating bulk scorer suggestion. I really like that idea 
[~jpountz]. Seems like a simple, easy to understand approach that still allows 
queries to provide their own custom bulk scoring logic as necessary. 

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Michael Sokolov (Jira)
Michael Sokolov created LUCENE-10577:


 Summary: Quantize vector values
 Key: LUCENE-10577
 URL: https://issues.apache.org/jira/browse/LUCENE-10577
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Reporter: Michael Sokolov


The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
These fields can be used (via {{KnnVectorsReader}}) in two main ways:

1. The {{VectorValues}} iterator enables retrieving values
2. Approximate nearest -neighbor search

The main point of this addition was to provide the search capability, and to 
support that it is not really necessary to store vectors in full precision. 
Perhaps users may also be willing to retrieve values in lower precision for 
whatever purpose those serve, if they are able to store more samples. We know 
that 8 bits is enough to provide a very near approximation to the same 
recall/performance tradeoff that is achieved with the full-precision vectors. 
I'd like to explore how we could enable 4:1 compression of these fields by 
reducing their precision.

A few ways I can imagine this would be done:

1. Provide a parallel byte-oriented API. This would allow users to provide 
their data in reduced-precision format and give control over the quantization 
to them. It would have a major impact on the Lucene API surface though, 
essentially requiring us to duplicate all of the vector APIs.
2. Automatically quantize the stored vector data when we can. This would 
require no or perhaps very limited change to the existing API to enable the 
feature.

I've been exploring (2), and what I find is that we can achieve very good 
recall results using dot-product similarity scoring by simple linear scaling + 
quantization of the vector values, so long as  we choose the scale that 
minimizes the quantization error. Dot-product is amenable to this treatment 
since vectors are required to be unit-length when used with that similarity 
function. 

 Even still there is variability in the ideal scale over different data sets. A 
good choice seems to be max(abs(min-value), abs(max-value)), but of course this 
assumes that the data set doesn't have a few outlier data points. A theoretical 
range can be obtained by 1/sqrt(dimension), but this is only useful when the 
samples are normally distributed. We could in theory determine the ideal scale 
when flushing a segment and manage this quantization per-segment, but then 
numerical error could creep in when merging.

I'll post a patch/PR with an experimental setup I've been using for evaluation 
purposes. It is pretty self-contained and simple, but has some drawbacks that 
need to be addressed:

1. No automated mechanism for determining quantization scale (it's a constant 
that I have been playing with)
2. Converts from byte/float when computing dot-product instead of directly 
computing on byte values

I'd like to get people's feedback on the approach and whether in general we 
should think about doing this compression under the hood, or expose a 
byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-17 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538448#comment-17538448
 ] 

Adrien Grand commented on LUCENE-10574:
---

It's not about absolute segment sizes, it's more about computing balanced 
merges. Say you have N 1-document segments and want to merge them down to a 
single segment, 10 segments at a time. If you always compute perfectly balanced 
merges then each document participates in O(log(N)) merges so it takes O(N 
log(N)) to get down to a single segment. If you take the naive approach of 
always merging the biggest segment you got so far with 9 1-document segments 
then each document participates in O(N) merges so it takes O(N^2) to get down 
to a single segment.

As bad as the second approach sounds, this is what TieredMergePolicy does with 
segments that are below the floor segment size.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538451#comment-17538451
 ] 

Robert Muir commented on LUCENE-10577:
--

I think a 2-byte float would be a better design than 1-byte float. We should 
design for things that have actual hardware support, not make up our own 
floating point formats for something like this, otherwise it will never get 
vectorized by hotspot and never scale.

We still don't even have 4-byte float support from openjdk vectors, so I think 
it would be better to first wait and see if java exposes half-float 
vectorization in some way we can use.


> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538456#comment-17538456
 ] 

Robert Muir commented on LUCENE-10577:
--

at least for fp16 we see some movement on openjdk (open pull request, java 
issue): 
https://bugs.openjdk.java.net/browse/JDK-8277304
https://github.com/openjdk/panama-vector/pull/164

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538457#comment-17538457
 ] 

Michael Sokolov commented on LUCENE-10577:
--

Actually what I have in mind is signed byte values (-128-127), not any kind of 
8 bit floating point. But perhaps your point still holds - I don't know if 
there is hardware support for byte-arithmetic?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538460#comment-17538460
 ] 

Adrien Grand commented on LUCENE-10577:
---

Would it be possible to implement (1) with a float API by making the format 
detect when all float values across a segment are effectively integers in 
0..255?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538459#comment-17538459
 ] 

Robert Muir commented on LUCENE-10577:
--

the actual operations you want to do need to be supported. E.g. if you want to 
work on bytes, look at ByteVector and try to write standalone vectorized 
prototype and see how it compares to e.g. dot-product on FloatVector. 
https://docs.oracle.com/en/java/javase/16/docs/api/jdk.incubator.vector/jdk/incubator/vector/ByteVector.html

I'm just saying we can at least make use of the incubating stuff to "design for 
tomorrow". Index format has to be supported for a long time, so I don't think 
we should introduce vectors format that... can't be vectorized :)

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538467#comment-17538467
 ] 

Michael Sokolov commented on LUCENE-10577:
--

Okay, thanks for the link [~rcmuir]I do see ByteVector.mul(ByteVector) and so 
on. And, yes [~jpountz]I think that could work for an API. It would be nice to 
let users worry about making their data in the right shape. I think it might 
make more sense to expect signed values though?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538467#comment-17538467
 ] 

Michael Sokolov edited comment on LUCENE-10577 at 5/17/22 8:57 PM:
---

Okay, thanks for the link [~rcmuir]I do see ByteVector.mul(ByteVector) and so 
on. And, yes [~jpountz]I think that could work for an API. It would be nice to 
let users worry about making their data in the right shape. I think it might 
make more sense to expect signed values though?

There do seem to be 8-bit vectorized instructions for Intel chips at least 
https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions-2/intrinsics-for-arithmetic-operations-2/mm256-add-epi8-16-32-64.html
 

I agree we should measure, but also the JDK support here seems to be a moving 
target. Perhaps it's time to give it another whirl and see where we are now 
with JDK 18/19


was (Author: sokolov):
Okay, thanks for the link [~rcmuir]I do see ByteVector.mul(ByteVector) and so 
on. And, yes [~jpountz]I think that could work for an API. It would be nice to 
let users worry about making their data in the right shape. I think it might 
make more sense to expect signed values though?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538474#comment-17538474
 ] 

Robert Muir commented on LUCENE-10577:
--

My main concern with some custom encoding would be if it requires some slow 
scalar conversion.

Currently with simple float representation you can do everything from a 
float[], byte[], or mmaped data directly. See 
https://issues.apache.org/jira/browse/LUCENE-9838

So if you can do stuff directly with ByteVector that would be fine. Also if you 
can use "poor man's vector" with varhandles and a 64-bit long to operate on the 
byte values, thats fine too. But please nothing that only works "one at a time".

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dsmiley commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos

2022-05-17 Thread GitBox


dsmiley commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r875264422


##
lucene/CHANGES.txt:
##
@@ -38,6 +38,8 @@ Improvements
 * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for 
Nori.
   (Uihyun Kim)
 
+* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos 
(Rushabh Shah)

Review Comment:
   But this is an Optimization (should thus go right below); no?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-05-17 Thread GitBox


mdmarshmallow commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r875205786


##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java:
##
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.search.DocIdSetIterator;
+
+/** Get counts given a list of HyperRectangles (which must be of the same 
type) */
+public class HyperRectangleFacetCounts extends Facets {
+  /** Hypper rectangles passed to constructor. */
+  protected final HyperRectangle[] hyperRectangles;
+
+  /** Counts, initialized in by subclass. */
+  protected final int[] counts;
+
+  /** Our field name. */
+  protected final String field;
+
+  /** Number of dimensions for field */
+  protected final int dims;
+
+  /** Total number of hits. */
+  protected int totCount;
+
+  /**
+   * Create HyperRectangleFacetCounts using
+   *
+   * @param field Field name
+   * @param hits Hits to facet on
+   * @param hyperRectangles List of long hyper rectangle facets
+   * @throws IOException If there is a problem reading the field
+   */
+  public HyperRectangleFacetCounts(
+  String field, FacetsCollector hits, LongHyperRectangle... 
hyperRectangles)
+  throws IOException {
+this(true, field, hits, hyperRectangles);
+  }
+
+  /**
+   * Create HyperRectangleFacetCounts using
+   *
+   * @param field Field name
+   * @param hits Hits to facet on
+   * @param hyperRectangles List of double hyper rectangle facets
+   * @throws IOException If there is a problem reading the field
+   */
+  public HyperRectangleFacetCounts(
+  String field, FacetsCollector hits, DoubleHyperRectangle... 
hyperRectangles)
+  throws IOException {
+this(true, field, hits, hyperRectangles);
+  }
+
+  private HyperRectangleFacetCounts(
+  boolean discarded, String field, FacetsCollector hits, HyperRectangle... 
hyperRectangles)

Review Comment:
   Nothing really, I just wanted to make all the `HyperRectangle`'s be of the 
same subclass, though we could also leave it up to the user to decide whether 
they want that or not, in which case I could just do `HyperRectangle...`



##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangle.java:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+/** Holds the name and the number of dims for a HyperRectangle */
+public abstract class HyperRectangle {
+  /** Label that identifies this range. */
+  public final String label;
+
+  /** How many dimensions this hyper rectangle has (IE: a regular rectangle 
would have dims=2) */
+  public final int dims;
+
+  /** Sole constructor. */
+  protected HyperRectangle(String label, int dims) {
+if (label == null) {
+  throw new IllegalArgumentException("label must not be null");
+}
+if (dims <= 0) {
+  throw new IllegalArgumentException("Dims must be greater than 0. Dims=

[GitHub] [lucene] shahrs87 commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos

2022-05-17 Thread GitBox


shahrs87 commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r875270812


##
lucene/CHANGES.txt:
##
@@ -38,6 +38,8 @@ Improvements
 * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for 
Nori.
   (Uihyun Kim)
 
+* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos 
(Rushabh Shah)

Review Comment:
   True. Changed it in latest commit. Please review again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dsmiley commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos

2022-05-17 Thread GitBox


dsmiley commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r875271168


##
lucene/CHANGES.txt:
##
@@ -38,6 +38,8 @@ Improvements
 * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for 
Nori.
   (Uihyun Kim)
 
+* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos 
(Rushabh Shah)

Review Comment:
   And you've put this in the 10.0 changes but I see no reason not to backport 
to 9.x.  There's a feature-freeze for 9.2 (it's going to be released) so... we 
could just wait a week or two here for the 9.3 section to appear by @romseygeek 
(the RM).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos

2022-05-17 Thread GitBox


shahrs87 commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r875273788


##
lucene/CHANGES.txt:
##
@@ -38,6 +38,8 @@ Improvements
 * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for 
Nori.
   (Uihyun Kim)
 
+* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos 
(Rushabh Shah)

Review Comment:
   I am pretty new to this project. This is my 2nd commit. So I don't know much 
about the release versions. Just that I understand clearly, we will wait for 
couple of weeks and once 9.2 is released and 9.3's section is created, I need 
to change CHANGES.txt and then we will merge this PR ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new pull request, #899: Lucene 10577

2022-05-17 Thread GitBox


msokolov opened a new pull request, #899:
URL: https://github.com/apache/lucene/pull/899

   This is SCRATCH - not to be committed. It has numerous problems, but was 
useful for testing and I share it as a first broken impl that can be improved.  
Things TBD:
   
   1. work out a better way to figure out scaling (maybe let customer pass in 
8-bit values, perhaps *as* floats).
   2. do the vector math directly on the mmaped bytes using ByteVector
   3. fix the tests so they can handle quantized data better  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538494#comment-17538494
 ] 

Michael Sokolov commented on LUCENE-10577:
--

> So if you can do stuff directly with ByteVector that would be fine. Also if 
> you can use "poor man's vector" with varhandles and a 64-bit long to operate 
> on the byte values, thats fine too. But please nothing that only works "one 
> at a time".

+1 -- that is what I have done in my prototype (one-at-a-time conversion from 
byte to float), but it is not we would ship.

By the way, I tried out the attached prototype on some sample data from work 
plus also on Stanford GloVe 200 data and got reasonable results. For the best 
scale value, recall stays within about 1% of baseline. Latency increased a bit 
in some cases (as much as 25%) but decreased in others?!

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538499#comment-17538499
 ] 

Robert Muir commented on LUCENE-10577:
--

Well, but comparing latency to the current dog-slow one-at-a-time float :)

The difference is, although the current encoding is slow, it can be easily fast 
in the future, whenever vector api is released. We need to keep this option 
open and not be in a situation where our vectors can't be vectorized, 
especially with the push to constantly increase the size into thousands. 
one-at-a-time is no good...

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #899: Lucene 10577

2022-05-17 Thread GitBox


rmuir commented on code in PR #899:
URL: https://github.com/apache/lucene/pull/899#discussion_r875319253


##
lucene/core/src/java/org/apache/lucene/codecs/lucene92/ExpandingRandomAccessVectorValues.java:
##
@@ -0,0 +1,57 @@
+package org.apache.lucene.codecs.lucene92;
+
+import org.apache.lucene.index.RandomAccessVectorValues;
+import org.apache.lucene.index.RandomAccessVectorValuesProducer;
+import org.apache.lucene.util.BytesRef;
+
+import java.io.IOException;
+
+public class ExpandingRandomAccessVectorValues implements 
RandomAccessVectorValuesProducer {
+
+  private final RandomAccessVectorValuesProducer delegate;
+  private final float scale;
+
+  /**
+   * Wraps an existing vector values producer. Floating point vector values 
will be produced by scaling
+   * byte-quantized values read from the values produced by the input.
+   */
+  protected ExpandingRandomAccessVectorValues(RandomAccessVectorValuesProducer 
in, float scale) {
+this.delegate = in;
+assert scale != 0;
+this.scale = scale;
+  }
+
+  @Override
+  public RandomAccessVectorValues randomAccess() throws IOException {
+RandomAccessVectorValues delegateValues = delegate.randomAccess();
+float[] value  = new float[delegateValues.dimension()];;
+
+return new RandomAccessVectorValues() {
+
+  @Override
+  public int size() {
+return delegateValues.size();
+  }
+
+  @Override
+  public int dimension() {
+return delegateValues.dimension();
+  }
+
+  @Override
+  public float[] vectorValue(int targetOrd) throws IOException {
+BytesRef binaryValue = delegateValues.binaryValue(targetOrd);
+byte[] bytes = binaryValue.bytes;
+for (int i = 0, j = binaryValue.offset; i < value.length; i++, j++) {
+  value[i] = bytes[j] * scale;

Review Comment:
   Seems to me that moving dotProduct etc out of `org.apache.lucene.util` could 
help. It could be in the codec.
   
   at a glance, i would modify dotproduct vectors patch and try something like:
   ```
   FloatVector floats = ByteVector.fromArray(bytes).reinterpretAsFloats();
   floats = floats.mul(scale);
   ... remainder of existing algorithm from patch ...
   ```
   
   I have no idea how this would perform off the top of my head, but we can try 
it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #899: Lucene 10577

2022-05-17 Thread GitBox


rmuir commented on code in PR #899:
URL: https://github.com/apache/lucene/pull/899#discussion_r875320987


##
lucene/core/src/java/org/apache/lucene/codecs/lucene92/ExpandingRandomAccessVectorValues.java:
##
@@ -0,0 +1,57 @@
+package org.apache.lucene.codecs.lucene92;
+
+import org.apache.lucene.index.RandomAccessVectorValues;
+import org.apache.lucene.index.RandomAccessVectorValuesProducer;
+import org.apache.lucene.util.BytesRef;
+
+import java.io.IOException;
+
+public class ExpandingRandomAccessVectorValues implements 
RandomAccessVectorValuesProducer {
+
+  private final RandomAccessVectorValuesProducer delegate;
+  private final float scale;
+
+  /**
+   * Wraps an existing vector values producer. Floating point vector values 
will be produced by scaling
+   * byte-quantized values read from the values produced by the input.
+   */
+  protected ExpandingRandomAccessVectorValues(RandomAccessVectorValuesProducer 
in, float scale) {
+this.delegate = in;
+assert scale != 0;
+this.scale = scale;
+  }
+
+  @Override
+  public RandomAccessVectorValues randomAccess() throws IOException {
+RandomAccessVectorValues delegateValues = delegate.randomAccess();
+float[] value  = new float[delegateValues.dimension()];;
+
+return new RandomAccessVectorValues() {
+
+  @Override
+  public int size() {
+return delegateValues.size();
+  }
+
+  @Override
+  public int dimension() {
+return delegateValues.dimension();
+  }
+
+  @Override
+  public float[] vectorValue(int targetOrd) throws IOException {
+BytesRef binaryValue = delegateValues.binaryValue(targetOrd);
+byte[] bytes = binaryValue.bytes;
+for (int i = 0, j = binaryValue.offset; i < value.length; i++, j++) {
+  value[i] = bytes[j] * scale;

Review Comment:
   and i think we don't want reinterpret, but this one: 
https://docs.oracle.com/en/java/javase/16/docs/api/jdk.incubator.vector/jdk/incubator/vector/ByteVector.html#viewAsFloatingLanes()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #899: Lucene 10577

2022-05-17 Thread GitBox


rmuir commented on code in PR #899:
URL: https://github.com/apache/lucene/pull/899#discussion_r875321513


##
lucene/core/src/java/org/apache/lucene/codecs/lucene92/ExpandingRandomAccessVectorValues.java:
##
@@ -0,0 +1,57 @@
+package org.apache.lucene.codecs.lucene92;
+
+import org.apache.lucene.index.RandomAccessVectorValues;
+import org.apache.lucene.index.RandomAccessVectorValuesProducer;
+import org.apache.lucene.util.BytesRef;
+
+import java.io.IOException;
+
+public class ExpandingRandomAccessVectorValues implements 
RandomAccessVectorValuesProducer {
+
+  private final RandomAccessVectorValuesProducer delegate;
+  private final float scale;
+
+  /**
+   * Wraps an existing vector values producer. Floating point vector values 
will be produced by scaling
+   * byte-quantized values read from the values produced by the input.
+   */
+  protected ExpandingRandomAccessVectorValues(RandomAccessVectorValuesProducer 
in, float scale) {
+this.delegate = in;
+assert scale != 0;
+this.scale = scale;
+  }
+
+  @Override
+  public RandomAccessVectorValues randomAccess() throws IOException {
+RandomAccessVectorValues delegateValues = delegate.randomAccess();
+float[] value  = new float[delegateValues.dimension()];;
+
+return new RandomAccessVectorValues() {
+
+  @Override
+  public int size() {
+return delegateValues.size();
+  }
+
+  @Override
+  public int dimension() {
+return delegateValues.dimension();
+  }
+
+  @Override
+  public float[] vectorValue(int targetOrd) throws IOException {
+BytesRef binaryValue = delegateValues.binaryValue(targetOrd);
+byte[] bytes = binaryValue.bytes;
+for (int i = 0, j = binaryValue.offset; i < value.length; i++, j++) {
+  value[i] = bytes[j] * scale;

Review Comment:
   the javadoc illustrates the challenge: "This method always throws 
UnsupportedOperationException, because there is no floating point type of the 
same size as byte. The return type of this method is arbitrarily designated as 
Vector. Future versions of this API may change the return type if additional 
floating point types become available."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID

2022-05-17 Thread GitBox


jtibshirani commented on PR #873:
URL: https://github.com/apache/lucene/pull/873#issuecomment-1129427428

   Sorry for jumping in late with some thoughts. Because of the approximate 
nature of HNSW, we are not guaranteed that the graph search will collect all 
documents with the same score. There could always be a document with a lower 
doc ID that the graph search misses, because it decided not to explore that 
part of the graph. So while this PR makes it more likely to return the lowest 
doc IDs, I still don't think we can state a helpful guarantee to the user. This 
makes me wonder if we should even be trying to tiebreak by doc ID during the 
graph search?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1129442757

   I'm merging this only to main - let me know if it's worth backporting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta merged pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


mocobeta merged PR #893:
URL: https://github.com/apache/lucene/pull/893


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it

2022-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538521#comment-17538521
 ] 

ASF subversion and git services commented on LUCENE-10531:
--

Commit b911d1d47c592a51cd3b0c3f59eea6e24455cea3 in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b911d1d47c5 ]

LUCENE-10531: Add @RequiresGUI test group for GUI tests (#893)

Co-authored-by: Dawid Weiss 

> Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI 
> workflow for it
> ---
>
> Key: LUCENE-10531
> URL: https://issues.apache.org/jira/browse/LUCENE-10531
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> We are going to allow running the test on Xvfb (a virtual display that speaks 
> X protocol) in [LUCENE-10528], this tweak is available only on Linux.
> I'm just guessing but it could confuse or bother also Mac and Windows users 
> (we can't know what window manager developers are using); it may be better to 
> make it opt-in by marking it as slow tests. 
> Instead, I think we can enable a dedicated Github actions workflow for the 
> distribution test that is triggered only when the related files are changed. 
> Besides Linux, we could run it both on Mac and Windows which most users run 
> the app on - it'd be slow, but if we limit the scope of the test I suppose it 
> works functionally just fine (I'm running actions workflows on mac and 
> windows elsewhere).
> To make it "slow test", we could add the same {{@Slow}} annotation as the 
> {{test-framework}} to the distribution tests, for consistency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it

2022-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538522#comment-17538522
 ] 

ASF subversion and git services commented on LUCENE-10531:
--

Commit 34446c40c4ab97bff75b2e85cf6e0dfab6b6c37a in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=34446c40c4a ]

LUCENE-10531: small follow-up for b911d1d47


> Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI 
> workflow for it
> ---
>
> Key: LUCENE-10531
> URL: https://issues.apache.org/jira/browse/LUCENE-10531
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> We are going to allow running the test on Xvfb (a virtual display that speaks 
> X protocol) in [LUCENE-10528], this tweak is available only on Linux.
> I'm just guessing but it could confuse or bother also Mac and Windows users 
> (we can't know what window manager developers are using); it may be better to 
> make it opt-in by marking it as slow tests. 
> Instead, I think we can enable a dedicated Github actions workflow for the 
> distribution test that is triggered only when the related files are changed. 
> Besides Linux, we could run it both on Mac and Windows which most users run 
> the app on - it'd be slow, but if we limit the scope of the test I suppose it 
> works functionally just fine (I'm running actions workflows on mac and 
> windows elsewhere).
> To make it "slow test", we could add the same {{@Slow}} annotation as the 
> {{test-framework}} to the distribution tests, for consistency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it

2022-05-17 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-10531.

Fix Version/s: 10.0 (main)
   Resolution: Fixed

> Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI 
> workflow for it
> ---
>
> Key: LUCENE-10531
> URL: https://issues.apache.org/jira/browse/LUCENE-10531
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Tomoko Uchida
>Priority: Minor
> Fix For: 10.0 (main)
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> We are going to allow running the test on Xvfb (a virtual display that speaks 
> X protocol) in [LUCENE-10528], this tweak is available only on Linux.
> I'm just guessing but it could confuse or bother also Mac and Windows users 
> (we can't know what window manager developers are using); it may be better to 
> make it opt-in by marking it as slow tests. 
> Instead, I think we can enable a dedicated Github actions workflow for the 
> distribution test that is triggered only when the related files are changed. 
> Besides Linux, we could run it both on Mac and Windows which most users run 
> the app on - it'd be slow, but if we limit the scope of the test I suppose it 
> works functionally just fine (I'm running actions workflows on mac and 
> windows elsewhere).
> To make it "slow test", we could add the same {{@Slow}} annotation as the 
> {{test-framework}} to the distribution tests, for consistency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-05-17 Thread GitBox


shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r875458505


##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/DoubleHyperRectangle.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import java.util.Arrays;
+import org.apache.lucene.util.NumericUtils;
+
+/** Stores a hyper rectangle as an array of DoubleRangePairs */
+public class DoubleHyperRectangle extends HyperRectangle {
+
+  /** Creates DoubleHyperRectangle */
+  public DoubleHyperRectangle(String label, DoubleRangePair... pairs) {
+super(label, convertToLongRangePairArray(pairs));
+  }
+
+  private static LongRangePair[] 
convertToLongRangePairArray(DoubleRangePair... pairs) {

Review Comment:
   nit: I find `Array` redundant, maybe `convertToLongRangePairs`? Or 
`toLongRangePairs`?



##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java:
##
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.search.DocIdSetIterator;
+
+/** Get counts given a list of HyperRectangles (which must be of the same 
type) */
+public class HyperRectangleFacetCounts extends Facets {
+  /** Hypper rectangles passed to constructor. */
+  protected final HyperRectangle[] hyperRectangles;
+
+  /** Counts, initialized in subclass. */
+  protected final int[] counts;
+
+  /** Our field name. */
+  protected final String field;
+
+  /** Number of dimensions for field */
+  protected final int dims;
+
+  /** Total number of hits. */
+  protected int totCount;
+
+  /**
+   * Create HyperRectangleFacetCounts using
+   *
+   * @param field Field name
+   * @param hits Hits to facet on
+   * @param hyperRectangles List of long hyper rectangle facets
+   * @throws IOException If there is a problem reading the field
+   */
+  public HyperRectangleFacetCounts(
+  String field, FacetsCollector hits, LongHyperRectangle... 
hyperRectangles)
+  throws IOException {
+this(true, field, hits, hyperRectangles);
+  }
+
+  /**
+   * Create HyperRectangleFacetCounts using
+   *
+   * @param field Field name
+   * @param hits Hits to facet on
+   * @param hyperRectangles List of double hyper rectangle facets
+   * @throws IOException If there is a problem reading the field
+   */
+  public HyperRectangleFacetCounts(
+  String field, FacetsCollector hits, DoubleHyperRectangle... 
hyperRectangles)
+  throws IOException {
+this(true, field, hits, hyperRectangles);
+  }
+
+  private HyperRectangleFacetCounts(
+  boolean discarded, String field, FacetsCollector hits, HyperRectangle... 
hyperRectangles)
+  throws IOException {
+assert hyperRectangles.length > 0 : "Hyper rectangle ranges cannot be 
empty";
+assert isHyperRectangleDimsConsistent(hyperRectangles)
+: "All hyper rectangles must be the same dimensionality";
+this

[GitHub] [lucene] dweiss commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


dweiss commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1129612104

   I'd apply this to 9x as well since it'll ease backports of other things/ 
decrease the potential of a conflict in the future?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-17 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538598#comment-17538598
 ] 

Uwe Schindler commented on LUCENE-10572:


bq. The stopwords are going to skew everything. If someone is removing them, 
the distribution of tokens will look much different.

If wikipedia has so many stopwords, this would explain what Mike is seeing. 
Every stop word produces a hash that's already known. So the Arrays.equals() 
code runs on each stopword every time it is seen over and over.

Maybe let's just change the analyzer that Mike uses to remove those stopwords? 
Or are there many stopwords we do not know about?

Nevertheless, this is a valid use case: Text without stopwords and text with 
stopwords (especially because we recommend to user not to remove stopwords 
anymore).

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-17 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1129641582

   Ok I'll backport it to the 9x branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org