[GitHub] [lucene] uschindler commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low
uschindler commented on PR #895: URL: https://github.com/apache/lucene/pull/895#issuecomment-1128502847 What I wanted to add: Merging is mostly an IO thing. More cores would not necessarily make it faster (your SSD has a limited amount of parallelism). It may be better on different indexes placed on different SSDs, but those would have separate merge schedulers anyways. If you look a few lines up in code: If its a harddisk and spins the maximum number of threads is 1. P.S.: When we really want to change this, the documentation (javadocs) needs update, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
mocobeta commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1128594358 The previous CI run result looks good to me. I also enabled GUI tests in the smoke tester. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
mocobeta commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1128599401 How about renaming the action's directory name `.github/actions/yarn-caches/` to `.github/actions/gradle-caches/`? @dweiss -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
dweiss commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1128601383 up to you, entirely. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
dweiss commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1128601991 The yarn-caches name is wrong - it's something else I was working (!). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
mocobeta commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1128605453 I'll update the directory name later. This time, the test timed out on Windows... I think this could occasionally happen :/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
dweiss commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1128610332 Increase the timeout, maybe? Windows boxes on github are slow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
mocobeta commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1128612813 I increased the timeout to 120 seconds. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
mocobeta commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1128664466 sorry couldn't resist. "Build Duke"  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
mocobeta commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1128676036 Passed all checks and I think we've done all I wanted to do here - we disabled the GUI test in the mandatory test runs, instead, we enabled it on all CI runs (Jenkins, GH Actions) and the smoke tester. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10575) Broken links in some javadocs
[ https://issues.apache.org/jira/browse/LUCENE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Woodward resolved LUCENE-10575. Fix Version/s: 9.2 Resolution: Fixed > Broken links in some javadocs > - > > Key: LUCENE-10575 > URL: https://issues.apache.org/jira/browse/LUCENE-10575 > Project: Lucene - Core > Issue Type: Bug >Reporter: Alan Woodward >Priority: Major > Fix For: 9.2 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The release wizard for 9.2 has found some broken javadoc links: > * ExternalRefSorter refers to package-private implementations when it should > probably refer to the relevant interfaces instead > * STMergingTermsEnum refers to package-private classes. I think we can > solve this by making the whole class package-private, given that it's an > implementation detail within a Codec? > * MatchRegionRetriever links to an internal implementation, which should > just be described rather than linked. > > These are all fairly simple to fix, and I will open a PR to do so. Slightly > more worrying is that running `./gradlew > lucene:documentation:checkBrokenLinks` does not seem to consistently find > these problems. The release wizard runs against an entirely clean checkout > and fails, but attempting to reproduce the failure on an existing checkout > produces a green build. Some of these broken links have been around for a > while - the STMergingTermsEnum ones since 2019 - so it may just be luck that > I found them this time round. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low
jpountz commented on PR #895: URL: https://github.com/apache/lucene/pull/895#issuecomment-1128736566 The current calculation makes sense to me. Merge policies like to organize segments into tiers, where the number of segments on each tier is typically also the number of segments that can be merged together. So it doesn't make much sense to perform multiple merges on the same tier concurrently. The way I'm reading the current formula is that we are scaling the number of merge threads with the number of processors, but stop at 4 anyway because it already allows Lucene to perform merges on 4 different tiers concurrently, which is already a lot given that tiers have exponential sizes, and that TieredMergePolicy has a max merged segment size of 5GB. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz closed pull request #892: LUCENE-10573: Improve stored fields bulk merge for degenerate O(n^2) merges.
jpountz closed pull request #892: LUCENE-10573: Improve stored fields bulk merge for degenerate O(n^2) merges. URL: https://github.com/apache/lucene/pull/892 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10573) Improve stored fields bulk merge for degenerate O(n^2) merges
[ https://issues.apache.org/jira/browse/LUCENE-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10573. --- Resolution: Won't Fix > Improve stored fields bulk merge for degenerate O(n^2) merges > - > > Key: LUCENE-10573 > URL: https://issues.apache.org/jira/browse/LUCENE-10573 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > Spin-off from LUCENE-10556. > For small merges that are below the floor segment size, TieredMergePolicy may > merge segments that have vastly different sizes, e.g. one 10k-docs segment > with 9 100-docs segments. > While we might be able to improve TieredMergePolicy (LUCENE-10569), there are > also improvements we could make to stored fields, such as bulk-copying chunks > of the first segment until the first dirty chunk. In this scenario where > segments keep being rewritten, this would help significantly. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?
[ https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538131#comment-17538131 ] Adrien Grand commented on LUCENE-10572: --- If this is memory-bound, I wonder if we could get benefits e.g. by splitting the hash table into a hash table for short terms and another one for long terms. Since most frequent terms are usually short, maybe this would help reduce the number of cache misses and in-turn help improve indexing speed. And then if it helps indexing be less memory-bound, maybe changes like Uwe's would start making a difference. > Can we optimize BytesRefHash? > - > > Key: LUCENE-10572 > URL: https://issues.apache.org/jira/browse/LUCENE-10572 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > I was poking around in our nightly benchmarks > ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR > profiling that the hottest method is this: > {noformat} > PERCENT CPU SAMPLES STACK > 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals() > at > org.apache.lucene.util.BytesRefHash#findHash() > at org.apache.lucene.util.BytesRefHash#add() > at > org.apache.lucene.index.TermsHashPerField#add() > at > org.apache.lucene.index.IndexingChain$PerField#invert() > at > org.apache.lucene.index.IndexingChain#processField() > at > org.apache.lucene.index.IndexingChain#processDocument() > at > org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat} > This is kinda crazy – comparing if the term to be inserted into the inverted > index hash equals the term already added to {{BytesRefHash}} is the hottest > method during nightly benchmarks. > Discussing offline with [~rcmuir] and [~jpountz] they noticed a few > questionable things about our current implementation: > * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the > inserted term into the hash? Let's just use two bytes always, since IW > limits term length to 32 K (< 64K that an unsigned short can cover) > * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} > (BitUtil.VH_BE_SHORT.get) > * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not > aggressive enough? Or the initial sizing of the hash is too small? > * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too > many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible > "upgrades"? > * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version > ({{{}murmurhash3_x86_32{}}})? > * Are we using the JVM's intrinsics to compare multiple bytes in a single > SIMD instruction ([~rcmuir] is quite sure we are indeed)? > * [~jpountz] suggested maybe the hash insert is simply memory bound > * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total > CPU cost) > I pulled these observations from a recent (5/6/22) profiler output: > [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html] > Maybe we can improve our performance on this crazy hotspot? > Or maybe this is a "healthy" hotspot and we should leave it be! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10392) Handle soft deletes via LiveDocsFormat
[ https://issues.apache.org/jira/browse/LUCENE-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538134#comment-17538134 ] Adrien Grand commented on LUCENE-10392: --- [~shahrs87] I set the priority to minor, but in my opinion this is a pretty hard task, so I'm not sure it's a good fit for a 2nd issue unless you're already very familiar with how Lucene handles file formats. > Handle soft deletes via LiveDocsFormat > -- > > Key: LUCENE-10392 > URL: https://issues.apache.org/jira/browse/LUCENE-10392 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > We have been using doc values to handle soft deletes until now, but this is a > bit of a hack as it: > - forces users to reserve a field name for doc values > - generally doesn't read directly from doc values, instead docs values help > populate bitsets and then reads are performed via these bitsets > It would also be more natural to have both hard and soft deletes handled by > the same file format? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID
jpountz commented on PR #873: URL: https://github.com/apache/lucene/pull/873#issuecomment-1128797150 When the order is reversed, your change negates the `node` twice so that we keep tie-breaking by increasing node IDs in all cases. With this fix, I wonder if we could simplify the encoding logic by only making the `score` affected by the order, not the `node`? I'm thinking of something like this (which may be incorrect, I haven't tested it): ``` float multiplicator = reversed ? -1f : 1f; // could be precomputed int sortableScore = NumericUtils.floatToSortableInt(multiplicator * score); long encoded = ((long) sortableScore << 32) | (Integer.MAX_VALUE - node); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10266) Move nearest-neighbor search on points to core?
[ https://issues.apache.org/jira/browse/LUCENE-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538150#comment-17538150 ] Adrien Grand commented on LUCENE-10266: --- Let's add a method to `LatLonPoint` with the following signature and remove similar logic from sandbox? {code} public static TopFieldDocs nearest(String field, double latitude, double longitude, IndexReader reader, int n); {code} > Move nearest-neighbor search on points to core? > --- > > Key: LUCENE-10266 > URL: https://issues.apache.org/jira/browse/LUCENE-10266 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > Now that the Points' public API supports running nearest-nearest neighbor > search, should we move it to core via helper methods on {{LatLonPoint}} and > {{XYPoint}}? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #876: LUCENE-9356: Change test to detect mismatched checksums instead of byte flips.
jpountz merged PR #876: URL: https://github.com/apache/lucene/pull/876 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538153#comment-17538153 ] ASF subversion and git services commented on LUCENE-9356: - Commit e65c0c777b61a964483d1f9ed645d91973a1540e in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e65c0c777b6 ] LUCENE-9356: Change test to detect mismatched checksums instead of byte flips. (#876) This makes the test more robust and gives a good sense of whether file formats are implementing `checkIntegrity` correctly. > Add tests for corruptions caused by byte flips > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538158#comment-17538158 ] ASF subversion and git services commented on LUCENE-9356: - Commit f69dc58befea40f1cd802d8b0502748cc7daad96 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f69dc58befe ] LUCENE-9356: Change test to detect mismatched checksums instead of byte flips. (#876) This makes the test more robust and gives a good sense of whether file formats are implementing `checkIntegrity` correctly. > Add tests for corruptions caused by byte flips > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9356) Add tests for corruptions caused by byte flips
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-9356. -- Fix Version/s: 9.2 Resolution: Fixed I pushed to the 9.2 branch since it included some fixes for vector file formats. > Add tests for corruptions caused by byte flips > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Fix For: 9.2 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538160#comment-17538160 ] ASF subversion and git services commented on LUCENE-9356: - Commit 978eef5459c7683038ddcca4ec56e4baa63715d0 in lucene's branch refs/heads/branch_9_2 from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=978eef5459c ] LUCENE-9356: Change test to detect mismatched checksums instead of byte flips. (#876) This makes the test more robust and gives a good sense of whether file formats are implementing `checkIntegrity` correctly. > Add tests for corruptions caused by byte flips > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9356) Add tests for mismatched checksums
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-9356: - Summary: Add tests for mismatched checksums (was: Add tests for corruptions caused by byte flips) > Add tests for mismatched checksums > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Fix For: 9.2 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] risdenk commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low
risdenk commented on PR #895: URL: https://github.com/apache/lucene/pull/895#issuecomment-1128827348 Fair enough - appreciate all the comments and additional context I couldn't find in the linked jiras. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] risdenk closed pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low
risdenk closed pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low URL: https://github.com/apache/lucene/pull/895 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low
[ https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10576: -- Resolution: Won't Fix Status: Resolved (was: Patch Available) > ConcurrentMergeScheduler maxThreadCount calculation is artificially low > --- > > Key: LUCENE-10576 > URL: https://issues.apache.org/jira/browse/LUCENE-10576 > Project: Lucene - Core > Issue Type: Bug >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177] > {code:java} > maxThreadCount = Math.max(1, Math.min(4, coreCount / 2)); > {code} > This has a practical limit of max of 4 threads due to the Math.min. This > doesn't take into account higher coreCount. > I can't seem to tell if this is by design or this is just a mix up of logic > during the calculation. > If I understand it looks like 1 and 4 are mixed up and should instead be: > {code:java} > maxThreadCount = Math.max(4, Math.min(1, coreCount / 2)); > {code} > which then simplifies to > {code:java} > maxThreadCount = Math.max(4, coreCount / 2); > {code} > So that you have a minimum of 4 maxThreadCount and max of coreCount/2. > > Based on the history I could find, this has been this way forever. > * LUCENE-6437 > * LUCENE-6119 > * LUCENE-5951 > ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, > Runtime.getRuntime().availableProcessors()/2));" > ** > https://github.com/apache/lucene/commit/33410e30c1af7105a6b8b922255af047d13be626#diff-ceb8ec6fe5807682cfb691a8ec52bcc672fb7c5eeb6922c80da4c075f7f003c8R147 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low
[ https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538164#comment-17538164 ] Kevin Risden commented on LUCENE-10576: --- This is marked as won't fix since some reasonable items were brought up on the PR - https://github.com/apache/lucene/pull/895 > ConcurrentMergeScheduler maxThreadCount calculation is artificially low > --- > > Key: LUCENE-10576 > URL: https://issues.apache.org/jira/browse/LUCENE-10576 > Project: Lucene - Core > Issue Type: Bug >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177] > {code:java} > maxThreadCount = Math.max(1, Math.min(4, coreCount / 2)); > {code} > This has a practical limit of max of 4 threads due to the Math.min. This > doesn't take into account higher coreCount. > I can't seem to tell if this is by design or this is just a mix up of logic > during the calculation. > If I understand it looks like 1 and 4 are mixed up and should instead be: > {code:java} > maxThreadCount = Math.max(4, Math.min(1, coreCount / 2)); > {code} > which then simplifies to > {code:java} > maxThreadCount = Math.max(4, coreCount / 2); > {code} > So that you have a minimum of 4 maxThreadCount and max of coreCount/2. > > Based on the history I could find, this has been this way forever. > * LUCENE-6437 > * LUCENE-6119 > * LUCENE-5951 > ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, > Runtime.getRuntime().availableProcessors()/2));" > ** > https://github.com/apache/lucene/commit/33410e30c1af7105a6b8b922255af047d13be626#diff-ceb8ec6fe5807682cfb691a8ec52bcc672fb7c5eeb6922c80da4c075f7f003c8R147 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #896: LUCENE-9409: Reenable TestAllFilesDetectTruncation.
jpountz opened a new pull request, #896: URL: https://github.com/apache/lucene/pull/896 - Removed dependency on LineFileDocs to improve reproducibility. - Relaxed the expected exception type: any exception is ok. - Ignore rare cases when a file still appears to have a well-formed footer after truncation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538183#comment-17538183 ] Adrien Grand commented on LUCENE-10574: --- I was assuming we wanted to have strong guarantees about the number of segments in the index at search time, but it's a fair point that degrading to O(n^2) merging to meet this guarantee is not a good trade-off. I tried to think of ways we could do this. One obvious option is to remove {{floorSegmentBytes}}, but this might be a bit too extreme as it would allow any index to have a long tail of small segments? One idea I started playing with consists of ensuring that every merge grows the largest input segment by at least some fraction, e.g. 50%. It tries to strike a balance between avoiding pathological merging and still trying to keep the number of segments contained at search time. I quickly hacked this into TieredMergePolicy and this made the StoredFieldsBenchmark more than 2x faster. I wonder if there are other approaches we should consider. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #876: LUCENE-9356: Change test to detect mismatched checksums instead of byte flips.
mayya-sharipova commented on PR #876: URL: https://github.com/apache/lucene/pull/876#issuecomment-1128871222 Thanks Adrien for catching errors with vector files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538194#comment-17538194 ] Robert Muir commented on LUCENE-10574: -- I think another approach is to actually remove the {{O(n)^2}}, remove {{floorSegmentBytes}}, let it kick into all the benchmarks. Now that the bad algorithm is gone, followup by looking at alternative, safe methods to try to keep the number of segments "contained" that don't cause pathological performance issues. It seems we all just accept this {{O(n)^2}} as a necessity, but I really don't know why: I'm not sold on it at all. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?
[ https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538198#comment-17538198 ] Uwe Schindler commented on LUCENE-10572: If we have 2 hash tables, we could have one for short terms up to 255 bytes ( for sure could also make the limit smaller, but 255 is the limit to get the 1 byte length encoding), and all longer ones in a separate hash (where also the comparisons are more expensive). I am not sure if the additioal complexity is worth to do this. About changing the hash algorithm: we may add a counter into the hash table to actually measure how many collisions we have during indexing wikipedia. But actually when inserting a term already in the hash-table, we get a hash collision and have to confirm with Array.equals() that the term is already there. I tend to think that the smaller terms are more often duplicates than larger ones, so having them in a separate table may be a good idea. Maybe we should have some statistics during wikipedia indexing: - how many hash collisions do we have (where term is actually not already in table)? => this ratio should be low. We can compare hash algorithms for that. - how many hash collisions do we get because the term is already in table? => this is most expensive memory-wise, because hash AND equals have to be calculated. - how many inserts of new terms without a collision do we get? > Can we optimize BytesRefHash? > - > > Key: LUCENE-10572 > URL: https://issues.apache.org/jira/browse/LUCENE-10572 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > I was poking around in our nightly benchmarks > ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR > profiling that the hottest method is this: > {noformat} > PERCENT CPU SAMPLES STACK > 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals() > at > org.apache.lucene.util.BytesRefHash#findHash() > at org.apache.lucene.util.BytesRefHash#add() > at > org.apache.lucene.index.TermsHashPerField#add() > at > org.apache.lucene.index.IndexingChain$PerField#invert() > at > org.apache.lucene.index.IndexingChain#processField() > at > org.apache.lucene.index.IndexingChain#processDocument() > at > org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat} > This is kinda crazy – comparing if the term to be inserted into the inverted > index hash equals the term already added to {{BytesRefHash}} is the hottest > method during nightly benchmarks. > Discussing offline with [~rcmuir] and [~jpountz] they noticed a few > questionable things about our current implementation: > * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the > inserted term into the hash? Let's just use two bytes always, since IW > limits term length to 32 K (< 64K that an unsigned short can cover) > * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} > (BitUtil.VH_BE_SHORT.get) > * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not > aggressive enough? Or the initial sizing of the hash is too small? > * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too > many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible > "upgrades"? > * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version > ({{{}murmurhash3_x86_32{}}})? > * Are we using the JVM's intrinsics to compare multiple bytes in a single > SIMD instruction ([~rcmuir] is quite sure we are indeed)? > * [~jpountz] suggested maybe the hash insert is simply memory bound > * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total > CPU cost) > I pulled these observations from a recent (5/6/22) profiler output: > [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html] > Maybe we can improve our performance on this crazy hotspot? > Or maybe this is a "healthy" hotspot and we should leave it be! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?
[ https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538211#comment-17538211 ] Robert Muir commented on LUCENE-10572: -- These measurements are also going to be strange because of how that wikipedia indexing works. The stopwords are going to skew everything. If someone is removing them, the distribution of tokens will look much different. > Can we optimize BytesRefHash? > - > > Key: LUCENE-10572 > URL: https://issues.apache.org/jira/browse/LUCENE-10572 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > I was poking around in our nightly benchmarks > ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR > profiling that the hottest method is this: > {noformat} > PERCENT CPU SAMPLES STACK > 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals() > at > org.apache.lucene.util.BytesRefHash#findHash() > at org.apache.lucene.util.BytesRefHash#add() > at > org.apache.lucene.index.TermsHashPerField#add() > at > org.apache.lucene.index.IndexingChain$PerField#invert() > at > org.apache.lucene.index.IndexingChain#processField() > at > org.apache.lucene.index.IndexingChain#processDocument() > at > org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat} > This is kinda crazy – comparing if the term to be inserted into the inverted > index hash equals the term already added to {{BytesRefHash}} is the hottest > method during nightly benchmarks. > Discussing offline with [~rcmuir] and [~jpountz] they noticed a few > questionable things about our current implementation: > * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the > inserted term into the hash? Let's just use two bytes always, since IW > limits term length to 32 K (< 64K that an unsigned short can cover) > * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} > (BitUtil.VH_BE_SHORT.get) > * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not > aggressive enough? Or the initial sizing of the hash is too small? > * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too > many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible > "upgrades"? > * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version > ({{{}murmurhash3_x86_32{}}})? > * Are we using the JVM's intrinsics to compare multiple bytes in a single > SIMD instruction ([~rcmuir] is quite sure we are indeed)? > * [~jpountz] suggested maybe the hash insert is simply memory bound > * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total > CPU cost) > I pulled these observations from a recent (5/6/22) profiler output: > [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html] > Maybe we can improve our performance on this crazy hotspot? > Or maybe this is a "healthy" hotspot and we should leave it be! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #896: LUCENE-9409: Reenable TestAllFilesDetectTruncation.
jpountz commented on PR #896: URL: https://github.com/apache/lucene/pull/896#issuecomment-1128913704 The test would still pass without the new checks (another check would fail later), but I thought it was more consistent if we call `checkFooter` for every `IndexInput` we open across all file formats. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring
[ https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538284#comment-17538284 ] Mike Drob commented on LUCENE-10236: [~zacharymorn] is this still relevant for 8.11? https://github.com/apache/lucene-solr/pull/2637 > CombinedFieldsQuery to use fieldAndWeights.values() when constructing > MultiNormsLeafSimScorer for scoring > - > > Key: LUCENE-10236 > URL: https://issues.apache.org/jira/browse/LUCENE-10236 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Zach Chen >Assignee: Zach Chen >Priority: Minor > Fix For: 9.1 > > Time Spent: 6h 50m > Remaining Estimate: 0h > > This is a spin-off issue from discussion in > [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a > quick fix in CombinedFieldsQuery scoring. > Currently CombinedFieldsQuery would use a constructed > [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421] > object to create a MultiNormsLeafSimScorer for scoring, but the fields > object may contain duplicated field-weight pairs as it is [built from looping > over > fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414], > resulting into duplicated norms being added during scoring calculation in > MultiNormsLeafSimScorer. > E.g. for CombinedFieldsQuery with two fields and two values matching a > particular doc: > {code:java} > CombinedFieldQuery query = > new CombinedFieldQuery.Builder() > .addField("field1", (float) 1.0) > .addField("field2", (float) 1.0) > .addTerm(new BytesRef("foo")) > .addTerm(new BytesRef("zoo")) > .build(); {code} > I would imagine the scoring to be based on the following: > # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + > freq(field1:zoo) + freq(field2:zoo) > # Sum of norms on doc = norm(field1) + norm(field2) > but the current logic would use the following for scoring: > # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + > freq(field1:zoo) + freq(field2:zoo) > # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + > norm(field2) > > In addition, this differs from how MultiNormsLeafSimScorer is constructed > from CombinedFieldsQuery explain function, which [uses > fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389] > and does not contain duplicated field-weight pairs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on pull request #2649: Remove '-' between base.version and version.suffix and change common-build to allow the new format
madrob commented on PR #2649: URL: https://github.com/apache/lucene-solr/pull/2649#issuecomment-1129013348 @anshumg does 8.11.2 need this, or should we close this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low
[ https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538321#comment-17538321 ] Chris M. Hostetter commented on LUCENE-10576: - Should those "reasonable items" be added as comments to the code so they aren't lost to time? > ConcurrentMergeScheduler maxThreadCount calculation is artificially low > --- > > Key: LUCENE-10576 > URL: https://issues.apache.org/jira/browse/LUCENE-10576 > Project: Lucene - Core > Issue Type: Bug >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177] > {code:java} > maxThreadCount = Math.max(1, Math.min(4, coreCount / 2)); > {code} > This has a practical limit of max of 4 threads due to the Math.min. This > doesn't take into account higher coreCount. > I can't seem to tell if this is by design or this is just a mix up of logic > during the calculation. > If I understand it looks like 1 and 4 are mixed up and should instead be: > {code:java} > maxThreadCount = Math.max(4, Math.min(1, coreCount / 2)); > {code} > which then simplifies to > {code:java} > maxThreadCount = Math.max(4, coreCount / 2); > {code} > So that you have a minimum of 4 maxThreadCount and max of coreCount/2. > > Based on the history I could find, this has been this way forever. > * LUCENE-6437 > * LUCENE-6119 > * LUCENE-5951 > ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, > Runtime.getRuntime().availableProcessors()/2));" > ** > https://github.com/apache/lucene/commit/33410e30c1af7105a6b8b922255af047d13be626#diff-ceb8ec6fe5807682cfb691a8ec52bcc672fb7c5eeb6922c80da4c075f7f003c8R147 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 opened a new pull request, #897: LUCENE-10266 Move nearest-neighbor search on points to core
shahrs87 opened a new pull request, #897: URL: https://github.com/apache/lucene/pull/897 # Description Please provide a short description of the changes you're making with this pull request. # Solution Please provide a short description of the approach taken to implement your solution. # Tests Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem. # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [ ] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `main` branch. - [ ] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core
shahrs87 commented on PR #897: URL: https://github.com/apache/lucene/pull/897#issuecomment-1129107476 @jpountz I have created this PR as per your suggestion in LUCENE-10266 jira. I have made the following assumptions. Please correct me if needed. 1. I have deleted LatLonPointPrototypeQueries class since there is no other sandobx query in that class. Should I keep the empty class ? 2. I see we have FloatPointNearestNeighbor implementation in sandbox which is similar to NearestNeighbor. Do I need to remove FloatPointNearestNeighbor from sandbox and move it to lucene/core ? 3. I have added this change to API Changes section in CHANGES.txt Please correct if it belongs somewhere else. Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10392) Handle soft deletes via LiveDocsFormat
[ https://issues.apache.org/jira/browse/LUCENE-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538347#comment-17538347 ] Rushabh Shah commented on LUCENE-10392: --- > unless you're already very familiar with how Lucene handles file formats. [~jpountz] Thank you for the reply. I am not at-all familiar with the file formats. Can you suggest some blog/article or some class names where I can learn more about the different file formats ? > Handle soft deletes via LiveDocsFormat > -- > > Key: LUCENE-10392 > URL: https://issues.apache.org/jira/browse/LUCENE-10392 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > We have been using doc values to handle soft deletes until now, but this is a > bit of a hack as it: > - forces users to reserve a field name for doc values > - generally doesn't read directly from doc values, instead docs values help > populate bitsets and then reads are performed via these bitsets > It would also be more natural to have both hard and soft deletes handled by > the same file format? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cpoerschke opened a new pull request, #2656: LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently
cpoerschke opened a new pull request, #2656: URL: https://github.com/apache/lucene-solr/pull/2656 backport of https://github.com/apache/lucene/pull/737 and https://github.com/apache/lucene/pull/758 for https://issues.apache.org/jira/browse/LUCENE-10477 and https://issues.apache.org/jira/browse/LUCENE-10464 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor
[ https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christine Poerschke reopened LUCENE-10477: -- re-opening for potential backport: https://github.com/apache/lucene-solr/pull/2656 > SpanBoostQuery.rewrite was incomplete for boost==1 factor > - > > Key: LUCENE-10477 > URL: https://issues.apache.org/jira/browse/LUCENE-10477 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.11.1 >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Fix For: 10.0 (main), 9.2 > > Time Spent: 50m > Remaining Estimate: 0h > > _(This bug report concerns pre-9.0 code only but it's so subtle that it > warrants sharing I think and maybe fixing if there was to be a 8.11.2 release > in future.)_ > Some existing code e.g. > [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54] > adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is > {{1.0}} i.e. technically wrapping is unnecessary. > Query rewriting should counteract this somewhat except it might not e.g. note > at > [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83] > how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called! > This can then manifest in strange ways e.g. during highlighting: > {code:java} > ... > java.lang.IllegalArgumentException: Rewrite first! > at > org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99) > at > org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183) > at > org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295) > ... > {code} > This stacktrace is not from 8.11.1 code but the general logic is that at line > 293 rewrite was called (except it didn't a full rewrite because of > {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at > line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: > [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538389#comment-17538389 ] Michael Sokolov commented on LUCENE-10574: -- I'm not sure if I understand, but are we seeing O(N^2) because tiny segments get merged into small segments, which get merged into smallish segments, and so on, and because the original segments were so tiny we end up merging the same document(s) many times? > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538395#comment-17538395 ] Michael Sokolov commented on LUCENE-9625: - There's no support for using an existing index; creating the index is an important part of the benchmark, I think? As for threading, no, it would be necessary to modify the test harness. But maybe you should consider contributing to ann-benchmarks? > Benchmark KNN search with ann-benchmarks > > > Key: LUCENE-9625 > URL: https://issues.apache.org/jira/browse/LUCENE-9625 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In addition to benchmarking with luceneutil, it would be good to be able to > make use of ann-benchmarks, which is publishing results from many approximate > knn algorithms, including the hnsw implementation from its authors. We don't > expect to challenge the performance of these native code libraries, however > it would be good to know just how far off we are. > I started looking into this and posted a fork of ann-benchmarks that uses > KnnGraphTester class to run these: > https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to > manually copy jars and the KnnGraphTester.class to the test host machine > rather than downloading from a distribution. KnnGraphTester needs some > modifications in order to support this process - this issue is mostly about > that. > One thing I noticed is that some of the index builds with higher fanout > (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is > concerning and I'll open a separate issue for trying to improve that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 opened a new pull request, #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
shahrs87 opened a new pull request, #898: URL: https://github.com/apache/lucene/pull/898 # Description Please provide a short description of the changes you're making with this pull request. # Solution Please provide a short description of the approach taken to implement your solution. # Tests Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem. # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [ ] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `main` branch. - [ ] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
shahrs87 commented on PR #898: URL: https://github.com/apache/lucene/pull/898#issuecomment-1129196482 Hi @dsmiley Can you please help me review this patch ? I have tried to implement this using your suggestion in the jira. Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude opened a new pull request, #2657: SOLR-16199: Fix query syntax for LIKE queries with wildcard
thelabdude opened a new pull request, #2657: URL: https://github.com/apache/lucene-solr/pull/2657 Backport of https://github.com/apache/solr/pull/865 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538424#comment-17538424 ] Greg Miller commented on LUCENE-10544: -- +1 to pursuing this delegating bulk scorer suggestion. I really like that idea [~jpountz]. Seems like a simple, easy to understand approach that still allows queries to provide their own custom bulk scoring logic as necessary. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10577) Quantize vector values
Michael Sokolov created LUCENE-10577: Summary: Quantize vector values Key: LUCENE-10577 URL: https://issues.apache.org/jira/browse/LUCENE-10577 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Reporter: Michael Sokolov The {{KnnVectorField}} api handles vectors with 4-byte floating point values. These fields can be used (via {{KnnVectorsReader}}) in two main ways: 1. The {{VectorValues}} iterator enables retrieving values 2. Approximate nearest -neighbor search The main point of this addition was to provide the search capability, and to support that it is not really necessary to store vectors in full precision. Perhaps users may also be willing to retrieve values in lower precision for whatever purpose those serve, if they are able to store more samples. We know that 8 bits is enough to provide a very near approximation to the same recall/performance tradeoff that is achieved with the full-precision vectors. I'd like to explore how we could enable 4:1 compression of these fields by reducing their precision. A few ways I can imagine this would be done: 1. Provide a parallel byte-oriented API. This would allow users to provide their data in reduced-precision format and give control over the quantization to them. It would have a major impact on the Lucene API surface though, essentially requiring us to duplicate all of the vector APIs. 2. Automatically quantize the stored vector data when we can. This would require no or perhaps very limited change to the existing API to enable the feature. I've been exploring (2), and what I find is that we can achieve very good recall results using dot-product similarity scoring by simple linear scaling + quantization of the vector values, so long as we choose the scale that minimizes the quantization error. Dot-product is amenable to this treatment since vectors are required to be unit-length when used with that similarity function. Even still there is variability in the ideal scale over different data sets. A good choice seems to be max(abs(min-value), abs(max-value)), but of course this assumes that the data set doesn't have a few outlier data points. A theoretical range can be obtained by 1/sqrt(dimension), but this is only useful when the samples are normally distributed. We could in theory determine the ideal scale when flushing a segment and manage this quantization per-segment, but then numerical error could creep in when merging. I'll post a patch/PR with an experimental setup I've been using for evaluation purposes. It is pretty self-contained and simple, but has some drawbacks that need to be addressed: 1. No automated mechanism for determining quantization scale (it's a constant that I have been playing with) 2. Converts from byte/float when computing dot-product instead of directly computing on byte values I'd like to get people's feedback on the approach and whether in general we should think about doing this compression under the hood, or expose a byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538448#comment-17538448 ] Adrien Grand commented on LUCENE-10574: --- It's not about absolute segment sizes, it's more about computing balanced merges. Say you have N 1-document segments and want to merge them down to a single segment, 10 segments at a time. If you always compute perfectly balanced merges then each document participates in O(log(N)) merges so it takes O(N log(N)) to get down to a single segment. If you take the naive approach of always merging the biggest segment you got so far with 9 1-document segments then each document participates in O(N) merges so it takes O(N^2) to get down to a single segment. As bad as the second approach sounds, this is what TieredMergePolicy does with segments that are below the floor segment size. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538451#comment-17538451 ] Robert Muir commented on LUCENE-10577: -- I think a 2-byte float would be a better design than 1-byte float. We should design for things that have actual hardware support, not make up our own floating point formats for something like this, otherwise it will never get vectorized by hotspot and never scale. We still don't even have 4-byte float support from openjdk vectors, so I think it would be better to first wait and see if java exposes half-float vectorization in some way we can use. > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538456#comment-17538456 ] Robert Muir commented on LUCENE-10577: -- at least for fp16 we see some movement on openjdk (open pull request, java issue): https://bugs.openjdk.java.net/browse/JDK-8277304 https://github.com/openjdk/panama-vector/pull/164 > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538457#comment-17538457 ] Michael Sokolov commented on LUCENE-10577: -- Actually what I have in mind is signed byte values (-128-127), not any kind of 8 bit floating point. But perhaps your point still holds - I don't know if there is hardware support for byte-arithmetic? > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538460#comment-17538460 ] Adrien Grand commented on LUCENE-10577: --- Would it be possible to implement (1) with a float API by making the format detect when all float values across a segment are effectively integers in 0..255? > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538459#comment-17538459 ] Robert Muir commented on LUCENE-10577: -- the actual operations you want to do need to be supported. E.g. if you want to work on bytes, look at ByteVector and try to write standalone vectorized prototype and see how it compares to e.g. dot-product on FloatVector. https://docs.oracle.com/en/java/javase/16/docs/api/jdk.incubator.vector/jdk/incubator/vector/ByteVector.html I'm just saying we can at least make use of the incubating stuff to "design for tomorrow". Index format has to be supported for a long time, so I don't think we should introduce vectors format that... can't be vectorized :) > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538467#comment-17538467 ] Michael Sokolov commented on LUCENE-10577: -- Okay, thanks for the link [~rcmuir]I do see ByteVector.mul(ByteVector) and so on. And, yes [~jpountz]I think that could work for an API. It would be nice to let users worry about making their data in the right shape. I think it might make more sense to expect signed values though? > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538467#comment-17538467 ] Michael Sokolov edited comment on LUCENE-10577 at 5/17/22 8:57 PM: --- Okay, thanks for the link [~rcmuir]I do see ByteVector.mul(ByteVector) and so on. And, yes [~jpountz]I think that could work for an API. It would be nice to let users worry about making their data in the right shape. I think it might make more sense to expect signed values though? There do seem to be 8-bit vectorized instructions for Intel chips at least https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions-2/intrinsics-for-arithmetic-operations-2/mm256-add-epi8-16-32-64.html I agree we should measure, but also the JDK support here seems to be a moving target. Perhaps it's time to give it another whirl and see where we are now with JDK 18/19 was (Author: sokolov): Okay, thanks for the link [~rcmuir]I do see ByteVector.mul(ByteVector) and so on. And, yes [~jpountz]I think that could work for an API. It would be nice to let users worry about making their data in the right shape. I think it might make more sense to expect signed values though? > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538474#comment-17538474 ] Robert Muir commented on LUCENE-10577: -- My main concern with some custom encoding would be if it requires some slow scalar conversion. Currently with simple float representation you can do everything from a float[], byte[], or mmaped data directly. See https://issues.apache.org/jira/browse/LUCENE-9838 So if you can do stuff directly with ByteVector that would be fine. Also if you can use "poor man's vector" with varhandles and a 64-bit long to operate on the byte values, thats fine too. But please nothing that only works "one at a time". > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dsmiley commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
dsmiley commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r875264422 ## lucene/CHANGES.txt: ## @@ -38,6 +38,8 @@ Improvements * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori. (Uihyun Kim) +* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos (Rushabh Shah) Review Comment: But this is an Optimization (should thus go right below); no? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
mdmarshmallow commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r875205786 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java: ## @@ -0,0 +1,163 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.io.IOException; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.search.DocIdSetIterator; + +/** Get counts given a list of HyperRectangles (which must be of the same type) */ +public class HyperRectangleFacetCounts extends Facets { + /** Hypper rectangles passed to constructor. */ + protected final HyperRectangle[] hyperRectangles; + + /** Counts, initialized in by subclass. */ + protected final int[] counts; + + /** Our field name. */ + protected final String field; + + /** Number of dimensions for field */ + protected final int dims; + + /** Total number of hits. */ + protected int totCount; + + /** + * Create HyperRectangleFacetCounts using + * + * @param field Field name + * @param hits Hits to facet on + * @param hyperRectangles List of long hyper rectangle facets + * @throws IOException If there is a problem reading the field + */ + public HyperRectangleFacetCounts( + String field, FacetsCollector hits, LongHyperRectangle... hyperRectangles) + throws IOException { +this(true, field, hits, hyperRectangles); + } + + /** + * Create HyperRectangleFacetCounts using + * + * @param field Field name + * @param hits Hits to facet on + * @param hyperRectangles List of double hyper rectangle facets + * @throws IOException If there is a problem reading the field + */ + public HyperRectangleFacetCounts( + String field, FacetsCollector hits, DoubleHyperRectangle... hyperRectangles) + throws IOException { +this(true, field, hits, hyperRectangles); + } + + private HyperRectangleFacetCounts( + boolean discarded, String field, FacetsCollector hits, HyperRectangle... hyperRectangles) Review Comment: Nothing really, I just wanted to make all the `HyperRectangle`'s be of the same subclass, though we could also leave it up to the user to decide whether they want that or not, in which case I could just do `HyperRectangle...` ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangle.java: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +/** Holds the name and the number of dims for a HyperRectangle */ +public abstract class HyperRectangle { + /** Label that identifies this range. */ + public final String label; + + /** How many dimensions this hyper rectangle has (IE: a regular rectangle would have dims=2) */ + public final int dims; + + /** Sole constructor. */ + protected HyperRectangle(String label, int dims) { +if (label == null) { + throw new IllegalArgumentException("label must not be null"); +} +if (dims <= 0) { + throw new IllegalArgumentException("Dims must be greater than 0. Dims=
[GitHub] [lucene] shahrs87 commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
shahrs87 commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r875270812 ## lucene/CHANGES.txt: ## @@ -38,6 +38,8 @@ Improvements * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori. (Uihyun Kim) +* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos (Rushabh Shah) Review Comment: True. Changed it in latest commit. Please review again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dsmiley commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
dsmiley commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r875271168 ## lucene/CHANGES.txt: ## @@ -38,6 +38,8 @@ Improvements * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori. (Uihyun Kim) +* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos (Rushabh Shah) Review Comment: And you've put this in the 10.0 changes but I see no reason not to backport to 9.x. There's a feature-freeze for 9.2 (it's going to be released) so... we could just wait a week or two here for the 9.3 section to appear by @romseygeek (the RM). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
shahrs87 commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r875273788 ## lucene/CHANGES.txt: ## @@ -38,6 +38,8 @@ Improvements * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori. (Uihyun Kim) +* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos (Rushabh Shah) Review Comment: I am pretty new to this project. This is my 2nd commit. So I don't know much about the release versions. Just that I understand clearly, we will wait for couple of weeks and once 9.2 is released and 9.3's section is created, I need to change CHANGES.txt and then we will merge this PR ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov opened a new pull request, #899: Lucene 10577
msokolov opened a new pull request, #899: URL: https://github.com/apache/lucene/pull/899 This is SCRATCH - not to be committed. It has numerous problems, but was useful for testing and I share it as a first broken impl that can be improved. Things TBD: 1. work out a better way to figure out scaling (maybe let customer pass in 8-bit values, perhaps *as* floats). 2. do the vector math directly on the mmaped bytes using ByteVector 3. fix the tests so they can handle quantized data better -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538494#comment-17538494 ] Michael Sokolov commented on LUCENE-10577: -- > So if you can do stuff directly with ByteVector that would be fine. Also if > you can use "poor man's vector" with varhandles and a 64-bit long to operate > on the byte values, thats fine too. But please nothing that only works "one > at a time". +1 -- that is what I have done in my prototype (one-at-a-time conversion from byte to float), but it is not we would ship. By the way, I tried out the attached prototype on some sample data from work plus also on Stanford GloVe 200 data and got reasonable results. For the best scale value, recall stays within about 1% of baseline. Latency increased a bit in some cases (as much as 25%) but decreased in others?! > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538499#comment-17538499 ] Robert Muir commented on LUCENE-10577: -- Well, but comparing latency to the current dog-slow one-at-a-time float :) The difference is, although the current encoding is slow, it can be easily fast in the future, whenever vector api is released. We need to keep this option open and not be in a situation where our vectors can't be vectorized, especially with the push to constantly increase the size into thousands. one-at-a-time is no good... > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #899: Lucene 10577
rmuir commented on code in PR #899: URL: https://github.com/apache/lucene/pull/899#discussion_r875319253 ## lucene/core/src/java/org/apache/lucene/codecs/lucene92/ExpandingRandomAccessVectorValues.java: ## @@ -0,0 +1,57 @@ +package org.apache.lucene.codecs.lucene92; + +import org.apache.lucene.index.RandomAccessVectorValues; +import org.apache.lucene.index.RandomAccessVectorValuesProducer; +import org.apache.lucene.util.BytesRef; + +import java.io.IOException; + +public class ExpandingRandomAccessVectorValues implements RandomAccessVectorValuesProducer { + + private final RandomAccessVectorValuesProducer delegate; + private final float scale; + + /** + * Wraps an existing vector values producer. Floating point vector values will be produced by scaling + * byte-quantized values read from the values produced by the input. + */ + protected ExpandingRandomAccessVectorValues(RandomAccessVectorValuesProducer in, float scale) { +this.delegate = in; +assert scale != 0; +this.scale = scale; + } + + @Override + public RandomAccessVectorValues randomAccess() throws IOException { +RandomAccessVectorValues delegateValues = delegate.randomAccess(); +float[] value = new float[delegateValues.dimension()];; + +return new RandomAccessVectorValues() { + + @Override + public int size() { +return delegateValues.size(); + } + + @Override + public int dimension() { +return delegateValues.dimension(); + } + + @Override + public float[] vectorValue(int targetOrd) throws IOException { +BytesRef binaryValue = delegateValues.binaryValue(targetOrd); +byte[] bytes = binaryValue.bytes; +for (int i = 0, j = binaryValue.offset; i < value.length; i++, j++) { + value[i] = bytes[j] * scale; Review Comment: Seems to me that moving dotProduct etc out of `org.apache.lucene.util` could help. It could be in the codec. at a glance, i would modify dotproduct vectors patch and try something like: ``` FloatVector floats = ByteVector.fromArray(bytes).reinterpretAsFloats(); floats = floats.mul(scale); ... remainder of existing algorithm from patch ... ``` I have no idea how this would perform off the top of my head, but we can try it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #899: Lucene 10577
rmuir commented on code in PR #899: URL: https://github.com/apache/lucene/pull/899#discussion_r875320987 ## lucene/core/src/java/org/apache/lucene/codecs/lucene92/ExpandingRandomAccessVectorValues.java: ## @@ -0,0 +1,57 @@ +package org.apache.lucene.codecs.lucene92; + +import org.apache.lucene.index.RandomAccessVectorValues; +import org.apache.lucene.index.RandomAccessVectorValuesProducer; +import org.apache.lucene.util.BytesRef; + +import java.io.IOException; + +public class ExpandingRandomAccessVectorValues implements RandomAccessVectorValuesProducer { + + private final RandomAccessVectorValuesProducer delegate; + private final float scale; + + /** + * Wraps an existing vector values producer. Floating point vector values will be produced by scaling + * byte-quantized values read from the values produced by the input. + */ + protected ExpandingRandomAccessVectorValues(RandomAccessVectorValuesProducer in, float scale) { +this.delegate = in; +assert scale != 0; +this.scale = scale; + } + + @Override + public RandomAccessVectorValues randomAccess() throws IOException { +RandomAccessVectorValues delegateValues = delegate.randomAccess(); +float[] value = new float[delegateValues.dimension()];; + +return new RandomAccessVectorValues() { + + @Override + public int size() { +return delegateValues.size(); + } + + @Override + public int dimension() { +return delegateValues.dimension(); + } + + @Override + public float[] vectorValue(int targetOrd) throws IOException { +BytesRef binaryValue = delegateValues.binaryValue(targetOrd); +byte[] bytes = binaryValue.bytes; +for (int i = 0, j = binaryValue.offset; i < value.length; i++, j++) { + value[i] = bytes[j] * scale; Review Comment: and i think we don't want reinterpret, but this one: https://docs.oracle.com/en/java/javase/16/docs/api/jdk.incubator.vector/jdk/incubator/vector/ByteVector.html#viewAsFloatingLanes() -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #899: Lucene 10577
rmuir commented on code in PR #899: URL: https://github.com/apache/lucene/pull/899#discussion_r875321513 ## lucene/core/src/java/org/apache/lucene/codecs/lucene92/ExpandingRandomAccessVectorValues.java: ## @@ -0,0 +1,57 @@ +package org.apache.lucene.codecs.lucene92; + +import org.apache.lucene.index.RandomAccessVectorValues; +import org.apache.lucene.index.RandomAccessVectorValuesProducer; +import org.apache.lucene.util.BytesRef; + +import java.io.IOException; + +public class ExpandingRandomAccessVectorValues implements RandomAccessVectorValuesProducer { + + private final RandomAccessVectorValuesProducer delegate; + private final float scale; + + /** + * Wraps an existing vector values producer. Floating point vector values will be produced by scaling + * byte-quantized values read from the values produced by the input. + */ + protected ExpandingRandomAccessVectorValues(RandomAccessVectorValuesProducer in, float scale) { +this.delegate = in; +assert scale != 0; +this.scale = scale; + } + + @Override + public RandomAccessVectorValues randomAccess() throws IOException { +RandomAccessVectorValues delegateValues = delegate.randomAccess(); +float[] value = new float[delegateValues.dimension()];; + +return new RandomAccessVectorValues() { + + @Override + public int size() { +return delegateValues.size(); + } + + @Override + public int dimension() { +return delegateValues.dimension(); + } + + @Override + public float[] vectorValue(int targetOrd) throws IOException { +BytesRef binaryValue = delegateValues.binaryValue(targetOrd); +byte[] bytes = binaryValue.bytes; +for (int i = 0, j = binaryValue.offset; i < value.length; i++, j++) { + value[i] = bytes[j] * scale; Review Comment: the javadoc illustrates the challenge: "This method always throws UnsupportedOperationException, because there is no floating point type of the same size as byte. The return type of this method is arbitrarily designated as Vector. Future versions of this API may change the return type if additional floating point types become available." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID
jtibshirani commented on PR #873: URL: https://github.com/apache/lucene/pull/873#issuecomment-1129427428 Sorry for jumping in late with some thoughts. Because of the approximate nature of HNSW, we are not guaranteed that the graph search will collect all documents with the same score. There could always be a document with a lower doc ID that the graph search misses, because it decided not to explore that part of the graph. So while this PR makes it more likely to return the lowest doc IDs, I still don't think we can state a helpful guarantee to the user. This makes me wonder if we should even be trying to tiebreak by doc ID during the graph search? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
mocobeta commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1129442757 I'm merging this only to main - let me know if it's worth backporting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
mocobeta merged PR #893: URL: https://github.com/apache/lucene/pull/893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it
[ https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538521#comment-17538521 ] ASF subversion and git services commented on LUCENE-10531: -- Commit b911d1d47c592a51cd3b0c3f59eea6e24455cea3 in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b911d1d47c5 ] LUCENE-10531: Add @RequiresGUI test group for GUI tests (#893) Co-authored-by: Dawid Weiss > Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI > workflow for it > --- > > Key: LUCENE-10531 > URL: https://issues.apache.org/jira/browse/LUCENE-10531 > Project: Lucene - Core > Issue Type: Task > Components: general/test >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 6h 10m > Remaining Estimate: 0h > > We are going to allow running the test on Xvfb (a virtual display that speaks > X protocol) in [LUCENE-10528], this tweak is available only on Linux. > I'm just guessing but it could confuse or bother also Mac and Windows users > (we can't know what window manager developers are using); it may be better to > make it opt-in by marking it as slow tests. > Instead, I think we can enable a dedicated Github actions workflow for the > distribution test that is triggered only when the related files are changed. > Besides Linux, we could run it both on Mac and Windows which most users run > the app on - it'd be slow, but if we limit the scope of the test I suppose it > works functionally just fine (I'm running actions workflows on mac and > windows elsewhere). > To make it "slow test", we could add the same {{@Slow}} annotation as the > {{test-framework}} to the distribution tests, for consistency. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it
[ https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538522#comment-17538522 ] ASF subversion and git services commented on LUCENE-10531: -- Commit 34446c40c4ab97bff75b2e85cf6e0dfab6b6c37a in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=34446c40c4a ] LUCENE-10531: small follow-up for b911d1d47 > Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI > workflow for it > --- > > Key: LUCENE-10531 > URL: https://issues.apache.org/jira/browse/LUCENE-10531 > Project: Lucene - Core > Issue Type: Task > Components: general/test >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 6h 10m > Remaining Estimate: 0h > > We are going to allow running the test on Xvfb (a virtual display that speaks > X protocol) in [LUCENE-10528], this tweak is available only on Linux. > I'm just guessing but it could confuse or bother also Mac and Windows users > (we can't know what window manager developers are using); it may be better to > make it opt-in by marking it as slow tests. > Instead, I think we can enable a dedicated Github actions workflow for the > distribution test that is triggered only when the related files are changed. > Besides Linux, we could run it both on Mac and Windows which most users run > the app on - it'd be slow, but if we limit the scope of the test I suppose it > works functionally just fine (I'm running actions workflows on mac and > windows elsewhere). > To make it "slow test", we could add the same {{@Slow}} annotation as the > {{test-framework}} to the distribution tests, for consistency. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it
[ https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved LUCENE-10531. Fix Version/s: 10.0 (main) Resolution: Fixed > Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI > workflow for it > --- > > Key: LUCENE-10531 > URL: https://issues.apache.org/jira/browse/LUCENE-10531 > Project: Lucene - Core > Issue Type: Task > Components: general/test >Reporter: Tomoko Uchida >Priority: Minor > Fix For: 10.0 (main) > > Time Spent: 6h 10m > Remaining Estimate: 0h > > We are going to allow running the test on Xvfb (a virtual display that speaks > X protocol) in [LUCENE-10528], this tweak is available only on Linux. > I'm just guessing but it could confuse or bother also Mac and Windows users > (we can't know what window manager developers are using); it may be better to > make it opt-in by marking it as slow tests. > Instead, I think we can enable a dedicated Github actions workflow for the > distribution test that is triggered only when the related files are changed. > Besides Linux, we could run it both on Mac and Windows which most users run > the app on - it'd be slow, but if we limit the scope of the test I suppose it > works functionally just fine (I'm running actions workflows on mac and > windows elsewhere). > To make it "slow test", we could add the same {{@Slow}} annotation as the > {{test-framework}} to the distribution tests, for consistency. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
shaie commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r875458505 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/DoubleHyperRectangle.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.util.Arrays; +import org.apache.lucene.util.NumericUtils; + +/** Stores a hyper rectangle as an array of DoubleRangePairs */ +public class DoubleHyperRectangle extends HyperRectangle { + + /** Creates DoubleHyperRectangle */ + public DoubleHyperRectangle(String label, DoubleRangePair... pairs) { +super(label, convertToLongRangePairArray(pairs)); + } + + private static LongRangePair[] convertToLongRangePairArray(DoubleRangePair... pairs) { Review Comment: nit: I find `Array` redundant, maybe `convertToLongRangePairs`? Or `toLongRangePairs`? ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java: ## @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.search.DocIdSetIterator; + +/** Get counts given a list of HyperRectangles (which must be of the same type) */ +public class HyperRectangleFacetCounts extends Facets { + /** Hypper rectangles passed to constructor. */ + protected final HyperRectangle[] hyperRectangles; + + /** Counts, initialized in subclass. */ + protected final int[] counts; + + /** Our field name. */ + protected final String field; + + /** Number of dimensions for field */ + protected final int dims; + + /** Total number of hits. */ + protected int totCount; + + /** + * Create HyperRectangleFacetCounts using + * + * @param field Field name + * @param hits Hits to facet on + * @param hyperRectangles List of long hyper rectangle facets + * @throws IOException If there is a problem reading the field + */ + public HyperRectangleFacetCounts( + String field, FacetsCollector hits, LongHyperRectangle... hyperRectangles) + throws IOException { +this(true, field, hits, hyperRectangles); + } + + /** + * Create HyperRectangleFacetCounts using + * + * @param field Field name + * @param hits Hits to facet on + * @param hyperRectangles List of double hyper rectangle facets + * @throws IOException If there is a problem reading the field + */ + public HyperRectangleFacetCounts( + String field, FacetsCollector hits, DoubleHyperRectangle... hyperRectangles) + throws IOException { +this(true, field, hits, hyperRectangles); + } + + private HyperRectangleFacetCounts( + boolean discarded, String field, FacetsCollector hits, HyperRectangle... hyperRectangles) + throws IOException { +assert hyperRectangles.length > 0 : "Hyper rectangle ranges cannot be empty"; +assert isHyperRectangleDimsConsistent(hyperRectangles) +: "All hyper rectangles must be the same dimensionality"; +this
[GitHub] [lucene] dweiss commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
dweiss commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1129612104 I'd apply this to 9x as well since it'll ease backports of other things/ decrease the potential of a conflict in the future? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?
[ https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538598#comment-17538598 ] Uwe Schindler commented on LUCENE-10572: bq. The stopwords are going to skew everything. If someone is removing them, the distribution of tokens will look much different. If wikipedia has so many stopwords, this would explain what Mike is seeing. Every stop word produces a hash that's already known. So the Arrays.equals() code runs on each stopword every time it is seen over and over. Maybe let's just change the analyzer that Mike uses to remove those stopwords? Or are there many stopwords we do not know about? Nevertheless, this is a valid use case: Text without stopwords and text with stopwords (especially because we recommend to user not to remove stopwords anymore). > Can we optimize BytesRefHash? > - > > Key: LUCENE-10572 > URL: https://issues.apache.org/jira/browse/LUCENE-10572 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > I was poking around in our nightly benchmarks > ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR > profiling that the hottest method is this: > {noformat} > PERCENT CPU SAMPLES STACK > 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals() > at > org.apache.lucene.util.BytesRefHash#findHash() > at org.apache.lucene.util.BytesRefHash#add() > at > org.apache.lucene.index.TermsHashPerField#add() > at > org.apache.lucene.index.IndexingChain$PerField#invert() > at > org.apache.lucene.index.IndexingChain#processField() > at > org.apache.lucene.index.IndexingChain#processDocument() > at > org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat} > This is kinda crazy – comparing if the term to be inserted into the inverted > index hash equals the term already added to {{BytesRefHash}} is the hottest > method during nightly benchmarks. > Discussing offline with [~rcmuir] and [~jpountz] they noticed a few > questionable things about our current implementation: > * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the > inserted term into the hash? Let's just use two bytes always, since IW > limits term length to 32 K (< 64K that an unsigned short can cover) > * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} > (BitUtil.VH_BE_SHORT.get) > * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not > aggressive enough? Or the initial sizing of the hash is too small? > * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too > many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible > "upgrades"? > * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version > ({{{}murmurhash3_x86_32{}}})? > * Are we using the JVM's intrinsics to compare multiple bytes in a single > SIMD instruction ([~rcmuir] is quite sure we are indeed)? > * [~jpountz] suggested maybe the hash insert is simply memory bound > * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total > CPU cost) > I pulled these observations from a recent (5/6/22) profiler output: > [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html] > Maybe we can improve our performance on this crazy hotspot? > Or maybe this is a "healthy" hotspot and we should leave it be! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
mocobeta commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1129641582 Ok I'll backport it to the 9x branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org