[jira] [Commented] (LUCENE-10658) Merges should periodically check for abort
[ https://issues.apache.org/jira/browse/LUCENE-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568962#comment-17568962 ] Michael McCandless commented on LUCENE-10658: - +1, merges should abort promptly. But it is indeed only a "best effort" mechanism. I guess Lucene's completion field is building FSTs during merging and not writing bytes to disk as it builds the large FST, until the end? Maybe there are other parts of Lucene merging that also fail to check promptly enough, e.g. maybe when dimensional points are doing a (large) offline sort before writing anything to the output files? Maybe we could instrument {{MergeRateLimiter}} to write a WARNING into {{infoStream}} whenever too much time has elapsed between visits to its {{maybePause}} API? We could use that to tease out other places that are failing to write bytes frequently enough for abort checking. Lucene used to check for merge abort deep inside {{IndexWriter}} and merging code (e.g. merging postings would check periodically, same for doc values, etc.), but I think we refactored that down to the rate limiter only in LUCENE-7700 which was a nice cleanup / step forward. > Merges should periodically check for abort > -- > > Key: LUCENE-10658 > URL: https://issues.apache.org/jira/browse/LUCENE-10658 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.3 >Reporter: Nhat Nguyen >Priority: Major > > Rolling back an IndexWriter without committing shouldn't take long (i.e., > less than several seconds), and Elasticsearch cluster coordination [relies > on|https://github.com/elastic/elasticsearch/issues/88055] this assumption. If > some merges are taking place, the rollback can take several minutes as merges > only check for abort when writing to files via > [MergeRateLimiter|https://github.com/apache/lucene/blob/3d7d85f245381f84c46c766119695a8645cde2b8/lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java#L117-L119]. > Merging a completion field, for example, can take a long time without > touching output files. Another reason merges should periodically check for > abort is its outputs will be discarded. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10656) It is unnecessary that using `limit` to check boundary
[ https://issues.apache.org/jira/browse/LUCENE-10656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10656: --- Fix Version/s: (was: 9.3) > It is unnecessary that using `limit` to check boundary > -- > > Key: LUCENE-10656 > URL: https://issues.apache.org/jira/browse/LUCENE-10656 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Priority: Trivial > Time Spent: 20m > Remaining Estimate: 0h > > follow-up discussion in [https://github.com/apache/lucene/pull/1021] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using ByteBuffersDataInput reduce memory copy on compressing data
luyuncheng commented on code in PR #987: URL: https://github.com/apache/lucene/pull/987#discussion_r925476227 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java: ## @@ -165,6 +165,36 @@ public void readBytes(byte[] arr, int off, int len) throws EOFException { } } } + /** Review Comment: Wondering +1, it seems like spotless can not format this situation. and i manually fixed it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #27: Improve the `Jira Information` header?
mocobeta commented on issue #27: URL: https://github.com/apache/lucene-jira-archive/issues/27#issuecomment-1190202161 Hi @mikemccand, I plan to start the next (hopefully last) migration test for all issues on July 25th. Could you merge improvements in your mind for `jira2github_import.py` until then, if possible? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import
mocobeta commented on issue #54: URL: https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190208864 I'll take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import
mocobeta commented on issue #54: URL: https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190235323 It looks like it's an expected behavior of GitHub's Markdown rendering. Any string surrounded by `[]` is not interpreted as a hyperlink.  We could remove `[` and `]` if and only if it contains a URL, but I think it'd be a bit risky? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #54: Hyperlinks are sometimes not actual links on import
mikemccand commented on issue #54: URL: https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190282496 Ahh thanks @mocobeta! I wonder why GitHub doesn't render links in [ .. ]? The problem is, this is a fairly frequent occurrence since `commitbot` formats the link in its comments that way. I agree it'd be risk to do this on the input we send to the Jira -> MD converter, but maybe not so risky if we do it on the output? I.e. if, after conversion, we see a URL-like-string (hmm, regexp parsing these is tricky), then we turn it into hyperlink? Or perhaps we specialize this to commitbot comments only? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #27: Improve the `Jira Information` header?
mikemccand commented on issue #27: URL: https://github.com/apache/lucene-jira-archive/issues/27#issuecomment-1190337880 > Hi @mikemccand, I plan to start the next (hopefully last) migration test for all issues on July 25th. Could you merge improvements in your mind for `jira2github_import.py` until then, if possible? Will do. Thanks @mocobeta! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import
mocobeta commented on issue #54: URL: https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190347139 I think humans are unlikely to make comments such as `[ http:// ]`? I'm +1 to only apply a special conversion for commitbot's comments; it's easy to filter comments by their author, and it'd be fairly safe to tweak the templated comments? ``` "comments": [ { "self": "https://issues.apache.org/jira/rest/api/2/issue/13469553/comment/17562453";, "id": "17562453", "author": { "self": "https://issues.apache.org/jira/rest/api/2/user?username=jira-bot";, "name": "jira-bot", "key": "jira-bot", "avatarUrls": { "48x48": "https://issues.apache.org/jira/secure/useravatar?avatarId=10452";, "24x24": "https://issues.apache.org/jira/secure/useravatar?size=small&avatarId=10452";, "16x16": "https://issues.apache.org/jira/secure/useravatar?size=xsmall&avatarId=10452";, "32x32": "https://issues.apache.org/jira/secure/useravatar?size=medium&avatarId=10452"; }, "displayName": "ASF subversion and git services", "active": true, "timeZone": "Etc/UTC" }, "body": "Commit 3dd9a5487c2c3994abdaf5ab0553a3d78ebe50ab in lucene's branch refs/heads/main from Adrien Grand\n[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3dd9a5487c2 ]\n\nLUCENE-10636: Avoid computing the same scores multiple times. (#1005)\n\n`BlockMaxMaxscoreScorer` would previously compute the score twice for essential\r\nscorers.\r\n\r\nCo-authored-by: zacharymorn ", "updateAuthor": { "self": "https://issues.apache.org/jira/rest/api/2/user?username=jira-bot";, "name": "jira-bot", "key": "jira-bot", "avatarUrls": { "48x48": "https://issues.apache.org/jira/secure/useravatar?avatarId=10452";, "24x24": "https://issues.apache.org/jira/secure/useravatar?size=small&avatarId=10452";, "16x16": "https://issues.apache.org/jira/secure/useravatar?size=xsmall&avatarId=10452";, "32x32": "https://issues.apache.org/jira/secure/useravatar?size=medium&avatarId=10452"; }, "displayName": "ASF subversion and git services", "active": true, "timeZone": "Etc/UTC" }, "created": "2022-07-05T08:14:08.255+", "updated": "2022-07-05T08:14:08.255+" } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import
mocobeta commented on issue #54: URL: https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190348875 > Or perhaps we specialize this to commitbot comments only? If this is okay with you, I can make a PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on PR #992: URL: https://github.com/apache/lucene/pull/992#issuecomment-1190373448 > Would you be able to check how the indexing rate compares when index sorting is enabled? @jpountz Thanks, I will do the comparison. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] JoeHF commented on a diff in pull request #1003: LUCENE-10616: optimizing decompress when only retrieving some fields
JoeHF commented on code in PR #1003: URL: https://github.com/apache/lucene/pull/1003#discussion_r925768424 ## lucene/core/src/java/org/apache/lucene/document/DocumentStoredFieldVisitor.java: ## @@ -98,6 +100,16 @@ public void doubleField(FieldInfo fieldInfo, double value) { @Override public Status needsField(FieldInfo fieldInfo) throws IOException { +// return stop after collected all needed fields +if (fieldsToAdd != null +&& !fieldsToAdd.contains(fieldInfo.name) +&& fieldsToAdd.size() +== doc.getFields().stream() +.map(IndexableField::name) +.collect(Collectors.toSet()) +.size()) { + return Status.STOP; Review Comment: you are right, produced errors in test case https://github.com/apache/lucene/pull/1003/files#diff-4439cae82856043dfe05c058daac8c23433110d9b0cf7a783edf0b63c1bc423dR100 The only way i can think of is to sort field name before writing stored field so that multiple values for the same field are close to each other. Only in that sorting cases we can stop early. But this reader mode is not back-compatible for the old stored file. How do we solve this issue? or do we have other easy options to return early? @jpountz -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order
gsmiller commented on PR #1013: URL: https://github.com/apache/lucene/pull/1013#issuecomment-1190614346 Sounds good. Thanks @Yuti-G! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
Greg Miller created LUCENE-10659: Summary: Fix random TestDisiPriorityQueue bug Key: LUCENE-10659 URL: https://issues.apache.org/jira/browse/LUCENE-10659 Project: Lucene - Core Issue Type: Bug Affects Versions: 9.3 Reporter: Greg Miller A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we should roll it into the 9.3 release. I'll prepare a PR, but raising it here for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on PR #992: URL: https://github.com/apache/lucene/pull/992#issuecomment-1190803247 @jpountz I have run another set of benchmarks on dataset **sift-128-euclidean M:16 efConstruction:100 with index sort on SortField.Type.LONG**, where I added an extra index sort field: `NumericDocValuesField` with random long values. Observed results: - the whole indexing + flush is slightly faster on the candidate (548s sec in candidate VS 654s in baseline) - baseline: indexing is fast, but flush takes 653 sec - candidate: indexing takes most time, and flush is very fast - 3 sec Comparison with [unsorted case](https://github.com/apache/lucene/pull/992#issuecomment-1178060346) that was done before: - baseline: indexing time increased from 533s sec to 654s - candidate: indexing time increased from 538s sec to 548s - in particular, reconstructing the graph using new ordinals doesn't seem to take much time: 866 ms or 0.8 s **Baseline** ```bash IW 0 [2022-07-20T21:00:49.727575Z; main]: MMapDirectory.UNMAP_SUPPORTED=true Done indexing 100 documents; now flush IW 0 [2022-07-20T21:00:51.099538Z; main]: now flush at close IW 0 [2022-07-20T21:00:51.100162Z; main]: start flush: applyAllDeletes=true IW 0 [2022-07-20T21:00:51.100936Z; main]: index before flush DW 0 [2022-07-20T21:00:51.101006Z; main]: startFullFlush DW 0 [2022-07-20T21:00:51.107445Z; main]: anyChanges? numDocsInRam=100 deletes=false hasTickets:false pendingChangesInFullFlush: false DWPT 0 [2022-07-20T21:00:51.119428Z; main]: flush postings as segment _3 numDocs=100 IW 0 [2022-07-20T21:00:51.715470Z; main]: 0 msec to write norms IW 0 [2022-07-20T21:00:51.852081Z; main]: 136 msec to write docValues IW 0 [2022-07-20T21:00:51.852305Z; main]: 0 msec to write points HNSW 0 [2022-07-20T21:00:53.264684Z; main]: build graph from 100 vectors HNSW 0 [2022-07-20T21:11:34.590292Z; main]: built 99 in 7288/641320 ms HNSW 0 [2022-07-20T21:11:34.590292Z; main]: built 99 in 7288/641320 ms IW 0 [2022-07-20T21:11:42.662461Z; main]: 650804 msec to write vectors IW 0 [2022-07-20T21:11:43.334377Z; main]: 671 msec to finish stored fields IW 0 [2022-07-20T21:11:43.334611Z; main]: 0 msec to write postings and finish vectors IW 0 [2022-07-20T21:11:43.336506Z; main]: 0 msec to write fieldInfos DWPT 0 [2022-07-20T21:11:44.244388Z; main]: flush time 653120.381917 msec IW 0 [2022-07-20T21:11:44.247650Z; main]: publishFlushedSegment _3(10.0.0):c100:[indexSort=]:... Indexed 100 documents in 654s ``` **Candidate** ```bash IW 0 [2022-07-20T18:35:41.879858Z; main]: MMapDirectory.UNMAP_SUPPORTED=true Done indexing 100 documents; now flush IW 0 [2022-07-20T18:44:46.109074Z; main]: now flush at close IW 0 [2022-07-20T18:44:46.109804Z; main]: start flush: applyAllDeletes=true IW 0 [2022-07-20T18:44:46.110587Z; main]: index before flush DW 0 [2022-07-20T18:44:46.110689Z; main]: startFullFlush DW 0 [2022-07-20T18:44:46.115672Z; main]: anyChanges? numDocsInRam=100 deletes=false hasTickets:false pendingChangesInFullFlush: false DWPT 0 [2022-07-20T18:44:46.126626Z; main]: flush postings as segment _2 numDocs=100 IW 0 [2022-07-20T18:44:46.741747Z; main]: 0 msec to write norms IW 0 [2022-07-20T18:44:46.864200Z; main]: 121 msec to write docValues IW 0 [2022-07-20T18:44:46.864364Z; main]: 0 msec to write points IndexWriter 0 [2022-07-20T18:44:47.609637Z; main]: starting reconstructing graph ordinals 63362025298959 IndexWriter 0 [2022-07-20T18:44:48.476035Z; main]: finished reconstructing graph ordinals 63362892156709 IW 0 [2022-07-20T18:44:48.481920Z; main]: 1617 msec to write vectors IW 0 [2022-07-20T18:44:49.166673Z; main]: 683 msec to finish stored fields IW 0 [2022-07-20T18:44:49.167432Z; main]: 0 msec to write postings and finish vectors IW 0 [2022-07-20T18:44:49.174701Z; main]: 6 msec to write fieldInfos IFD 0 [2022-07-20T18:44:50.072852Z; main]: now checkpoint "_2(10.0.0):c100:[indexSort=]:.. DWPT 0 [2022-07-20T18:44:50.058801Z; main]: flush time 3931.69475 msec Indexed 100 documents in 548s ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569184#comment-17569184 ] Greg Miller commented on LUCENE-10659: -- PR for pulling this fix into 9.3: https://github.com/apache/lucene/pull/1038 > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Priority: Minor > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10659: - Priority: Blocker (was: Minor) > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Priority: Blocker > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #54: Hyperlinks are sometimes not actual links on import
mikemccand commented on issue #54: URL: https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190854358 > If this is okay with you, I can make a PR. +1, thank you @mocobeta! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?
[ https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569208#comment-17569208 ] Julie Tibshirani commented on LUCENE-10655: --- [~sokolov] I ran a bunch of similar experiments when putting together https://github.com/apache/lucene/pull/641. I reached the same conclusions. For the hash set question, I opened https://issues.apache.org/jira/browse/LUCENE-10404 -- we've been discussing a bit there. > can we optimize visited bitset usage in HNSW graph search/indexing? > --- > > Key: LUCENE-10655 > URL: https://issues.apache.org/jira/browse/LUCENE-10655 > Project: Lucene - Core > Issue Type: Improvement > Components: core/hnsw >Reporter: Michael Sokolov >Priority: Major > > When running {{luceneutil}} I noticed that {{FixedBitSet.clear()}} dominates > the CPU profiler output. I had a few ideas: > # In upper graph layers, the occupied nodes are very sparse - maybe > {{SparseFixedBitSet}} would be a better fit for those > # We are caching these bitsets, but they are only used for a single search > (single document insert, during indexing). Should we cache across searches? > We would need to pool them though, and they would vary by field since fields > can have different numbers of vector nodes. This starts to get complex > # Are we sure that clearing a bitset is more efficient than allocating a new > one? Maybe the JDK maintains a pool of already-zeroed memory for us > I think we could try specializing the bitset type by graph level, and then I > think we ought to measure the performance of allocation vs the limited reuse > that we currently have. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing
jtibshirani commented on PR #992: URL: https://github.com/apache/lucene/pull/992#issuecomment-1190948650 Thanks for running these new benchmarks. It's good to see that the remapping time isn't too high. It's a bit confusing that the baseline slows down so much from 533s to 654s, which is almost 2 minutes slower. Do you have a sense for why this is? I wonder if graph building time can vary a lot based on what order the vectors are processed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569237#comment-17569237 ] Julie Tibshirani commented on LUCENE-10404: --- As a note, LUCENE-10592 changes the index strategy so that we build the graph as each document is added, instead of waiting until 'flush'. In the PR, graph building still shares a single FixedBitSet to track the 'visited' set, but it's continuously resized since we don't know the full number of docs up-front. So maybe switching to a hash set could help even more after that change is merged. > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
jtibshirani commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r926196192 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -203,8 +204,11 @@ private NeighborQueue searchLevel( return results; } - private void clearScratchState() { + private void clearScratchState(int capacity) { candidates.clear(); +if (visited.length() < capacity) { + visited = FixedBitSet.ensureCapacity((FixedBitSet) visited, capacity); Review Comment: I just realized that we're doing a cast which is pretty tricky/ fragile. The check `visited.length() < capacity` is only true if we are building the graph (not searching), and `HnswGraphBuilder` happens to always use `FixedBitSet`. As a follow-up maybe we should consider LUCENE-10404 or something similar, which chooses a better 'visited' data structure and doesn't require us to do this cast + resize. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn opened a new pull request, #1039: LUCENE-10635: Ensure test coverage for WANDScorer by using a test query
zacharymorn opened a new pull request, #1039: URL: https://github.com/apache/lucene/pull/1039 ### Description (or a Jira issue link if you have one) Ensure test coverage for WANDScorer by using a test query -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #1039: LUCENE-10635: Ensure test coverage for WANDScorer by using a test query
zacharymorn commented on PR #1039: URL: https://github.com/apache/lucene/pull/1039#issuecomment-1191043324 I guess this will go into `10.0.0`, as `9.3` has already been cut and the PR is test only? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org