date:20220720

[jira] [Commented] (LUCENE-10658) Merges should periodically check for abort

2022-07-20 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568962#comment-17568962
 ] 

Michael McCandless commented on LUCENE-10658:
-

+1, merges should abort promptly.  But it is indeed only a "best effort" 
mechanism.

I guess Lucene's completion field is building FSTs during merging and not 
writing bytes to disk as it builds the large FST, until the end?

Maybe there are other parts of Lucene merging that also fail to check promptly 
enough, e.g. maybe when dimensional points are doing a (large) offline sort 
before writing anything to the output files?

Maybe we could instrument {{MergeRateLimiter}} to write a WARNING into 
{{infoStream}} whenever too much time has elapsed between visits to its 
{{maybePause}} API?  We could use that to tease out other places that are 
failing to write bytes frequently enough for abort checking.

Lucene used to check for merge abort deep inside {{IndexWriter}} and merging 
code (e.g. merging postings would check periodically, same for doc values, 
etc.), but I think we refactored that down to the rate limiter only in 
LUCENE-7700 which was a nice cleanup / step forward.

> Merges should periodically check for abort
> --
>
> Key: LUCENE-10658
> URL: https://issues.apache.org/jira/browse/LUCENE-10658
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.3
>Reporter: Nhat Nguyen
>Priority: Major
>
> Rolling back an IndexWriter without committing shouldn't take long (i.e., 
> less than several seconds), and Elasticsearch cluster coordination [relies 
> on|https://github.com/elastic/elasticsearch/issues/88055] this assumption. If 
> some merges are taking place, the rollback can take several minutes as merges 
> only check for abort when writing to files via 
> [MergeRateLimiter|https://github.com/apache/lucene/blob/3d7d85f245381f84c46c766119695a8645cde2b8/lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java#L117-L119].
>  Merging a completion field, for example, can take a long time without 
> touching output files. Another reason merges should periodically check for 
> abort is its outputs will be discarded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10656) It is unnecessary that using `limit` to check boundary

2022-07-20 Thread Lu Xugang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10656:
---
Fix Version/s: (was: 9.3)

> It is unnecessary that using `limit` to check boundary
> --
>
> Key: LUCENE-10656
> URL: https://issues.apache.org/jira/browse/LUCENE-10656
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> follow-up discussion in [https://github.com/apache/lucene/pull/1021]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using ByteBuffersDataInput reduce memory copy on compressing data

2022-07-20 Thread GitBox



luyuncheng commented on code in PR #987:
URL: https://github.com/apache/lucene/pull/987#discussion_r925476227


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java:
##
@@ -165,6 +165,36 @@ public void readBytes(byte[] arr, int off, int len) throws 
EOFException {
   }
 }
   }
+  /**

Review Comment:
   Wondering +1, it seems like spotless can not format this situation. and i 
manually fixed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #27: Improve the `Jira Information` header?

2022-07-20 Thread GitBox



mocobeta commented on issue #27:
URL: 
https://github.com/apache/lucene-jira-archive/issues/27#issuecomment-1190202161

   Hi @mikemccand, I plan to start the next (hopefully last) migration test for 
all issues on July 25th. Could you merge improvements in your mind for 
`jira2github_import.py` until then, if possible?
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import

2022-07-20 Thread GitBox



mocobeta commented on issue #54:
URL: 
https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190208864

   I'll take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import

2022-07-20 Thread GitBox



mocobeta commented on issue #54:
URL: 
https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190235323

   It looks like it's an expected behavior of GitHub's Markdown rendering. Any 
string surrounded by `[]` is not interpreted as a hyperlink.
   
   ![Screenshot from 2022-07-20 
21-32-25](https://user-images.githubusercontent.com/1825333/179982896-e0ec79df-074a-486d-9521-573c7b8c9f89.png)
   
   We could remove `[` and `]` if and only if it contains a URL, but I think 
it'd be a bit risky?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #54: Hyperlinks are sometimes not actual links on import

2022-07-20 Thread GitBox



mikemccand commented on issue #54:
URL: 
https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190282496

   Ahh thanks @mocobeta!  I wonder why GitHub doesn't render links in [ .. ]?
   
   The problem is, this is a fairly frequent occurrence since `commitbot` 
formats the link in its comments that way.
   
   I agree it'd be risk to do this on the input we send to the Jira -> MD 
converter, but maybe not so risky if we do it on the output?  I.e. if, after 
conversion, we see a URL-like-string (hmm, regexp parsing these is tricky), 
then we turn it into hyperlink?
   
   Or perhaps we specialize this to commitbot comments only?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #27: Improve the `Jira Information` header?

2022-07-20 Thread GitBox



mikemccand commented on issue #27:
URL: 
https://github.com/apache/lucene-jira-archive/issues/27#issuecomment-1190337880

   > Hi @mikemccand, I plan to start the next (hopefully last) migration test 
for all issues on July 25th. Could you merge improvements in your mind for 
`jira2github_import.py` until then, if possible?
   
   Will do.  Thanks @mocobeta!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import

2022-07-20 Thread GitBox



mocobeta commented on issue #54:
URL: 
https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190347139

   I think humans are unlikely to make comments such as `[ http:// ]`?
   
   I'm +1 to only apply a special conversion for commitbot's comments; it's 
easy to filter comments by their author, and it'd be fairly safe to tweak the 
templated comments?
   ```
 "comments": [
   {
 "self": 
"https://issues.apache.org/jira/rest/api/2/issue/13469553/comment/17562453";,
 "id": "17562453",
 "author": {
   "self": 
"https://issues.apache.org/jira/rest/api/2/user?username=jira-bot";,
   "name": "jira-bot",
   "key": "jira-bot",
   "avatarUrls": {
 "48x48": 
"https://issues.apache.org/jira/secure/useravatar?avatarId=10452";,
 "24x24": 
"https://issues.apache.org/jira/secure/useravatar?size=small&avatarId=10452";,
 "16x16": 
"https://issues.apache.org/jira/secure/useravatar?size=xsmall&avatarId=10452";,
 "32x32": 
"https://issues.apache.org/jira/secure/useravatar?size=medium&avatarId=10452";
   },
   "displayName": "ASF subversion and git services",
   "active": true,
   "timeZone": "Etc/UTC"
 },
 "body": "Commit 3dd9a5487c2c3994abdaf5ab0553a3d78ebe50ab in 
lucene's branch refs/heads/main from Adrien Grand\n[ 
https://gitbox.apache.org/repos/asf?p=lucene.git;h=3dd9a5487c2 
]\n\nLUCENE-10636: Avoid computing the same scores multiple times. 
(#1005)\n\n`BlockMaxMaxscoreScorer` would previously compute the score twice 
for essential\r\nscorers.\r\n\r\nCo-authored-by: zacharymorn 
",
 "updateAuthor": {
   "self": 
"https://issues.apache.org/jira/rest/api/2/user?username=jira-bot";,
   "name": "jira-bot",
   "key": "jira-bot",
   "avatarUrls": {
 "48x48": 
"https://issues.apache.org/jira/secure/useravatar?avatarId=10452";,
 "24x24": 
"https://issues.apache.org/jira/secure/useravatar?size=small&avatarId=10452";,
 "16x16": 
"https://issues.apache.org/jira/secure/useravatar?size=xsmall&avatarId=10452";,
 "32x32": 
"https://issues.apache.org/jira/secure/useravatar?size=medium&avatarId=10452";
   },
   "displayName": "ASF subversion and git services",
   "active": true,
   "timeZone": "Etc/UTC"
 },
 "created": "2022-07-05T08:14:08.255+",
 "updated": "2022-07-05T08:14:08.255+"
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import

2022-07-20 Thread GitBox



mocobeta commented on issue #54:
URL: 
https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190348875

   > Or perhaps we specialize this to commitbot comments only?
   
   If this is okay with you, I can make a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-20 Thread GitBox



mayya-sharipova commented on PR #992:
URL: https://github.com/apache/lucene/pull/992#issuecomment-1190373448

   > Would you be able to check how the indexing rate compares when index 
sorting is enabled?
   
   @jpountz Thanks, I will do the comparison. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] JoeHF commented on a diff in pull request #1003: LUCENE-10616: optimizing decompress when only retrieving some fields

2022-07-20 Thread GitBox



JoeHF commented on code in PR #1003:
URL: https://github.com/apache/lucene/pull/1003#discussion_r925768424


##
lucene/core/src/java/org/apache/lucene/document/DocumentStoredFieldVisitor.java:
##
@@ -98,6 +100,16 @@ public void doubleField(FieldInfo fieldInfo, double value) {
 
   @Override
   public Status needsField(FieldInfo fieldInfo) throws IOException {
+// return stop after collected all needed fields
+if (fieldsToAdd != null
+&& !fieldsToAdd.contains(fieldInfo.name)
+&& fieldsToAdd.size()
+== doc.getFields().stream()
+.map(IndexableField::name)
+.collect(Collectors.toSet())
+.size()) {
+  return Status.STOP;

Review Comment:
   you are right, produced errors in test case 
https://github.com/apache/lucene/pull/1003/files#diff-4439cae82856043dfe05c058daac8c23433110d9b0cf7a783edf0b63c1bc423dR100
 The only way i can think of is to sort field name before writing stored field 
so that multiple values for the same field are close to each other. Only in 
that sorting cases we can stop early. But this reader mode is not 
back-compatible for the old stored file. How do we solve this issue? or do we 
have other easy options to return early? @jpountz 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

2022-07-20 Thread GitBox



gsmiller commented on PR #1013:
URL: https://github.com/apache/lucene/pull/1013#issuecomment-1190614346

   Sounds good. Thanks @Yuti-G!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-20 Thread Greg Miller (Jira)

Greg Miller created LUCENE-10659:


 Summary: Fix random TestDisiPriorityQueue bug
 Key: LUCENE-10659
 URL: https://issues.apache.org/jira/browse/LUCENE-10659
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 9.3
Reporter: Greg Miller


A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
should roll it into the 9.3 release. I'll prepare a PR, but raising it here for 
visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-20 Thread GitBox



mayya-sharipova commented on PR #992:
URL: https://github.com/apache/lucene/pull/992#issuecomment-1190803247

   @jpountz  I have run another set of benchmarks on  dataset 
   **sift-128-euclidean M:16 efConstruction:100 with index sort on 
SortField.Type.LONG**, where I added an extra index sort field: 
`NumericDocValuesField` with random long values. 
   
   Observed results:
   
   - the whole indexing + flush is slightly faster on the candidate (548s sec 
in candidate VS 654s in baseline)
   - baseline: indexing is fast, but flush takes 653 sec
   - candidate: indexing takes most time, and flush is very fast - 3 sec
   
   
   Comparison with [unsorted 
case](https://github.com/apache/lucene/pull/992#issuecomment-1178060346) that 
was done before:
   - baseline: indexing time increased from 533s sec to 654s
   - candidate: indexing time increased from 538s sec to 548s 
  -  in particular, reconstructing the graph using new ordinals doesn't 
seem to take much time: 866 ms or 0.8 s
   
   
   **Baseline**
   
   ```bash
   IW 0 [2022-07-20T21:00:49.727575Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
   Done indexing 100 documents; now flush
   IW 0 [2022-07-20T21:00:51.099538Z; main]: now flush at close
   IW 0 [2022-07-20T21:00:51.100162Z; main]:   start flush: applyAllDeletes=true
   IW 0 [2022-07-20T21:00:51.100936Z; main]:   index before flush
   DW 0 [2022-07-20T21:00:51.101006Z; main]: startFullFlush
   DW 0 [2022-07-20T21:00:51.107445Z; main]: anyChanges? numDocsInRam=100 
deletes=false hasTickets:false pendingChangesInFullFlush: false
   DWPT 0 [2022-07-20T21:00:51.119428Z; main]: flush postings as segment _3 
numDocs=100
   IW 0 [2022-07-20T21:00:51.715470Z; main]: 0 msec to write norms
   IW 0 [2022-07-20T21:00:51.852081Z; main]: 136 msec to write docValues
   IW 0 [2022-07-20T21:00:51.852305Z; main]: 0 msec to write points
   HNSW 0 [2022-07-20T21:00:53.264684Z; main]: build graph from 100 vectors
   
   HNSW 0 [2022-07-20T21:11:34.590292Z; main]: built 99 in 7288/641320 ms
   HNSW 0 [2022-07-20T21:11:34.590292Z; main]: built 99 in 7288/641320 ms
   IW 0 [2022-07-20T21:11:42.662461Z; main]: 650804 msec to write vectors
   IW 0 [2022-07-20T21:11:43.334377Z; main]: 671 msec to finish stored fields
   IW 0 [2022-07-20T21:11:43.334611Z; main]: 0 msec to write postings and 
finish vectors
   IW 0 [2022-07-20T21:11:43.336506Z; main]: 0 msec to write fieldInfos
   
   DWPT 0 [2022-07-20T21:11:44.244388Z; main]: flush time 653120.381917 msec
   IW 0 [2022-07-20T21:11:44.247650Z; main]: publishFlushedSegment 
_3(10.0.0):c100:[indexSort=]:...
   
   Indexed 100 documents in 654s
   ```
   
   **Candidate**
   
   ```bash
   IW 0 [2022-07-20T18:35:41.879858Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
   Done indexing 100 documents; now flush
   IW 0 [2022-07-20T18:44:46.109074Z; main]: now flush at close
   IW 0 [2022-07-20T18:44:46.109804Z; main]:   start flush: applyAllDeletes=true
   IW 0 [2022-07-20T18:44:46.110587Z; main]:   index before flush
   DW 0 [2022-07-20T18:44:46.110689Z; main]: startFullFlush
   DW 0 [2022-07-20T18:44:46.115672Z; main]: anyChanges? numDocsInRam=100 
deletes=false hasTickets:false pendingChangesInFullFlush: false
   DWPT 0 [2022-07-20T18:44:46.126626Z; main]: flush postings as segment _2 
numDocs=100
   IW 0 [2022-07-20T18:44:46.741747Z; main]: 0 msec to write norms
   IW 0 [2022-07-20T18:44:46.864200Z; main]: 121 msec to write docValues
   IW 0 [2022-07-20T18:44:46.864364Z; main]: 0 msec to write points
   IndexWriter 0 [2022-07-20T18:44:47.609637Z; main]: starting reconstructing 
graph ordinals 63362025298959
   IndexWriter 0 [2022-07-20T18:44:48.476035Z; main]: finished reconstructing 
graph ordinals 63362892156709
   IW 0 [2022-07-20T18:44:48.481920Z; main]: 1617 msec to write vectors
   IW 0 [2022-07-20T18:44:49.166673Z; main]: 683 msec to finish stored fields
   IW 0 [2022-07-20T18:44:49.167432Z; main]: 0 msec to write postings and 
finish vectors
   IW 0 [2022-07-20T18:44:49.174701Z; main]: 6 msec to write fieldInfos
   
   IFD 0 [2022-07-20T18:44:50.072852Z; main]: now checkpoint 
"_2(10.0.0):c100:[indexSort=]:..
   
   DWPT 0 [2022-07-20T18:44:50.058801Z; main]: flush time 3931.69475 msec
   Indexed 100 documents in 548s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-20 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569184#comment-17569184
 ] 

Greg Miller commented on LUCENE-10659:
--

PR for pulling this fix into 9.3: https://github.com/apache/lucene/pull/1038

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Priority: Minor
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-20 Thread Greg Miller (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10659:
-
Priority: Blocker  (was: Minor)

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Priority: Blocker
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #54: Hyperlinks are sometimes not actual links on import

2022-07-20 Thread GitBox



mikemccand commented on issue #54:
URL: 
https://github.com/apache/lucene-jira-archive/issues/54#issuecomment-1190854358

   > If this is okay with you, I can make a PR.
   
   +1, thank you @mocobeta!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-20 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569208#comment-17569208
 ] 

Julie Tibshirani commented on LUCENE-10655:
---

[~sokolov] I ran a bunch of similar experiments when putting together 
https://github.com/apache/lucene/pull/641. I reached the same conclusions.

For the hash set question, I opened 
https://issues.apache.org/jira/browse/LUCENE-10404 -- we've been discussing a 
bit there.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-20 Thread GitBox



jtibshirani commented on PR #992:
URL: https://github.com/apache/lucene/pull/992#issuecomment-1190948650

   Thanks for running these new benchmarks. It's good to see that the remapping 
time isn't too high.
   
   It's a bit confusing that the baseline slows down so much from 533s to 654s, 
which is almost 2 minutes slower. Do you have a sense for why this is? I wonder 
if graph building time can vary a lot based on what order the vectors are 
processed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-20 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569237#comment-17569237
 ] 

Julie Tibshirani commented on LUCENE-10404:
---

As a note, LUCENE-10592 changes the index strategy so that we build the graph 
as each document is added, instead of waiting until 'flush'. In the PR, graph 
building still shares a single FixedBitSet to track the 'visited' set, but it's 
continuously resized since we don't know the full number of docs up-front. So 
maybe switching to a hash set could help even more after that change is merged.

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-20 Thread GitBox



jtibshirani commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r926196192


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -203,8 +204,11 @@ private NeighborQueue searchLevel(
 return results;
   }
 
-  private void clearScratchState() {
+  private void clearScratchState(int capacity) {
 candidates.clear();
+if (visited.length() < capacity) {
+  visited = FixedBitSet.ensureCapacity((FixedBitSet) visited, capacity);

Review Comment:
   I just realized that we're doing a cast which is pretty tricky/ fragile. The 
check `visited.length() < capacity` is only true if we are building the graph 
(not searching), and `HnswGraphBuilder` happens to always use `FixedBitSet`.
   
   As a follow-up maybe we should consider LUCENE-10404 or something similar, 
which chooses a better 'visited' data structure and doesn't require us to do 
this cast + resize.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn opened a new pull request, #1039: LUCENE-10635: Ensure test coverage for WANDScorer by using a test query

2022-07-20 Thread GitBox



zacharymorn opened a new pull request, #1039:
URL: https://github.com/apache/lucene/pull/1039

   ### Description (or a Jira issue link if you have one)
   
   Ensure test coverage for WANDScorer by using a test query


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on pull request #1039: LUCENE-10635: Ensure test coverage for WANDScorer by using a test query

2022-07-20 Thread GitBox



zacharymorn commented on PR #1039:
URL: https://github.com/apache/lucene/pull/1039#issuecomment-1191043324

   I guess this will go into `10.0.0`, as `9.3` has already been cut and the PR 
is test only?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10658) Merges should periodically check for abort

[jira] [Updated] (LUCENE-10656) It is unnecessary that using `limit` to check boundary

[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using ByteBuffersDataInput reduce memory copy on compressing data

[GitHub] [lucene-jira-archive] mocobeta commented on issue #27: Improve the `Jira Information` header?

[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import

[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import

[GitHub] [lucene-jira-archive] mikemccand commented on issue #54: Hyperlinks are sometimes not actual links on import

[GitHub] [lucene-jira-archive] mikemccand commented on issue #27: Improve the `Jira Information` header?

[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import

[GitHub] [lucene-jira-archive] mocobeta commented on issue #54: Hyperlinks are sometimes not actual links on import

[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

[GitHub] [lucene] JoeHF commented on a diff in pull request #1003: LUCENE-10616: optimizing decompress when only retrieving some fields

[GitHub] [lucene] gsmiller commented on pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

[jira] [Created] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

[jira] [Updated] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

[GitHub] [lucene-jira-archive] mikemccand commented on issue #54: Hyperlinks are sometimes not actual links on import

[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

[GitHub] [lucene] jtibshirani commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

[GitHub] [lucene] zacharymorn opened a new pull request, #1039: LUCENE-10635: Ensure test coverage for WANDScorer by using a test query

[GitHub] [lucene] zacharymorn commented on pull request #1039: LUCENE-10635: Ensure test coverage for WANDScorer by using a test query

24 matches

Site Navigation

Mail list logo

Footer information