[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput
thecoop commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301893972 This is ready to review. I've removed the threadlocal byte buffer. With testing in elasticsearch, this reduced the duplicated strings in these fields by 90%. For the linked tickets this would reduce the memory usage from these strings from 6GB to 600MB. There is no significant performance difference I can see from running this in lucenebench. In the cases where the new method is not called, this would just add an empty `HashMap` instance to each `DataInput` instance, with the same lifetime as the container. When `getCanonicalString` is called, this added a single hashmap lookup to the call compared to `getString`. It also requires memory for the backing `HashMap`, with the same lifecycle as the container, scaling with the number of distinct strings returned by that method. So this causes slightly more memory when deserializing, at the benefit of using drastically less memory once it is all deserialized and the data input & hashmap have been GCd. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thecoop commented on issue #11712: Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale [LUCENE-10677]
thecoop commented on issue #11712: URL: https://github.com/apache/lucene/issues/11712#issuecomment-1301896942 @dweiss Please see #11847 for a proposed solution -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #11847: Add a method allowing canonical strings to be returned from DataInput
dweiss commented on code in PR #11847: URL: https://github.com/apache/lucene/pull/11847#discussion_r1012761533 ## lucene/core/src/java/org/apache/lucene/store/DataInput.java: ## @@ -280,6 +281,18 @@ public String readString() throws IOException { return new String(bytes, 0, length, StandardCharsets.UTF_8); } + private final Map canonicalStrings = new HashMap<>(); Review Comment: I think this breaks with Cloneable, doesn't it? Just the reference to the same map instance would be cloned. I have no strong opinion about the string cache itself. It's a separate, explicit method so the risk of it breaking existing stuff is small... if it indeed helps the problem then it looks fine to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput
thecoop commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301947013 Another option is to have the `FieldInfosFormat` classes handle string pooling themselves, leaving `DataInput` unchanged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thecoop commented on a diff in pull request #11847: Add a method allowing canonical strings to be returned from DataInput
thecoop commented on code in PR #11847: URL: https://github.com/apache/lucene/pull/11847#discussion_r1012767732 ## lucene/core/src/java/org/apache/lucene/store/DataInput.java: ## @@ -280,6 +281,18 @@ public String readString() throws IOException { return new String(bytes, 0, length, StandardCharsets.UTF_8); } + private final Map canonicalStrings = new HashMap<>(); Review Comment: Added code to handle cloning. Seeing as the variable is no longer final, I've also made it lazily initialised. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput
rmuir commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301951432 I still have a problem with leaking memory on purpose. The issue here is elastic's bad design, not lucene. Otherwise you wouldnt have thousands of indexes X thousands of fields: that situation is your own creation. I'm -1 to adding this garbage to datainput. I don't care what it does for your abuse-case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11896: TestLucene91HnswVectorsFormat reproducible failure
rmuir commented on issue #11896: URL: https://github.com/apache/lucene/issues/11896#issuecomment-1301971180 The build fails in CI with this issue (multiple times), so i captured a seed to reproduce it. i don't think intellij has anything to do with this. If we aren't going to fix the test, I'll open a PR to delete it completely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase opened a new issue, #11898: Release wizard: Remove obsolete JIRA instructions
iverase opened a new issue, #11898: URL: https://github.com/apache/lucene/issues/11898 ### Description The current release wizard still contains instructions that needs to be performed on JIRA. We are not using it any more so those instructions are obsolete and should be removed / replaced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11860: GITHUB-11830 Better optimize storage for vector connections
rmuir commented on PR #11860: URL: https://github.com/apache/lucene/pull/11860#issuecomment-1302331407 > For this vector graph file, since we use the same bpv, we can easily find a node's neighbor list just by offseting using `nodeOrdinal * 32 * bpv`. It's important this is efficient because you can load 100s of neighbor lists per search. that's one way to do it, but seems like a self-created problem: you could also just encode "offset" directly which could be a zigzag delta, or use numerous other designs, that wouldn't require the same bpv everywhere. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dsmiley commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput
dsmiley commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1302705957 There is somewhat a matter of taste at stake, so I can appreciate Robert's point of view that DataInput is nice the way it is without adding this to it for an edge (abuse?) case. I've seen Lucene based indexes used in a variety of ways so I'm sympathetic for finding a path forward. If this optimization can be isolated to a small place in FieldInfosFormat and also fix the problem (?), will that address your concern Robert? Maybe we need to see what it'd look like first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput
rmuir commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1302748754 I think if you get yourself in this situation via oversharding/dynamic fields/etc, then you can add `-XX:+UseStringDeduplication` to your commandline, problem solved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11881: Further optimize DrillSideways scoring
gsmiller commented on PR #11881: URL: https://github.com/apache/lucene/pull/11881#issuecomment-1302761082 OK, this should be ready for review now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput
dweiss commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1303040362 > -XX:+UseStringDeduplication to your commandline, problem solved. This is actually a very elegant workaround... I wonder what the benchmarks would show on those affected systems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org