[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox
thecoop commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301893972 This is ready to review. I've removed the threadlocal byte buffer. With testing in elasticsearch, this reduced the duplicated strings in these fields by 90%. For the linked tickets

[GitHub] [lucene] thecoop commented on issue #11712: Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale [LUCENE-10677]

2022-11-03 Thread GitBox
thecoop commented on issue #11712: URL: https://github.com/apache/lucene/issues/11712#issuecomment-1301896942 @dweiss Please see #11847 for a proposed solution -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

[GitHub] [lucene] dweiss commented on a diff in pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox
dweiss commented on code in PR #11847: URL: https://github.com/apache/lucene/pull/11847#discussion_r1012761533 ## lucene/core/src/java/org/apache/lucene/store/DataInput.java: ## @@ -280,6 +281,18 @@ public String readString() throws IOException { return new String(bytes, 0,

[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox
thecoop commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301947013 Another option is to have the `FieldInfosFormat` classes handle string pooling themselves, leaving `DataInput` unchanged. -- This is an automated message from the Apache Git Service. T

[GitHub] [lucene] thecoop commented on a diff in pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox
thecoop commented on code in PR #11847: URL: https://github.com/apache/lucene/pull/11847#discussion_r1012767732 ## lucene/core/src/java/org/apache/lucene/store/DataInput.java: ## @@ -280,6 +281,18 @@ public String readString() throws IOException { return new String(bytes, 0

[GitHub] [lucene] rmuir commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox
rmuir commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301951432 I still have a problem with leaking memory on purpose. The issue here is elastic's bad design, not lucene. Otherwise you wouldnt have thousands of indexes X thousands of fields: that situa

[GitHub] [lucene] rmuir commented on issue #11896: TestLucene91HnswVectorsFormat reproducible failure

2022-11-03 Thread GitBox
rmuir commented on issue #11896: URL: https://github.com/apache/lucene/issues/11896#issuecomment-1301971180 The build fails in CI with this issue (multiple times), so i captured a seed to reproduce it. i don't think intellij has anything to do with this. If we aren't going to

[GitHub] [lucene] iverase opened a new issue, #11898: Release wizard: Remove obsolete JIRA instructions

2022-11-03 Thread GitBox
iverase opened a new issue, #11898: URL: https://github.com/apache/lucene/issues/11898 ### Description The current release wizard still contains instructions that needs to be performed on JIRA. We are not using it any more so those instructions are obsolete and should be removed / re

[GitHub] [lucene] rmuir commented on pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-03 Thread GitBox
rmuir commented on PR #11860: URL: https://github.com/apache/lucene/pull/11860#issuecomment-1302331407 > For this vector graph file, since we use the same bpv, we can easily find a node's neighbor list just by offseting using `nodeOrdinal * 32 * bpv`. It's important this is efficient becaus

[GitHub] [lucene] dsmiley commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox
dsmiley commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1302705957 There is somewhat a matter of taste at stake, so I can appreciate Robert's point of view that DataInput is nice the way it is without adding this to it for an edge (abuse?) case. I've s

[GitHub] [lucene] rmuir commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox
rmuir commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1302748754 I think if you get yourself in this situation via oversharding/dynamic fields/etc, then you can add `-XX:+UseStringDeduplication` to your commandline, problem solved. -- This is an aut

[GitHub] [lucene] gsmiller commented on pull request #11881: Further optimize DrillSideways scoring

2022-11-03 Thread GitBox
gsmiller commented on PR #11881: URL: https://github.com/apache/lucene/pull/11881#issuecomment-1302761082 OK, this should be ready for review now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [lucene] dweiss commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox
dweiss commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1303040362 > -XX:+UseStringDeduplication to your commandline, problem solved. This is actually a very elegant workaround... I wonder what the benchmarks would show on those affected systems.