[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox


thecoop commented on PR #11847:
URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301893972

   This is ready to review. I've removed the threadlocal byte buffer.
   
   With testing in elasticsearch, this reduced the duplicated strings in these 
fields by 90%. For the linked tickets this would reduce the memory usage from 
these strings from 6GB to 600MB. There is no significant performance difference 
I can see from running this in lucenebench.
   
   In the cases where the new method is not called, this would just add an 
empty `HashMap` instance to each `DataInput` instance, with the same lifetime 
as the container.
   When `getCanonicalString` is called, this added a single hashmap lookup to 
the call compared to `getString`. It also requires memory for the backing 
`HashMap`, with the same lifecycle as the container, scaling with the number of 
distinct strings returned by that method.
   
   So this causes slightly more memory when deserializing, at the benefit of 
using drastically less memory once it is all deserialized and the data input & 
hashmap have been GCd.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] thecoop commented on issue #11712: Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale [LUCENE-10677]

2022-11-03 Thread GitBox


thecoop commented on issue #11712:
URL: https://github.com/apache/lucene/issues/11712#issuecomment-1301896942

   @dweiss Please see #11847 for a proposed solution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox


dweiss commented on code in PR #11847:
URL: https://github.com/apache/lucene/pull/11847#discussion_r1012761533


##
lucene/core/src/java/org/apache/lucene/store/DataInput.java:
##
@@ -280,6 +281,18 @@ public String readString() throws IOException {
 return new String(bytes, 0, length, StandardCharsets.UTF_8);
   }
 
+  private final Map canonicalStrings = new HashMap<>();

Review Comment:
   I think this breaks with Cloneable, doesn't it? Just the reference to the 
same map instance would be cloned. I have no strong opinion about the string 
cache itself. It's a separate, explicit method so the risk of it breaking 
existing stuff is small... if it indeed helps the problem then it looks fine to 
me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox


thecoop commented on PR #11847:
URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301947013

   Another option is to have the `FieldInfosFormat` classes handle string 
pooling themselves, leaving `DataInput` unchanged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] thecoop commented on a diff in pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox


thecoop commented on code in PR #11847:
URL: https://github.com/apache/lucene/pull/11847#discussion_r1012767732


##
lucene/core/src/java/org/apache/lucene/store/DataInput.java:
##
@@ -280,6 +281,18 @@ public String readString() throws IOException {
 return new String(bytes, 0, length, StandardCharsets.UTF_8);
   }
 
+  private final Map canonicalStrings = new HashMap<>();

Review Comment:
   Added code to handle cloning. Seeing as the variable is no longer final, 
I've also made it lazily initialised.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox


rmuir commented on PR #11847:
URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301951432

   I still have a problem with leaking memory on purpose. The issue here is 
elastic's bad design, not lucene. Otherwise you wouldnt have thousands of 
indexes X thousands of fields: that situation is your own creation.
   
   I'm -1 to adding this garbage to datainput. I don't care what it does for 
your abuse-case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11896: TestLucene91HnswVectorsFormat reproducible failure

2022-11-03 Thread GitBox


rmuir commented on issue #11896:
URL: https://github.com/apache/lucene/issues/11896#issuecomment-1301971180

   The build fails in CI with this issue (multiple times), so i captured a seed 
to reproduce it.
   
   i don't think intellij has anything to do with this. 
   
   If we aren't going to fix the test, I'll open a PR to delete it completely.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase opened a new issue, #11898: Release wizard: Remove obsolete JIRA instructions

2022-11-03 Thread GitBox


iverase opened a new issue, #11898:
URL: https://github.com/apache/lucene/issues/11898

   ### Description
   
   The current release wizard still contains instructions that needs to be 
performed on JIRA. We are not using it any more so those instructions are 
obsolete and should be removed / replaced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-03 Thread GitBox


rmuir commented on PR #11860:
URL: https://github.com/apache/lucene/pull/11860#issuecomment-1302331407

   > For this vector graph file, since we use the same bpv, we can easily find 
a node's neighbor list just by offseting using `nodeOrdinal * 32 * bpv`. It's 
important this is efficient because you can load 100s of neighbor lists per 
search.
   
   that's one way to do it, but seems like a self-created problem: you could 
also just encode "offset" directly which could be a zigzag delta, or use 
numerous other designs, that wouldn't require the same bpv everywhere.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dsmiley commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox


dsmiley commented on PR #11847:
URL: https://github.com/apache/lucene/pull/11847#issuecomment-1302705957

   There is somewhat a matter of taste at stake, so I can appreciate Robert's 
point of view that DataInput is nice the way it is without adding this to it 
for an edge (abuse?) case.  I've seen Lucene based indexes used in a variety of 
ways so I'm sympathetic for finding a path forward.  If this optimization can 
be isolated to a small place in FieldInfosFormat and also fix the problem (?), 
will that address your concern Robert?  Maybe we need to see what it'd look 
like first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox


rmuir commented on PR #11847:
URL: https://github.com/apache/lucene/pull/11847#issuecomment-1302748754

   I think if you get yourself in this situation via oversharding/dynamic 
fields/etc, then you can add `-XX:+UseStringDeduplication` to your commandline, 
problem solved. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11881: Further optimize DrillSideways scoring

2022-11-03 Thread GitBox


gsmiller commented on PR #11881:
URL: https://github.com/apache/lucene/pull/11881#issuecomment-1302761082

   OK, this should be ready for review now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-03 Thread GitBox


dweiss commented on PR #11847:
URL: https://github.com/apache/lucene/pull/11847#issuecomment-1303040362

   > -XX:+UseStringDeduplication to your commandline, problem solved.
   
   This is actually a very elegant workaround... I wonder what the benchmarks 
would show on those affected systems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org