[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-18 Thread GitBox


shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1159391170

   > I don't think so? `FSD` is currently about decoding the encoded `byte[]` 
into a `long[]` for `FacetSetMatcher` purposes.
   
   Strike that, you're absolutely right! I think tests pass only because the 
values aren't actual `float/double`, I'll change that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-18 Thread GitBox


shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1159392950

   > Also, should we add `double` and `float` methods to `FacetSetDecoder`?
   
   I thought I answered wrongly but I didn't. To clarify, we encode `double` as 
sortable longs, and therefore use `decodeLongs` works. Similarly for `float`. 
Therefore I don't think we should introduce those `decodeFloats/Doubles` for 
matching purposes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi opened a new pull request, #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField

2022-06-18 Thread GitBox


tang-hi opened a new pull request, #966:
URL: https://github.com/apache/lucene/pull/966

   ### Description (or a Jira issue link if you have one)
   
https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-10619?filter=allopenissues
   
   when bytes length is large, write a block of bytes each time
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField

2022-06-18 Thread tangdh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555882#comment-17555882
 ] 

tangdh commented on LUCENE-10619:
-

hi [~jpountz] ,I have raised a [PR|https://github.com/apache/lucene/pull/966]:D

> Optimize the writeBytes in TermsHashPerField
> 
>
> Key: LUCENE-10619
> URL: https://issues.apache.org/jira/browse/LUCENE-10619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tangdh
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Because we don't know the length of slice, writeBytes will always write byte 
> one after another instead of writing a block of bytes.
> May be we could return both offset and length in ByteBlockPool#allocSlice?
> 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits.
> 2. slice size is at most 200, so it could fit in 8 bits.
> So we could put them together into an int  offset | length
> There are only two places where this function is used,the cost of change it 
> is relatively small.
> When allocSlice could return the offset and length of new Slice, we could 
> change writeBytes like below
> {code:java}
> // write block of bytes each time
> while(remaining > 0 ) {
>int offsetAndLength = allocSlice(bytes, offset);
>length = min(remaining, (offsetAndLength & 0xff) - 1);
>offset = offsetAndLength >> 8;
>System.arraycopy(src, srcPos, bytePool.buffer, offset, length);
>remaining -= length;
>offset+= (length + 1);
> }
> {code}
> If it could work, I'd like to raise a pr.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-18 Thread GitBox


shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1159454008

   I pushed a commit which:
   
   1. Adds demo code for a `TemperatureReading` `FacetSet` and 
`FacetSetMatcher`.
   2. Modifies the existing Exact and Range matcher float/double tests to 
properly test float/double values.
   3. Introduces default implementations of `packValues` and `sizePackedBytes` 
which uses `getComparableLongs`. I think this (1) clarifies why one would need 
to implement `getComparableLongs` and (2) provides a simple default for custom 
facet sets who don't care about the encoding


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a diff in pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField

2022-06-18 Thread GitBox


vigyasharma commented on code in PR #966:
URL: https://github.com/apache/lucene/pull/966#discussion_r901031392


##
lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java:
##
@@ -230,9 +230,29 @@ final void writeByte(int stream, byte b) {
   }
 
   final void writeBytes(int stream, byte[] b, int offset, int len) {
-// TODO: optimize
 final int end = offset + len;
-for (int i = offset; i < end; i++) writeByte(stream, b[i]);
+int streamAddress = streamAddressOffset + stream;
+int upto = termStreamAddressBuffer[streamAddress];
+byte[] slice = bytePool.buffers[upto >> ByteBlockPool.BYTE_BLOCK_SHIFT];
+assert slice != null;
+int sliceOffset = upto & ByteBlockPool.BYTE_BLOCK_MASK;
+
+while (slice[sliceOffset] == 0 && offset < end) {
+  slice[sliceOffset++] = b[offset++];
+  (termStreamAddressBuffer[streamAddress])++;
+}
+
+while (offset < end) {
+  int offsetAndLength = bytePool.allocKnownSizeSlice(slice, sliceOffset);
+  sliceOffset = offsetAndLength >> 8;
+  int sliceLength = offsetAndLength & 0xff;
+  slice = bytePool.buffer;
+  int writeLength = Math.min(sliceLength - 1, end - offset);
+  System.arraycopy(b, offset, slice, sliceOffset, writeLength);
+  sliceOffset += writeLength;
+  offset += writeLength;
+  termStreamAddressBuffer[streamAddress] = sliceOffset + 
bytePool.byteOffset;
+}

Review Comment:
   Wow, this is an interesting change! Can you help me better understand this 
optimization? 
   
   Is `System.arraycopy()` much faster than manually copying like in your while 
loop above. I would guess it is similar since there are no extra checks for 
primitive or object type [[ref: fast array copy 
comparison](https://www.javaspecialists.eu/archive/Issue124-Copying-Arrays-Fast.html)],
 but I'm not sure.
   This is probably better than calling `writeByte()` in a loop because we 
don't need to fetch all the pointers every time. 
   
   A follow up idea: I think `allocSlice()` (or `allocKnownSizeSlice()` here), 
always returns the next higher sized buffer slice.. Maybe `allocKnownSizeSlice` 
could instead return a slice appropriate for the length of pending items we 
need to write `(end - offset)`? We could then avoid having to call alloc 
multiple times for larger writes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi commented on a diff in pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField

2022-06-18 Thread GitBox


tang-hi commented on code in PR #966:
URL: https://github.com/apache/lucene/pull/966#discussion_r901039033


##
lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java:
##
@@ -230,9 +230,29 @@ final void writeByte(int stream, byte b) {
   }
 
   final void writeBytes(int stream, byte[] b, int offset, int len) {
-// TODO: optimize
 final int end = offset + len;
-for (int i = offset; i < end; i++) writeByte(stream, b[i]);
+int streamAddress = streamAddressOffset + stream;
+int upto = termStreamAddressBuffer[streamAddress];
+byte[] slice = bytePool.buffers[upto >> ByteBlockPool.BYTE_BLOCK_SHIFT];
+assert slice != null;
+int sliceOffset = upto & ByteBlockPool.BYTE_BLOCK_MASK;
+
+while (slice[sliceOffset] == 0 && offset < end) {
+  slice[sliceOffset++] = b[offset++];
+  (termStreamAddressBuffer[streamAddress])++;
+}
+
+while (offset < end) {
+  int offsetAndLength = bytePool.allocKnownSizeSlice(slice, sliceOffset);
+  sliceOffset = offsetAndLength >> 8;
+  int sliceLength = offsetAndLength & 0xff;
+  slice = bytePool.buffer;
+  int writeLength = Math.min(sliceLength - 1, end - offset);
+  System.arraycopy(b, offset, slice, sliceOffset, writeLength);
+  sliceOffset += writeLength;
+  offset += writeLength;
+  termStreamAddressBuffer[streamAddress] = sliceOffset + 
bytePool.byteOffset;
+}

Review Comment:
   thanks for your comment
   1. Is System.arraycopy() much faster than manually copying like in your 
while loop above?
   I think it's faster than  a manual for loop  to copy. Maybe you are 
interested in taking a look at this answer[Is it better to use 
System.arraycopy(...) than a for loop for copying 
arrays?](https://stackoverflow.com/a/49269866)
   2. Maybe allocKnownSizeSlice could instead return a slice appropriate for 
the length of pending items we need to write (end - offset)?
   In the begining, I also want to implement in this way.But I found the 
method `nextSlice` in `ByteSliceReader`  always read slice with size 5, 14, 20, 
30 ..., 200,200 which is same with `LEVEL_SIZE_ARRAY` in `ByteBlockPool`. So I 
have to allocate slice with next-level-size instead of `end - offset`,



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi commented on a diff in pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField

2022-06-18 Thread GitBox


tang-hi commented on code in PR #966:
URL: https://github.com/apache/lucene/pull/966#discussion_r901039371


##
lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java:
##
@@ -230,9 +230,29 @@ final void writeByte(int stream, byte b) {
   }
 
   final void writeBytes(int stream, byte[] b, int offset, int len) {
-// TODO: optimize
 final int end = offset + len;
-for (int i = offset; i < end; i++) writeByte(stream, b[i]);
+int streamAddress = streamAddressOffset + stream;
+int upto = termStreamAddressBuffer[streamAddress];
+byte[] slice = bytePool.buffers[upto >> ByteBlockPool.BYTE_BLOCK_SHIFT];
+assert slice != null;
+int sliceOffset = upto & ByteBlockPool.BYTE_BLOCK_MASK;
+
+while (slice[sliceOffset] == 0 && offset < end) {
+  slice[sliceOffset++] = b[offset++];
+  (termStreamAddressBuffer[streamAddress])++;
+}
+
+while (offset < end) {
+  int offsetAndLength = bytePool.allocKnownSizeSlice(slice, sliceOffset);
+  sliceOffset = offsetAndLength >> 8;
+  int sliceLength = offsetAndLength & 0xff;
+  slice = bytePool.buffer;
+  int writeLength = Math.min(sliceLength - 1, end - offset);
+  System.arraycopy(b, offset, slice, sliceOffset, writeLength);
+  sliceOffset += writeLength;
+  offset += writeLength;
+  termStreamAddressBuffer[streamAddress] = sliceOffset + 
bytePool.byteOffset;
+}

Review Comment:
   > Wow, this is an interesting change! Can you help me better understand this 
optimization?
   > 
   > Is `System.arraycopy()` much faster than manually copying like in your 
while loop above. I would guess it is similar since there are no extra checks 
for primitive or object type [[ref: fast array copy 
comparison](https://www.javaspecialists.eu/archive/Issue124-Copying-Arrays-Fast.html)],
 but I'm not sure. This is probably better than calling `writeByte()` in a loop 
because we don't need to fetch all the pointers every time.
   > 
   > A follow up idea: I think `allocSlice()` (or `allocKnownSizeSlice()` 
here), always returns the next higher sized buffer slice.. Maybe 
`allocKnownSizeSlice` could instead return a slice appropriate for the length 
of pending items we need to write `(end - offset)`? We could then avoid having 
to call alloc multiple times for larger writes.
   
   thanks for your comment
   1. Is System.arraycopy() much faster than manually copying like in your 
while loop above?
   I think it's faster than  a manual for loop  to copy. Maybe you are 
interested in taking a look at this answer[Is it better to use 
System.arraycopy(...) than a for loop for copying 
arrays?](https://stackoverflow.com/a/49269866)
   2. Maybe allocKnownSizeSlice could instead return a slice appropriate for 
the length of pending items we need to write (end - offset)?
   In the begining, I also want to implement in this way.But I found the 
method `nextSlice` in `ByteSliceReader`  always read slice with size 5, 14, 20, 
30 ..., 200,200 which is same with `LEVEL_SIZE_ARRAY` in `ByteBlockPool`. So I 
have to allocate slice with next-level-size instead of `end - offset`,



##
lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java:
##
@@ -230,9 +230,29 @@ final void writeByte(int stream, byte b) {
   }
 
   final void writeBytes(int stream, byte[] b, int offset, int len) {
-// TODO: optimize
 final int end = offset + len;
-for (int i = offset; i < end; i++) writeByte(stream, b[i]);
+int streamAddress = streamAddressOffset + stream;
+int upto = termStreamAddressBuffer[streamAddress];
+byte[] slice = bytePool.buffers[upto >> ByteBlockPool.BYTE_BLOCK_SHIFT];
+assert slice != null;
+int sliceOffset = upto & ByteBlockPool.BYTE_BLOCK_MASK;
+
+while (slice[sliceOffset] == 0 && offset < end) {
+  slice[sliceOffset++] = b[offset++];
+  (termStreamAddressBuffer[streamAddress])++;
+}
+
+while (offset < end) {
+  int offsetAndLength = bytePool.allocKnownSizeSlice(slice, sliceOffset);
+  sliceOffset = offsetAndLength >> 8;
+  int sliceLength = offsetAndLength & 0xff;
+  slice = bytePool.buffer;
+  int writeLength = Math.min(sliceLength - 1, end - offset);
+  System.arraycopy(b, offset, slice, sliceOffset, writeLength);
+  sliceOffset += writeLength;
+  offset += writeLength;
+  termStreamAddressBuffer[streamAddress] = sliceOffset + 
bytePool.byteOffset;
+}

Review Comment:
   thanks for your comment
   1. Is System.arraycopy() much faster than manually copying like in your 
while loop above?
   I think it's faster than  a manual for loop  to copy. Maybe you are 
interested in taking a look at this answer[Is it better to use 
System.arraycopy(...) than a for loop for copying 
arrays?](https://stackoverflow.com/a/49269866)
   2. Maybe allocKnownSizeSlice could instead return a slice appropriate for 
the length of pending items we need to write (end - offset)?
   In the begini

[GitHub] [lucene] vigyasharma commented on a diff in pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField

2022-06-18 Thread GitBox


vigyasharma commented on code in PR #966:
URL: https://github.com/apache/lucene/pull/966#discussion_r901050635


##
lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java:
##
@@ -230,9 +230,29 @@ final void writeByte(int stream, byte b) {
   }
 
   final void writeBytes(int stream, byte[] b, int offset, int len) {
-// TODO: optimize
 final int end = offset + len;
-for (int i = offset; i < end; i++) writeByte(stream, b[i]);
+int streamAddress = streamAddressOffset + stream;
+int upto = termStreamAddressBuffer[streamAddress];
+byte[] slice = bytePool.buffers[upto >> ByteBlockPool.BYTE_BLOCK_SHIFT];
+assert slice != null;
+int sliceOffset = upto & ByteBlockPool.BYTE_BLOCK_MASK;
+
+while (slice[sliceOffset] == 0 && offset < end) {
+  slice[sliceOffset++] = b[offset++];
+  (termStreamAddressBuffer[streamAddress])++;
+}
+
+while (offset < end) {
+  int offsetAndLength = bytePool.allocKnownSizeSlice(slice, sliceOffset);
+  sliceOffset = offsetAndLength >> 8;
+  int sliceLength = offsetAndLength & 0xff;
+  slice = bytePool.buffer;
+  int writeLength = Math.min(sliceLength - 1, end - offset);
+  System.arraycopy(b, offset, slice, sliceOffset, writeLength);
+  sliceOffset += writeLength;
+  offset += writeLength;
+  termStreamAddressBuffer[streamAddress] = sliceOffset + 
bytePool.byteOffset;
+}

Review Comment:
   Ah, I see what you are saying. This is some deep coupling between the 
`ByteSliceReader` and the `ByteBlockPool`. 
   
   `ByteSliceReader.nextSlice()` just assumes that the slice is the next higher 
level in `ByteBlockPool.NEXT_LEVEL_ARRAY`. Wonder if we could set the 
`ByteSliceReader.level` to the right value before calling `nextSlice()`. The 
buffer seems to contain the next level value [in its last few 
bits](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java#L268-L269).
 `allocKnownSizeSlice()` would then return the next higher slice size for 
`(end-offset)` in `LEVEL_SIZE_ARRAY`.
   
   It might be a more intrusive change though. ByteBlockPool docstring calls 
out that it allocates *"slices of increasing lengths"*, so there might be other 
places that hold this expectation. Maybe worth playing around in a separate 
spin off PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi commented on a diff in pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField

2022-06-18 Thread GitBox


tang-hi commented on code in PR #966:
URL: https://github.com/apache/lucene/pull/966#discussion_r901054938


##
lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java:
##
@@ -230,9 +230,29 @@ final void writeByte(int stream, byte b) {
   }
 
   final void writeBytes(int stream, byte[] b, int offset, int len) {
-// TODO: optimize
 final int end = offset + len;
-for (int i = offset; i < end; i++) writeByte(stream, b[i]);
+int streamAddress = streamAddressOffset + stream;
+int upto = termStreamAddressBuffer[streamAddress];
+byte[] slice = bytePool.buffers[upto >> ByteBlockPool.BYTE_BLOCK_SHIFT];
+assert slice != null;
+int sliceOffset = upto & ByteBlockPool.BYTE_BLOCK_MASK;
+
+while (slice[sliceOffset] == 0 && offset < end) {
+  slice[sliceOffset++] = b[offset++];
+  (termStreamAddressBuffer[streamAddress])++;
+}
+
+while (offset < end) {
+  int offsetAndLength = bytePool.allocKnownSizeSlice(slice, sliceOffset);
+  sliceOffset = offsetAndLength >> 8;
+  int sliceLength = offsetAndLength & 0xff;
+  slice = bytePool.buffer;
+  int writeLength = Math.min(sliceLength - 1, end - offset);
+  System.arraycopy(b, offset, slice, sliceOffset, writeLength);
+  sliceOffset += writeLength;
+  offset += writeLength;
+  termStreamAddressBuffer[streamAddress] = sliceOffset + 
bytePool.byteOffset;
+}

Review Comment:
   If we want to set `ByteSliceReader.level` to the right value, may be we 
should use one bit in  [buffer[byteUpto - 1] 
](https://github.com/apache/lucene/blob/6ba759df866289db485d44fd1f75b3eb00f8d99f/lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java#L268-L269)
 as a flag and other bits to represent the customized size.
   eg.
   if the highest bit is flag,`ByteSliceReader.nextSlice` will check the 
highest bit and decide whether to use the old method or calculate 
[newSize](https://github.com/apache/lucene/blob/6ba759df866289db485d44fd1f75b3eb00f8d99f/lucene/core/src/java/org/apache/lucene/index/ByteSliceReader.java#L101)
 from the remaining bits
   
   But this implementation needs to change a lot and needs to be fully tested.I 
think this is best done in another pr.I might do it in the next PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org