[GitHub] [lucene] LuXugang commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID

2022-05-13 Thread GitBox


LuXugang commented on PR #873:
URL: https://github.com/apache/lucene/pull/873#issuecomment-1125834512

   Sorry for the delay.  Whoa, the `bug` label made this RP so noticeable among 
the PR lists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #883: LUCENE-10561 Reduce class/member visibility of all normalizer and stemmer classes

2022-05-13 Thread GitBox


mocobeta commented on PR #883:
URL: https://github.com/apache/lucene/pull/883#issuecomment-1125910079

   @shahrs87 Looks good to me. Can you please add a 
[CHANGES](https://github.com/apache/lucene/blob/main/lucene/CHANGES.txt) entry 
to "API changes" section in 10.0.0? Also, we need a 
[MIGRATE](https://github.com/apache/lucene/blob/main/lucene/MIGRATE.md) entry; 
I'll add it later.
   
   I think we can merge this with CHANGES and MIGRATE entries. Will keep it 
open till tomorrow to let others give time for another review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #883: LUCENE-10561 Reduce class/member visibility of all normalizer and stemmer classes

2022-05-13 Thread GitBox


mocobeta commented on PR #883:
URL: https://github.com/apache/lucene/pull/883#issuecomment-1125928019

   > Can you please suggest some starter jiras which will give me overview on 
Lucene and will be helpful to community also. 
   
   We have a couple of Jira issues with `newdev` label for new developers but I 
don't think the list is well maintained, sorry :/
   
https://issues.apache.org/jira/browse/LUCENE-9303?jql=(project%3DLUCENE)%20AND%20resolution%3DUnresolved%20AND%20labels%3Dnewdev
   
   Apart from `newdev` label, there are many small (minor) issues that are 
non-controversial, you could browse unresolved and well-described Jira issues 
and then pick ones for starters. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID

2022-05-13 Thread GitBox


rmuir commented on PR #873:
URL: https://github.com/apache/lucene/pull/873#issuecomment-1125930594

   > Whoa, the `bug` label made this RP so noticeable among the PR lists.
   
   Sorry, this was me. I was doing some experimentation with labels and github 
issues/PRs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-13 Thread Deepika Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536624#comment-17536624
 ] 

Deepika Sharma commented on LUCENE-10544:
-

I am currently working on wrapping postings/impacts with {{ExitableTermsEnum}} 
in {{{}ExitableDirectoryReader{}}}. I am looking suggestion for unit test 
regarding this. 

More context regarding the approach I have done is:
To wrap postings/impacts with {{{}ExitableTermsEnum{}}}, I tried overriding 
other methods of {{FilterTermsEnum}} with timeout check in 
{{ExitableTermsEnum}} along with already overridden {{next()}} method. Also, 
{{PostingsEnum}} and {{ImpactsEnum}} needs to be wrapped. So, for this I 
created {{ExitableImpactsEnum }}extended from {{ImpactsEnum}} and similarly 
{{ExitablePostingsEnum}} extended from {{PostingsEnum}} in 
{{ExitableDirectoryReader}} class where all the methods of the respective 
classes are overridden along with timeout checks. 
Also, modified {{ExitableTermsEnum}} such that when {{#impacts(...)}} is 
invoked then we return instance of {{{}ExitableImpactsEnum{}}}. Similarly, when 
{{#postings(...)}} is invoked then we return instance of 
{{{}ExitablePostingsEnum{}}}. 
It was observed that for {{ExitableImpactsEnum}} and {{ExitablePostingsEnum}} 
are getting invoked through {{#scoreAll}} for {{TermQuery}} and {{PrefixQuery}} 
respectively. 

One problem that I have observed is, for any particular timeout value the no. 
of results are not fixed. For example, I tried running the test with timeout 
value 100 and I got all the results, but at some subsequent running of test 
with same timeout value I received 0, 3, and 5 (different number) results

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID

2022-05-13 Thread GitBox


msokolov commented on code in PR #873:
URL: https://github.com/apache/lucene/pull/873#discussion_r872437351


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##
@@ -90,26 +92,30 @@ public boolean insertWithOverflow(int newNode, float 
newScore) {
   }
 
   private long encode(int node, float score) {
-return order.applylong) NumericUtils.floatToSortableInt(score)) << 32) 
| node);
+long nodeReverse = reversed ? node : (-1 - node);
+// make sure all high 32 bits were 0 by unsigned right shift

Review Comment:
   perhaps a bitmask would be clearer? `nodeReverse &= 0x`



##
lucene/core/src/java/org/apache/lucene/util/LongHeap.java:
##
@@ -74,9 +74,9 @@ public final long push(long element) {
* @return whether the value was added (unless the heap is full, or the new 
value is less than the
* top value)
*/
-  public boolean insertWithOverflow(long value) {
+  public final boolean insertWithOverflow(long value) {
 if (size >= maxSize) {
-  if (value < heap[1]) {
+  if ((value < heap[1])) {

Review Comment:
   don't need the extra parens



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 commented on pull request #883: LUCENE-10561 Reduce class/member visibility of all normalizer and stemmer classes

2022-05-13 Thread GitBox


shahrs87 commented on PR #883:
URL: https://github.com/apache/lucene/pull/883#issuecomment-1126220894

   > Can you please add a 
[CHANGES](https://github.com/apache/lucene/blob/main/lucene/CHANGES.txt) entry 
to "API changes" section in 10.0.0?
   @mocobeta  Done ! Thank you for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10570) Monitor Presearcher to reject registration of "ANYTOKEN" queries

2022-05-13 Thread Chris M. Hostetter (Jira)
Chris M. Hostetter created LUCENE-10570:
---

 Summary: Monitor Presearcher to reject registration of  "ANYTOKEN" 
queries
 Key: LUCENE-10570
 URL: https://issues.apache.org/jira/browse/LUCENE-10570
 Project: Lucene - Core
  Issue Type: Improvement
  Components: monitor
Reporter: Chris M. Hostetter


I'm starting to do some work with Monitor, and one of the things i realized I 
was going to want is an easy way to detect Queries that result in {{ANYTOKEN}} 
style QueryIndex documents – requiring a "forward search" test against every 
future {{match(Document)}} call.

The simplest solution I could come up with is a Presearcher "wrapper" – 
something that might be generally useful for other users – but I'm certainly 
open to other approaches.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10570) Monitor Presearcher to reject registration of "ANYTOKEN" queries

2022-05-13 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated LUCENE-10570:

Attachment: LUCENE-10570.patch
Status: Open  (was: Open)

Attaching patch w/tests for a really simple 
{{RejectUnconstrainedQueriesPresearcherWrapper}}

 

 

> Monitor Presearcher to reject registration of  "ANYTOKEN" queries
> -
>
> Key: LUCENE-10570
> URL: https://issues.apache.org/jira/browse/LUCENE-10570
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: monitor
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-10570.patch
>
>
> I'm starting to do some work with Monitor, and one of the things i realized I 
> was going to want is an easy way to detect Queries that result in 
> {{ANYTOKEN}} style QueryIndex documents – requiring a "forward search" test 
> against every future {{match(Document)}} call.
> The simplest solution I could come up with is a Presearcher "wrapper" – 
> something that might be generally useful for other users – but I'm certainly 
> open to other approaches.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10570) Monitor Presearcher to reject registration of "ANYTOKEN" queries

2022-05-13 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated LUCENE-10570:

Component/s: modules/monitor
 (was: monitor)

> Monitor Presearcher to reject registration of  "ANYTOKEN" queries
> -
>
> Key: LUCENE-10570
> URL: https://issues.apache.org/jira/browse/LUCENE-10570
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/monitor
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-10570.patch
>
>
> I'm starting to do some work with Monitor, and one of the things i realized I 
> was going to want is an easy way to detect Queries that result in 
> {{ANYTOKEN}} style QueryIndex documents – requiring a "forward search" test 
> against every future {{match(Document)}} call.
> The simplest solution I could come up with is a Presearcher "wrapper" – 
> something that might be generally useful for other users – but I'm certainly 
> open to other approaches.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-05-13 Thread Peixin Li (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536763#comment-17536763
 ] 

Peixin Li commented on LUCENE-10551:


Yes, we're following up with GraalVm team for the root cause. and the code 
sample above do stop the bug. or user can also adding 
{color:#e01e5a}-XX:-UseJVMCICompiler{color}  to disable use of the Graal 
compiler. That are the current findings, will update once we get more infos.

> LowercaseAsciiCompression should return false when it's unable to compress
> --
>
> Key: LUCENE-10551
> URL: https://issues.apache.org/jira/browse/LUCENE-10551
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Lucene version 8.11.1
>Reporter: Peixin Li
>Priority: Major
> Attachments: LUCENE-10551-test.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> {code:java}
>  Failed to commit..
> java.lang.IllegalStateException: 10 <> 5 
> cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
>  cloud gen2tion instance - dev1tion instance - 
> testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
>         at 
> org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
>         at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
>         at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
>         at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
>         at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
>         at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
>         at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
>         at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
>        {code}
> {code:java}
> key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
>  resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
> domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
> java.lang.IllegalStateException: 29 <> 16 
> analytics-platform-test/koala/cluster-tool:1.0-20220310151438.492,mesh_istio_examples-bookinfo-details-v1:1.16.2mesh_istio_examples-bookinfo-reviews-v3:1.16.2oce-clamav:1.0.219oce-tesseract:1.0.7oce-traefik:2.5.1oci-opensearch:1.2.4.8.103oda-digital-assistant-control-plane-train-pool-workflow-v6:22.02.14oke-coresvcs-k8s-dns-dnsmasq-nanny-amd64@sha256:41aa9160ceeaf712369ddb660d02e5ec06d1679965e6930351967c8cf5ed62d4oke-coresvcs-k8s-dns-kube-dns-amd64@sha256:2cf34b04106974952996c6ef1313f165ce65b4ad68a3051f51b1b8f91ba5f838oke-coresvcs-k8s-dns-sidecar-amd64@sha256:8a82c7288725cb4de9c7cd8d5a78279208e379f35751539b406077f9a3163dcdoke-coresvcs-node-problem-detector@sha256:9d54df11804a862c54276648702a45a6a0027a9d930a86becd69c34cc84bf510oke-coresvcs-oke-fluentd-lumberjack@sha256:5f3f10b187eb804ce4e84bc3672de1cf318c0f793f00dac01cd7da8beea8f269oke-etcd-operator@sha256:4353a2e5ef02bb0f6b046a8d6219b1af359a2c1141c358ff110e395f29d0bfc8oke-oke-hyperkube-amd64@sha256:3c734f46099400507f938090eb9a874338fa25cde425ac9409df4c885759752foke-public-busybox@sha256:4cee1979ba0bf7db9fc5d28fb7b798ca69ae95a47c5fecf46327720df4ff352doke-public-coredns@sha256:86f8cfc74497f04e181ab2e1d26d2fd8bd46c4b33ce24b55620efcdfcb214670oke-public-coredns@sha256:8cd974302f1f6108f6f31312f8181ae723b514e2022089cdcc3db10666c49228oke-public-etcd@sha256:b751e459bc2a8f079f6730dd8462671b253c7c8b0d0eb47c67888d5091c6bb77oke-public-etcd@sha256:d6a76200a6e9103681bc2cf7fefbcada0dd9372d52cf8964178d846b89959d14oke-public-etcd@sha256:fa056479342b45479ac74c58176ddad43687d5fc295375d705808f9dfb4843

[jira] [Comment Edited] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-05-13 Thread Peixin Li (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536763#comment-17536763
 ] 

Peixin Li edited comment on LUCENE-10551 at 5/13/22 4:47 PM:
-

Yes, we're following up with GraalVm team for the root cause. and the code 
sample above do stop the bug. or user can also add 
{color:#e01e5a}-XX:-UseJVMCICompiler{color}  to disable use of the Graal 
compiler. That are the current findings, will update once we get more infos.


was (Author: JIRAUSER285785):
Yes, we're following up with GraalVm team for the root cause. and the code 
sample above do stop the bug. or user can also adding 
{color:#e01e5a}-XX:-UseJVMCICompiler{color}  to disable use of the Graal 
compiler. That are the current findings, will update once we get more infos.

> LowercaseAsciiCompression should return false when it's unable to compress
> --
>
> Key: LUCENE-10551
> URL: https://issues.apache.org/jira/browse/LUCENE-10551
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Lucene version 8.11.1
>Reporter: Peixin Li
>Priority: Major
> Attachments: LUCENE-10551-test.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> {code:java}
>  Failed to commit..
> java.lang.IllegalStateException: 10 <> 5 
> cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
>  cloud gen2tion instance - dev1tion instance - 
> testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
>         at 
> org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
>         at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
>         at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
>         at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
>         at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
>         at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
>         at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
>         at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
>        {code}
> {code:java}
> key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
>  resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
> domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
> java.lang.IllegalStateException: 29 <> 16 
> analytics-platform-test/koala/cluster-tool:1.0-20220310151438.492,mesh_istio_examples-bookinfo-details-v1:1.16.2mesh_istio_examples-bookinfo-reviews-v3:1.16.2oce-clamav:1.0.219oce-tesseract:1.0.7oce-traefik:2.5.1oci-opensearch:1.2.4.8.103oda-digital-assistant-control-plane-train-pool-workflow-v6:22.02.14oke-coresvcs-k8s-dns-dnsmasq-nanny-amd64@sha256:41aa9160ceeaf712369ddb660d02e5ec06d1679965e6930351967c8cf5ed62d4oke-coresvcs-k8s-dns-kube-dns-amd64@sha256:2cf34b04106974952996c6ef1313f165ce65b4ad68a3051f51b1b8f91ba5f838oke-coresvcs-k8s-dns-sidecar-amd64@sha256:8a82c7288725cb4de9c7cd8d5a78279208e379f35751539b406077f9a3163dcdoke-coresvcs-node-problem-detector@sha256:9d54df11804a862c54276648702a45a6a0027a9d930a86becd69c34cc84bf510oke-coresvcs-oke-fluentd-lumberjack@sha256:5f3f10b187eb804ce4e84bc3672de1cf318c0f793f00dac01cd7da8beea8f269oke-etcd-operator@sha256:4353a2e5ef02bb0f6b046a8d6219b1af359a2c1141c358ff110e395f29d0bfc8oke-oke-hyperkube-amd64@sha256:3c734f46099400507f938090eb9a874338fa25cde425ac9409df4c885759752foke-public-busybox@sha256:4cee1979ba0bf7db9fc5d28fb7b798ca69ae95a47c5fecf46327720df4ff352doke-public-coredns@sha256:86f8cfc74497f04e181ab2e1d26d2fd8bd46c4b33ce24b55620ef

[jira] [Created] (LUCENE-10571) Monitor alternative "TermFilter" Presearcher for sparse filter fields

2022-05-13 Thread Chris M. Hostetter (Jira)
Chris M. Hostetter created LUCENE-10571:
---

 Summary: Monitor alternative "TermFilter" Presearcher for sparse 
filter fields
 Key: LUCENE-10571
 URL: https://issues.apache.org/jira/browse/LUCENE-10571
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/monitor
Reporter: Chris M. Hostetter


One of the things that surprised me the most when looking into how the 
{{TermFilteredPresearcher}} worked was what happens when Queries and/or 
Documents do _NOT_  have a value in a configured filter field.

per the javadocs...
{quote}Filtering by additional fields can be configured by passing a set of 
field names. Documents that contain values in those fields will only be checked 
against \{@link MonitorQuery} instances that have the same fieldname-value 
mapping in their metadata.
{quote}
...which is straightforward and useful in the tested example where every 
registered Query has {{"language"}} metadata, and every Document has a 
{{"language"}} field, but gives unintuitive results when a Query or Document 
does *NOT* have a {{"language"}}

A more "intuitive" & useful (in my opinions) implementation would be something 
that could be documented as ...
{quote}Filtering by additional fields can be configured by passing a set of 
field names. Documents that contain values in those fields will only be checked 
against \{@link MonitorQuery} instances
that have the same fieldname-value mapping in their metadata or have no 
mapping for that fieldname.

Documents that do not contain values in those fields will only be checked 
against \{@link MonitorQuery} instances that also have no mapping for that 
fieldname.
{quote}
...ie: instead of being a straight "filter candidate queries by what we find in 
the filter fields in the documents" we can instead "derive the queries that are 
viable candidates for each document if we were restricting the set of documents 
by those values during a "forward search"



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10266) Move nearest-neighbor search on points to core?

2022-05-13 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536765#comment-17536765
 ] 

Rushabh Shah commented on LUCENE-10266:
---

[~jpountz]  I am new to Lucene project. This will be 2nd issue. Given that this 
is a minor task, I would like to create a PR for this. Can you please elaborate 
the steps needed to tackle this issue ? 

I see that NearestNeighbor is in sandbox module (maybe sandbox module is for 
baking new features ?) and now we have to move this to core module ? Thank you.

> Move nearest-neighbor search on points to core?
> ---
>
> Key: LUCENE-10266
> URL: https://issues.apache.org/jira/browse/LUCENE-10266
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> Now that the Points' public API supports running nearest-nearest neighbor 
> search, should we move it to core via helper methods on {{LatLonPoint}} and 
> {{XYPoint}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10571) Monitor alternative "TermFilter" Presearcher for sparse filter fields

2022-05-13 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated LUCENE-10571:

Attachment: LUCENE-10571.patch
Status: Open  (was: Open)

I'm attaching a patch with a {{HuperDuperTermFilteredPresearcher}} (name just a 
placeholder) that works the way described by introducing a 
{{MISSING_FILTERS_FIELD}} into (Query) documents in the {{QueryIndex}} which we 
then search when a Document doesn't contain any values in a specific filter 
field.

The easiest way to really see what the impact of this is compared to 
{{TermFilteredPresearcher}} is to compare the two new 
{{testMissingFieldFiltering}} methods and the differnet expected results for 
each impl.

At the moment this new class is largely a lot of copy/paste duplication of 
{{TermFilteredPresearcher}} with small additions, because i'm not sure how we 
might want to really expose this functionality to users

Obviously even if other folks agree that this is a better way to do "term 
filtering" in Monitor then how {{TermFilteredPresearcher}} currently works, 
changing the internals of {{TermFilteredPresearcher}} to "invert" it's logic 
like this would be a huge back compat break -- but what i'm not sure is if it 
would make sense to make this behavior "configurable" in 
{{TermFilteredPresearcher}} or refactor some of the internals to all this new  
functionality in a new subclass (which would probably be straightforward, but 
would also require _another_ subclass to support "multipass" in combination 
with this alternative filtering)

> Monitor alternative "TermFilter" Presearcher for sparse filter fields
> -
>
> Key: LUCENE-10571
> URL: https://issues.apache.org/jira/browse/LUCENE-10571
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/monitor
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-10571.patch
>
>
> One of the things that surprised me the most when looking into how the 
> {{TermFilteredPresearcher}} worked was what happens when Queries and/or 
> Documents do _NOT_  have a value in a configured filter field.
> per the javadocs...
> {quote}Filtering by additional fields can be configured by passing a set of 
> field names. Documents that contain values in those fields will only be 
> checked against \{@link MonitorQuery} instances that have the same 
> fieldname-value mapping in their metadata.
> {quote}
> ...which is straightforward and useful in the tested example where every 
> registered Query has {{"language"}} metadata, and every Document has a 
> {{"language"}} field, but gives unintuitive results when a Query or Document 
> does *NOT* have a {{"language"}}
> A more "intuitive" & useful (in my opinions) implementation would be 
> something that could be documented as ...
> {quote}Filtering by additional fields can be configured by passing a set of 
> field names. Documents that contain values in those fields will only be 
> checked against \{@link MonitorQuery} instances
> that have the same fieldname-value mapping in their metadata or have no 
> mapping for that fieldname.
> Documents that do not contain values in those fields will only be checked 
> against \{@link MonitorQuery} instances that also have no mapping for that 
> fieldname.
> {quote}
> ...ie: instead of being a straight "filter candidate queries by what we find 
> in the filter fields in the documents" we can instead "derive the queries 
> that are viable candidates for each document if we were restricting the set 
> of documents by those values during a "forward search"



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10392) Handle soft deletes via LiveDocsFormat

2022-05-13 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536778#comment-17536778
 ] 

Rushabh Shah commented on LUCENE-10392:
---

[~jpountz]  I am new to Lucene project. This will be 2nd issue. Given that this 
is a minor task, I would like to create a PR for this. Can you please elaborate 
the steps needed to tackle this issue ? Also if you can point me to some 
classes relevant to this patch where I can read more about the existing 
behavior, that would be helpful. Thank you.

> Handle soft deletes via LiveDocsFormat
> --
>
> Key: LUCENE-10392
> URL: https://issues.apache.org/jira/browse/LUCENE-10392
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> We have been using doc values to handle soft deletes until now, but this is a 
> bit of a hack as it:
>  - forces users to reserve a field name for doc values
>  - generally doesn't read directly from doc values, instead docs values help 
> populate bitsets and then reads are performed via these bitsets
> It would also be more natural to have both hard and soft deletes handled by 
> the same file format?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

2022-05-13 Thread GitBox


msokolov commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1126294216

   Things have been moving kind of fast here! Which is great, but I am trying
   to catch up and having trouble reconstructing the changes. Today on main
   lucene92/OffHeapVectorValues.java has only one commit in its git history,
   and I'm trying to find the place where we added the overrides of
   vectorValue() and binaryValue() for the Sparse/Dense subclasses (since they
   are copies, it seems weird). I think this has something to do with working
   around JVM weirdness - I have a vague memory of a discussion about that,
   but I can't find any record of it in git. I tried looking at the old (90 /
   91) readers but I think these changes came after that. I wonder if we lost
   the history while doing some git surgery on this feature branch?
   
   On Tue, May 10, 2022 at 3:17 PM Lu Xugang ***@***.***> wrote:
   
   > Thanks @mayya-sharipova  , let's move
   > to #877  to continue this
   > change.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10572) Can we optimize BytesRefHash

2022-05-13 Thread Michael McCandless (Jira)
Michael McCandless created LUCENE-10572:
---

 Summary: Can we optimize BytesRefHash
 Key: LUCENE-10572
 URL: https://issues.apache.org/jira/browse/LUCENE-10572
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


I was poking around in our nightly benchmarks 
([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
profiling that the hottest method is this:
{noformat}
PERCENT   CPU SAMPLES   STACK
9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
  at org.apache.lucene.util.BytesRefHash#findHash()
  at org.apache.lucene.util.BytesRefHash#add()
  at org.apache.lucene.index.TermsHashPerField#add()
  at 
org.apache.lucene.index.IndexingChain$PerField#invert()
  at 
org.apache.lucene.index.IndexingChain#processField()
  at 
org.apache.lucene.index.IndexingChain#processDocument()
  at 
org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
This is kinda crazy – comparing if the term to be inserted into the inverted 
index hash equals the term already added to {{BytesRefHash}} is the hottest 
method during nightly benchmarks.

Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
questionable things about our current implementation:
 * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the inserted 
term into the hash?  Let's just use two bytes always, since IW limits term 
length to 32 K (< 64K that an unsigned short can cover)


 * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
(BitUtil.VH_BE_SHORT.get)
 * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
aggressive enough?  Or the initial sizing of the hash is too small?

 * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too many 
{{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible "upgrades"?


 * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
({{{}murmurhash3_x86_32{}}})?

 * Are we using the JVM's intrinsics to compare multiple bytes in a single SIMD 
instruction ([~rcmuir] is quite sure we are indeed)?


 * [~jpountz] suggested maybe the hash insert is simply memory bound


 * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total CPU 
cost)

I pulled these observations from a recent (5/6/22) profiler output: 
[https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]

Maybe we can improve our performance on this crazy hotspot?

Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-10572:

Description: 
I was poking around in our nightly benchmarks 
([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
profiling that the hottest method is this:
{noformat}
PERCENT   CPU SAMPLES   STACK
9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
  at org.apache.lucene.util.BytesRefHash#findHash()
  at org.apache.lucene.util.BytesRefHash#add()
  at org.apache.lucene.index.TermsHashPerField#add()
  at 
org.apache.lucene.index.IndexingChain$PerField#invert()
  at 
org.apache.lucene.index.IndexingChain#processField()
  at 
org.apache.lucene.index.IndexingChain#processDocument()
  at 
org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
This is kinda crazy – comparing if the term to be inserted into the inverted 
index hash equals the term already added to {{BytesRefHash}} is the hottest 
method during nightly benchmarks.

Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
questionable things about our current implementation:
 * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the inserted 
term into the hash?  Let's just use two bytes always, since IW limits term 
length to 32 K (< 64K that an unsigned short can cover)

 * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
(BitUtil.VH_BE_SHORT.get)


 * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
aggressive enough?  Or the initial sizing of the hash is too small?

 * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too many 
{{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible "upgrades"?

 * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
({{{}murmurhash3_x86_32{}}})?

 * Are we using the JVM's intrinsics to compare multiple bytes in a single SIMD 
instruction ([~rcmuir] is quite sure we are indeed)?

 * [~jpountz] suggested maybe the hash insert is simply memory bound

 * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total CPU 
cost)

I pulled these observations from a recent (5/6/22) profiler output: 
[https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]

Maybe we can improve our performance on this crazy hotspot?

Or maybe this is a "healthy" hotspot and we should leave it be!

  was:
I was poking around in our nightly benchmarks 
([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
profiling that the hottest method is this:
{noformat}
PERCENT   CPU SAMPLES   STACK
9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
  at org.apache.lucene.util.BytesRefHash#findHash()
  at org.apache.lucene.util.BytesRefHash#add()
  at org.apache.lucene.index.TermsHashPerField#add()
  at 
org.apache.lucene.index.IndexingChain$PerField#invert()
  at 
org.apache.lucene.index.IndexingChain#processField()
  at 
org.apache.lucene.index.IndexingChain#processDocument()
  at 
org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
This is kinda crazy – comparing if the term to be inserted into the inverted 
index hash equals the term already added to {{BytesRefHash}} is the hottest 
method during nightly benchmarks.

Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
questionable things about our current implementation:
 * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the inserted 
term into the hash?  Let's just use two bytes always, since IW limits term 
length to 32 K (< 64K that an unsigned short can cover)


 * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
(BitUtil.VH_BE_SHORT.get)
 * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
aggressive enough?  Or the initial sizing of the hash is too small?

 * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too many 
{{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible "upgrades"?


 * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
({{{}murmurhash3_x86_32{}}})?

 * Are we using the JVM's intrinsics to compare multiple bytes in a single SIMD 
instruction ([~rcmuir] is quite sure we are indeed)?


 * [~jpountz] suggested maybe the hash insert is simply memory bound


 * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total CPU 
cost)

I pull

[jira] [Updated] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-10572:

Summary: Can we optimize BytesRefHash?  (was: Can we optimize BytesRefHash)

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536855#comment-17536855
 ] 

Uwe Schindler commented on LUCENE-10572:


With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.getDefault() here to get 
the varhandle. The byte order does not mapper, we just need to get 2 bytes.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536855#comment-17536855
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/13/22 7:51 PM:
-

With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length.


was (Author: thetaphi):
With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.getDefault() here to get 
the varhandle. The byte order does not mapper, we just need to get 2 bytes.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536855#comment-17536855
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/13/22 7:53 PM:
-

With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length, also in platform 
order (as this encoding is private to the class and is never serialized to 
disk).


was (Author: thetaphi):
With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hot

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536855#comment-17536855
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/13/22 7:55 PM:
-

With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run our code (e.g. x86, somebody could see bugs on arm).

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length, also in platform 
order (as this encoding is private to the class and is never serialized to 
disk).


was (Author: thetaphi):
With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length, also in platform 
order (as this encoding is private to the class and is never serialized to 
disk).

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mike

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536877#comment-17536877
 ] 

Robert Muir commented on LUCENE-10572:
--

{quote}
With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.
{quote}

The problem is there are a lot of different places today specifying bigendian, 
little-endian all over the place. The byte order does matter, in the sense that 
it needs to be e.g. consistent with ByteBlockPool's code and what not. We do 
encoding/decoding (just to memory). so if we start specifying nativeOrder() 
just here in BytesRefHash, but neglect to consistently do it in the same places 
in ByteBlockPool, then we are going to have bugs. ones that won't be detected 
by any CI today.

So yes, I stand by my assessment: there are enough endian shenanigans happening 
that its dangerous to start mixing in nativeOrder() into this stuff unless we 
can test on platform where its BE.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536883#comment-17536883
 ] 

Uwe Schindler commented on LUCENE-10572:


Hey,
I agree with the length encoding. This is indeed used at other places, too.

My argument was meant primarily for the hash seed, where we already use a var 
handle. This one is (like the hash algorithm) private to the bytesrefhash.

If we want to test all platforms, we could default to platform in the 
initializer when not in Test mode. In Test mode we use a random be order. This 
could be controlled by a sysproo in the static initializer of the class (it 
cannot be fully dynamic as the var handle MUST be declared static final.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536885#comment-17536885
 ] 

Robert Muir commented on LUCENE-10572:
--

ok, i like your suggestion actually. it solves my issue, it would make this 
testable. Then native order could be used freely without this concern.

We just use our own constant, but then we can change it for testing. I would 
even ban the nativeOrder() in forbidden apis, too. It is just like Locale or 
anything else, same thing. Let's be specific but then test all of them.



> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536905#comment-17536905
 ] 

Uwe Schindler commented on LUCENE-10572:


bq. ok, i like your suggestion actually. it solves my issue, it would make this 
testable. Then native order could be used freely without this concern.

We can do this in the same way like the initialization of StringUtils. There we 
check the system property tests.seed and then initialize some randomness. In 
BitUtil we could have a similar code (or maybe share that in Constants to get 
the random seed and save as a Long value, or null if not given - this would 
prevent us from doing the lookup multiple times). In BitUtil we could have 
varhandles like {{BitUtil.VH_NATIVE_SHORT}} that are native in production 
environments, but randomized in test environments.

I can make a PR as starting point tomorrow, it's too late now. We can then 
improve from there - [~mikemccand] other ideas included. 

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536907#comment-17536907
 ] 

Uwe Schindler commented on LUCENE-10572:


bq. We just use our own constant, but then we can change it for testing. I 
would even ban the nativeOrder() in forbidden apis, too. It is just like Locale 
or anything else, same thing. Let's be specific but then test all of them.

I disagree with that. We also do not forbid {{Locale#getDefault}} because if 
somebody uses that method he explciitely want the default locale.

Actually using ByteOrder#nativeOrder() is also an explicit vote to do that. And 
with coming MMapDirectory v2 and more panama features like accessing native 
APIs like locking pages in mmapdir or using madvise/fadvise based on IOContext 
needs native order to talk to those APIs.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536915#comment-17536915
 ] 

Robert Muir commented on LUCENE-10572:
--

Well, banning it doesn't matter so much to me, if we define BitUtil helpers. If 
we have the helpers, all will be fine. My concern is this endian-specific 
encode/decode stuff happening in memory all over indexwriter, and those helpers 
are what its going to be using.

I'm not concerned with nativeOrder() being used by MMapDirectory or something 
like that, that is different.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536917#comment-17536917
 ] 

Robert Muir commented on LUCENE-10572:
--

And by the way, I'm fine also with just using little-endian here as a simpler 
solution. We don't have any real BE machines to test on, i dont know anyone 
running lucene on BE machines (AIX??? what is still out there really supported 
by openjdk?) and am concerned about adding complexity around such a crazy case. 
if we use LE, explicitly, all is safe, because the code will just swap bytes on 
the AIX machine and still be correct. 

Today its swapping bytes on the intel and ARM machines and the AIX machine is 
"fast" instead. in quotes because we know its still not :)

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536916#comment-17536916
 ] 

Uwe Schindler commented on LUCENE-10572:


Hi,
I have a PR almost ready. In my comment above I confused the native order issue 
we discussed with LZ4. In my patch, I replaced it there to be NATIVE, too.

In BytesRefHash we have one big endian variant, but in ByteBlockPool we have 
also little endian writes. It's too late for me now, I changed the Big Endian 
ones in BytesRefHash for now, but what we should for sure check: Those blocks 
should never be written to disk, so maybe somebody with more knowledge should 
look into it.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536918#comment-17536918
 ] 

Robert Muir commented on LUCENE-10572:
--

{quote}
In BytesRefHash we have one big endian variant, but in ByteBlockPool we have 
also little endian writes. It's too late for me now, I changed the Big Endian 
ones in BytesRefHash for now, but what we should for sure check: Those blocks 
should never be written to disk, so maybe somebody with more knowledge should 
look into it.
{quote}

Exactly, these indexwriter classes are crazy as soon as you start digging in to 
this issue. That's why I'm worried about using nativeOrder() everywhere and 
making it worse.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536921#comment-17536921
 ] 

Uwe Schindler commented on LUCENE-10572:


Yeah exactly, sometimes BE is used to allow to sort the terms based as byte 
sequence.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #777: LUCENE-10488: Optimize Facets#getTopDims in ConcurrentSortedSetDocValuesFacetCounts

2022-05-13 Thread GitBox


gsmiller merged PR #777:
URL: https://github.com/apache/lucene/pull/777


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #779: LUCENE-10488: Optimize Facets#getTopDims in IntTaxonomyFacets

2022-05-13 Thread GitBox


gsmiller merged PR #779:
URL: https://github.com/apache/lucene/pull/779


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #806: LUCENE-10488: Optimize Facets#getTopDims in FloatTaxonomyFacets

2022-05-13 Thread GitBox


gsmiller merged PR #806:
URL: https://github.com/apache/lucene/pull/806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-05-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536925#comment-17536925
 ] 

ASF subversion and git services commented on LUCENE-10488:
--

Commit f0ec226167230a42b27a4b946c9c0f74d0f0abfc in lucene's branch 
refs/heads/main from Yuting Gan
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f0ec2261672 ]

LUCENE-10488: Optimize Facets#getTopDims in FloatTaxonomyFacets (#806)



> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-05-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536924#comment-17536924
 ] 

ASF subversion and git services commented on LUCENE-10488:
--

Commit 57f8cb2fd6bcc0a27f89dc8c36b43bfd812ded46 in lucene's branch 
refs/heads/main from Yuting Gan
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=57f8cb2fd6b ]

LUCENE-10488: Optimize Facets#getTopDims in IntTaxonomyFacets (#779)



> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-05-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536923#comment-17536923
 ] 

ASF subversion and git services commented on LUCENE-10488:
--

Commit ef43242d77a2f868ba77ea1f186344bfeae3065c in lucene's branch 
refs/heads/main from Yuting Gan
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ef43242d77a ]

LUCENE-10488: Optimized getTopDims in ConcurrentSSDVFacetCounts (#777)



> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler opened a new pull request, #888: LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests)

2022-05-13 Thread GitBox


uschindler opened a new pull request, #888:
URL: https://github.com/apache/lucene/pull/888

   see https://issues.apache.org/jira/browse/LUCENE-10572


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536927#comment-17536927
 ] 

Uwe Schindler commented on LUCENE-10572:


Here is a draft PR about the idea. I just changed LZ4 to use the native order 
(as it is explicitly allowed and also documented in the algorithm).

Playing with BytesRefHash and ByteBlockPool crushed most tests around docvalues 
and blockterms. So Robert is right: Some of those are BE just because of 
sorting the blocks as byte arrays (so it must be BE).

If this is not going to work, throw it away. I just started to make it testable.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536927#comment-17536927
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/13/22 11:00 PM:
--

Here is a draft PR about the idea: https://github.com/apache/lucene/pull/888

I just changed LZ4 to use the native order (as it is explicitly allowed and 
also documented in the algorithm).

Playing with BytesRefHash and ByteBlockPool crushed most tests around docvalues 
and blockterms. So Robert is right: Some of those are BE just because of 
sorting the blocks as byte arrays (so it must be BE).

If this is not going to work, throw it away. I just started to make it testable.


was (Author: thetaphi):
Here is a draft PR about the idea. I just changed LZ4 to use the native order 
(as it is explicitly allowed and also documented in the algorithm).

Playing with BytesRefHash and ByteBlockPool crushed most tests around docvalues 
and blockterms. So Robert is right: Some of those are BE just because of 
sorting the blocks as byte arrays (so it must be BE).

If this is not going to work, throw it away. I just started to make it testable.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-05-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536928#comment-17536928
 ] 

ASF subversion and git services commented on LUCENE-10488:
--

Commit e01b65d28418d9bfc7439b3f3b701ea520700c86 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e01b65d2841 ]

CHANGES entry for LUCENE-10488


> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536931#comment-17536931
 ] 

Uwe Schindler commented on LUCENE-10572:


bq. Today its swapping bytes on the intel and ARM machines and the AIX machine 
is "fast" instead. in quotes because we know its still not 

I know that OpenJDK is heavily tested on big endian ARM machines (i think they 
can be switched). So won't those users of modern ARM-Macs good candidates?

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #888: LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests)

2022-05-13 Thread GitBox


uschindler commented on PR #888:
URL: https://github.com/apache/lucene/pull/888#issuecomment-1126569898

   Anybody may play and commit ideas to this PR. @rmuir @mikemccand 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #889: LUCENE-10488: Optimized Facets#getTopDims for taxonomy faceting and ConcurrentSSDVFacetCounts

2022-05-13 Thread GitBox


gsmiller opened a new pull request, #889:
URL: https://github.com/apache/lucene/pull/889

   Just using to backport.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536933#comment-17536933
 ] 

Uwe Schindler commented on LUCENE-10572:


bq. Playing with BytesRefHash and ByteBlockPool crushed most tests around 
docvalues and blockterms. So Robert is right: Some of those are BE just because 
of sorting the blocks as byte arrays (so it must be BE).

This is because of the Vint-link encoding, not sorting: You need to remove the 
vint encoding, so you don't need first byte to switch between 1 or 2 bytes. 
When we remove the Vint-linke encoding we can use the native order in 
BytesRefHash and ByteBlockPool and possibly PagesByte, too.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536937#comment-17536937
 ] 

Uwe Schindler commented on LUCENE-10572:


I removed the vInt-like encoding in ByteBlockPool and BytesRefHash. After that 
I was able to switch to native shorts.

I did not touch PagedBytes, although the same thing could be done there.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #888: LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests)

2022-05-13 Thread GitBox


uschindler commented on PR #888:
URL: https://github.com/apache/lucene/pull/888#issuecomment-1126578739

   I removed the vInt-like encoding in ByteBlockPool and BytesRefHash. After 
that I was able to switch to native shorts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #889: LUCENE-10488: Optimized Facets#getTopDims for taxonomy faceting and ConcurrentSSDVFacetCounts

2022-05-13 Thread GitBox


gsmiller merged PR #889:
URL: https://github.com/apache/lucene/pull/889


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-05-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536938#comment-17536938
 ] 

ASF subversion and git services commented on LUCENE-10488:
--

Commit 87655fd015ceef36c4210a0d20f1071068544fe4 in lucene's branch 
refs/heads/branch_9x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=87655fd015c ]

LUCENE-10488: Optimized Facets#getTopDims for taxonomy faceting and 
ConcurrentSSDVFacetCounts (#889)

Co-authored-by: Yuting Gan <4710+yut...@users.noreply.github.com>

> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-05-13 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10488.
--
Fix Version/s: 9.2
   Resolution: Fixed

Merged to {{main}} and {{branch_9x}}. Resolving. Thanks again [~yutinggan]!

> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536942#comment-17536942
 ] 

Robert Muir commented on LUCENE-10572:
--

{quote}
I know that OpenJDK is heavily tested on big endian ARM machines (i think they 
can be switched). So won't those users of modern ARM-Macs good candidates
{quote}

These can be switched but the code compiles for LE. upcoming risc-v is LE, too. 

look at https://adoptopenjdk.net/releases.html, I'm pretty sure the only BE 
builds are PPC64 (not PPC64LE), and s390x.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta merged pull request #883: LUCENE-10561 Reduce class/member visibility of all normalizer and stemmer classes

2022-05-13 Thread GitBox


mocobeta merged PR #883:
URL: https://github.com/apache/lucene/pull/883


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10561) Reduce class/member visibility of ArabicStemmer, ArabicNormalizer, and PersianNormalizer

2022-05-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536960#comment-17536960
 ] 

ASF subversion and git services commented on LUCENE-10561:
--

Commit 694d797526ae1d9dbe65c69eaa52d5531824c560 in lucene's branch 
refs/heads/main from Rushabh Shah
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=694d797526a ]

LUCENE-10561 Reduce class/member visibility of all normalizer and stemmer 
classes (#883)

Co-authored-by: Rushabh Shah 
Co-authored-by: Tomoko Uchida 

> Reduce class/member visibility of ArabicStemmer, ArabicNormalizer, and 
> PersianNormalizer
> 
>
> Key: LUCENE-10561
> URL: https://issues.apache.org/jira/browse/LUCENE-10561
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This is a spin-off of [LUCENE-10312].
> Constants and methods in those classes are exposed to the outside packages; 
> we should be able to limit the visibility to {{private}} or, at least to 
> {{package private}}.
> This change breaks backward compatibility so should be applied to the main 
> branch (10.0) only, and a MIGRATE entry may be needed.
> Also, they seem unchanged since 2008, we could refactor them to embrace newer 
> Java APIs as we did in [https://github.com/apache/lucene/pull/540]. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10561) Reduce class/member visibility of all normalizer and stemmer classes

2022-05-13 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-10561:
---
Summary: Reduce class/member visibility of all normalizer and stemmer 
classes  (was: Reduce class/member visibility of ArabicStemmer, 
ArabicNormalizer, and PersianNormalizer)

> Reduce class/member visibility of all normalizer and stemmer classes
> 
>
> Key: LUCENE-10561
> URL: https://issues.apache.org/jira/browse/LUCENE-10561
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This is a spin-off of [LUCENE-10312].
> Constants and methods in those classes are exposed to the outside packages; 
> we should be able to limit the visibility to {{private}} or, at least to 
> {{package private}}.
> This change breaks backward compatibility so should be applied to the main 
> branch (10.0) only, and a MIGRATE entry may be needed.
> Also, they seem unchanged since 2008, we could refactor them to embrace newer 
> Java APIs as we did in [https://github.com/apache/lucene/pull/540]. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10561) Reduce class/member visibility of all normalizer and stemmer classes

2022-05-13 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-10561.

Fix Version/s: 10.0 (main)
   Resolution: Fixed

> Reduce class/member visibility of all normalizer and stemmer classes
> 
>
> Key: LUCENE-10561
> URL: https://issues.apache.org/jira/browse/LUCENE-10561
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
> Fix For: 10.0 (main)
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This is a spin-off of [LUCENE-10312].
> Constants and methods in those classes are exposed to the outside packages; 
> we should be able to limit the visibility to {{private}} or, at least to 
> {{package private}}.
> This change breaks backward compatibility so should be applied to the main 
> branch (10.0) only, and a MIGRATE entry may be needed.
> Also, they seem unchanged since 2008, we could refactor them to embrace newer 
> Java APIs as we did in [https://github.com/apache/lucene/pull/540]. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-05-13 Thread Yuting Gan (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536968#comment-17536968
 ] 

Yuting Gan commented on LUCENE-10488:
-

Thank you so much for reviewing and merging my PRs! I will work on adding 
getTopDim to benchmarks soon.

> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org