[GitHub] [lucene-solr] iverase opened a new pull request #2131: LUCENE-9552: make sure we don't construct Illegal rectangles due to quantization
iverase opened a new pull request #2131: URL: https://github.com/apache/lucene-solr/pull/2131 This commit just add a correction in the minLat/maxLat when it becomes invalid during quantisation. CC: @nknize This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15010) Missing jstack warning is alarming, when using bin/solr as client interface to solr
[ https://issues.apache.org/jira/browse/SOLR-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245741#comment-17245741 ] Jan Høydahl commented on SOLR-15010: +1 to fallback to jattach in bin/solr if jstack is not found. See https://github.com/apangin/jattach. > Missing jstack warning is alarming, when using bin/solr as client interface > to solr > --- > > Key: SOLR-15010 > URL: https://issues.apache.org/jira/browse/SOLR-15010 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 8.7 >Reporter: David Eric Pugh >Priority: Minor > > In SOLR-14442 we added a warning if jstack wasn't found. I notice that I > use the bin/solr command a lot as a client, so bin solr zk or bin solr > healthcheck. > For example: > {{docker exec solr1 solr zk cp /security.json zk:security.json -z zoo1:2181}} > All of these emit the message: > The currently defined JAVA_HOME (/usr/local/openjdk-11) refers to a location > where java was found but jstack was not found. Continuing. > This is somewhat alarming, and then becomes annoying. Thoughts on maybe > only conducting this check if you are running {{bin/solr start}} or one of > the other commands that is actually starting Solr as a process? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] romseygeek commented on a change in pull request #2127: LUCENE-9633: Improve match highlighter behavior for degenerate intervals
romseygeek commented on a change in pull request #2127: URL: https://github.com/apache/lucene-solr/pull/2127#discussion_r538157532 ## File path: lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchRegionRetriever.java ## @@ -361,6 +374,41 @@ public void testIntervalQueries() throws IOException { ); } + @Test + public void testDegenerateIntervalsWithPositions() throws IOException { +testDegenerateIntervals(FLD_TEXT_POS); + } + + @Test @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/LUCENE-9634: " + Review comment: So `extend` will widen the bounds of an interval's positions, but leave its offsets untouched (because it has no way of knowing what the offsets actually are). I sort of think that just highlighting the original term is the correct behaviour? But there will be a discrepancy when we generate offsets directly from the token stream by comparing to positions. I see that ExtendedIntervalIterator's javadoc is incorrect regarding prefixes. It says ``` An interval with prefix bounds extended by n will skip over matches that appear in positions lower than n ``` but it actually just readjusts these matches to start at position 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on pull request #2130: Adding Apache Reporter step in Release Wizard.
HoustonPutman commented on pull request #2130: URL: https://github.com/apache/lucene-solr/pull/2130#issuecomment-740486427 Thanks for checking on that Anshum This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman merged pull request #2130: Adding Apache Reporter step in Release Wizard.
HoustonPutman merged pull request #2130: URL: https://github.com/apache/lucene-solr/pull/2130 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-8673) o.a.s.search.facet classes not public/extendable
[ https://issues.apache.org/jira/browse/SOLR-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245758#comment-17245758 ] ASF subversion and git services commented on SOLR-8673: --- Commit 6f357af0c10e0dc3d84cbef4a48fe2ba0b566d7d in lucene-solr's branch refs/heads/branch_8x from Mikhail Khludnev [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=6f357af ] SOLR-8673: fix build. > o.a.s.search.facet classes not public/extendable > > > Key: SOLR-8673 > URL: https://issues.apache.org/jira/browse/SOLR-8673 > Project: Solr > Issue Type: Improvement > Components: Facet Module >Affects Versions: 5.4.1 >Reporter: Markus Jelsma >Priority: Major > Fix For: 6.2, 7.0 > > Attachments: SOLR-8673.patch, SOLR-8673.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > It is not easy to create a custom JSON facet function. A simple function > based on AvgAgg quickly results in the following compilation failures: > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-compiler-plugin:3.3:compile (default-compile) > on project openindex-solr: Compilation failure: Compilation failure: > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[22,36] > org.apache.solr.search.facet.FacetContext is not public in > org.apache.solr.search.facet; cannot be accessed from outside package > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[23,36] > org.apache.solr.search.facet.FacetDoubleMerger is not public in > org.apache.solr.search.facet; cannot be accessed from outside package > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[40,32] > cannot find symbol > [ERROR] symbol: class FacetContext > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[49,39] > cannot find symbol > [ERROR] symbol: class FacetDoubleMerger > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[54,43] > cannot find symbol > [ERROR] symbol: class Context > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg.Merger > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[41,16] > cannot find symbol > [ERROR] symbol: class AvgSlotAcc > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[46,12] > incompatible types: i.o.s.search.facet.CustomAvgAgg.Merger cannot be > converted to org.apache.solr.search.facet.FacetMerger > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[53,5] > method does not override or implement a method from a supertype > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[60,5] > method does not override or implement a method from a supertype > {code} > It seems lots of classes are tucked away in FacetModule, which we can't reach > from outside. > Originates from this thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3ccab_8yd9ldbg_0zxm_h1igkfm6bqeypd5ilyy7tty8cztscv...@mail.gmail.com%3E > ( also available at > https://lists.apache.org/thread.html/9fddcad3136ec908ce1c57881f8d3069e5d153f08b71f80f3e18d995%401455019826%40%3Csolr-user.lucene.apache.org%3E > ) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2127: LUCENE-9633: Improve match highlighter behavior for degenerate intervals
dweiss commented on a change in pull request #2127: URL: https://github.com/apache/lucene-solr/pull/2127#discussion_r538162712 ## File path: lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchRegionRetriever.java ## @@ -361,6 +374,41 @@ public void testIntervalQueries() throws IOException { ); } + @Test + public void testDegenerateIntervalsWithPositions() throws IOException { +testDegenerateIntervals(FLD_TEXT_POS); + } + + @Test @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/LUCENE-9634: " + Review comment: > I sort of think that just highlighting the original term is the correct behaviour? Hmm... I don't think I agree. When you have a query parser that allows intervals then extend becomes a function just like anything else. The intuitive user expectation for a query extend(foo 2 2) is to actually highlight the matching interval of positions (well, users think of "words") pointed to by that interval. This is particularly important if you're building more complex expressions out of these (left/ right/ extend, etc.) and you wish to see partial fragments as you're building more focused expressions. I'm not saying this has to be fixed (neither do I know how it should) but it's real feedback from people who use those queries intensively (and my gut feeling agrees). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] romseygeek commented on a change in pull request #2127: LUCENE-9633: Improve match highlighter behavior for degenerate intervals
romseygeek commented on a change in pull request #2127: URL: https://github.com/apache/lucene-solr/pull/2127#discussion_r538220443 ## File path: lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchRegionRetriever.java ## @@ -361,6 +374,41 @@ public void testIntervalQueries() throws IOException { ); } + @Test + public void testDegenerateIntervalsWithPositions() throws IOException { +testDegenerateIntervals(FLD_TEXT_POS); + } + + @Test @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/LUCENE-9634: " + Review comment: Fair enough! I originally added `extend` to deal with stopwords and to help implement `before` and `after` filters, but if it's being used elsewhere then that's all good. I'm interested in how it's being exposed in query parsers - we don't actually have it as an option in the elasticsearch intervals DSL but maybe we ought to add it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-15034) CoreAdmin STATUS should also return config set
Andreas Hubold created SOLR-15034: - Summary: CoreAdmin STATUS should also return config set Key: SOLR-15034 URL: https://issues.apache.org/jira/browse/SOLR-15034 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Affects Versions: 8.6.3 Reporter: Andreas Hubold Currently, the CoreAdmin STATUS response does not return the config set of the core. It would be nice if it could be included in the result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15034) CoreAdmin STATUS should also return config set
[ https://issues.apache.org/jira/browse/SOLR-15034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245836#comment-17245836 ] Andreas Hubold commented on SOLR-15034: --- I've asked on solr-user mailing list: [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/202012.mbox/%3Ca77c6c99-a62b-0b4b-e63d-4dc851814f34%40coremedia.com%3E] > CoreAdmin STATUS should also return config set > -- > > Key: SOLR-15034 > URL: https://issues.apache.org/jira/browse/SOLR-15034 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 8.6.3 >Reporter: Andreas Hubold >Priority: Major > > Currently, the CoreAdmin STATUS response does not return the config set of > the core. It would be nice if it could be included in the result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-8673) o.a.s.search.facet classes not public/extendable
[ https://issues.apache.org/jira/browse/SOLR-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245843#comment-17245843 ] Mikhail Khludnev commented on SOLR-8673: [https://builds.apache.org/job/Lucene/job/Lucene-Solr-Tests-8.x/1027/testReport/org.apache.solr.search.function/AggValueSourceTest/] Fixed. > o.a.s.search.facet classes not public/extendable > > > Key: SOLR-8673 > URL: https://issues.apache.org/jira/browse/SOLR-8673 > Project: Solr > Issue Type: Improvement > Components: Facet Module >Affects Versions: 5.4.1 >Reporter: Markus Jelsma >Priority: Major > Fix For: 6.2, 7.0 > > Attachments: SOLR-8673.patch, SOLR-8673.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > It is not easy to create a custom JSON facet function. A simple function > based on AvgAgg quickly results in the following compilation failures: > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-compiler-plugin:3.3:compile (default-compile) > on project openindex-solr: Compilation failure: Compilation failure: > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[22,36] > org.apache.solr.search.facet.FacetContext is not public in > org.apache.solr.search.facet; cannot be accessed from outside package > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[23,36] > org.apache.solr.search.facet.FacetDoubleMerger is not public in > org.apache.solr.search.facet; cannot be accessed from outside package > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[40,32] > cannot find symbol > [ERROR] symbol: class FacetContext > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[49,39] > cannot find symbol > [ERROR] symbol: class FacetDoubleMerger > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[54,43] > cannot find symbol > [ERROR] symbol: class Context > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg.Merger > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[41,16] > cannot find symbol > [ERROR] symbol: class AvgSlotAcc > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[46,12] > incompatible types: i.o.s.search.facet.CustomAvgAgg.Merger cannot be > converted to org.apache.solr.search.facet.FacetMerger > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[53,5] > method does not override or implement a method from a supertype > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[60,5] > method does not override or implement a method from a supertype > {code} > It seems lots of classes are tucked away in FacetModule, which we can't reach > from outside. > Originates from this thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3ccab_8yd9ldbg_0zxm_h1igkfm6bqeypd5ilyy7tty8cztscv...@mail.gmail.com%3E > ( also available at > https://lists.apache.org/thread.html/9fddcad3136ec908ce1c57881f8d3069e5d153f08b71f80f3e18d995%401455019826%40%3Csolr-user.lucene.apache.org%3E > ) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2127: LUCENE-9633: Improve match highlighter behavior for degenerate intervals
dweiss commented on a change in pull request #2127: URL: https://github.com/apache/lucene-solr/pull/2127#discussion_r538313908 ## File path: lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchRegionRetriever.java ## @@ -361,6 +374,41 @@ public void testIntervalQueries() throws IOException { ); } + @Test + public void testDegenerateIntervalsWithPositions() throws IOException { +testDegenerateIntervals(FLD_TEXT_POS); + } + + @Test @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/LUCENE-9634: " + Review comment: It is extremely useful to capture and drill down in the context of another query. Let's say apples nearby oranges. Yes, you can achieve a similar thing with other queries but it's pretty useful on its own (because you can first inspect the context you're looking at by running the extends query in isolation). I've modified flexible query parser and added those functions as prefix-scoped "language". Looks like this: https://get.carrotsearch.com/lingo4g/1.12.0-SNAPSHOT/doc/#interval-functions And combine with the matches highlighter it really shines. It's the best when you get multiple overlapping intervals; I don't have an example won this computer (I have a day of on home duties) but I can send you one later on - you can do some really impressive stuff with intervals! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] munendrasn commented on a change in pull request #2123: SOLR-10732: short circuit calls to searcher#numDocs when base is empty
munendrasn commented on a change in pull request #2123: URL: https://github.com/apache/lucene-solr/pull/2123#discussion_r538315749 ## File path: solr/core/src/java/org/apache/solr/search/facet/FacetProcessor.java ## @@ -419,7 +419,7 @@ void fillBucket(SimpleOrderedMap bucket, Query q, DocSet result, boolean } count = result.size(); // don't really need this if we are skipping, but it's free. } else { - if (q == null) { + if (q == null || fcontext.base.size() == 0) { Review comment: This is done https://github.com/apache/lucene-solr/pull/2123/commits/c194e09ca0d2df32acf21875c7625f9e862fdc09 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] munendrasn commented on a change in pull request #2123: SOLR-10732: short circuit calls to searcher#numDocs when base is empty
munendrasn commented on a change in pull request #2123: URL: https://github.com/apache/lucene-solr/pull/2123#discussion_r538316824 ## File path: solr/core/src/java/org/apache/solr/request/SimpleFacets.java ## @@ -903,7 +910,7 @@ public void execute(Runnable r) { private int numDocs(String term, final SchemaField sf, final FieldType ft, final DocSet baseDocset) { try { - return searcher.numDocs(ft.getFieldQuery(null, sf, term), baseDocset); + return baseDocset.size() == 0? 0: searcher.numDocs(ft.getFieldQuery(null, sf, term), baseDocset); Review comment: sort by count won't be done if baseDocSet size is 0 but I have kept this check in numDocs so that, any future usage can benefit from it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on pull request #2127: LUCENE-9633: Improve match highlighter behavior for degenerate intervals
dweiss commented on pull request #2127: URL: https://github.com/apache/lucene-solr/pull/2127#issuecomment-740595732 I plan to commit it in (with assume-disabled test involving position+offsets) if nobody objects. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] munendrasn commented on a change in pull request #2123: SOLR-10732: short circuit calls to searcher#numDocs when base is empty
munendrasn commented on a change in pull request #2123: URL: https://github.com/apache/lucene-solr/pull/2123#discussion_r538323161 ## File path: solr/core/src/java/org/apache/solr/request/SimpleFacets.java ## @@ -325,6 +329,9 @@ public void getFacetQueryCount(ParsedParams parsed, NamedList res) thro * @see FacetParams#FACET_QUERY */ public int getGroupedFacetQueryCount(Query facetQuery, DocSet docSet) throws IOException { +if (docSet.size() == 0) { + return 0; +} Review comment: Same as above This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14397) Vector Search in Solr
[ https://issues.apache.org/jira/browse/SOLR-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245867#comment-17245867 ] Alessandro Benedetti commented on SOLR-14397: - Should we resume this work, now that https://issues.apache.org/jira/browse/LUCENE-9004 has been officially merged to master? I read it superficially and I have not yet explored the code, but the aforementioned contribution seems quite relevant, potentially is now the right time to redefine the design? > Vector Search in Solr > - > > Key: SOLR-14397 > URL: https://issues.apache.org/jira/browse/SOLR-14397 > Project: Solr > Issue Type: Improvement >Reporter: Trey Grainger >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Search engines have traditionally relied upon token-based matching (typically > keywords) on an inverted index, plus relevance ranking based upon keyword > occurrence statistics. This can be viewed as a "sparse vector” match (where > each term is a one-hot encoded dimension in the vector), since only a few > keywords out of all possible keywords are considered in each query. With the > introduction of deep-learning-based transformers over the last few years, > however, the state of the art in relevance has moved to ranking models based > upon dense vectors that encode a latent, semantic understanding of both > language constructs and the underlying domain upon which the model was > trained. These dense vectors are also referred to as “embeddings”. An example > of this kind of embedding would be taking the phrase “chief executive officer > of the tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3] > . Other similar phrases should encode to vectors with very similar numbers, > so we may expect a query like “CEO of a technology org” to generate a vector > like [0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity > calculation between these vectors, we would expect a number closer to 1.0, > whereas a very unrelated text blurb would generate a much smaller cosine > similarity. > This is a proposal for how we should implement these vector search > capabilities in Solr. > h1. Search Process Overview: > In order to implement dense vector search, the following process is typically > followed: > h2. Offline: > An encoder is built. An encoder can take in text (a query, a sentence, a > paragraph, a document, etc.) and return a dense vector representing that > document in a rich semantic space. The semantic space is learned from > training on textual data (usually, though other sources work, too), typically > from the domain of the search engine. > h2. Document Ingestion: > When documents are processed, they are passed to the encoder, and the dense > vector(s) returned are stored as fields on the document. There could be one > or more vectors per-document, as the granularity of the vectors could be > per-document, per field, per paragraph, per-sentence, or even per phrase or > per term. > h2. Query Time: > *Encoding:* The query is translated to a dense vector by passing it to the > encoder > Quantization: The query is quantized. Quantization is the process of taking > a vector with many values and turning it into “terms” in a vector space that > approximates the full vector space of the dense vectors. > *ANN Matching:* A query on the quantized vector tokens is executed as an ANN > (approximate nearest neighbor) search. This allows finding most of the best > matching documents (typically up to 95%) with a traditional and efficient > lookup against the inverted index. > _(optional)_ *ANN Ranking*: ranking may be performed based upon the matched > quantized tokens to get a rough, initial ranking of documents based upon the > similarity of the query and document vectors. This allows the next step > (re-ranking) to be performed on a smaller subset of documents. > *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is > performed, a similarity calculation (cosine, dot-product, or any number of > other calculations) is typically performed between the full (non-quantized) > dense vectors for the query and those in the document. This re-ranking will > typically be on the top-N results for performance reasons. > *Return Results:* As with any search, the final step is typically to return > the results in relevance-ranked order. In this case, that would be sorted by > the re-ranking similarity score (i.e. “cosine descending”). > -- > *Variant:* For small document sets, it may be preferable to rank all > documents and skip steps steps 2, 3, and 4. This is because ANN Matching > typically reduces recall (current state of the art is around 95% recall), so > it can be beneficial to rank all documents if performance is not a concern. > In thi
[jira] [Commented] (LUCENE-9629) Use computed mask values in ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245883#comment-17245883 ] Feng Guo commented on LUCENE-9629: -- [~jpountz] Sorry to bother you! now we have come into a new week, can you please help merge this PR? > Use computed mask values in ForUtil > --- > > Key: LUCENE-9629 > URL: https://issues.apache.org/jira/browse/LUCENE-9629 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > In the class ForkUtil, mask values have been computed and stored in static > final vailables, but they are recomputed for every encoding, which may be > unnecessary. > anther small fix is that change > {code:java} > remainingBitsPerValue > remainingBitsPerLong{code} > to > {code:java} > remainingBitsPerValue >= remainingBitsPerLong{code} > otherwise > {code:java} > if (remainingBitsPerValue == 0) { > idx++; > remainingBitsPerValue = bitsPerValue; > } > {code} > these code will never be used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9629) Use computed mask values in ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245883#comment-17245883 ] Feng Guo edited comment on LUCENE-9629 at 12/8/20, 1:29 PM: [~jpountz] Sorry to bother you! now we have come into a new week, could you please help merge this PR? was (Author: gf2121): [~jpountz] Sorry to bother you! now we have come into a new week, can you please help merge this PR? > Use computed mask values in ForUtil > --- > > Key: LUCENE-9629 > URL: https://issues.apache.org/jira/browse/LUCENE-9629 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > In the class ForkUtil, mask values have been computed and stored in static > final vailables, but they are recomputed for every encoding, which may be > unnecessary. > anther small fix is that change > {code:java} > remainingBitsPerValue > remainingBitsPerLong{code} > to > {code:java} > remainingBitsPerValue >= remainingBitsPerLong{code} > otherwise > {code:java} > if (remainingBitsPerValue == 0) { > idx++; > remainingBitsPerValue = bitsPerValue; > } > {code} > these code will never be used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8673) o.a.s.search.facet classes not public/extendable
[ https://issues.apache.org/jira/browse/SOLR-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245843#comment-17245843 ] Mikhail Khludnev edited comment on SOLR-8673 at 12/8/20, 1:49 PM: -- [https://builds.apache.org/job/Lucene/job/Lucene-Solr-Tests-8.x/1027/testReport/org.apache.solr.search.function/AggValueSourceTest/] https://builds.apache.org/job/Lucene/job/Lucene-Solr-NightlyTests-master/lastCompletedBuild/testReport/org.apache.solr.search.function/AggValueSourceTest/ Fixed. was (Author: mkhludnev): [https://builds.apache.org/job/Lucene/job/Lucene-Solr-Tests-8.x/1027/testReport/org.apache.solr.search.function/AggValueSourceTest/] Fixed. > o.a.s.search.facet classes not public/extendable > > > Key: SOLR-8673 > URL: https://issues.apache.org/jira/browse/SOLR-8673 > Project: Solr > Issue Type: Improvement > Components: Facet Module >Affects Versions: 5.4.1 >Reporter: Markus Jelsma >Priority: Major > Fix For: 6.2, 7.0 > > Attachments: SOLR-8673.patch, SOLR-8673.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > It is not easy to create a custom JSON facet function. A simple function > based on AvgAgg quickly results in the following compilation failures: > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-compiler-plugin:3.3:compile (default-compile) > on project openindex-solr: Compilation failure: Compilation failure: > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[22,36] > org.apache.solr.search.facet.FacetContext is not public in > org.apache.solr.search.facet; cannot be accessed from outside package > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[23,36] > org.apache.solr.search.facet.FacetDoubleMerger is not public in > org.apache.solr.search.facet; cannot be accessed from outside package > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[40,32] > cannot find symbol > [ERROR] symbol: class FacetContext > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[49,39] > cannot find symbol > [ERROR] symbol: class FacetDoubleMerger > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[54,43] > cannot find symbol > [ERROR] symbol: class Context > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg.Merger > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[41,16] > cannot find symbol > [ERROR] symbol: class AvgSlotAcc > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[46,12] > incompatible types: i.o.s.search.facet.CustomAvgAgg.Merger cannot be > converted to org.apache.solr.search.facet.FacetMerger > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[53,5] > method does not override or implement a method from a supertype > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[60,5] > method does not override or implement a method from a supertype > {code} > It seems lots of classes are tucked away in FacetModule, which we can't reach > from outside. > Originates from this thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3ccab_8yd9ldbg_0zxm_h1igkfm6bqeypd5ilyy7tty8cztscv...@mail.gmail.com%3E > ( also available at > https://lists.apache.org/thread.html/9fddcad3136ec908ce1c57881f8d3069e5d153f08b71f80f3e18d995%401455019826%40%3Csolr-user.lucene.apache.org%3E > ) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-8673) o.a.s.search.facet classes not public/extendable
[ https://issues.apache.org/jira/browse/SOLR-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated SOLR-8673: --- Fix Version/s: (was: 6.2) (was: 7.0) 8.8 > o.a.s.search.facet classes not public/extendable > > > Key: SOLR-8673 > URL: https://issues.apache.org/jira/browse/SOLR-8673 > Project: Solr > Issue Type: Improvement > Components: Facet Module >Affects Versions: 5.4.1 >Reporter: Markus Jelsma >Priority: Major > Fix For: 8.8 > > Attachments: SOLR-8673.patch, SOLR-8673.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > It is not easy to create a custom JSON facet function. A simple function > based on AvgAgg quickly results in the following compilation failures: > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-compiler-plugin:3.3:compile (default-compile) > on project openindex-solr: Compilation failure: Compilation failure: > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[22,36] > org.apache.solr.search.facet.FacetContext is not public in > org.apache.solr.search.facet; cannot be accessed from outside package > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[23,36] > org.apache.solr.search.facet.FacetDoubleMerger is not public in > org.apache.solr.search.facet; cannot be accessed from outside package > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[40,32] > cannot find symbol > [ERROR] symbol: class FacetContext > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[49,39] > cannot find symbol > [ERROR] symbol: class FacetDoubleMerger > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[54,43] > cannot find symbol > [ERROR] symbol: class Context > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg.Merger > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[41,16] > cannot find symbol > [ERROR] symbol: class AvgSlotAcc > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[46,12] > incompatible types: i.o.s.search.facet.CustomAvgAgg.Merger cannot be > converted to org.apache.solr.search.facet.FacetMerger > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[53,5] > method does not override or implement a method from a supertype > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[60,5] > method does not override or implement a method from a supertype > {code} > It seems lots of classes are tucked away in FacetModule, which we can't reach > from outside. > Originates from this thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3ccab_8yd9ldbg_0zxm_h1igkfm6bqeypd5ilyy7tty8cztscv...@mail.gmail.com%3E > ( also available at > https://lists.apache.org/thread.html/9fddcad3136ec908ce1c57881f8d3069e5d153f08b71f80f3e18d995%401455019826%40%3Csolr-user.lucene.apache.org%3E > ) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-8673) o.a.s.search.facet classes not public/extendable
[ https://issues.apache.org/jira/browse/SOLR-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated SOLR-8673: --- Assignee: Mikhail Khludnev Resolution: Fixed Status: Resolved (was: Patch Available) > o.a.s.search.facet classes not public/extendable > > > Key: SOLR-8673 > URL: https://issues.apache.org/jira/browse/SOLR-8673 > Project: Solr > Issue Type: Improvement > Components: Facet Module >Affects Versions: 5.4.1 >Reporter: Markus Jelsma >Assignee: Mikhail Khludnev >Priority: Major > Fix For: 8.8 > > Attachments: SOLR-8673.patch, SOLR-8673.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > It is not easy to create a custom JSON facet function. A simple function > based on AvgAgg quickly results in the following compilation failures: > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-compiler-plugin:3.3:compile (default-compile) > on project openindex-solr: Compilation failure: Compilation failure: > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[22,36] > org.apache.solr.search.facet.FacetContext is not public in > org.apache.solr.search.facet; cannot be accessed from outside package > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[23,36] > org.apache.solr.search.facet.FacetDoubleMerger is not public in > org.apache.solr.search.facet; cannot be accessed from outside package > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[40,32] > cannot find symbol > [ERROR] symbol: class FacetContext > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[49,39] > cannot find symbol > [ERROR] symbol: class FacetDoubleMerger > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[54,43] > cannot find symbol > [ERROR] symbol: class Context > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg.Merger > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[41,16] > cannot find symbol > [ERROR] symbol: class AvgSlotAcc > [ERROR] location: class i.o.s.search.facet.CustomAvgAgg > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[46,12] > incompatible types: i.o.s.search.facet.CustomAvgAgg.Merger cannot be > converted to org.apache.solr.search.facet.FacetMerger > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[53,5] > method does not override or implement a method from a supertype > [ERROR] > /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[60,5] > method does not override or implement a method from a supertype > {code} > It seems lots of classes are tucked away in FacetModule, which we can't reach > from outside. > Originates from this thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3ccab_8yd9ldbg_0zxm_h1igkfm6bqeypd5ilyy7tty8cztscv...@mail.gmail.com%3E > ( also available at > https://lists.apache.org/thread.html/9fddcad3136ec908ce1c57881f8d3069e5d153f08b71f80f3e18d995%401455019826%40%3Csolr-user.lucene.apache.org%3E > ) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-15035) core.properties different when using ADDREPLICA .vs. when the replica created with CREATE
Erick Erickson created SOLR-15035: - Summary: core.properties different when using ADDREPLICA .vs. when the replica created with CREATE Key: SOLR-15035 URL: https://issues.apache.org/jira/browse/SOLR-15035 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: SolrCloud Affects Versions: 8.7 Reporter: Erick Erickson I verified this after seeing it on the user's list. Here are the core.properties files: Note that numShards is missing from the replica created with ADDREPLICA. If anyone picks this up, we there are lots of places in {color:#00}TestCollectionAPI {color} that add a replica that could reach out to the core.properties files and check. What's not clear to me is whether numShards _should_ be in core.properties, but whether or not that's the case, we should be consistent. -Core created via CREATE #Written by CorePropertiesLocator #Tue Dec 08 14:01:13 UTC 2020 coreNodeName=core_node3 collection.configName=_default name=blivet_shard1_replica_n1 numShards=2 shard=shard1 collection=blivet replicaType=NRT [branch_8x] ~/apache/solr/solrtest8/solr/example/cloud/node1/solr$ cat blivet_shard1_replica_n5/core.properties -Core created via ADDREPLICA #Written by CorePropertiesLocator #Tue Dec 08 14:01:20 UTC 2020 coreNodeName=core_node6 collection.configName=_default name=blivet_shard1_replica_n5 shard=shard1 collection=blivet replicaType=NRT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gf2121 commented on pull request #2113: LUCENE-9629: Use computed masks
gf2121 commented on pull request #2113: URL: https://github.com/apache/lucene-solr/pull/2113#issuecomment-740674712 > I wonder if the difference in performance is observable since final long values would be inlined at compile time (and easily optimized for hotspot) whereas array accesses, even if locally cached, still have to be dynamic (I don't think the compiler is smart enough to detect constant array values?). Hi @dweiss ! Thers days I did some more benchmarks on this issue and get some 'amazing' result... First, i randomly choosed a decode method `decode15`, and try to find out if it will be slower in an array case. Here is the benchmark code based on JMH: ``` @State(Scope.Benchmark) public class MyBenchmark { private static final long MASK16_1 = 0x0001000100010001L; private static final long[] MASKS16_1 = new long[] {MASK16_1}; private static final long[] TMP = new long[128]; private static final long[] ARR = new long[128]; static { for (int i=0;i<128;i++) { TMP[i] = ARR[i] = i; } } public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder() .include("MyBenchmark") .build(); new Runner(opt).run(); } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode0() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode1() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASK16_1) << 14; l0 |= (TMP[tmpIdx+1] & MASK16_1) << 13; l0 |= (TMP[tmpIdx+2] & MASK16_1) << 12; l0 |= (TMP[tmpIdx+3] & MASK16_1) << 11; l0 |= (TMP[tmpIdx+4] & MASK16_1) << 10; l0 |= (TMP[tmpIdx+5] & MASK16_1) << 9; l0 |= (TMP[tmpIdx+6] & MASK16_1) << 8; l0 |= (TMP[tmpIdx+7] & MASK16_1) << 7; l0 |= (TMP[tmpIdx+8] & MASK16_1) << 6; l0 |= (TMP[tmpIdx+9] & MASK16_1) << 5; l0 |= (TMP[tmpIdx+10] & MASK16_1) << 4; l0 |= (TMP[tmpIdx+11] & MASK16_1) << 3; l0 |= (TMP[tmpIdx+12] & MASK16_1) << 2; l0 |= (TMP[tmpIdx+13] & MASK16_1) << 1; l0 |= (TMP[tmpIdx+14] & MASK16_1) << 0; ARR[longsIdx+0] = l0; } } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode2() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & 0x0001000100010001L) << 14; l0 |= (TMP[tmpIdx+1] & 0x0001000100010001L) << 13; l0 |= (TMP[tmpIdx+2] & 0x0001000100010001L) << 12; l0 |= (TMP[tmpIdx+3] & 0x0001000100010001L) << 11; l0 |= (TMP[tmpIdx+4] & 0x0001000100010001L) << 10; l0 |= (TMP[tmpIdx+5] & 0x0001000100010001L) << 9; l0 |= (TMP[tmpIdx+6] & 0x0001000100010001L) << 8; l0 |= (TMP[tmpIdx+7] & 0x0001000100010001L) << 7; l0 |= (TMP[tmpIdx+8] & 0x0001000100010001L) << 6; l0 |= (TMP[tmpIdx+9] & 0x0001000100010001L) << 5; l0 |= (TMP[tmpIdx+10] & 0x00010
[GitHub] [lucene-solr] gf2121 edited a comment on pull request #2113: LUCENE-9629: Use computed masks
gf2121 edited a comment on pull request #2113: URL: https://github.com/apache/lucene-solr/pull/2113#issuecomment-740674712 > I wonder if the difference in performance is observable since final long values would be inlined at compile time (and easily optimized for hotspot) whereas array accesses, even if locally cached, still have to be dynamic (I don't think the compiler is smart enough to detect constant array values?). Hi @dweiss ! Thers days I did some more benchmarks on this issue and get some 'amazing' result... First, i randomly choosed a decode method `decode15`, and try to find out if it will be slower in an array case. Here is the benchmark code based on JMH: ``` @State(Scope.Benchmark) public class MyBenchmark { private static final long MASK16_1 = 0x0001000100010001L; private static final long[] MASKS16_1 = new long[] {MASK16_1}; private static final long[] TMP = new long[128]; private static final long[] ARR = new long[128]; static { for (int i=0;i<128;i++) { TMP[i] = ARR[i] = i; } } public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder() .include("MyBenchmark") .build(); new Runner(opt).run(); } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode0() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode1() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASK16_1) << 14; l0 |= (TMP[tmpIdx+1] & MASK16_1) << 13; l0 |= (TMP[tmpIdx+2] & MASK16_1) << 12; l0 |= (TMP[tmpIdx+3] & MASK16_1) << 11; l0 |= (TMP[tmpIdx+4] & MASK16_1) << 10; l0 |= (TMP[tmpIdx+5] & MASK16_1) << 9; l0 |= (TMP[tmpIdx+6] & MASK16_1) << 8; l0 |= (TMP[tmpIdx+7] & MASK16_1) << 7; l0 |= (TMP[tmpIdx+8] & MASK16_1) << 6; l0 |= (TMP[tmpIdx+9] & MASK16_1) << 5; l0 |= (TMP[tmpIdx+10] & MASK16_1) << 4; l0 |= (TMP[tmpIdx+11] & MASK16_1) << 3; l0 |= (TMP[tmpIdx+12] & MASK16_1) << 2; l0 |= (TMP[tmpIdx+13] & MASK16_1) << 1; l0 |= (TMP[tmpIdx+14] & MASK16_1) << 0; ARR[longsIdx+0] = l0; } } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode2() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & 0x0001000100010001L) << 14; l0 |= (TMP[tmpIdx+1] & 0x0001000100010001L) << 13; l0 |= (TMP[tmpIdx+2] & 0x0001000100010001L) << 12; l0 |= (TMP[tmpIdx+3] & 0x0001000100010001L) << 11; l0 |= (TMP[tmpIdx+4] & 0x0001000100010001L) << 10; l0 |= (TMP[tmpIdx+5] & 0x0001000100010001L) << 9; l0 |= (TMP[tmpIdx+6] & 0x0001000100010001L) << 8; l0 |= (TMP[tmpIdx+7] & 0x0001000100010001L) << 7; l0 |= (TMP[tmpIdx+8] & 0x0001000100010001L) << 6; l0 |= (TMP[tmpIdx+9] & 0x0001000100010001L) << 5; l0 |= (TMP[tmpIdx+10] &
[GitHub] [lucene-solr] gf2121 edited a comment on pull request #2113: LUCENE-9629: Use computed masks
gf2121 edited a comment on pull request #2113: URL: https://github.com/apache/lucene-solr/pull/2113#issuecomment-740674712 > I wonder if the difference in performance is observable since final long values would be inlined at compile time (and easily optimized for hotspot) whereas array accesses, even if locally cached, still have to be dynamic (I don't think the compiler is smart enough to detect constant array values?). Hi @dweiss ! Thers days I did some more benchmarks on this issue and get some 'amazing' result... First, i randomly choosed a decode method `decode15`, and try to find out if it will be slower in an array case. Here is the benchmark code based on JMH: ``` @State(Scope.Benchmark) public class MyBenchmark { private static final long MASK16_1 = 0x0001000100010001L; private static final long[] MASKS16_1 = new long[] {MASK16_1}; private static final long[] TMP = new long[128]; private static final long[] ARR = new long[128]; static { for (int i=0;i<128;i++) { TMP[i] = ARR[i] = i; } } public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder() .include("MyBenchmark") .build(); new Runner(opt).run(); } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode0() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode1() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASK16_1) << 14; l0 |= (TMP[tmpIdx+1] & MASK16_1) << 13; l0 |= (TMP[tmpIdx+2] & MASK16_1) << 12; l0 |= (TMP[tmpIdx+3] & MASK16_1) << 11; l0 |= (TMP[tmpIdx+4] & MASK16_1) << 10; l0 |= (TMP[tmpIdx+5] & MASK16_1) << 9; l0 |= (TMP[tmpIdx+6] & MASK16_1) << 8; l0 |= (TMP[tmpIdx+7] & MASK16_1) << 7; l0 |= (TMP[tmpIdx+8] & MASK16_1) << 6; l0 |= (TMP[tmpIdx+9] & MASK16_1) << 5; l0 |= (TMP[tmpIdx+10] & MASK16_1) << 4; l0 |= (TMP[tmpIdx+11] & MASK16_1) << 3; l0 |= (TMP[tmpIdx+12] & MASK16_1) << 2; l0 |= (TMP[tmpIdx+13] & MASK16_1) << 1; l0 |= (TMP[tmpIdx+14] & MASK16_1) << 0; ARR[longsIdx+0] = l0; } } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode2() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & 0x0001000100010001L) << 14; l0 |= (TMP[tmpIdx+1] & 0x0001000100010001L) << 13; l0 |= (TMP[tmpIdx+2] & 0x0001000100010001L) << 12; l0 |= (TMP[tmpIdx+3] & 0x0001000100010001L) << 11; l0 |= (TMP[tmpIdx+4] & 0x0001000100010001L) << 10; l0 |= (TMP[tmpIdx+5] & 0x0001000100010001L) << 9; l0 |= (TMP[tmpIdx+6] & 0x0001000100010001L) << 8; l0 |= (TMP[tmpIdx+7] & 0x0001000100010001L) << 7; l0 |= (TMP[tmpIdx+8] & 0x0001000100010001L) << 6; l0 |= (TMP[tmpIdx+9] & 0x0001000100010001L) << 5; l0 |= (TMP[tmpIdx+10] &
[GitHub] [lucene-solr] gf2121 edited a comment on pull request #2113: LUCENE-9629: Use computed masks
gf2121 edited a comment on pull request #2113: URL: https://github.com/apache/lucene-solr/pull/2113#issuecomment-740674712 > I wonder if the difference in performance is observable since final long values would be inlined at compile time (and easily optimized for hotspot) whereas array accesses, even if locally cached, still have to be dynamic (I don't think the compiler is smart enough to detect constant array values?). Hi @dweiss ! Thers days I did some more benchmarks on this issue and get some 'amazing' result which i want to share with you :) First, i randomly choosed a decode method `decode15`, and try to find out if it will be slower in an array case. Here is the benchmark code based on JMH: ``` @State(Scope.Benchmark) public class MyBenchmark { private static final long MASK16_1 = 0x0001000100010001L; private static final long[] MASKS16_1 = new long[] {MASK16_1}; private static final long[] TMP = new long[128]; private static final long[] ARR = new long[128]; static { for (int i=0;i<128;i++) { TMP[i] = ARR[i] = i; } } public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder() .include("MyBenchmark") .build(); new Runner(opt).run(); } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode0() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode1() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASK16_1) << 14; l0 |= (TMP[tmpIdx+1] & MASK16_1) << 13; l0 |= (TMP[tmpIdx+2] & MASK16_1) << 12; l0 |= (TMP[tmpIdx+3] & MASK16_1) << 11; l0 |= (TMP[tmpIdx+4] & MASK16_1) << 10; l0 |= (TMP[tmpIdx+5] & MASK16_1) << 9; l0 |= (TMP[tmpIdx+6] & MASK16_1) << 8; l0 |= (TMP[tmpIdx+7] & MASK16_1) << 7; l0 |= (TMP[tmpIdx+8] & MASK16_1) << 6; l0 |= (TMP[tmpIdx+9] & MASK16_1) << 5; l0 |= (TMP[tmpIdx+10] & MASK16_1) << 4; l0 |= (TMP[tmpIdx+11] & MASK16_1) << 3; l0 |= (TMP[tmpIdx+12] & MASK16_1) << 2; l0 |= (TMP[tmpIdx+13] & MASK16_1) << 1; l0 |= (TMP[tmpIdx+14] & MASK16_1) << 0; ARR[longsIdx+0] = l0; } } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public static void decode2() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & 0x0001000100010001L) << 14; l0 |= (TMP[tmpIdx+1] & 0x0001000100010001L) << 13; l0 |= (TMP[tmpIdx+2] & 0x0001000100010001L) << 12; l0 |= (TMP[tmpIdx+3] & 0x0001000100010001L) << 11; l0 |= (TMP[tmpIdx+4] & 0x0001000100010001L) << 10; l0 |= (TMP[tmpIdx+5] & 0x0001000100010001L) << 9; l0 |= (TMP[tmpIdx+6] & 0x0001000100010001L) << 8; l0 |= (TMP[tmpIdx+7] & 0x0001000100010001L) << 7; l0 |= (TMP[tmpIdx+8] & 0x0001000100010001L) << 6; l0 |= (TMP[tmpIdx+9] & 0x0001000100010001L) << 5;
[jira] [Created] (SOLR-15036) Use plist automatically for executing a facet expression against a collection alias backed by multiple collections
Timothy Potter created SOLR-15036: - Summary: Use plist automatically for executing a facet expression against a collection alias backed by multiple collections Key: SOLR-15036 URL: https://issues.apache.org/jira/browse/SOLR-15036 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: streaming expressions Reporter: Timothy Potter Assignee: Timothy Potter Attachments: relay-approach.patch For analytics use cases, streaming expressions make it possible to compute basic aggregations (count, min, max, sum, and avg) over massive data sets. Moreover, with massive data sets, it is common to use collection aliases over many underlying collections, for instance time-partitioned aliases backed by a set of collections, each covering a specific time range. In some cases, we can end up with many collections (think 50-60) each with 100's of shards. Aliases help insulate client applications from complex collection topologies on the server side. Let's take a basic facet expression that computes some useful aggregation metrics: {code:java} facet( some_alias, q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), min(a_d), max(a_d), count(*) ) {code} Behind the scenes, the {{FacetStream}} sends a JSON facet request to Solr which then expands the alias to a list of collections. For each collection, the top-level distributed query controller gathers a candidate set of replicas to query and then scatters {{distrib=false}} queries to each replica in the list. For instance, if we have 60 collections with 200 shards each, then this results in 12,000 shard requests from the query controller node to the other nodes in the cluster. The requests are sent in an async manner (see {{SearchHandler}} and {{HttpShardHandler}}) In my testing, we’ve seen cases where we hit 18,000 replicas and these queries don’t always come back in a timely manner. Put simply, this also puts a lot of load on the top-level query controller node in terms of open connections and new object creation. Instead, we can use {{plist}} to send the JSON facet query to each collection in the alias in parallel, which reduces the overhead of each top-level distributed query from 12,000 to 200 in my example above. With this approach, you’ll then need to sort the tuples back from each collection and do a rollup, something like: {code:java} select( rollup( sort( plist( select(facet(coll1,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt), select(facet(coll2,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt) ), by="a_i asc" ), over="a_i", sum(the_sum), avg(the_avg), min(the_min), max(the_max), sum(cnt) ), a_i, sum(the_sum) as the_sum, avg(the_avg) as the_avg, min(the_min) as the_min, max(the_max) as the_max, sum(cnt) as cnt ) {code} One thing to point out is that you can’t just avg. the averages back from each collection in the rollup. It needs to be a *weighted avg.* when rolling up the avg. from each facet expression in the plist. However, we have the count per collection, so this is doable but will require some changes to the rollup expression to support weighted average. While this plist approach is doable, it’s a pain for users to have to create the rollup / sort over plist expression for collection aliases. After all, aliases are supposed to hide these types of complexities from client applications! The point of this ticket is to investigate the feasibility of auto-wrapping the facet expression with a rollup / sort / plist when the collection argument is an alias with multiple collections; other stream sources will be considered after facet is proven out. Lastly, I also considered an alternative approach of doing a parallel relay on the server side. The idea is similar to {{plist}} but instead of this being driven on the client side, the {{FacetModule}} can create intermediate queries (I called them {{relay}} queries in my impl.) that help distribute the load. In my example above, there would be 60 such relay queries, each sent to a replica for each collection in the alias, which then sends the {{distrib=false}} queries to each replica. The relay query response handler collects the facet responses from each replica before sending back to the top-level query
[jira] [Commented] (SOLR-15036) Use plist automatically for executing a facet expression against a collection alias backed by multiple collections
[ https://issues.apache.org/jira/browse/SOLR-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246017#comment-17246017 ] Atri Sharma commented on SOLR-15036: I havent looked at the patch yet – but why not do a drill expression and wrap it with the aggregate to be computed? Would that not achieve the objective to push down aggregation to shards? > Use plist automatically for executing a facet expression against a collection > alias backed by multiple collections > -- > > Key: SOLR-15036 > URL: https://issues.apache.org/jira/browse/SOLR-15036 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Attachments: relay-approach.patch > > > For analytics use cases, streaming expressions make it possible to compute > basic aggregations (count, min, max, sum, and avg) over massive data sets. > Moreover, with massive data sets, it is common to use collection aliases over > many underlying collections, for instance time-partitioned aliases backed by > a set of collections, each covering a specific time range. In some cases, we > can end up with many collections (think 50-60) each with 100's of shards. > Aliases help insulate client applications from complex collection topologies > on the server side. > Let's take a basic facet expression that computes some useful aggregation > metrics: > {code:java} > facet( > some_alias, > q="*:*", > fl="a_i", > sort="a_i asc", > buckets="a_i", > bucketSorts="count(*) asc", > bucketSizeLimit=1, > sum(a_d), avg(a_d), min(a_d), max(a_d), count(*) > ) > {code} > Behind the scenes, the {{FacetStream}} sends a JSON facet request to Solr > which then expands the alias to a list of collections. For each collection, > the top-level distributed query controller gathers a candidate set of > replicas to query and then scatters {{distrib=false}} queries to each replica > in the list. For instance, if we have 60 collections with 200 shards each, > then this results in 12,000 shard requests from the query controller node to > the other nodes in the cluster. The requests are sent in an async manner (see > {{SearchHandler}} and {{HttpShardHandler}}) In my testing, we’ve seen cases > where we hit 18,000 replicas and these queries don’t always come back in a > timely manner. Put simply, this also puts a lot of load on the top-level > query controller node in terms of open connections and new object creation. > Instead, we can use {{plist}} to send the JSON facet query to each collection > in the alias in parallel, which reduces the overhead of each top-level > distributed query from 12,000 to 200 in my example above. With this approach, > you’ll then need to sort the tuples back from each collection and do a > rollup, something like: > {code:java} > select( > rollup( > sort( > plist( > select(facet(coll1,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt), > select(facet(coll2,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt) > ), > by="a_i asc" > ), > over="a_i", > sum(the_sum), avg(the_avg), min(the_min), max(the_max), sum(cnt) > ), > a_i, sum(the_sum) as the_sum, avg(the_avg) as the_avg, min(the_min) as > the_min, max(the_max) as the_max, sum(cnt) as cnt > ) > {code} > One thing to point out is that you can’t just avg. the averages back from > each collection in the rollup. It needs to be a *weighted avg.* when rolling > up the avg. from each facet expression in the plist. However, we have the > count per collection, so this is doable but will require some changes to the > rollup expression to support weighted average. > While this plist approach is doable, it’s a pain for users to have to create > the rollup / sort over plist expression for collection aliases. After all, > aliases are supposed to hide these types of complexities from client > applications! > The point of this ticket is to investigate the feasibility of auto-wrapping > the facet expression with a rollup / sort / plist when the collection > argument is an alias with multiple collections; other stream sources will be > consid
[jira] [Commented] (SOLR-15036) Use plist automatically for executing a facet expression against a collection alias backed by multiple collections
[ https://issues.apache.org/jira/browse/SOLR-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246031#comment-17246031 ] Timothy Potter commented on SOLR-15036: --- [~atri] the patch isn't the solution I'm going for, as I explained in the description ... Regarding drill, I'll have to investigate its performance compared to {{plist}} and {{facet}}. However, since it's based on {{/export}} seems like it would be a lot of I/O out each Solr node instead of just relying on the efficient JSON facet implementation? I certainly don't want to {{/export}} 1B rows to count them when I can just facet instead. What would a {{drill}} expression look like that does the same as my example in the description? > Use plist automatically for executing a facet expression against a collection > alias backed by multiple collections > -- > > Key: SOLR-15036 > URL: https://issues.apache.org/jira/browse/SOLR-15036 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Attachments: relay-approach.patch > > > For analytics use cases, streaming expressions make it possible to compute > basic aggregations (count, min, max, sum, and avg) over massive data sets. > Moreover, with massive data sets, it is common to use collection aliases over > many underlying collections, for instance time-partitioned aliases backed by > a set of collections, each covering a specific time range. In some cases, we > can end up with many collections (think 50-60) each with 100's of shards. > Aliases help insulate client applications from complex collection topologies > on the server side. > Let's take a basic facet expression that computes some useful aggregation > metrics: > {code:java} > facet( > some_alias, > q="*:*", > fl="a_i", > sort="a_i asc", > buckets="a_i", > bucketSorts="count(*) asc", > bucketSizeLimit=1, > sum(a_d), avg(a_d), min(a_d), max(a_d), count(*) > ) > {code} > Behind the scenes, the {{FacetStream}} sends a JSON facet request to Solr > which then expands the alias to a list of collections. For each collection, > the top-level distributed query controller gathers a candidate set of > replicas to query and then scatters {{distrib=false}} queries to each replica > in the list. For instance, if we have 60 collections with 200 shards each, > then this results in 12,000 shard requests from the query controller node to > the other nodes in the cluster. The requests are sent in an async manner (see > {{SearchHandler}} and {{HttpShardHandler}}) In my testing, we’ve seen cases > where we hit 18,000 replicas and these queries don’t always come back in a > timely manner. Put simply, this also puts a lot of load on the top-level > query controller node in terms of open connections and new object creation. > Instead, we can use {{plist}} to send the JSON facet query to each collection > in the alias in parallel, which reduces the overhead of each top-level > distributed query from 12,000 to 200 in my example above. With this approach, > you’ll then need to sort the tuples back from each collection and do a > rollup, something like: > {code:java} > select( > rollup( > sort( > plist( > select(facet(coll1,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt), > select(facet(coll2,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt) > ), > by="a_i asc" > ), > over="a_i", > sum(the_sum), avg(the_avg), min(the_min), max(the_max), sum(cnt) > ), > a_i, sum(the_sum) as the_sum, avg(the_avg) as the_avg, min(the_min) as > the_min, max(the_max) as the_max, sum(cnt) as cnt > ) > {code} > One thing to point out is that you can’t just avg. the averages back from > each collection in the rollup. It needs to be a *weighted avg.* when rolling > up the avg. from each facet expression in the plist. However, we have the > count per collection, so this is doable but will require some changes to the > rollup expression to support weighted average. > While this plist approach is doable, it’s a pain for users to have to create > the rollup / sort over plist expression fo
[jira] [Commented] (SOLR-10732) potential optimizations in callers of SolrIndexSearcher.numDocs when docset is empty
[ https://issues.apache.org/jira/browse/SOLR-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246106#comment-17246106 ] Michael Gibney commented on SOLR-10732: --- I'm curious, [~munendrasn] -- were you able to perceive a performance benefit with these changes? Where these optimizations are located, afaict they optimize edge cases, and the query-building they prevent (if I'm reading right) is generally pretty lightweight (e.g., {{TermQuery}} ...). It seems like it makes most sense to optimize this kind of thing either at the leaf level (i.e., in {{SolrIndexSearcher.numDocs(...)}} -- already done in SOLR-10727) or maybe also higher up in the program logic, to prune as much execution as possible (and when it's clearer how/why we got the point of having an empty domain). The changes here seem to be building in mid-level "shot in the dark" safeguards, where it's relatively unclear what's going on. By way of contrast (wrt complexity/benefit tradeoff), at the leaf level it looks like {{SolrIndexSearcher.getDocSet(Query, DocSet)}} could be optimized in a way analogous to what SOLR-10727 does for {{SolrIndexSearcher.numDocs(Query, DocSet)}}, avoiding filterCache pollution ... > potential optimizations in callers of SolrIndexSearcher.numDocs when docset > is empty > > > Key: SOLR-10732 > URL: https://issues.apache.org/jira/browse/SOLR-10732 > Project: Solr > Issue Type: Improvement >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-10732.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > spin off of SOLR-10727... > {quote} > ...why not (also) optimize it slightly higher up and completely avoid the > construction of the Query objects? (and in some cases: additional overhead) > for example: the first usage of {{SolrIndexSearcher.numDocs(Query,DocSet)}} i > found was {{RangeFacetProcessor.rangeCount(DocSet subset,...)}} ... if the > first line of that method was {{if (0 == subset.size()) return 0}} then we'd > not only optimize away the SolrIndexSearcher hit, but also fetching the > SchemaField & building the range query (not to mention the much more > expensive {{getGroupedFacetQueryCount}} in the grouping case) > At a glance, most other callers of > {{SolrIndexSearcher.numDocs(Query,DocSet)}} could be trivially optimize this > way as well -- at a minimum to eliminate Query parsing/construction. > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2127: LUCENE-9633: Improve match highlighter behavior for degenerate intervals
dweiss commented on a change in pull request #2127: URL: https://github.com/apache/lucene-solr/pull/2127#discussion_r538796480 ## File path: lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchRegionRetriever.java ## @@ -361,6 +374,41 @@ public void testIntervalQueries() throws IOException { ); } + @Test + public void testDegenerateIntervalsWithPositions() throws IOException { +testDegenerateIntervals(FLD_TEXT_POS); + } + + @Test @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/LUCENE-9634: " + Review comment: I may provide a PR with those query parser changes I made if there's interest - they're not that difficult and they make it possible to use intervals from plain text queries. I'll get to it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #2121: SOLR-10860: Return proper error code for bad input incase of inplace updates
madrob commented on a change in pull request #2121: URL: https://github.com/apache/lucene-solr/pull/2121#discussion_r538804398 ## File path: solr/core/src/java/org/apache/solr/update/processor/AtomicUpdateDocumentMerger.java ## @@ -143,6 +147,15 @@ public SolrInputDocument merge(final SolrInputDocument fromDoc, SolrInputDocumen return toDoc; } + private static String getID(SolrInputDocument doc, IndexSchema schema) { +String id = ""; Review comment: can we default to `(unknown id)` otherwise the error message will look weird I think. ## File path: solr/core/src/java/org/apache/solr/update/processor/AtomicUpdateDocumentMerger.java ## @@ -553,7 +574,15 @@ private Object getNativeFieldValue(String fieldName, Object val) { return val; } SchemaField sf = schema.getField(fieldName); -return sf.getType().toNativeType(val); +try { + return sf.getType().toNativeType(val); +} catch (SolrException ex) { + throw new SolrException(SolrException.ErrorCode.getErrorCode(ex.code()), + "Error converting field '" + sf.getName() + "'='" +val+"' to native type, msg=" + ex.getMessage(), ex); Review comment: I don't think we want `msg` copied since it will be in the cause anyway. ## File path: solr/core/src/test/org/apache/solr/update/TestInPlaceUpdatesStandalone.java ## @@ -121,6 +123,36 @@ public void deleteAllAndCommit() throws Exception { assertU(commit("softCommit", "false")); } + @Test + public void testUpdateBadRequest() throws Exception { +final long version1 = addAndGetVersion(sdoc("id", "1", "title_s", "first", "inplace_updatable_float", 41), null); +assertU(commit()); + +// invalid value with set operation +SolrException e = expectThrows(SolrException.class, +() -> addAndAssertVersion(version1, "id", "1", "inplace_updatable_float", map("set", "NOT_NUMBER"))); +assertEquals(SolrException.ErrorCode.BAD_REQUEST.code, e.code()); +MatcherAssert.assertThat(e.getMessage(), containsString("For input string: \"NOT_NUMBER\"")); + +// invalid value with inc operation +e = expectThrows(SolrException.class, +() -> addAndAssertVersion(version1, "id", "1", "inplace_updatable_float", map("inc", "NOT_NUMBER"))); +assertEquals(SolrException.ErrorCode.BAD_REQUEST.code, e.code()); +MatcherAssert.assertThat(e.getMessage(), containsString("For input string: \"NOT_NUMBER\"")); + +// inc op with null value +e = expectThrows(SolrException.class, +() -> addAndAssertVersion(version1, "id", "1", "inplace_updatable_float", map("inc", null))); +assertEquals(SolrException.ErrorCode.BAD_REQUEST.code, e.code()); +MatcherAssert.assertThat(e.getMessage(), containsString("Invalid input 'null' for field inplace_updatable_float")); + +e = expectThrows(SolrException.class, +() -> addAndAssertVersion(version1, "id", "1", "inplace_updatable_float", Review comment: This surprises me a little bit that we can't increment a float by an integer amount? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #2118: SOLR-15031: Prevent null being wrapped in a QueryValueSource
madrob commented on a change in pull request #2118: URL: https://github.com/apache/lucene-solr/pull/2118#discussion_r538812418 ## File path: solr/core/src/java/org/apache/solr/search/FunctionQParser.java ## @@ -361,7 +361,9 @@ protected ValueSource parseValueSource(int flags) throws SyntaxError { ((FunctionQParser)subParser).setParseMultipleSources(true); } Query subQuery = subParser.getQuery(); - if (subQuery instanceof FunctionQuery) { + if (subQuery == null) { +valueSource = new DoubleConstValueSource(0.0f); + } else if (subQuery instanceof FunctionQuery) { valueSource = ((FunctionQuery) subQuery).getValueSource(); } else { valueSource = new QueryValueSource(subQuery, 0.0f); Review comment: Should we add a test in QueryValueSource constructor to require non-null? ## File path: solr/core/src/java/org/apache/solr/search/FunctionQParser.java ## @@ -361,7 +361,9 @@ protected ValueSource parseValueSource(int flags) throws SyntaxError { ((FunctionQParser)subParser).setParseMultipleSources(true); } Query subQuery = subParser.getQuery(); - if (subQuery instanceof FunctionQuery) { + if (subQuery == null) { +valueSource = new DoubleConstValueSource(0.0f); Review comment: Why a Double? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on pull request #2118: SOLR-15031: Prevent null being wrapped in a QueryValueSource
madrob commented on pull request #2118: URL: https://github.com/apache/lucene-solr/pull/2118#issuecomment-741045328 Overall the fix is definitely good, and I think it's correct, just a few minor questions about it for completeness. Thank you for opening the PR! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15036) Use plist automatically for executing a facet expression against a collection alias backed by multiple collections
[ https://issues.apache.org/jira/browse/SOLR-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246157#comment-17246157 ] Joel Bernstein commented on SOLR-15036: --- I can comment on the drill vs facet. Facet will always be faster than drill except in the high cardinality use case. Drill really shines in the high cardinality use case though. Rather than sending all tuples to the aggregator node, drill can first aggregate inside of the export handler and compress the result significantly before hitting the network. And drill never runs out of memory. More work is coming that improves the export handler performance by about 300%. But even this improvement doesn't allow drill to match the speed of facet on low cardinality aggregations. > Use plist automatically for executing a facet expression against a collection > alias backed by multiple collections > -- > > Key: SOLR-15036 > URL: https://issues.apache.org/jira/browse/SOLR-15036 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Attachments: relay-approach.patch > > > For analytics use cases, streaming expressions make it possible to compute > basic aggregations (count, min, max, sum, and avg) over massive data sets. > Moreover, with massive data sets, it is common to use collection aliases over > many underlying collections, for instance time-partitioned aliases backed by > a set of collections, each covering a specific time range. In some cases, we > can end up with many collections (think 50-60) each with 100's of shards. > Aliases help insulate client applications from complex collection topologies > on the server side. > Let's take a basic facet expression that computes some useful aggregation > metrics: > {code:java} > facet( > some_alias, > q="*:*", > fl="a_i", > sort="a_i asc", > buckets="a_i", > bucketSorts="count(*) asc", > bucketSizeLimit=1, > sum(a_d), avg(a_d), min(a_d), max(a_d), count(*) > ) > {code} > Behind the scenes, the {{FacetStream}} sends a JSON facet request to Solr > which then expands the alias to a list of collections. For each collection, > the top-level distributed query controller gathers a candidate set of > replicas to query and then scatters {{distrib=false}} queries to each replica > in the list. For instance, if we have 60 collections with 200 shards each, > then this results in 12,000 shard requests from the query controller node to > the other nodes in the cluster. The requests are sent in an async manner (see > {{SearchHandler}} and {{HttpShardHandler}}) In my testing, we’ve seen cases > where we hit 18,000 replicas and these queries don’t always come back in a > timely manner. Put simply, this also puts a lot of load on the top-level > query controller node in terms of open connections and new object creation. > Instead, we can use {{plist}} to send the JSON facet query to each collection > in the alias in parallel, which reduces the overhead of each top-level > distributed query from 12,000 to 200 in my example above. With this approach, > you’ll then need to sort the tuples back from each collection and do a > rollup, something like: > {code:java} > select( > rollup( > sort( > plist( > select(facet(coll1,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt), > select(facet(coll2,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt) > ), > by="a_i asc" > ), > over="a_i", > sum(the_sum), avg(the_avg), min(the_min), max(the_max), sum(cnt) > ), > a_i, sum(the_sum) as the_sum, avg(the_avg) as the_avg, min(the_min) as > the_min, max(the_max) as the_max, sum(cnt) as cnt > ) > {code} > One thing to point out is that you can’t just avg. the averages back from > each collection in the rollup. It needs to be a *weighted avg.* when rolling > up the avg. from each facet expression in the plist. However, we have the > count per collection, so this is doable but will require some changes to the > rollup expression to support weighted average. > While this plist approach is doable, it’s a pain for users to have to create > the
[GitHub] [lucene-solr] madrob commented on pull request #2113: LUCENE-9629: Use computed masks
madrob commented on pull request #2113: URL: https://github.com/apache/lucene-solr/pull/2113#issuecomment-741047880 You need to either return a value from the benchmark methods or call blackhole.consume, otherwise the JVM will detect that everything is unused outside of the scope and optimize it away. That should get you some different results. Thank you for being thorough! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-15036) Use plist automatically for executing a facet expression against a collection alias backed by multiple collections
[ https://issues.apache.org/jira/browse/SOLR-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246157#comment-17246157 ] Joel Bernstein edited comment on SOLR-15036 at 12/8/20, 9:18 PM: - I can comment on the drill vs facet question. Facet will always be faster than drill except in the high cardinality use case. Drill really shines in the high cardinality use case though. Rather than sending all tuples to the aggregator node, drill can first aggregate inside of the export handler and compress the result significantly before hitting the network. And drill never runs out of memory. More work is coming that improves the export handler performance by about 300%. But even this improvement doesn't allow drill to match the speed of facet on low cardinality aggregations. was (Author: joel.bernstein): I can comment on the drill vs facet. Facet will always be faster than drill except in the high cardinality use case. Drill really shines in the high cardinality use case though. Rather than sending all tuples to the aggregator node, drill can first aggregate inside of the export handler and compress the result significantly before hitting the network. And drill never runs out of memory. More work is coming that improves the export handler performance by about 300%. But even this improvement doesn't allow drill to match the speed of facet on low cardinality aggregations. > Use plist automatically for executing a facet expression against a collection > alias backed by multiple collections > -- > > Key: SOLR-15036 > URL: https://issues.apache.org/jira/browse/SOLR-15036 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Attachments: relay-approach.patch > > > For analytics use cases, streaming expressions make it possible to compute > basic aggregations (count, min, max, sum, and avg) over massive data sets. > Moreover, with massive data sets, it is common to use collection aliases over > many underlying collections, for instance time-partitioned aliases backed by > a set of collections, each covering a specific time range. In some cases, we > can end up with many collections (think 50-60) each with 100's of shards. > Aliases help insulate client applications from complex collection topologies > on the server side. > Let's take a basic facet expression that computes some useful aggregation > metrics: > {code:java} > facet( > some_alias, > q="*:*", > fl="a_i", > sort="a_i asc", > buckets="a_i", > bucketSorts="count(*) asc", > bucketSizeLimit=1, > sum(a_d), avg(a_d), min(a_d), max(a_d), count(*) > ) > {code} > Behind the scenes, the {{FacetStream}} sends a JSON facet request to Solr > which then expands the alias to a list of collections. For each collection, > the top-level distributed query controller gathers a candidate set of > replicas to query and then scatters {{distrib=false}} queries to each replica > in the list. For instance, if we have 60 collections with 200 shards each, > then this results in 12,000 shard requests from the query controller node to > the other nodes in the cluster. The requests are sent in an async manner (see > {{SearchHandler}} and {{HttpShardHandler}}) In my testing, we’ve seen cases > where we hit 18,000 replicas and these queries don’t always come back in a > timely manner. Put simply, this also puts a lot of load on the top-level > query controller node in terms of open connections and new object creation. > Instead, we can use {{plist}} to send the JSON facet query to each collection > in the alias in parallel, which reduces the overhead of each top-level > distributed query from 12,000 to 200 in my example above. With this approach, > you’ll then need to sort the tuples back from each collection and do a > rollup, something like: > {code:java} > select( > rollup( > sort( > plist( > select(facet(coll1,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt), > select(facet(coll2,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt) > ), > by="a_i asc" > ), > ove
[jira] [Commented] (SOLR-14688) First party package implementation design
[ https://issues.apache.org/jira/browse/SOLR-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246171#comment-17246171 ] Noble Paul commented on SOLR-14688: --- Yes David, that's a missing piece. When there are multiple versions of a package available, Solr should pick up the compatible version eg: package v1 is compatible with solr 8.5 to 8.8 and package v2 is compatible with solr 8.9 to solr 9.5. if a node is started with solr 8.8, it should use v1 and if a node is started with solr 9, it should pick package v2 > First party package implementation design > - > > Key: SOLR-14688 > URL: https://issues.apache.org/jira/browse/SOLR-14688 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Priority: Major > Labels: package, packagemanager > > Here's the design document for first party packages: > https://docs.google.com/document/d/1n7gB2JAdZhlJKFrCd4Txcw4HDkdk7hlULyAZBS-wXrE/edit?usp=sharing > Put differently, this is about package-ifying our "contribs". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15036) Use plist automatically for executing a facet expression against a collection alias backed by multiple collections
[ https://issues.apache.org/jira/browse/SOLR-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246172#comment-17246172 ] Michael Gibney commented on SOLR-15036: --- [~thelabdude], you mention "JSON facet implementation", which I gather (according to the refguide) is under the hood of the [facet streaming expression|https://lucene.apache.org/solr/guide/8_7/stream-source-reference.html#facet]. [~jbernste], you imply that "facet" sends "all tuples to the aggregator node". I'm confused here, because that implication contradicts my understanding of what the "JSON facet" implementation does (i.e., shard-level aggregation first, merging on coordinator node, optional shard-level refinement). Perhaps I'm missing something about the specific way in which the {{facet}} streaming expression wraps "JSON facet" functionality? Also, when you say "high cardinality use case", roughly how high is "high", and are you referring to high cardinality wrt DocSet domain size, or number of unique values in a field? > Use plist automatically for executing a facet expression against a collection > alias backed by multiple collections > -- > > Key: SOLR-15036 > URL: https://issues.apache.org/jira/browse/SOLR-15036 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Attachments: relay-approach.patch > > > For analytics use cases, streaming expressions make it possible to compute > basic aggregations (count, min, max, sum, and avg) over massive data sets. > Moreover, with massive data sets, it is common to use collection aliases over > many underlying collections, for instance time-partitioned aliases backed by > a set of collections, each covering a specific time range. In some cases, we > can end up with many collections (think 50-60) each with 100's of shards. > Aliases help insulate client applications from complex collection topologies > on the server side. > Let's take a basic facet expression that computes some useful aggregation > metrics: > {code:java} > facet( > some_alias, > q="*:*", > fl="a_i", > sort="a_i asc", > buckets="a_i", > bucketSorts="count(*) asc", > bucketSizeLimit=1, > sum(a_d), avg(a_d), min(a_d), max(a_d), count(*) > ) > {code} > Behind the scenes, the {{FacetStream}} sends a JSON facet request to Solr > which then expands the alias to a list of collections. For each collection, > the top-level distributed query controller gathers a candidate set of > replicas to query and then scatters {{distrib=false}} queries to each replica > in the list. For instance, if we have 60 collections with 200 shards each, > then this results in 12,000 shard requests from the query controller node to > the other nodes in the cluster. The requests are sent in an async manner (see > {{SearchHandler}} and {{HttpShardHandler}}) In my testing, we’ve seen cases > where we hit 18,000 replicas and these queries don’t always come back in a > timely manner. Put simply, this also puts a lot of load on the top-level > query controller node in terms of open connections and new object creation. > Instead, we can use {{plist}} to send the JSON facet query to each collection > in the alias in parallel, which reduces the overhead of each top-level > distributed query from 12,000 to 200 in my example above. With this approach, > you’ll then need to sort the tuples back from each collection and do a > rollup, something like: > {code:java} > select( > rollup( > sort( > plist( > select(facet(coll1,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt), > select(facet(coll2,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt) > ), > by="a_i asc" > ), > over="a_i", > sum(the_sum), avg(the_avg), min(the_min), max(the_max), sum(cnt) > ), > a_i, sum(the_sum) as the_sum, avg(the_avg) as the_avg, min(the_min) as > the_min, max(the_max) as the_max, sum(cnt) as cnt > ) > {code} > One thing to point out is that you can’t just avg. the averages back from > each collection in the rollup. It needs to be a *weighted avg.* when rolling > up the avg. from each fa
[GitHub] [lucene-solr] tflobbe commented on a change in pull request #2120: SOLR-15029 More gracefully give up shard leadership
tflobbe commented on a change in pull request #2120: URL: https://github.com/apache/lucene-solr/pull/2120#discussion_r538843558 ## File path: solr/core/src/java/org/apache/solr/handler/admin/CollectionsHandler.java ## @@ -1306,7 +1306,7 @@ private static void forceLeaderElection(SolrQueryRequest req, CollectionsHandler try (ZkShardTerms zkShardTerms = new ZkShardTerms(collectionName, slice.getName(), zkController.getZkClient())) { // if an active replica is the leader, then all is fine already Replica leader = slice.getLeader(); - if (leader != null && leader.getState() == State.ACTIVE) { + if (leader != null && leader.getState() == State.ACTIVE && zkShardTerms.getHighestTerm() == zkShardTerms.getTerm(leader.getName())) { Review comment: I know this is not new code, but should we change `leader.getState() == State.ACTIVE` to `leader.isActive(liveNodes)`? ## File path: solr/core/src/java/org/apache/solr/util/TestInjection.java ## @@ -337,6 +342,39 @@ public static boolean injectFailUpdateRequests() { return true; } + + public static boolean injectLeaderTragedy(SolrCore core) { Review comment: What's the point of the return value? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15036) Use plist automatically for executing a facet expression against a collection alias backed by multiple collections
[ https://issues.apache.org/jira/browse/SOLR-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246198#comment-17246198 ] Joel Bernstein commented on SOLR-15036: --- In the high cardinality use case, faceting will eventually run into performance and memory problems, so I don't really consider it a great high cardinality solution. Not because it sends all tuples to aggregator nodes, but because it's an in-memory aggregation. I was comparing Streaming Expressions, prior to drill, when I mentioned sending all tuples to the aggregator nodes. Streaming Expressions, prior to drill, could use the export handler to send all sorted tuples to the aggregator node and accomplish high cardinality aggregation. So, drill improves on previous implementations of Streaming Expressions by first aggregating inside the export handler. > Use plist automatically for executing a facet expression against a collection > alias backed by multiple collections > -- > > Key: SOLR-15036 > URL: https://issues.apache.org/jira/browse/SOLR-15036 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Attachments: relay-approach.patch > > > For analytics use cases, streaming expressions make it possible to compute > basic aggregations (count, min, max, sum, and avg) over massive data sets. > Moreover, with massive data sets, it is common to use collection aliases over > many underlying collections, for instance time-partitioned aliases backed by > a set of collections, each covering a specific time range. In some cases, we > can end up with many collections (think 50-60) each with 100's of shards. > Aliases help insulate client applications from complex collection topologies > on the server side. > Let's take a basic facet expression that computes some useful aggregation > metrics: > {code:java} > facet( > some_alias, > q="*:*", > fl="a_i", > sort="a_i asc", > buckets="a_i", > bucketSorts="count(*) asc", > bucketSizeLimit=1, > sum(a_d), avg(a_d), min(a_d), max(a_d), count(*) > ) > {code} > Behind the scenes, the {{FacetStream}} sends a JSON facet request to Solr > which then expands the alias to a list of collections. For each collection, > the top-level distributed query controller gathers a candidate set of > replicas to query and then scatters {{distrib=false}} queries to each replica > in the list. For instance, if we have 60 collections with 200 shards each, > then this results in 12,000 shard requests from the query controller node to > the other nodes in the cluster. The requests are sent in an async manner (see > {{SearchHandler}} and {{HttpShardHandler}}) In my testing, we’ve seen cases > where we hit 18,000 replicas and these queries don’t always come back in a > timely manner. Put simply, this also puts a lot of load on the top-level > query controller node in terms of open connections and new object creation. > Instead, we can use {{plist}} to send the JSON facet query to each collection > in the alias in parallel, which reduces the overhead of each top-level > distributed query from 12,000 to 200 in my example above. With this approach, > you’ll then need to sort the tuples back from each collection and do a > rollup, something like: > {code:java} > select( > rollup( > sort( > plist( > select(facet(coll1,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt), > select(facet(coll2,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt) > ), > by="a_i asc" > ), > over="a_i", > sum(the_sum), avg(the_avg), min(the_min), max(the_max), sum(cnt) > ), > a_i, sum(the_sum) as the_sum, avg(the_avg) as the_avg, min(the_min) as > the_min, max(the_max) as the_max, sum(cnt) as cnt > ) > {code} > One thing to point out is that you can’t just avg. the averages back from > each collection in the rollup. It needs to be a *weighted avg.* when rolling > up the avg. from each facet expression in the plist. However, we have the > count per collection, so this is doable but will require some changes to the > rollup expression to support weighted average. >
[jira] [Comment Edited] (SOLR-15036) Use plist automatically for executing a facet expression against a collection alias backed by multiple collections
[ https://issues.apache.org/jira/browse/SOLR-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246157#comment-17246157 ] Joel Bernstein edited comment on SOLR-15036 at 12/9/20, 12:21 AM: -- I can comment on the drill vs facet question. Facet will always be faster than drill except in the high cardinality use case. Drill really shines in the high cardinality use case though. Rather than sending all tuples to the aggregator node, and using the rollup Stream, drill can first aggregate inside of the export handler and compress the result significantly before hitting the network. And drill never runs out of memory, where faceting will eventually run out of memory. More work is coming that improves the export handler performance by about 300%. But even this improvement doesn't allow drill to match the speed of facet on low cardinality aggregations. was (Author: joel.bernstein): I can comment on the drill vs facet question. Facet will always be faster than drill except in the high cardinality use case. Drill really shines in the high cardinality use case though. Rather than sending all tuples to the aggregator node, drill can first aggregate inside of the export handler and compress the result significantly before hitting the network. And drill never runs out of memory. More work is coming that improves the export handler performance by about 300%. But even this improvement doesn't allow drill to match the speed of facet on low cardinality aggregations. > Use plist automatically for executing a facet expression against a collection > alias backed by multiple collections > -- > > Key: SOLR-15036 > URL: https://issues.apache.org/jira/browse/SOLR-15036 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Attachments: relay-approach.patch > > > For analytics use cases, streaming expressions make it possible to compute > basic aggregations (count, min, max, sum, and avg) over massive data sets. > Moreover, with massive data sets, it is common to use collection aliases over > many underlying collections, for instance time-partitioned aliases backed by > a set of collections, each covering a specific time range. In some cases, we > can end up with many collections (think 50-60) each with 100's of shards. > Aliases help insulate client applications from complex collection topologies > on the server side. > Let's take a basic facet expression that computes some useful aggregation > metrics: > {code:java} > facet( > some_alias, > q="*:*", > fl="a_i", > sort="a_i asc", > buckets="a_i", > bucketSorts="count(*) asc", > bucketSizeLimit=1, > sum(a_d), avg(a_d), min(a_d), max(a_d), count(*) > ) > {code} > Behind the scenes, the {{FacetStream}} sends a JSON facet request to Solr > which then expands the alias to a list of collections. For each collection, > the top-level distributed query controller gathers a candidate set of > replicas to query and then scatters {{distrib=false}} queries to each replica > in the list. For instance, if we have 60 collections with 200 shards each, > then this results in 12,000 shard requests from the query controller node to > the other nodes in the cluster. The requests are sent in an async manner (see > {{SearchHandler}} and {{HttpShardHandler}}) In my testing, we’ve seen cases > where we hit 18,000 replicas and these queries don’t always come back in a > timely manner. Put simply, this also puts a lot of load on the top-level > query controller node in terms of open connections and new object creation. > Instead, we can use {{plist}} to send the JSON facet query to each collection > in the alias in parallel, which reduces the overhead of each top-level > distributed query from 12,000 to 200 in my example above. With this approach, > you’ll then need to sort the tuples back from each collection and do a > rollup, something like: > {code:java} > select( > rollup( > sort( > plist( > select(facet(coll1,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt), > select(facet(coll2,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_mi
[jira] [Comment Edited] (SOLR-15036) Use plist automatically for executing a facet expression against a collection alias backed by multiple collections
[ https://issues.apache.org/jira/browse/SOLR-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246198#comment-17246198 ] Joel Bernstein edited comment on SOLR-15036 at 12/9/20, 12:22 AM: -- In the high cardinality use case, faceting will eventually run into performance and memory problems, so I don't really consider it a great high cardinality solution. Not because it sends all tuples to aggregator nodes, but because it's an in-memory aggregation. I was comparing Streaming Expressions, prior to drill, when I mentioned sending all tuples to the aggregator nodes. Streaming Expressions, prior to drill, could use the export handler to send all sorted tuples to the aggregator node and accomplish high cardinality aggregation. So, drill improves on previous implementations of Streaming Expressions by first aggregating inside the export handler. Just updated my prior comment to make this more clear. was (Author: joel.bernstein): In the high cardinality use case, faceting will eventually run into performance and memory problems, so I don't really consider it a great high cardinality solution. Not because it sends all tuples to aggregator nodes, but because it's an in-memory aggregation. I was comparing Streaming Expressions, prior to drill, when I mentioned sending all tuples to the aggregator nodes. Streaming Expressions, prior to drill, could use the export handler to send all sorted tuples to the aggregator node and accomplish high cardinality aggregation. So, drill improves on previous implementations of Streaming Expressions by first aggregating inside the export handler. > Use plist automatically for executing a facet expression against a collection > alias backed by multiple collections > -- > > Key: SOLR-15036 > URL: https://issues.apache.org/jira/browse/SOLR-15036 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Attachments: relay-approach.patch > > > For analytics use cases, streaming expressions make it possible to compute > basic aggregations (count, min, max, sum, and avg) over massive data sets. > Moreover, with massive data sets, it is common to use collection aliases over > many underlying collections, for instance time-partitioned aliases backed by > a set of collections, each covering a specific time range. In some cases, we > can end up with many collections (think 50-60) each with 100's of shards. > Aliases help insulate client applications from complex collection topologies > on the server side. > Let's take a basic facet expression that computes some useful aggregation > metrics: > {code:java} > facet( > some_alias, > q="*:*", > fl="a_i", > sort="a_i asc", > buckets="a_i", > bucketSorts="count(*) asc", > bucketSizeLimit=1, > sum(a_d), avg(a_d), min(a_d), max(a_d), count(*) > ) > {code} > Behind the scenes, the {{FacetStream}} sends a JSON facet request to Solr > which then expands the alias to a list of collections. For each collection, > the top-level distributed query controller gathers a candidate set of > replicas to query and then scatters {{distrib=false}} queries to each replica > in the list. For instance, if we have 60 collections with 200 shards each, > then this results in 12,000 shard requests from the query controller node to > the other nodes in the cluster. The requests are sent in an async manner (see > {{SearchHandler}} and {{HttpShardHandler}}) In my testing, we’ve seen cases > where we hit 18,000 replicas and these queries don’t always come back in a > timely manner. Put simply, this also puts a lot of load on the top-level > query controller node in terms of open connections and new object creation. > Instead, we can use {{plist}} to send the JSON facet query to each collection > in the alias in parallel, which reduces the overhead of each top-level > distributed query from 12,000 to 200 in my example above. With this approach, > you’ll then need to sort the tuples back from each collection and do a > rollup, something like: > {code:java} > select( > rollup( > sort( > plist( > select(facet(coll1,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt), > select(facet(coll2,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit
[jira] [Commented] (SOLR-7964) suggest.highlight=true does not work when using context filter query
[ https://issues.apache.org/jira/browse/SOLR-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246219#comment-17246219 ] Graham Sutton commented on SOLR-7964: - Any progress on getting this incorporated into one of the upcoming official releases? I am still encountering this issue in 8.5.2. > suggest.highlight=true does not work when using context filter query > > > Key: SOLR-7964 > URL: https://issues.apache.org/jira/browse/SOLR-7964 > Project: Solr > Issue Type: Improvement > Components: Suggester >Affects Versions: 5.4 >Reporter: Arcadius Ahouansou >Assignee: David Smiley >Priority: Minor > Labels: suggester > Attachments: SOLR-7964.patch, SOLR_7964.patch, SOLR_7964.patch > > > When using the new suggester context filtering query param > {{suggest.contextFilterQuery}} introduced in SOLR-7888, the param > {{suggest.highlight=true}} has no effect. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude opened a new pull request #2132: SOLR-15036: auto- select / rollup / sort / plist over facet expression when using a collection alias with multiple collections
thelabdude opened a new pull request #2132: URL: https://github.com/apache/lucene-solr/pull/2132 # Description Quick impl to show the concept discussed in the JIRA, more tests required ... Pretty non-invasive to the existing codebase in my opinion thus far ;-) Also want to try to generalize some of this auto-plist stuff for use with different stream sources. # Solution Please provide a short description of the approach taken to implement your solution. # Tests Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem. # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [ ] I have created a Jira issue and added the issue ID to my pull request title. - [ ] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `master` branch. - [ ] I have run `./gradlew check`. - [ ] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15036) Use plist automatically for executing a facet expression against a collection alias backed by multiple collections
[ https://issues.apache.org/jira/browse/SOLR-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246228#comment-17246228 ] Michael Gibney commented on SOLR-15036: --- Thanks for the clarification, [~jbernste]. Would you be able to give a rough sense of how high you consider to be high cardinality, and whether you're talking about high cardinality _domain_ (DocSet size) or _field_ (number of unique values)? Apologies (and I hope/trust this isn't off-topic for this issue), but "faceting will eventually run into performance and memory problems ... because it's an in-memory aggregation" -- in a sense all aggregation is an in-memory aggregation, it's just a question of how aggressively the accumulation data structure is pruned (unless {{drill}} is writing to disk?). I'm honestly having a hard time wrapping my head around cases in which {{drill}} would perform better than "JSON facet", esp. considering the fundamental distinction that an exportWriter-based impl would work with BytesRefs (right?), whereas "JSON facet" generally works against term ords (at the shard level). Hence my questions about "how high is high" wrt cardinality, etc. ... hoping that will help me better understand the performance characteristics you're describing. > Use plist automatically for executing a facet expression against a collection > alias backed by multiple collections > -- > > Key: SOLR-15036 > URL: https://issues.apache.org/jira/browse/SOLR-15036 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Attachments: relay-approach.patch > > Time Spent: 10m > Remaining Estimate: 0h > > For analytics use cases, streaming expressions make it possible to compute > basic aggregations (count, min, max, sum, and avg) over massive data sets. > Moreover, with massive data sets, it is common to use collection aliases over > many underlying collections, for instance time-partitioned aliases backed by > a set of collections, each covering a specific time range. In some cases, we > can end up with many collections (think 50-60) each with 100's of shards. > Aliases help insulate client applications from complex collection topologies > on the server side. > Let's take a basic facet expression that computes some useful aggregation > metrics: > {code:java} > facet( > some_alias, > q="*:*", > fl="a_i", > sort="a_i asc", > buckets="a_i", > bucketSorts="count(*) asc", > bucketSizeLimit=1, > sum(a_d), avg(a_d), min(a_d), max(a_d), count(*) > ) > {code} > Behind the scenes, the {{FacetStream}} sends a JSON facet request to Solr > which then expands the alias to a list of collections. For each collection, > the top-level distributed query controller gathers a candidate set of > replicas to query and then scatters {{distrib=false}} queries to each replica > in the list. For instance, if we have 60 collections with 200 shards each, > then this results in 12,000 shard requests from the query controller node to > the other nodes in the cluster. The requests are sent in an async manner (see > {{SearchHandler}} and {{HttpShardHandler}}) In my testing, we’ve seen cases > where we hit 18,000 replicas and these queries don’t always come back in a > timely manner. Put simply, this also puts a lot of load on the top-level > query controller node in terms of open connections and new object creation. > Instead, we can use {{plist}} to send the JSON facet query to each collection > in the alias in parallel, which reduces the overhead of each top-level > distributed query from 12,000 to 200 in my example above. With this approach, > you’ll then need to sort the tuples back from each collection and do a > rollup, something like: > {code:java} > select( > rollup( > sort( > plist( > select(facet(coll1,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt), > select(facet(coll2,q="*:*", fl="a_i", sort="a_i asc", buckets="a_i", > bucketSorts="count(*) asc", bucketSizeLimit=1, sum(a_d), avg(a_d), > min(a_d), max(a_d), count(*)),a_i,sum(a_d) as the_sum, avg(a_d) as the_avg, > min(a_d) as the_min, max(a_d) as the_max, count(*) as cnt) > ), > by="a_i asc" > ), > over="a_i", > sum(the_sum), avg(the_avg), min(the_min), max(the_max), sum(cnt) > ), > a_i, sum(the_sum) as the
[jira] [Commented] (SOLR-14848) Demonstrate how Solr 8, master, or any version previous Solr version before pales next to the reference branch.
[ https://issues.apache.org/jira/browse/SOLR-14848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246278#comment-17246278 ] Mark Robert Miller commented on SOLR-14848: --- Whew, okay, this issue is finally queuing up. The Solr ref branch phase 1 will be called “complete” on Friday. Like any milestone, that is really to my definition, but milestones are milestones and it’s an important one for me. This issue is prime phase 2, alongside some Nightly and merge up work that can move along in parallel. > Demonstrate how Solr 8, master, or any version previous Solr version before > pales next to the reference branch. > --- > > Key: SOLR-14848 > URL: https://issues.apache.org/jira/browse/SOLR-14848 > Project: Solr > Issue Type: Sub-task >Reporter: Mark Robert Miller >Priority: Major > > I've got a lot of code here and I have and will be claiming that it's an > order of magnitude better than what has come before. > I've been too busy and will be busy for a bit, so I have not been too > concerned about backing that up really at all. Most people have no clue what > I have here, some people have an inkling, some people are just totally > confused, some people think I maybe have some fast tests, or a slightly more > stable system, or maybe some neato performance changes, or even maybe some > poorly coded speed hacks. Maybe one or two has a more hope filled guess. > Almost everyone will think, "all that new code, mostly done by a single > person? I know a lot of smart and smarter devs, who cares what this guy is up > to. Why would I leave the safety of the branch I know and feel safe with? By > definition, the existing stuff is the battle hardened, tried and true leader, > and how are you going to come in here without disrupting our comfortable > thing?" > Well, fair enough. I won't try to come and disrupt anything. Instead, there > will be benchmarks, stress tests, chaos monkeys, long term endurance tests, > and all sorts of fun competitions. Spy vs Spy. I mean Solr vs Solr. > And while this vanilla version of my previous work has avoided a lot of great > changes and improvements I can make (a "remastered" Solr sensible, initial > mandate that puts a hand or two behind my back) ... > ... The reference branch will trounce previous versions of Solr in benchmark > after benchmark. It will keep pumping through endurance tests and performance > challenges at impressive speed while Solr proper will struggle to finish in a > reasonable time or almost certainly, often enough, simply fail to complete > the task. The reference branch will devour available resources and fly > through work. Solr master will struggle and meander, sometimes in the wrong > direction, while leaving the hardware with gobs of idle cpu to chill with > (unless it's using most of the cpu for garbage collection at some points). > This is not meant to brag or dis previous versions of Solr. I was heavily > involved in building them. This is the result of dedication and time more > than any of my brilliance - the above is simply meant to state the path that > I see coming. As this comparison information and other experiences and > stories start to emerge, that master branch won't look nearly so safe or > comfortable anymore. And it's at that point that we will find out if anyone > is interested in testing our tolerance for disruption by trying to figure out > how to get master into the reference branch as opposed to the other way > around. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing
[ https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246280#comment-17246280 ] Mark Robert Miller commented on SOLR-14788: --- The flywheel defender is in rare form. Next week we will start to see what this hack code can do more concretely. > Solr: The Next Big Thing > > > Key: SOLR-14788 > URL: https://issues.apache.org/jira/browse/SOLR-14788 > Project: Solr > Issue Type: Task >Reporter: Mark Robert Miller >Assignee: Mark Robert Miller >Priority: Critical > Time Spent: 4h > Remaining Estimate: 0h > > h3. > [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The > Policeman is on duty!*{color} > {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and > have some fun. Try to make some progress. Don't stress too much about the > impact of your changes or maintaining stability and performance and > correctness so much. Until the end of phase 1, I've got your back. I have a > variety of tools and contraptions I have been building over the years and I > will continue training them on this branch. I will review your changes and > peer out across the land and course correct where needed. As Mike D will be > thinking, "Sounds like a bottleneck Mark." And indeed it will be to some > extent. Which is why once stage one is completed, I will flip The Policeman > to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} > *down for some vigilante justice, but I won't be walking the beat, all that > stuff about sit back and relax goes out the window.*{color}_ > {quote} > > I have stolen this title from Ishan or Noble and Ishan. > This issue is meant to capture the work of a small team that is forming to > push Solr and SolrCloud to the next phase. > I have kicked off the work with an effort to create a very fast and solid > base. That work is not 100% done, but it's ready to join the fight. > Tim Potter has started giving me a tremendous hand in finishing up. Ishan and > Noble have already contributed support and testing and have plans for > additional work to shore up some of our current shortcomings. > Others have expressed an interest in helping and hopefully they will pop up > here as well. > Let's organize and discuss our efforts here and in various sub issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing
[ https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246281#comment-17246281 ] Mark Robert Miller commented on SOLR-14788: --- [~markus17] I was planning on shortcutting for a variety of reasons a couple months back, but given the opportunity stack things in my favor more completely before gambling caveats and split priorities, known feedback, I had to take it. If you have the opportunity to take a look again, sometime after this week is going to be the good entry point. > Solr: The Next Big Thing > > > Key: SOLR-14788 > URL: https://issues.apache.org/jira/browse/SOLR-14788 > Project: Solr > Issue Type: Task >Reporter: Mark Robert Miller >Assignee: Mark Robert Miller >Priority: Critical > Time Spent: 4h > Remaining Estimate: 0h > > h3. > [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The > Policeman is on duty!*{color} > {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and > have some fun. Try to make some progress. Don't stress too much about the > impact of your changes or maintaining stability and performance and > correctness so much. Until the end of phase 1, I've got your back. I have a > variety of tools and contraptions I have been building over the years and I > will continue training them on this branch. I will review your changes and > peer out across the land and course correct where needed. As Mike D will be > thinking, "Sounds like a bottleneck Mark." And indeed it will be to some > extent. Which is why once stage one is completed, I will flip The Policeman > to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} > *down for some vigilante justice, but I won't be walking the beat, all that > stuff about sit back and relax goes out the window.*{color}_ > {quote} > > I have stolen this title from Ishan or Noble and Ishan. > This issue is meant to capture the work of a small team that is forming to > push Solr and SolrCloud to the next phase. > I have kicked off the work with an effort to create a very fast and solid > base. That work is not 100% done, but it's ready to join the fight. > Tim Potter has started giving me a tremendous hand in finishing up. Ishan and > Noble have already contributed support and testing and have plans for > additional work to shore up some of our current shortcomings. > Others have expressed an interest in helping and hopefully they will pop up > here as well. > Let's organize and discuss our efforts here and in various sub issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-14788) Solr: The Next Big Thing
[ https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246281#comment-17246281 ] Mark Robert Miller edited comment on SOLR-14788 at 12/9/20, 4:28 AM: - [~markus17] I was planning on shortcutting for a variety of reasons a couple months back, but given the opportunity to stack things in my favor more completely before gambling caveats and split priorities, known feedback, etc, I had to take it. If you have the opportunity to take a look again, sometime after this week is going to be the good entry point. was (Author: markrmiller): [~markus17] I was planning on shortcutting for a variety of reasons a couple months back, but given the opportunity stack things in my favor more completely before gambling caveats and split priorities, known feedback, I had to take it. If you have the opportunity to take a look again, sometime after this week is going to be the good entry point. > Solr: The Next Big Thing > > > Key: SOLR-14788 > URL: https://issues.apache.org/jira/browse/SOLR-14788 > Project: Solr > Issue Type: Task >Reporter: Mark Robert Miller >Assignee: Mark Robert Miller >Priority: Critical > Time Spent: 4h > Remaining Estimate: 0h > > h3. > [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The > Policeman is on duty!*{color} > {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and > have some fun. Try to make some progress. Don't stress too much about the > impact of your changes or maintaining stability and performance and > correctness so much. Until the end of phase 1, I've got your back. I have a > variety of tools and contraptions I have been building over the years and I > will continue training them on this branch. I will review your changes and > peer out across the land and course correct where needed. As Mike D will be > thinking, "Sounds like a bottleneck Mark." And indeed it will be to some > extent. Which is why once stage one is completed, I will flip The Policeman > to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} > *down for some vigilante justice, but I won't be walking the beat, all that > stuff about sit back and relax goes out the window.*{color}_ > {quote} > > I have stolen this title from Ishan or Noble and Ishan. > This issue is meant to capture the work of a small team that is forming to > push Solr and SolrCloud to the next phase. > I have kicked off the work with an effort to create a very fast and solid > base. That work is not 100% done, but it's ready to join the fight. > Tim Potter has started giving me a tremendous hand in finishing up. Ishan and > Noble have already contributed support and testing and have plans for > additional work to shore up some of our current shortcomings. > Others have expressed an interest in helping and hopefully they will pop up > here as well. > Let's organize and discuss our efforts here and in various sub issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gf2121 commented on pull request #2113: LUCENE-9629: Use computed masks
gf2121 commented on pull request #2113: URL: https://github.com/apache/lucene-solr/pull/2113#issuecomment-741520801 > You need to either return a value from the benchmark methods or call blackhole.consume, otherwise the JVM will detect that everything is unused outside of the scope and optimize it away. That should get you some different results. Thank you for being thorough! Thank you for the clue! Based on your guidance, I tried some more benchmark, but find array val is alway faster... here are the codes and results (code is used to shows the way that i tried to prevent jvm optimize, so only one method is enough). 1. return an array result ``` public long[] decode0() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } return ARR; } ``` method | speed (ops/s) | - MyBenchmark.decode0 | 92215691.271 ± 1149229.830 MyBenchmark.decode1 | 62019521.428 ± 4268837.164 MyBenchmark.decode2 | 62595196.347 ± 1434012.058 2. return an long result ``` public long decode0() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } return ARR[31]; } ``` method | speed (ops/s) | - MyBenchmark.decode0 | 92470935.234 ± 3525240.576 MyBenchmark.decode1 | 62389057.277 ± 567747.489 MyBenchmark.decode2 | 62141559.925 ± 1012364.417 3. blackwhole consume last ``` public void decode0(Blackhole blackhole) { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } blackhole.consume(ARR[30]); blackhole.consume(ARR[31]); } ``` method | speed (ops/s) | - MyBenchmark.decode0 | 79570016.826 ± 1210338.335 MyBenchmark.decode1 | 58225242.201 ± 905039.184 MyBenchmark.decode2 | 58524381.688 ± 585220.494 4. blackwhole consume in loop ``` public void decode0(Blackhole blackhole) { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) {
[GitHub] [lucene-solr] gf2121 edited a comment on pull request #2113: LUCENE-9629: Use computed masks
gf2121 edited a comment on pull request #2113: URL: https://github.com/apache/lucene-solr/pull/2113#issuecomment-741520801 > You need to either return a value from the benchmark methods or call blackhole.consume, otherwise the JVM will detect that everything is unused outside of the scope and optimize it away. That should get you some different results. Thank you for being thorough! Thank you for the clue! Based on your guidance, I tried some more benchmark, but find array val is alway faster... here are the codes and results (code is used to show the way prevent jvm optimize, so only one method is enough here). 1. return an array result ``` public long[] decode0() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } return ARR; } ``` method | speed (ops/s) | - MyBenchmark.decode0 | 92215691.271 ± 1149229.830 MyBenchmark.decode1 | 62019521.428 ± 4268837.164 MyBenchmark.decode2 | 62595196.347 ± 1434012.058 2. return an long result ``` public long decode0() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } return ARR[31]; } ``` method | speed (ops/s) | - MyBenchmark.decode0 | 92470935.234 ± 3525240.576 MyBenchmark.decode1 | 62389057.277 ± 567747.489 MyBenchmark.decode2 | 62141559.925 ± 1012364.417 3. blackwhole consume last ``` public void decode0(Blackhole blackhole) { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } blackhole.consume(ARR[30]); blackhole.consume(ARR[31]); } ``` method | speed (ops/s) | - MyBenchmark.decode0 | 79570016.826 ± 1210338.335 MyBenchmark.decode1 | 58225242.201 ± 905039.184 MyBenchmark.decode2 | 58524381.688 ± 585220.494 4. blackwhole consume in loop ``` public void decode0(Blackhole blackhole) { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) {
[GitHub] [lucene-solr] gf2121 edited a comment on pull request #2113: LUCENE-9629: Use computed masks
gf2121 edited a comment on pull request #2113: URL: https://github.com/apache/lucene-solr/pull/2113#issuecomment-741520801 > You need to either return a value from the benchmark methods or call blackhole.consume, otherwise the JVM will detect that everything is unused outside of the scope and optimize it away. That should get you some different results. Thank you for being thorough! Thank you for the clue! Based on your guidance, I tried some more benchmark, but find array val is alway faster... here are the codes and results (code is used to show the way prevent jvm optimize, so only one method is pasted here). 1. return an array result ``` public long[] decode0() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } return ARR; } ``` method | speed (ops/s) | - MyBenchmark.decode0 | 92215691.271 ± 1149229.830 MyBenchmark.decode1 | 62019521.428 ± 4268837.164 MyBenchmark.decode2 | 62595196.347 ± 1434012.058 2. return an long result ``` public long decode0() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } return ARR[31]; } ``` method | speed (ops/s) | - MyBenchmark.decode0 | 92470935.234 ± 3525240.576 MyBenchmark.decode1 | 62389057.277 ± 567747.489 MyBenchmark.decode2 | 62141559.925 ± 1012364.417 3. blackwhole consume last ``` public void decode0(Blackhole blackhole) { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASKS16_1[0]) << 14; l0 |= (TMP[tmpIdx+1] & MASKS16_1[0]) << 13; l0 |= (TMP[tmpIdx+2] & MASKS16_1[0]) << 12; l0 |= (TMP[tmpIdx+3] & MASKS16_1[0]) << 11; l0 |= (TMP[tmpIdx+4] & MASKS16_1[0]) << 10; l0 |= (TMP[tmpIdx+5] & MASKS16_1[0]) << 9; l0 |= (TMP[tmpIdx+6] & MASKS16_1[0]) << 8; l0 |= (TMP[tmpIdx+7] & MASKS16_1[0]) << 7; l0 |= (TMP[tmpIdx+8] & MASKS16_1[0]) << 6; l0 |= (TMP[tmpIdx+9] & MASKS16_1[0]) << 5; l0 |= (TMP[tmpIdx+10] & MASKS16_1[0]) << 4; l0 |= (TMP[tmpIdx+11] & MASKS16_1[0]) << 3; l0 |= (TMP[tmpIdx+12] & MASKS16_1[0]) << 2; l0 |= (TMP[tmpIdx+13] & MASKS16_1[0]) << 1; l0 |= (TMP[tmpIdx+14] & MASKS16_1[0]) << 0; ARR[longsIdx+0] = l0; } blackhole.consume(ARR[30]); blackhole.consume(ARR[31]); } ``` method | speed (ops/s) | - MyBenchmark.decode0 | 79570016.826 ± 1210338.335 MyBenchmark.decode1 | 58225242.201 ± 905039.184 MyBenchmark.decode2 | 58524381.688 ± 585220.494 4. blackwhole consume in loop ``` public void decode0(Blackhole blackhole) { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) {
[GitHub] [lucene-solr] munendrasn commented on a change in pull request #2121: SOLR-10860: Return proper error code for bad input incase of inplace updates
munendrasn commented on a change in pull request #2121: URL: https://github.com/apache/lucene-solr/pull/2121#discussion_r539024109 ## File path: solr/core/src/test/org/apache/solr/update/TestInPlaceUpdatesStandalone.java ## @@ -121,6 +123,36 @@ public void deleteAllAndCommit() throws Exception { assertU(commit("softCommit", "false")); } + @Test + public void testUpdateBadRequest() throws Exception { +final long version1 = addAndGetVersion(sdoc("id", "1", "title_s", "first", "inplace_updatable_float", 41), null); +assertU(commit()); + +// invalid value with set operation +SolrException e = expectThrows(SolrException.class, +() -> addAndAssertVersion(version1, "id", "1", "inplace_updatable_float", map("set", "NOT_NUMBER"))); +assertEquals(SolrException.ErrorCode.BAD_REQUEST.code, e.code()); +MatcherAssert.assertThat(e.getMessage(), containsString("For input string: \"NOT_NUMBER\"")); + +// invalid value with inc operation +e = expectThrows(SolrException.class, +() -> addAndAssertVersion(version1, "id", "1", "inplace_updatable_float", map("inc", "NOT_NUMBER"))); +assertEquals(SolrException.ErrorCode.BAD_REQUEST.code, e.code()); +MatcherAssert.assertThat(e.getMessage(), containsString("For input string: \"NOT_NUMBER\"")); + +// inc op with null value +e = expectThrows(SolrException.class, +() -> addAndAssertVersion(version1, "id", "1", "inplace_updatable_float", map("inc", null))); +assertEquals(SolrException.ErrorCode.BAD_REQUEST.code, e.code()); +MatcherAssert.assertThat(e.getMessage(), containsString("Invalid input 'null' for field inplace_updatable_float")); + +e = expectThrows(SolrException.class, +() -> addAndAssertVersion(version1, "id", "1", "inplace_updatable_float", Review comment: We can increment float by an integer. In this particular test input, verifying the case when instead of passing the number, a list of numbers is passed. Previously, Solr used to return 500 with the current changes Bad request would be returned `"Invalid input '[123]' for field inplace_updatable_float"` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] munendrasn commented on a change in pull request #2121: SOLR-10860: Return proper error code for bad input incase of inplace updates
munendrasn commented on a change in pull request #2121: URL: https://github.com/apache/lucene-solr/pull/2121#discussion_r539029127 ## File path: solr/core/src/java/org/apache/solr/update/processor/AtomicUpdateDocumentMerger.java ## @@ -143,6 +147,15 @@ public SolrInputDocument merge(final SolrInputDocument fromDoc, SolrInputDocumen return toDoc; } + private static String getID(SolrInputDocument doc, IndexSchema schema) { +String id = ""; Review comment: I'm thinking to rephrase the above error message to something like so that it is better than the previous msg. If the id is not known then, I think maybe it is better not to send anything related id, wdyt? ``` "Error:" + getID(toDoc, schema) + "Unknown operation for the an atomic update : " + key; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] munendrasn commented on a change in pull request #2121: SOLR-10860: Return proper error code for bad input incase of inplace updates
munendrasn commented on a change in pull request #2121: URL: https://github.com/apache/lucene-solr/pull/2121#discussion_r539029723 ## File path: solr/core/src/java/org/apache/solr/update/processor/AtomicUpdateDocumentMerger.java ## @@ -553,7 +574,15 @@ private Object getNativeFieldValue(String fieldName, Object val) { return val; } SchemaField sf = schema.getField(fieldName); -return sf.getType().toNativeType(val); +try { + return sf.getType().toNativeType(val); +} catch (SolrException ex) { + throw new SolrException(SolrException.ErrorCode.getErrorCode(ex.code()), + "Error converting field '" + sf.getName() + "'='" +val+"' to native type, msg=" + ex.getMessage(), ex); Review comment: cause gets lost in the metadata section of the response so, thought this would give simpler insight into error. Also, trying to follow the same convention as other error messages in DocumentBuilder This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-10732) potential optimizations in callers of SolrIndexSearcher.numDocs when docset is empty
[ https://issues.apache.org/jira/browse/SOLR-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246305#comment-17246305 ] Munendra S N commented on SOLR-10732: - {quote}I'm curious, Munendra S N – were you able to perceive a performance benefit with these changes? Where these optimizations are located, afaict they optimize edge cases, and the query-building they prevent (if I'm reading right) is generally pretty lightweight (e.g., TermQuery ...).{quote} Changes here are based on this [comment|https://issues.apache.org/jira/browse/SOLR-10727?focusedCommentId=16020247&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16020247]. As you said, this tries to avoid additional object creation and computation for some edge cases and based on my understanding, it helps especially in case facet queries or group facets {quote}By way of contrast (wrt complexity/benefit tradeoff), at the leaf level it looks like SolrIndexSearcher.getDocSet(Query, DocSet) could be optimized in a way analogous to what SOLR-10727 does for SolrIndexSearcher.numDocs(Query, DocSet), avoiding filterCache pollution {quote} +1, If there is possibility to improve/optimize it. We should definitely do it but I think it should be handled in its own issue {quote}or maybe also higher up in the program logic, to prune as much execution as possible (and when it's clearer how/why we got the point of having an empty domain). The changes here seem to be building in mid-level "shot in the dark" safeguards, where it's relatively unclear what's going on.{quote} Initially, planned to make these changes in getFacetCounts which would handle the case for intervalFacet and heatmap but realized changes would be too cluttered so, decided to delegate handling this case respective types. Let me know if this could be simplified and probably handle other facets too > potential optimizations in callers of SolrIndexSearcher.numDocs when docset > is empty > > > Key: SOLR-10732 > URL: https://issues.apache.org/jira/browse/SOLR-10732 > Project: Solr > Issue Type: Improvement >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-10732.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > spin off of SOLR-10727... > {quote} > ...why not (also) optimize it slightly higher up and completely avoid the > construction of the Query objects? (and in some cases: additional overhead) > for example: the first usage of {{SolrIndexSearcher.numDocs(Query,DocSet)}} i > found was {{RangeFacetProcessor.rangeCount(DocSet subset,...)}} ... if the > first line of that method was {{if (0 == subset.size()) return 0}} then we'd > not only optimize away the SolrIndexSearcher hit, but also fetching the > SchemaField & building the range query (not to mention the much more > expensive {{getGroupedFacetQueryCount}} in the grouping case) > At a glance, most other callers of > {{SolrIndexSearcher.numDocs(Query,DocSet)}} could be trivially optimize this > way as well -- at a minimum to eliminate Query parsing/construction. > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14688) First party package implementation design
[ https://issues.apache.org/jira/browse/SOLR-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246314#comment-17246314 ] Tomas Eduardo Fernandez Lobbe commented on SOLR-14688: -- There were some discussions in Slack the last couple days that I'd like to bring here since they are related to this Jira issue. The threads are [this|https://the-asf.slack.com/archives/CNMTSU970/p1582794103004800] and [this|https://the-asf.slack.com/archives/CNMTSU970/p1607019429038000]. While there are different opinions expressed in those threads, the general sentiment is that there are missing pieces to the stories of packages/plugins, that need to be resolved before we proceed here. In particular, compatibility and offline installs. I'm now going to bring a summary of my comments and concerns here. In the current state of things with package manager and if this issue is implemented, someone could technically install a package from a different version, but this can only be achieved by ensuring binary compatibility between core and the packages across those different versions. This is way more than what we guarantee today (think of something in analysis-extras calling {{coreContainer.getCore().getLatestSchema().callSomethingAddedIn8.9()}}, would break in any core version previous to 8.9). Guaranteeing this binary compatibility would require a ton of testing (every package version against every core version that's supported), it would put a lot of burden on us developers, making it very difficult to add/change/deprecate/retire code and may even make major upgrades impossible, or if we just take binary compatibility as a "best effort", we'll make it difficult on the users to figure out which version of what is compatible with other versions. One question that I raised is, why do we want people to install a newer contribs/packages into an older core version? Why don't we instead encourage people to upgrade Solr by making it easier to do? Major version upgrades could be more problematic because of index compatibility, yes, but really, having binary compatibility across major upgrades is going to be very, very hard. There is also great concern about the inability to install packages offline, and how that affects the ability to install/deploy first/third party plugins (a bunch of people expressed this in particular in those Slack threads I mentioned). I believe the root of the problem is the fact that packages *have* to be cluster-wide now. Instead of being able to create the deployable in some build infrastructure, away from production environments, and then move that deployable across your different environments such as "dev", "qa", "prod", or whatever you have, the current implementation only allows one to configure a cluster once it's created and running, doing API calls (forcing to enable package manager AFAIK, even if no code needs to be added dynamically later), and exposing the production environment to either a package repository or even internet. I believe packages (first, or third party) could work better if they could be local to a node (and this doesn't mean there can't be cluster-wide packages, but we need at least the "local" option). People could then, for example, create their Docker image like (and these are not real commands, just get the idea): {noformat} FROM official-docker-image-slim:x.y.z ADD /some/build/path/custom-plugin1 /some/location/in/solr/custom-plugin1 RUN /solr/bin/solr install custom-plugin1 /some/location/in/solr/custom-plugin1 RUN /solr/bin/solr install analysis-extra solr.apache.org/packages/analysis-extra/x.y.z #or RUN /solr/bin/solr install analysis-extra /first/party/plugins/location {noformat} (The example is with Docker, but similar things can be done with other deployables, like AMIs in AWS, or I'm sure any container technology.) And then just build it and deploy it. If you are using the Kubernetes Solr operator, it's a single command and the upgrade will start safely and automatically. It's also important to mention that any upgrade could look just the same, regardless if what you changed was Solr core, first party or third party plugin. I'm +1 on making the code more modular and independent, have better, well thought interfaces like the one created for the replica placement framework and much of the work ab has been doing to define higher level interfaces of things currently need rework, but I think with the current state of package manager, with cluster-wide packages, this issue is very dangerous. > First party package implementation design > - > > Key: SOLR-14688 > URL: https://issues.apache.org/jira/browse/SOLR-14688 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Priority: Major >