[jira] [Created] (LUCENE-9636) Exact and operation to get a SIMD optimize
Feng Guo created LUCENE-9636: Summary: Exact and operation to get a SIMD optimize Key: LUCENE-9636 URL: https://issues.apache.org/jira/browse/LUCENE-9636 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Reporter: Feng Guo In `decode6()` `decode7()` `decode14()` `decode15()` `decode24`, longs always `&` a same mask and do some shift. By printing assemble language, i find that JIT did not optimize them with SIMD instructions. But when we extract all `&` operations and do them first, JIT will use SIMD optimize on them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gf2121 opened a new pull request #2139: LUCENE-9636: Extract and operation to get a SIMD optimize
gf2121 opened a new pull request #2139: URL: https://github.com/apache/lucene-solr/pull/2139 # Description In `decode6()` `decode7()` `decode14()` `decode15()` `decode24`, longs always `&` a same mask and do some shift. By printing assemble language, i find that JIT did not optimize them with SIMD instructions. But when we extract all `&` operations and do them first, JIT will use SIMD to optimize them. # Tests Java Version: > java version "11.0.6" 2020-01-14 LTS > Java(TM) SE Runtime Environment 18.9 (build 11.0.6+8-LTS) > Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.6+8-LTS, mixed mode) Using `decode15` as an example, here is a microbenchmark based on JMH: **code** ``` @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public void decode15a() { for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = (TMP[tmpIdx+0] & MASK16_1) << 14; l0 |= (TMP[tmpIdx+1] & MASK16_1) << 13; l0 |= (TMP[tmpIdx+2] & MASK16_1) << 12; l0 |= (TMP[tmpIdx+3] & MASK16_1) << 11; l0 |= (TMP[tmpIdx+4] & MASK16_1) << 10; l0 |= (TMP[tmpIdx+5] & MASK16_1) << 9; l0 |= (TMP[tmpIdx+6] & MASK16_1) << 8; l0 |= (TMP[tmpIdx+7] & MASK16_1) << 7; l0 |= (TMP[tmpIdx+8] & MASK16_1) << 6; l0 |= (TMP[tmpIdx+9] & MASK16_1) << 5; l0 |= (TMP[tmpIdx+10] & MASK16_1) << 4; l0 |= (TMP[tmpIdx+11] & MASK16_1) << 3; l0 |= (TMP[tmpIdx+12] & MASK16_1) << 2; l0 |= (TMP[tmpIdx+13] & MASK16_1) << 1; l0 |= (TMP[tmpIdx+14] & MASK16_1) << 0; ARR[longsIdx+0] = l0; } } @Benchmark @BenchmarkMode({Mode.Throughput}) @Fork(1) @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS) public void decode15b() { shiftLongs(TMP, 30, TMP, 0, 0, MASK16_1); for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) { long l0 = TMP[tmpIdx+0] << 14; l0 |= TMP[tmpIdx+1] << 13; l0 |= TMP[tmpIdx+2] << 12; l0 |= TMP[tmpIdx+3] << 11; l0 |= TMP[tmpIdx+4] << 10; l0 |= TMP[tmpIdx+5] << 9; l0 |= TMP[tmpIdx+6] << 8; l0 |= TMP[tmpIdx+7] << 7; l0 |= TMP[tmpIdx+8] << 6; l0 |= TMP[tmpIdx+9] << 5; l0 |= TMP[tmpIdx+10] << 4; l0 |= TMP[tmpIdx+11] << 3; l0 |= TMP[tmpIdx+12] << 2; l0 |= TMP[tmpIdx+13] << 1; l0 |= TMP[tmpIdx+14] << 0; ARR[longsIdx+0] = l0; } } ``` **Result** ``` Benchmark Mode Cnt Score Error Units MyBenchmark.decode15a thrpt 10 65234108.600 ± 1336311.970 ops/s MyBenchmark.decode15b thrpt 10 106840656.363 ± 448026.092 ops/s ``` And an end-to-end test based on _wikimedium1m_ also looks positive overall: ``` Fuzzy1 131.77 (5.4%) 131.75 (4.2%) -0.0% ( -9% - 10%) 0.990 MedPhrase 146.41 (4.5%) 146.44 (4.8%) 0.0% ( -8% -9%) 0.992 AndHighMed 643.10 (5.4%) 643.95 (5.5%) 0.1% ( -10% - 11%) 0.939 HighSpanNear 125.99 (5.7%) 126.48 (4.9%) 0.4% ( -9% - 11%) 0.818 Respell 164.81 (4.9%) 165.48 (4.5%) 0.4% ( -8% - 10%) 0.783 HighSloppyPhrase 103.20 (6.2%) 103.65 (5.8%) 0.4% ( -10% - 13%) 0.816 IntNRQ 662.80 (5.0%) 665.87 (5.1%) 0.5% ( -9% - 11%) 0.770 Prefix3 882.57 (6.8%) 887.18 (8.6%) 0.5% ( -13% - 17%) 0.832 LowSloppyPhrase 76.17 (5.5%) 76.57 (5.0%) 0.5% ( -9% - 11%) 0.754 AndHighHigh 236.71 (5.8%) 237.99 (5.2%) 0.5% ( -9% - 12%) 0.756 Fuzzy2 100.40 (5.6%) 101.02 (4.7%) 0.6% ( -9% - 11%) 0.708 OrHighHigh 154.05 (5.4%) 155.08 (5.0%) 0.7% ( -9% - 11%) 0.684 LowPhrase 327.86 (4.4%) 330.10 (4.9%) 0.7% ( -8% - 10%) 0.641 BrowseDayOfYearSSDVFacets 120.00 (5.1%) 120.88 (4.5%) 0.7% ( -8% - 10%) 0.627 MedTerm 2239.68 (6.3%) 2256.94 (5.9%)
[jira] [Updated] (LUCENE-9636) Exact and operation to get a SIMD optimize
[ https://issues.apache.org/jira/browse/LUCENE-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-9636: - Description: In decode6(), decode7(), decode14(), decode15(), decode24() longs always `&` a same mask and do some shift. By printing assemble language, i find that JIT did not optimize them with SIMD instructions. But when we extract all `&` operations and do them first, JIT will use SIMD optimize on them. was: In `decode6()` `decode7()` `decode14()` `decode15()` `decode24`, longs always `&` a same mask and do some shift. By printing assemble language, i find that JIT did not optimize them with SIMD instructions. But when we extract all `&` operations and do them first, JIT will use SIMD optimize on them. > Exact and operation to get a SIMD optimize > -- > > Key: LUCENE-9636 > URL: https://issues.apache.org/jira/browse/LUCENE-9636 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > In decode6(), decode7(), decode14(), decode15(), decode24() longs always `&` > a same mask and do some shift. By printing assemble language, i find that JIT > did not optimize them with SIMD instructions. But when we extract all `&` > operations and do them first, JIT will use SIMD optimize on them. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on pull request #2139: LUCENE-9636: Extract and operation to get a SIMD optimize
dweiss commented on pull request #2139: URL: https://github.com/apache/lucene-solr/pull/2139#issuecomment-742434844 This is excellent, thank you! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15039) Error in Solr Cell extract when using multipart upload with some documents
[ https://issues.apache.org/jira/browse/SOLR-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247159#comment-17247159 ] sam marshall commented on SOLR-15039: - In case it is helpful to reproduce the problem, here is a complete sequence of commands that will reproduce it on starting from a fresh Ubuntu 18.04 installation (I used a Microsoft Azure VM). It uses a fresh Solr 8.7.0 installation with the supplied 'techproducts' sample, which has the extract handler enabled and I assume is correctly configured. After creating the new VM, I copied the file b364b24b-public into the home directory, and then this is the full sequence of commands I needed to reproduce it (it doesn't quite run as a script, you have to press Y or Q at a couple of points): {code} sudo apt install openjdk-11-jdk wget https://archive.apache.org/dist/lucene/solr/8.7.0/solr-8.7.0.tgz tar xzf solr-8.7.0.tgz solr-8.7.0/bin/install_solr_service.sh --strip-components=2 sudo bash ./install_solr_service.sh solr-8.7.0.tgz sudo su - solr -c "/opt/solr/bin/solr create -c testcollection -d sample_techproducts_configs" curl "http://localhost:8983/solr/testcollection/update/extract?&extractOnly=true"; --data-binary '@b364b24b-public' -H 'Content-type:text/html' > nonmultipart-result.txt curl "http://localhost:8983/solr/testcollection/update/extract?&extractOnly=true"; -F 'myfile=@b364b24b-public' -H 'Content-type:text/html' > multipart-result.txt {code} After that point you can see the results in the two files, which are of clearly different sizes: {code} sam@solr-test-temp:~$ ls -l total 212648 -rw-r--r-- 1 sam sam 10323956 Dec 10 10:32 b364b24b-public -rwxr-xr-x 1 sam sam 12694 Oct 28 09:21 install_solr_service.sh -rw-rw-r-- 1 sam sam 6589425 Dec 10 10:40 multipart-result.txt -rw-rw-r-- 1 sam sam 9988 Dec 10 10:39 nonmultipart-result.txt -rw-rw-r-- 1 sam sam 200805960 Oct 29 19:05 solr-8.7.0.tgz {code} > Error in Solr Cell extract when using multipart upload with some documents > -- > > Key: SOLR-15039 > URL: https://issues.apache.org/jira/browse/SOLR-15039 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: contrib - Solr Cell (Tika extraction) >Affects Versions: 6.6.4, 8.4, 8.6.3, 8.7 >Reporter: sam marshall >Priority: Major > Attachments: b364b24b-public > > > (Note: I asked about this in the IRC channel as prompted, but didn't get a > response.) > When uploading particular documents to /update/extract, you get different > (wrong) results if you are using multipart file upload compared to the basic > encoded upload, even though both methods are shown on the documentation page > ([https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html]). > The first example in the documentation page uses a multipart POST with a > field called 'myfile' set to the file content. Some later examples use a > standard POST with the raw data provided. > Here are these two approaches in the commands I used with my example file (I > have replaced the URL, username, password, and collection name for my Solr, > which isn't publicly available): > {code} > curl --user myuser:mypassword > "https://example.org/solr/mycollection/update/extract?&extractOnly=true"; > --data-binary '@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H > 'Content-type:text/html' > nonmultipart-result.txt > curl --user myuser:mypassword > "https://example.org/solr/mycollection/update/extract?&extractOnly=true"; -F > 'myfile=@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H > 'Content-type:text/html' > multipart-result.txt > {code} > The example file is a ~10MB PowerPoint with a few sentences of English text > in it (and some pictures). > The nonmultipart-result.txt file is 9,871 bytes long and JSON-encoded; it > includes an XHTML version of the text content of the PowerPoint, and some > metadata. > The multipart-result.txt is 7,352,348 bytes long and contains mainly a large > sequence of Chinese characters, or at least, random data being interpreted as > Chinese characters. > This example was running against Solr 8.4 on a Linux server from our cloud > Solr supplier. On another Linux (Ubuntu 18) server that I set up myself I got > the same results using various other Solr versions. Running against localhost > which is a Windows 10 machine with Solr 8.5, I get slightly different > results; the non-multipart works correctly but the multipart-result.txt in > that case is a slightly more helpful error 500 message: > {code} > > > > 500 > 138 > > > > org.apache.solr.common.SolrException > java.util.zip.ZipException > > org.apache.tika.exception.TikaException: E
[jira] [Commented] (SOLR-13101) Shared storage support in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247258#comment-17247258 ] David Smiley commented on SOLR-13101: - I would like to close this issue as won't-fix because the substance and feature branch (with linked PRs) pointing to this issue is dead-in-the-water (will not be merged, or further publicly contributed to). However the issue title, "Shared storage support" (rather general) is not a "won't-fix" ! So with that, I propose I re-title the issue to "Shared storage via new SHARED replica type" because in my mind, that's the most stand-out aspect of this PR compared to other alternatives. WDYT [~ilan]? That said, do not lose hope for a solution to come into being! I've been excitedly working on a new plan I've been internally sharing that solves the contribut-ability matters that the SHARED replica type implementation lacks. If things go well in the coming weeks... there will end up being a new Jira issue to be called "BlobDirectory, a shared storage approach" that will link here. > Shared storage support in SolrCloud > --- > > Key: SOLR-13101 > URL: https://issues.apache.org/jira/browse/SOLR-13101 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Yonik Seeley >Priority: Major > Time Spent: 15h 50m > Remaining Estimate: 0h > > Solr should have first-class support for shared storage (blob/object stores > like S3, google cloud storage, etc. and shared filesystems like HDFS, NFS, > etc). > The key component will likely be a new replica type for shared storage. It > would have many of the benefits of the current "pull" replicas (not indexing > on all replicas, all shards identical with no shards getting out-of-sync, > etc), but would have additional benefits: > - Any shard could become leader (the blob store always has the index) > - Better elasticity scaling down >- durability not linked to number of replcias.. a single replica could be > common for write workloads >- could drop to 0 replicas for a shard when not needed (blob store always > has index) > - Allow for higher performance write workloads by skipping the transaction > log >- don't pay for what you don't need >- a commit will be necessary to flush to stable storage (blob store) > - A lot of the complexity and failure modes go away > An additional component a Directory implementation that will work well with > blob stores. We probably want one that treats local disk as a cache since > the latency to remote storage is so large. I think there are still some > "locking" issues to be solved here (ensuring that more than one writer to the > same index won't corrupt it). This should probably be pulled out into a > different JIRA issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-15040) Improvements to postlogs timestamp handling
Joel Bernstein created SOLR-15040: - Summary: Improvements to postlogs timestamp handling Key: SOLR-15040 URL: https://issues.apache.org/jira/browse/SOLR-15040 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Joel Bernstein This ticket will make some small improvements to how the bin/postlogs programs handles timestamps. In particular it will change the format of the datetime stamp so that it matches the ISO spec more closely. It will also add a few date truncated string time stamp fields which make it easier for time series analysis. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-15040) Improvements to postlogs timestamp handling
[ https://issues.apache.org/jira/browse/SOLR-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Bernstein reassigned SOLR-15040: - Assignee: Joel Bernstein > Improvements to postlogs timestamp handling > --- > > Key: SOLR-15040 > URL: https://issues.apache.org/jira/browse/SOLR-15040 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Minor > > This ticket will make some small improvements to how the bin/postlogs > programs handles timestamps. In particular it will change the format of the > datetime stamp so that it matches the ISO spec more closely. It will also add > a few date truncated string time stamp fields which make it easier for time > series analysis. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-15040) Improvements to postlogs timestamp handling
[ https://issues.apache.org/jira/browse/SOLR-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Bernstein updated SOLR-15040: -- Attachment: SOLR-15040.patch > Improvements to postlogs timestamp handling > --- > > Key: SOLR-15040 > URL: https://issues.apache.org/jira/browse/SOLR-15040 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Minor > Attachments: SOLR-15040.patch > > > This ticket will make some small improvements to how the bin/postlogs > programs handles timestamps. In particular it will change the format of the > datetime stamp so that it matches the ISO spec more closely. It will also add > a few date truncated string time stamp fields which make it easier for time > series analysis. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15040) Improvements to postlogs timestamp handling
[ https://issues.apache.org/jira/browse/SOLR-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247398#comment-17247398 ] ASF subversion and git services commented on SOLR-15040: Commit 04b9a9806013d98b8ad78a33a905d10dadf3129a in lucene-solr's branch refs/heads/master from Joel Bernstein [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=04b9a98 ] SOLR-15040: Improvements to postlogs timestamp handling > Improvements to postlogs timestamp handling > --- > > Key: SOLR-15040 > URL: https://issues.apache.org/jira/browse/SOLR-15040 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Minor > Attachments: SOLR-15040.patch > > > This ticket will make some small improvements to how the bin/postlogs > programs handles timestamps. In particular it will change the format of the > datetime stamp so that it matches the ISO spec more closely. It will also add > a few date truncated string time stamp fields which make it easier for time > series analysis. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15040) Improvements to postlogs timestamp handling
[ https://issues.apache.org/jira/browse/SOLR-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247430#comment-17247430 ] ASF subversion and git services commented on SOLR-15040: Commit 3bb4ed24d89e2efab742dde5f666049f7d4fff0c in lucene-solr's branch refs/heads/branch_8x from Joel Bernstein [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3bb4ed2 ] SOLR-15040: Improvements to postlogs timestamp handling > Improvements to postlogs timestamp handling > --- > > Key: SOLR-15040 > URL: https://issues.apache.org/jira/browse/SOLR-15040 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Minor > Attachments: SOLR-15040.patch > > > This ticket will make some small improvements to how the bin/postlogs > programs handles timestamps. In particular it will change the format of the > datetime stamp so that it matches the ISO spec more closely. It will also add > a few date truncated string time stamp fields which make it easier for time > series analysis. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-15041) CSV update handler can't handle line breaks/new lines together with field split/separators for multivalued fields
Matt Hov created SOLR-15041: --- Summary: CSV update handler can't handle line breaks/new lines together with field split/separators for multivalued fields Key: SOLR-15041 URL: https://issues.apache.org/jira/browse/SOLR-15041 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: update Affects Versions: 8.4 Environment: Ubuntu 20.04 8 CPU 60GB+ ram Reporter: Matt Hov I've been using the /update/csv option to bulk import large numbers of data with great success, but I believe I've found a corner case in the parsing of csv when the field is a multi-valued string field with a new-line character in it. As soon as you specify {{f.[fieldname].split=true&f.[fieldname].separator=[something]}} the multi-field/split parsing stops at the first linebreak My managed schema: {code:java} -- managed schema {code} Example POST url, I'm using ! as split character for test1_strs and test2_strs {code:java} http://[myserver]/solr/[mycore]/update/csv?commitWithin=1000&f.test1_strs.split=true&f.test1_strs.separator=!&f.test2_strs.split=true&f.test2_strs.separator=!{code} CSV content: (notice the new-lines are included but encapsulated by "", these new-lines need to be maintained as is) {code:java} id,title,test1_strs,test2_strs,test3_str csv_test,title,"first line with break!second line","first line!second_line","a line break" {code} Resulting Solr Doc: {code:java} { "id":"csv_test", "title":"title", "_version_":1685718010076069888, "test1_strs":["first line "], "test2_strs":["first line", "second_line"], "test3_str":"a line\r\nbreak"}] } {code} Note in the single value {{test3_str}} the new-line is appropriately maintained as \r\n (or just \n when this is done via code instead of manually) {{test2_strs}} shows that the mutli-value split on ! worked correctly {{test1_strs}} immediately stops processing after the first value's new-line, instead of the actual separator after the new-line. Expected values should look like: {code:java} { "id":"csv_test", "title":"title", "_version_":1685718010076069888, "test1_strs":["first line\r\nwith break", "second line"], "test2_strs":["first line", "second_line"], "test3_str":"a line\r\nbreak"}] } {code} I've tried pre-escaping line breaks but all that gives me is the escaped new-line in solr, which would need to be post-processed on the consuming end to return to a \r\n (or \n) and would be nontrivial to do. Solr handles \n just find in all other cases so I consider this an expected behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mayya-sharipova commented on pull request #2129: Fix format indent from 4 to 2 spaces
mayya-sharipova commented on pull request #2129: URL: https://github.com/apache/lucene-solr/pull/2129#issuecomment-742759335 @msokolov Thanks for your comment. Indeed having a `gradlew precommit` to fail on inconsistent code style would be useful. Thanks for feedback, I will merge the PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mayya-sharipova merged pull request #2129: Fix format indent from 4 to 2 spaces
mayya-sharipova merged pull request #2129: URL: https://github.com/apache/lucene-solr/pull/2129 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13101) Shared storage support in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247510#comment-17247510 ] Ilan Ginzburg commented on SOLR-13101: -- I have no issue with "will not fix" this Jira. >From my perspective, the fundamental problem of this approach is not the >introduction of a new replica type but the need to commit every batch to be >able to push segments and having to wait for the push to complete and succeed >before calling the indexing itself as successful (there are a few possible >optimizations such as pushing files before commit happens so they're ready on >blob by then, but the fundamental issues do not go away). That's a major >performance degradation. So yes, please close it. Thanks. Looking forward to see a different approach that does not have the problems listed above! (or less of them :)) > Shared storage support in SolrCloud > --- > > Key: SOLR-13101 > URL: https://issues.apache.org/jira/browse/SOLR-13101 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Yonik Seeley >Priority: Major > Time Spent: 15h 50m > Remaining Estimate: 0h > > Solr should have first-class support for shared storage (blob/object stores > like S3, google cloud storage, etc. and shared filesystems like HDFS, NFS, > etc). > The key component will likely be a new replica type for shared storage. It > would have many of the benefits of the current "pull" replicas (not indexing > on all replicas, all shards identical with no shards getting out-of-sync, > etc), but would have additional benefits: > - Any shard could become leader (the blob store always has the index) > - Better elasticity scaling down >- durability not linked to number of replcias.. a single replica could be > common for write workloads >- could drop to 0 replicas for a shard when not needed (blob store always > has index) > - Allow for higher performance write workloads by skipping the transaction > log >- don't pay for what you don't need >- a commit will be necessary to flush to stable storage (blob store) > - A lot of the complexity and failure modes go away > An additional component a Directory implementation that will work well with > blob stores. We probably want one that treats local disk as a cache since > the latency to remote storage is so large. I think there are still some > "locking" issues to be solved here (ensuring that more than one writer to the > same index won't corrupt it). This should probably be pulled out into a > different JIRA issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on pull request #2120: SOLR-15029 More gracefully give up shard leadership
madrob commented on pull request #2120: URL: https://github.com/apache/lucene-solr/pull/2120#issuecomment-742815596 Converting back to draft, as the new asserts that I added in the unit test are failing. Further discussion on JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-15026) MiniSolrCloudCluster can inconsistently get confused about when it's using SSL
[ https://issues.apache.org/jira/browse/SOLR-15026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Potter reassigned SOLR-15026: - Assignee: (was: Timothy Potter) > MiniSolrCloudCluster can inconsistently get confused about when it's using SSL > -- > > Key: SOLR-15026 > URL: https://issues.apache.org/jira/browse/SOLR-15026 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Priority: Major > > A new test added in SOLR-14934 caused the following reproducible failure to > pop up on jenkins... > {noformat} > hossman@slate:~/lucene/dev [j11] [master] $ ./gradlew -p solr/test-framework/ > test --tests MiniSolrCloudClusterTest.testSolrHomeAndResourceLoaders > -Dtests.seed=806A85748BD81F48 -Dtests.multiplier=2 -Dtests.slow=true > -Dtests.locale=ln-CG -Dtests.timezone=Asia/Thimbu -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > Starting a Gradle Daemon (subsequent builds will be faster) > > Task :randomizationInfo > Running tests with randomization seed: tests.seed=806A85748BD81F48 > > Task :solr:test-framework:test > org.apache.solr.cloud.MiniSolrCloudClusterTest > > testSolrHomeAndResourceLoaders FAILED > org.apache.solr.client.solrj.SolrServerException: IOException occurred > when talking to server at: https://127.0.0.1:38681/solr > at > __randomizedtesting.SeedInfo.seed([806A85748BD81F48:37548FA7602CB5FD]:0) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:712) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:269) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:251) > at > org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:390) > at > org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:360) > at > org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1168) > at > org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:931) > at > org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:865) > at > org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:229) > at > org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:246) > at > org.apache.solr.cloud.MiniSolrCloudClusterTest.testSolrHomeAndResourceLoaders(MiniSolrCloudClusterTest.java:125) > ... > Caused by: > javax.net.ssl.SSLException: Unsupported or unrecognized SSL message > at > java.base/sun.security.ssl.SSLSocketInputRecord.handleUnknownRecord(SSLSocketInputRecord.java:439) > {noformat} > The problem sems to be that even though the MiniSolrCloudCluster being > instantiated isn't _intentionally_ using any SSL randomization (it just uses > {{JettyConfig.builder().build()}} the CloudSolrClient returned by > {{cluster.getSolrClient()}} is evidnetly picking up the ranodmized SSL and > trying to use it to talk to the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-14766) Deprecate ManagedResources from Solr
[ https://issues.apache.org/jira/browse/SOLR-14766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Potter reassigned SOLR-14766: - Assignee: (was: Timothy Potter) > Deprecate ManagedResources from Solr > > > Key: SOLR-14766 > URL: https://issues.apache.org/jira/browse/SOLR-14766 > Project: Solr > Issue Type: Task >Reporter: Noble Paul >Priority: Major > Labels: deprecation > Attachments: SOLR-14766.patch > > Time Spent: 1h > Remaining Estimate: 0h > > This feature has the following problems. > * It's insecure because it is using restlet > * Nobody knows that code enough to even remove the restlet dependency > * Restlest dependency on Solr exists just because of this > We should deprecate this from 8.7 and remove it from master -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-9008) Investigate feasibilty and impact of using SparseFixedBitSet where Solr is currently using FixedBitSet
[ https://issues.apache.org/jira/browse/SOLR-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Potter reassigned SOLR-9008: Assignee: (was: Timothy Potter) > Investigate feasibilty and impact of using SparseFixedBitSet where Solr is > currently using FixedBitSet > -- > > Key: SOLR-9008 > URL: https://issues.apache.org/jira/browse/SOLR-9008 > Project: Solr > Issue Type: Improvement >Reporter: Timothy Potter >Priority: Major > > Found this gem in one of Mike's blog posts: > {quote} > But with 5.0.0, Lucene now supports random-writable and advance-able sparse > bitsets (RoaringDocIdSet and SparseFixedBitSet), so the heap required is in > proportion to how many bits are set, not how many total documents exist in > the index. > {quote} > http://blog.mikemccandless.com/2014/11/apache-lucene-500-is-coming.html > I don't see any uses of either of these classes in Solr code but from a quick > look, sounds compelling for saving memory, such as when caching fq's > This ticket is for exploring where Solr can leverage these structures and > whether there's an improvement in performance and/or memory usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-6443) TestManagedResourceStorage fails on Jenkins with SolrCore.getOpenCount()==2
[ https://issues.apache.org/jira/browse/SOLR-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Potter reassigned SOLR-6443: Assignee: (was: Timothy Potter) > TestManagedResourceStorage fails on Jenkins with SolrCore.getOpenCount()==2 > --- > > Key: SOLR-6443 > URL: https://issues.apache.org/jira/browse/SOLR-6443 > Project: Solr > Issue Type: Bug > Components: Schema and Analysis >Reporter: Timothy Potter >Priority: Major > > FAILED: > junit.framework.TestSuite.org.apache.solr.rest.TestManagedResourceStorage > Error Message: > SolrCore.getOpenCount()==2 > Stack Trace: > java.lang.RuntimeException: SolrCore.getOpenCount()==2 > at __randomizedtesting.SeedInfo.seed([A491D1FD4CEF5EF8]:0) > at org.apache.solr.util.TestHarness.close(TestHarness.java:332) > at org.apache.solr.SolrTestCaseJ4.deleteCore(SolrTestCaseJ4.java:620) > at org.apache.solr.SolrTestCaseJ4.afterClass(SolrTestCaseJ4.java:183) > at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:484) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9564) Format code automatically and enforce it
[ https://issues.apache.org/jira/browse/LUCENE-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247581#comment-17247581 ] Erick Erickson commented on LUCENE-9564: I've been in 2-hour meetings early in my career when I was young and unsure of myself arguing about whether the curly braces should be at the end of an "if" (or whatever) statement or on the next line. And then if on the next line, should the curly brace be indented or should it be flush with the "if". And should the first code line be on the same line as the curly brace? If on the next line, should it be flush with the curly brace or indented again? Then had the conversation repeat some time later when the person(s) who didn't get what they wanted brought it up again. Best guy I ever worked for had a method of dealing with this. If the topic was brought up again he'd say "We decided it this way, end of discussion". Later in my career I'd have walked out about 30 seconds into that conversation. So you can see why it's easy to get me to sign on ;) When we reconcile the reference impl, I can help... > Format code automatically and enforce it > > > Key: LUCENE-9564 > URL: https://issues.apache.org/jira/browse/LUCENE-9564 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Time Spent: 2h 20m > Remaining Estimate: 0h > > This is a trivial change but a bold move. And I'm sure it's not for everyone. > I started using google java format [1] in my projects a while ago and have > never looked back since. It is an oracle-style formatter (doesn't allow > customizations or deviations from the defined 'ideal') - this takes some > getting used to - but it also eliminates *all* the potential differences > between IDEs, configs, etc. And the formatted code typically looks much > better than hand-edited one. It is also verifiable on precommit (so you can't > commit code that deviates from what you'd get from automated formatting > output). > The biggest benefit I see is that refactorings become such a joy and keep the > code neat, everywhere. Before you commit you just reformat everything > automatically, no matter how much you messed it up. > This isn't a change for everyone. I myself love hand-edited, neat code... but > the reality is that with IDE support for automated code changes and so many > people with different styles working on the same codebase keeping it neat is > a big pain. > Checkstyle and other tools are fine for ensuring certain rules but they don't > take the burden of formatting off your shoulders. This tool does. > Like I said - I had *great* reservations about using it at the beginning but > over time got so used to it that I almost can't live without it now. It's > like magic - you play with the code in any way you like, then run formatting > and it's nice and neat. > The downside is that automated formatting does imply potential merge problems > in backward patches (or any currently existing branches). > Like I said, it is a bold move. Just throwing this for your consideration. > -I've added a PR that adds spotless but it's not ready; some files would have > to be excluded as they currently violate header rules.- > A more interesting thing is here where the current code is automatically > reformatted - this branch is for eyeballing only. > https://github.com/dweiss/lucene-solr/compare/LUCENE-9564...dweiss:LUCENE-9564-example > [1] https://google.github.io/styleguide/javaguide.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] ErickErickson commented on pull request #2129: Fix format indent from 4 to 2 spaces
ErickErickson commented on pull request #2129: URL: https://github.com/apache/lucene-solr/pull/2129#issuecomment-742891680 It already fails on _tabs_ rather than spaces, but failing on too many spaces isn’t checked. That said, rather than a one-off for indentation, I’d rather see the effort go here rather than a separate precommit check….: https://issues.apache.org/jira/browse/LUCENE-9564 and SOLR-14920… BTW, 'gradlew check’ does all the precommit tasks as well as run tests... > On Dec 10, 2020, at 2:55 PM, Mayya Sharipova wrote: > > > Merged #2129 into master. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub, or unsubscribe. > This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13101) Shared storage via a new SHARED replica type
[ https://issues.apache.org/jira/browse/SOLR-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley updated SOLR-13101: Description: _This issue is closed as Won't-Fix because the particular approach here won't be contributed. Linked issues may appear approaching it differently._ Solr should have first-class support for shared storage (blob/object stores like S3, google cloud storage, etc. and shared filesystems like HDFS, NFS, etc). The key component will likely be a new replica type for shared storage. It would have many of the benefits of the current "pull" replicas (not indexing on all replicas, all shards identical with no shards getting out-of-sync, etc), but would have additional benefits: - Any shard could become leader (the blob store always has the index) - Better elasticity scaling down - durability not linked to number of replcias.. a single replica could be common for write workloads - could drop to 0 replicas for a shard when not needed (blob store always has index) - Allow for higher performance write workloads by skipping the transaction log - don't pay for what you don't need - a commit will be necessary to flush to stable storage (blob store) - A lot of the complexity and failure modes go away An additional component a Directory implementation that will work well with blob stores. We probably want one that treats local disk as a cache since the latency to remote storage is so large. I think there are still some "locking" issues to be solved here (ensuring that more than one writer to the same index won't corrupt it). This should probably be pulled out into a different JIRA issue. was: Solr should have first-class support for shared storage (blob/object stores like S3, google cloud storage, etc. and shared filesystems like HDFS, NFS, etc). The key component will likely be a new replica type for shared storage. It would have many of the benefits of the current "pull" replicas (not indexing on all replicas, all shards identical with no shards getting out-of-sync, etc), but would have additional benefits: - Any shard could become leader (the blob store always has the index) - Better elasticity scaling down - durability not linked to number of replcias.. a single replica could be common for write workloads - could drop to 0 replicas for a shard when not needed (blob store always has index) - Allow for higher performance write workloads by skipping the transaction log - don't pay for what you don't need - a commit will be necessary to flush to stable storage (blob store) - A lot of the complexity and failure modes go away An additional component a Directory implementation that will work well with blob stores. We probably want one that treats local disk as a cache since the latency to remote storage is so large. I think there are still some "locking" issues to be solved here (ensuring that more than one writer to the same index won't corrupt it). This should probably be pulled out into a different JIRA issue. Summary: Shared storage via a new SHARED replica type (was: Shared storage support in SolrCloud) > Shared storage via a new SHARED replica type > > > Key: SOLR-13101 > URL: https://issues.apache.org/jira/browse/SOLR-13101 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Yonik Seeley >Priority: Major > Time Spent: 15h 50m > Remaining Estimate: 0h > > _This issue is closed as Won't-Fix because the particular approach here won't > be contributed. Linked issues may appear approaching it differently._ > > Solr should have first-class support for shared storage (blob/object stores > like S3, google cloud storage, etc. and shared filesystems like HDFS, NFS, > etc). > The key component will likely be a new replica type for shared storage. It > would have many of the benefits of the current "pull" replicas (not indexing > on all replicas, all shards identical with no shards getting out-of-sync, > etc), but would have additional benefits: > - Any shard could become leader (the blob store always has the index) > - Better elasticity scaling down > - durability not linked to number of replcias.. a single replica could be > common for write workloads > - could drop to 0 replicas for a shard when not needed (blob store always > has index) > - Allow for higher performance write workloads by skipping the transaction > log > - don't pay for what you don't need > - a commit will be necessary to flush to stable storage (blob store) > - A lot of the complexity and failure modes go away > An additional component a Directory implementation that will work well with > blob stores. We probably want one that treats local disk as a cache since the > latency to remote storage
[jira] [Resolved] (SOLR-13101) Shared storage via a new SHARED replica type
[ https://issues.apache.org/jira/browse/SOLR-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley resolved SOLR-13101. - Resolution: Won't Fix > Shared storage via a new SHARED replica type > > > Key: SOLR-13101 > URL: https://issues.apache.org/jira/browse/SOLR-13101 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Yonik Seeley >Priority: Major > Time Spent: 15h 50m > Remaining Estimate: 0h > > _This issue is closed as Won't-Fix because the particular approach here won't > be contributed. Linked issues may appear approaching it differently._ > > Solr should have first-class support for shared storage (blob/object stores > like S3, google cloud storage, etc. and shared filesystems like HDFS, NFS, > etc). > The key component will likely be a new replica type for shared storage. It > would have many of the benefits of the current "pull" replicas (not indexing > on all replicas, all shards identical with no shards getting out-of-sync, > etc), but would have additional benefits: > - Any shard could become leader (the blob store always has the index) > - Better elasticity scaling down > - durability not linked to number of replcias.. a single replica could be > common for write workloads > - could drop to 0 replicas for a shard when not needed (blob store always > has index) > - Allow for higher performance write workloads by skipping the transaction > log > - don't pay for what you don't need > - a commit will be necessary to flush to stable storage (blob store) > - A lot of the complexity and failure modes go away > An additional component a Directory implementation that will work well with > blob stores. We probably want one that treats local disk as a cache since the > latency to remote storage is so large. I think there are still some "locking" > issues to be solved here (ensuring that more than one writer to the same > index won't corrupt it). This should probably be pulled out into a different > JIRA issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] zacharymorn opened a new pull request #2140: UCENE-9346: Support minimumNumberShouldMatch in WANDScorer
zacharymorn opened a new pull request #2140: URL: https://github.com/apache/lucene-solr/pull/2140 # Description Support minimumNumberShouldMatch in WANDScorer Currently has a few `nocommit` to keep track of questions # Solution Similar to `MinShouldMatchSumScorer`, the logic here keeps track of number of matched scorers for each candidate doc, and compares it with `minShouldMatch` to decide if the minimum number of optional clauses have been matched. # Tests Passed existing tests (especially those in `TestBooleanMinShouldMatch` and `TestWANDScorer`), and updated some that check for scores. `./gradlew check` passed with `nocommit` rule commented out for now. # Checklist Please review the following and check all that apply: - [x ] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `master` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] zacharymorn closed pull request #2140: UCENE-9346: Support minimumNumberShouldMatch in WANDScorer
zacharymorn closed pull request #2140: URL: https://github.com/apache/lucene-solr/pull/2140 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] zacharymorn opened a new pull request #2141: LUCENE-9346: Support minimumNumberShouldMatch in WANDScorer
zacharymorn opened a new pull request #2141: URL: https://github.com/apache/lucene-solr/pull/2141 # Description Support minimumNumberShouldMatch in WANDScorer Currently has a few `nocommit` to keep track of questions # Solution Similar to `MinShouldMatchSumScorer`, the logic here keeps track of number of matched scorers for each candidate doc, and compares it with `minShouldMatch` to decide if the minimum number of optional clauses have been matched. # Tests Passed existing tests (especially those in `TestBooleanMinShouldMatch` and `TestWANDScorer`), and updated some that check for scores. `./gradlew check` passed with `nocommit` rule commented out for now. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `master` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] noblepaul merged pull request #1963: SOLR-14827: Refactor schema loading to not use XPath
noblepaul merged pull request #1963: URL: https://github.com/apache/lucene-solr/pull/1963 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14827) Refactor schema loading to not use XPath
[ https://issues.apache.org/jira/browse/SOLR-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247625#comment-17247625 ] ASF subversion and git services commented on SOLR-14827: Commit a95ce0d4224539094dc602ba8afa1ff796009a2b in lucene-solr's branch refs/heads/master from Noble Paul [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a95ce0d ] SOLR-14827: Refactor schema loading to not use XPath (#1963) > Refactor schema loading to not use XPath > > > Key: SOLR-14827 > URL: https://issues.apache.org/jira/browse/SOLR-14827 > Project: Solr > Issue Type: Task >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Labels: perfomance > Time Spent: 2h 10m > Remaining Estimate: 0h > > XPath is slower compared to DOM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9346) WANDScorer should support minimumNumberShouldMatch
[ https://issues.apache.org/jira/browse/LUCENE-9346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247640#comment-17247640 ] Zach Chen commented on LUCENE-9346: --- hi [~jpountz], I spent some time looking into this and studying the algorithms in *MinShouldMatchSumScorer* and *WANDScorer,* and just finished with some initial changes and opened a draft PR. I think I went with a different direction from what you suggested above, by mainly keeping track of the number of scorers matched without changing the *WANDScorer* algorithm (not sure if I understand it enough to make a correct change either :D ), and comparing it with *minShouldMatch* parameter after *minCompetitiveScore* has been reached. Could you please take a look and let me know if that approach works as well? In the PR, I also put in some nocommit to keep track of some questions I have (all the tests are now passing without the nocommit comments btw): # Currently, *WANDScorer* will only be used for *ScoreMode.TOP_SCORES*. Should it be used for other score modes as well once *MinShouldMatchSumScorer* gets deprecated? Running *WANDScorer* with other ScodeMode now would fail some tests I think. # For now inside *WANDScorer*'s constructor, *WANDScorer.cost* is calculated as sum of the cost of its individual scorer. But from *MinShouldMatchSumScorer*'s side, the cost is calculated also taking into account the *minShouldMatch* parameter as it impacts the tail capacity. Should *minShouldMatch* be taken into account in the calculation for *WANDScorer.cost* as well**, especially when the current solution in the PR doesn't change the tail capacity of *WANDScorer?* > WANDScorer should support minimumNumberShouldMatch > -- > > Key: LUCENE-9346 > URL: https://issues.apache.org/jira/browse/LUCENE-9346 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Currently we deoptimize when a minimumNumberShouldMatch is provided and fall > back to a scorer that doesn't dynamically prune hits based on scores. > Given how WANDScorer and MinShouldMatchSumScorer are similar I wonder if we > could remove MinShouldSumScorer once WANDScorer supports minimumNumberShould > match. Then any improvements we bring to WANDScorer like two-phase support > (LUCENE-8806) would automatically cover more queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9346) WANDScorer should support minimumNumberShouldMatch
[ https://issues.apache.org/jira/browse/LUCENE-9346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247640#comment-17247640 ] Zach Chen edited comment on LUCENE-9346 at 12/11/20, 5:18 AM: -- hi [~jpountz], I spent some time looking into this and studying the algorithms in *MinShouldMatchSumScorer* and *WANDScorer,* and just finished with some initial changes and opened a draft PR. I think I went with a different direction from what you suggested above, by mainly keeping track of the number of scorers matched without changing the *WANDScorer* algorithm (not sure if I understand it enough to make a correct change either :D ), and comparing it with *minShouldMatch* parameter after *minCompetitiveScore* has been reached. Could you please take a look and let me know if that approach works as well? In the PR, I also put in some nocommit to keep track of some questions I have (all the tests are passing without the nocommit comments btw): # Currently, *WANDScorer* will only be used for *ScoreMode.TOP_SCORES*. Should it be used for other score modes as well once *MinShouldMatchSumScorer* gets deprecated? Running *WANDScorer* with other ScodeMode now would fail some tests I think. # For now inside *WANDScorer*'s constructor, *WANDScorer.cost* is calculated as sum of the cost of its individual scorer. But from *MinShouldMatchSumScorer*'s side, the cost is calculated also taking into account the *minShouldMatch* parameter as it impacts the tail capacity. Should *minShouldMatch* be taken into account in the calculation for *WANDScorer.cost* as well**, especially when the current solution in the PR doesn't change the tail capacity of *WANDScorer?* was (Author: zacharymorn): hi [~jpountz], I spent some time looking into this and studying the algorithms in *MinShouldMatchSumScorer* and *WANDScorer,* and just finished with some initial changes and opened a draft PR. I think I went with a different direction from what you suggested above, by mainly keeping track of the number of scorers matched without changing the *WANDScorer* algorithm (not sure if I understand it enough to make a correct change either :D ), and comparing it with *minShouldMatch* parameter after *minCompetitiveScore* has been reached. Could you please take a look and let me know if that approach works as well? In the PR, I also put in some nocommit to keep track of some questions I have (all the tests are now passing without the nocommit comments btw): # Currently, *WANDScorer* will only be used for *ScoreMode.TOP_SCORES*. Should it be used for other score modes as well once *MinShouldMatchSumScorer* gets deprecated? Running *WANDScorer* with other ScodeMode now would fail some tests I think. # For now inside *WANDScorer*'s constructor, *WANDScorer.cost* is calculated as sum of the cost of its individual scorer. But from *MinShouldMatchSumScorer*'s side, the cost is calculated also taking into account the *minShouldMatch* parameter as it impacts the tail capacity. Should *minShouldMatch* be taken into account in the calculation for *WANDScorer.cost* as well**, especially when the current solution in the PR doesn't change the tail capacity of *WANDScorer?* > WANDScorer should support minimumNumberShouldMatch > -- > > Key: LUCENE-9346 > URL: https://issues.apache.org/jira/browse/LUCENE-9346 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Currently we deoptimize when a minimumNumberShouldMatch is provided and fall > back to a scorer that doesn't dynamically prune hits based on scores. > Given how WANDScorer and MinShouldMatchSumScorer are similar I wonder if we > could remove MinShouldSumScorer once WANDScorer supports minimumNumberShould > match. Then any improvements we bring to WANDScorer like two-phase support > (LUCENE-8806) would automatically cover more queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15029) Allow Shard Leader to give up leadership gracefully via shard terms
[ https://issues.apache.org/jira/browse/SOLR-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247646#comment-17247646 ] Mike Drob commented on SOLR-15029: -- I think this can be done a lot more simply than what I was trying to accomplish at first. If we simply do a leader election, then the current leader will go to the end of the queue, a new leader will come in. If there continue to be indexing errors on the given node, then the new leader will increase terms and the previous one will fall behind. > Allow Shard Leader to give up leadership gracefully via shard terms > --- > > Key: SOLR-15029 > URL: https://issues.apache.org/jira/browse/SOLR-15029 > Project: Solr > Issue Type: Improvement >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Currently we have (via SOLR-12412) that when a leader sees an index writing > error during an update it will give up leadership by deleting the replica and > adding a new replica. One stated benefit of this was that because we are > using the overseer and a known code path, that this is done asynchronous and > very efficiently. > I would argue that this approach is too heavy handed. > In the case of a corrupt index exception, it makes some sense to completely > delete the index dir and attempt to sync from a good peer. Even in this case, > however, it might be better to allow fingerprinting and other index delta > mechanisms take over and allow for a more efficient data transfer. > In an alternate case where the index error arises due to a disconnected file > system (possible with shared file systems, i.e. S3, HDFS, some k8s systems) > and the required solution is some kind of reconnect, then this approach has > several shortcomings - the core delete and creations are going to fail > leaving dangling replicas. Further, the data is still present so there is no > need to do so many extra copies. > I propose that we bring in a mechanism to give up leadership via the existing > shard terms language. I believe we would be able to set all replicas > currently equal to leader term T to T+1, and then trigger a new leader > election. The current leader would know it is ineligible, while the other > replicas that were current before the failed update would be eligible. This > improvement would entail adding an additional possible operation to terms > state machine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-15029) More gracefully allow Shard Leader to give up leadership
[ https://issues.apache.org/jira/browse/SOLR-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Drob updated SOLR-15029: - Summary: More gracefully allow Shard Leader to give up leadership (was: Allow Shard Leader to give up leadership gracefully via shard terms) > More gracefully allow Shard Leader to give up leadership > > > Key: SOLR-15029 > URL: https://issues.apache.org/jira/browse/SOLR-15029 > Project: Solr > Issue Type: Improvement >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Currently we have (via SOLR-12412) that when a leader sees an index writing > error during an update it will give up leadership by deleting the replica and > adding a new replica. One stated benefit of this was that because we are > using the overseer and a known code path, that this is done asynchronous and > very efficiently. > I would argue that this approach is too heavy handed. > In the case of a corrupt index exception, it makes some sense to completely > delete the index dir and attempt to sync from a good peer. Even in this case, > however, it might be better to allow fingerprinting and other index delta > mechanisms take over and allow for a more efficient data transfer. > In an alternate case where the index error arises due to a disconnected file > system (possible with shared file systems, i.e. S3, HDFS, some k8s systems) > and the required solution is some kind of reconnect, then this approach has > several shortcomings - the core delete and creations are going to fail > leaving dangling replicas. Further, the data is still present so there is no > need to do so many extra copies. > I propose that we bring in a mechanism to give up leadership via the existing > shard terms language. I believe we would be able to set all replicas > currently equal to leader term T to T+1, and then trigger a new leader > election. The current leader would know it is ineligible, while the other > replicas that were current before the failed update would be eligible. This > improvement would entail adding an additional possible operation to terms > state machine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #1992: SOLR-14939: JSON range faceting to support cache=false parameter
madrob commented on a change in pull request #1992: URL: https://github.com/apache/lucene-solr/pull/1992#discussion_r540707440 ## File path: solr/core/src/java/org/apache/solr/search/facet/FacetRangeProcessor.java ## @@ -531,7 +533,20 @@ private SimpleOrderedMap getRangeCountsIndexed() throws IOException { private Query[] filters; private DocSet[] intersections; private void rangeStats(Range range, int slot) throws IOException { -Query rangeQ = sf.getType().getRangeQuery(null, sf, range.low == null ? null : calc.formatValue(range.low), range.high==null ? null : calc.formatValue(range.high), range.includeLower, range.includeUpper); +final Query rangeQ; +{ + final Query rangeQuery = sf.getType().getRangeQuery(null, sf, range.low == null ? null : calc.formatValue(range.low), range.high==null ? null : calc.formatValue(range.high), range.includeLower, range.includeUpper); + if (fcontext.cache) { +rangeQ = rangeQuery; + } else if (rangeQuery instanceof ExtendedQuery) { +((ExtendedQuery) rangeQuery).setCache(fcontext.cache); Review comment: Here (and in the else) I think I would explicitly do `setCache(false)` as it feels more readable to me, but I don't have strong opinions on that. ## File path: solr/core/src/test/org/apache/solr/search/facet/TestJsonRangeFacets.java ## @@ -41,6 +42,7 @@ public static void beforeTests() throws Exception { if (Boolean.getBoolean(NUMERIC_POINTS_SYSPROP)) System.setProperty(NUMERIC_DOCVALUES_SYSPROP,"true"); initCore("solrconfig-tlog.xml","schema_latest.xml"); +cache = random().nextBoolean(); Review comment: Might as well store the string directly? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9564) Format code automatically and enforce it
[ https://issues.apache.org/jira/browse/LUCENE-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247684#comment-17247684 ] Houston Putman commented on LUCENE-9564: +1, I think this is a terrific idea! Golang has a formatter built into the language, so many Go projects will require the formatting to be correct in order to merge. I have many gripes with the language, but this something they got 100% right. It is so nice to have consistent code and not have to worry about maintaining it. > Format code automatically and enforce it > > > Key: LUCENE-9564 > URL: https://issues.apache.org/jira/browse/LUCENE-9564 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Time Spent: 2h 20m > Remaining Estimate: 0h > > This is a trivial change but a bold move. And I'm sure it's not for everyone. > I started using google java format [1] in my projects a while ago and have > never looked back since. It is an oracle-style formatter (doesn't allow > customizations or deviations from the defined 'ideal') - this takes some > getting used to - but it also eliminates *all* the potential differences > between IDEs, configs, etc. And the formatted code typically looks much > better than hand-edited one. It is also verifiable on precommit (so you can't > commit code that deviates from what you'd get from automated formatting > output). > The biggest benefit I see is that refactorings become such a joy and keep the > code neat, everywhere. Before you commit you just reformat everything > automatically, no matter how much you messed it up. > This isn't a change for everyone. I myself love hand-edited, neat code... but > the reality is that with IDE support for automated code changes and so many > people with different styles working on the same codebase keeping it neat is > a big pain. > Checkstyle and other tools are fine for ensuring certain rules but they don't > take the burden of formatting off your shoulders. This tool does. > Like I said - I had *great* reservations about using it at the beginning but > over time got so used to it that I almost can't live without it now. It's > like magic - you play with the code in any way you like, then run formatting > and it's nice and neat. > The downside is that automated formatting does imply potential merge problems > in backward patches (or any currently existing branches). > Like I said, it is a bold move. Just throwing this for your consideration. > -I've added a PR that adds spotless but it's not ready; some files would have > to be excluded as they currently violate header rules.- > A more interesting thing is here where the current code is automatically > reformatted - this branch is for eyeballing only. > https://github.com/dweiss/lucene-solr/compare/LUCENE-9564...dweiss:LUCENE-9564-example > [1] https://google.github.io/styleguide/javaguide.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14788) Solr: The Next Big Thing
[ https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated SOLR-14788: -- Description: h3. [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The Policeman is {color:#de350b}NOW{color} {color:#de350b}OFF{color} duty!*{color} {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and have some fun. Try to make some progress. Don't stress too much about the impact of your changes or maintaining stability and performance and correctness so much. Until the end of phase 1, I've got your back. I have a variety of tools and contraptions I have been building over the years and I will continue training them on this branch. I will review your changes and peer out across the land and course correct where needed. As Mike D will be thinking, "Sounds like a bottleneck Mark." And indeed it will be to some extent. Which is why once stage one is completed, I will flip The Policeman to off duty. When off duty, I'm always* *occasionally*{color} *down for some vigilante justice, but I won't be walking the beat, all that stuff about sit back and relax goes out the window.*_ {quote} I have stolen this title from Ishan or Noble and Ishan. This issue is meant to capture the work of a small team that is forming to push Solr and SolrCloud to the next phase. I have kicked off the work with an effort to create a very fast and solid base. That work is not 100% done, but it's ready to join the fight. Tim Potter has started giving me a tremendous hand in finishing up. Ishan and Noble have already contributed support and testing and have plans for additional work to shore up some of our current shortcomings. Others have expressed an interest in helping and hopefully they will pop up here as well. Let's organize and discuss our efforts here and in various sub issues. was: h3. [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The Policeman is on duty!*{color} {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and have some fun. Try to make some progress. Don't stress too much about the impact of your changes or maintaining stability and performance and correctness so much. Until the end of phase 1, I've got your back. I have a variety of tools and contraptions I have been building over the years and I will continue training them on this branch. I will review your changes and peer out across the land and course correct where needed. As Mike D will be thinking, "Sounds like a bottleneck Mark." And indeed it will be to some extent. Which is why once stage one is completed, I will flip The Policeman to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} *down for some vigilante justice, but I won't be walking the beat, all that stuff about sit back and relax goes out the window.*{color}_ {quote} I have stolen this title from Ishan or Noble and Ishan. This issue is meant to capture the work of a small team that is forming to push Solr and SolrCloud to the next phase. I have kicked off the work with an effort to create a very fast and solid base. That work is not 100% done, but it's ready to join the fight. Tim Potter has started giving me a tremendous hand in finishing up. Ishan and Noble have already contributed support and testing and have plans for additional work to shore up some of our current shortcomings. Others have expressed an interest in helping and hopefully they will pop up here as well. Let's organize and discuss our efforts here and in various sub issues. > Solr: The Next Big Thing > > > Key: SOLR-14788 > URL: https://issues.apache.org/jira/browse/SOLR-14788 > Project: Solr > Issue Type: Task >Reporter: Mark Robert Miller >Assignee: Mark Robert Miller >Priority: Critical > Time Spent: 4h > Remaining Estimate: 0h > > h3. > [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The > Policeman is {color:#de350b}NOW{color} {color:#de350b}OFF{color} > duty!*{color} > {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and > have some fun. Try to make some progress. Don't stress too much about the > impact of your changes or maintaining stability and performance and > correctness so much. Until the end of phase 1, I've got your back. I have a > variety of tools and contraptions I have been building over the years and I > will continue training them on this branch. I will review your changes and > peer out across the land and course correct where needed. As Mike D will be > thinking,