[ https://issues.apache.org/jira/browse/SOLR-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263528#comment-17263528 ]
ASF subversion and git services commented on SOLR-14608: -------------------------------------------------------- Commit 4f691b8bb4492bec44440c1db65cb45ab83bec1c in lucene-solr's branch refs/heads/jira/SOLR-14608-export-merge from Joel Bernstein [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4f691b8 ] SOLR-14608: Faster sorting for the /export handler Squashed commit of the following: commit 66f85c550691fcb3b4ee1959c7ed1695aed3cb78 Author: Joel Bernstein <jbern...@apache.org> Date: Thu Jan 7 11:04:18 2021 -0500 SOLR-14608: Fix tie-break in SortDoc, TestExportWriter now passing. commit a183d29ea84dbd4a76517f27a87ef78d56bc935f Author: Joel Bernstein <jbern...@apache.org> Date: Tue Jan 5 16:07:18 2021 -0500 SOLR-14608: Fix failing TestExportWriter tests commit d7e81e8197a4b9e06c8d387d34e941a1365ad163 Author: Joel Bernstein <jbern...@apache.org> Date: Tue Dec 29 12:34:02 2020 -0500 SOLR-14608: Tone down debug logging commit cae61336f86295014a1c373774bc56c5b9f21670 Author: Joel Bernstein <jbern...@apache.org> Date: Tue Dec 29 09:40:35 2020 -0500 SOLR-14608: Fix nanoTime to millis calculation and more code cleanup commit 4e2cd9aaeaeeeed835af6b638faeefab231d7b9a Author: Joel Bernstein <jbern...@apache.org> Date: Mon Dec 28 15:00:12 2020 -0500 SOLR-14608: Code clean up commit 894141b3c9461880917de471285d3b8a96c0a4fe Author: Joel Bernstein <jbern...@apache.org> Date: Sun Dec 27 13:56:25 2020 -0500 SOLR-14608: Fix bug when caching docvalues objects related to the leafreader ord commit 0a7ea0ef20d7b280b7d4381ad851024096addf27 Author: Joel Bernstein <jbern...@apache.org> Date: Wed Dec 23 16:12:59 2020 -0500 SOLR-14608: Reuse docvalues when possible commit 6af848b086c2002b031ea159e485f4b2f30df7c0 Author: Joel Bernstein <jbern...@apache.org> Date: Mon Dec 21 14:13:31 2020 -0500 SOLR-14608: Suppress Broken pipe logging commit f40001700778bcac4390b2f00c014ad0bf19d091 Author: Joel Bernstein <jbern...@apache.org> Date: Sun Dec 6 10:36:01 2020 -0500 Test commit 2 commit 8373f3a6e1383028a517831bc561ac2f491c6ce3 Author: Joel Bernstein <jbern...@apache.org> Date: Sun Dec 6 10:34:26 2020 -0500 Test commit commit 8e9a7afddde80080150e3fd078005a8add29ac7f Author: Andrzej Bialecki <a...@apache.org> Date: Thu Jul 30 15:48:28 2020 +0200 SOLR-14608: More cleanups. Fix a bug in compareTo. Add SortDoc.equals() / hashCode(). commit 536d962d6e016573cafd2f420511e4f7083e0468 Author: Andrzej Bialecki <a...@apache.org> Date: Wed Jul 29 13:13:09 2020 +0200 SOLR-14608: Fix generics / raw types, move around the timer metrics so that they make sense. commit ebd5bcaab8c917b16164bcd059276558002b09a6 Author: Andrzej Bialecki <a...@apache.org> Date: Tue Jul 28 11:38:44 2020 +0200 SOLR-14608: Fix code formatting. commit bf8d954ca1289d82eb5334719fb97bbabacacb09 Merge: b610ddae2f8 6bf5f4a87f4 Author: Andrzej Bialecki <a...@apache.org> Date: Mon Jul 27 15:42:04 2020 +0200 Merge branch 'master' into jira/SOLR-14608-export commit b610ddae2f8a4258f1d6e7c842f480f2b8c46fa9 Author: Joel Bernstein <jbern...@apache.org> Date: Fri Jul 17 10:32:27 2020 -0400 SOLR-14608: Cache output bytesref commit ee9c3d083c60850c3e48d301356329cf0f017c86 Author: Joel Bernstein <jbern...@apache.org> Date: Wed Jul 15 15:55:48 2020 -0400 SOLR-14608: Works with one string value sort field. commit 32e92c5025637e4b6cd940628a491fe6b7e7cb13 Author: Joel Bernstein <jbern...@apache.org> Date: Mon Jul 13 11:26:56 2020 -0400 SOLR-14608: Wire-up the MergeIterator part three commit f747562ca60907dd542cd9bc141a5806f156db1c Author: Joel Bernstein <jbern...@apache.org> Date: Mon Jul 13 10:41:39 2020 -0400 SOLR-14608: Wire-up the MergeIterator part two commit 970c6cf4f5abad4248ff5c686c8ec65061dfa949 Author: Joel Bernstein <jbern...@apache.org> Date: Mon Jul 13 09:32:52 2020 -0400 SOLR-14608: Wire-up the MergeIterator commit 95e706abc425003d79a037500b9887f2c8a7798c Author: Joel Bernstein <jbern...@apache.org> Date: Fri Jul 10 09:53:22 2020 -0400 SOLR-14608: Size segment level sort queues based on segement maxdoc commit e0fc38f1b1093cd761da03e561df6395d3a79fc1 Author: Joel Bernstein <jbern...@apache.org> Date: Thu Jul 9 16:40:38 2020 -0400 SOLR-14608: Add method for creating the MergeIterator commit 9b01320ddd3800607fa0197df6ac66bfd27e148a Author: Joel Bernstein <jbern...@apache.org> Date: Thu Jul 9 14:23:17 2020 -0400 SOLR-14608: Add skeleton algorithm for segment level iterator commit bb4ae51c1c3e54c976bd1d449d5264afa3d74ec2 Author: Joel Bernstein <jbern...@apache.org> Date: Thu Jul 9 13:04:39 2020 -0400 SOLR-14608: Add basic top level merge sort iterator > Faster sorting for the /export handler > -------------------------------------- > > Key: SOLR-14608 > URL: https://issues.apache.org/jira/browse/SOLR-14608 > Project: Solr > Issue Type: New Feature > Reporter: Joel Bernstein > Assignee: Joel Bernstein > Priority: Major > > The largest cost of the export handler is the sorting. This ticket will > implement an improved algorithm for sorting that should greatly increase > overall throughput for the export handler. > *The current algorithm is as follows:* > Collect a bitset of matching docs. Iterate over that bitset and materialize > the top level oridinals for the sort fields in the document and add them to > priority queue of size 30000. Then export the top 30000 docs, turn off the > bits in the bit set and iterate again until all docs are sorted and sent. > There are two performance bottlenecks with this approach: > 1) Materializing the top level ordinals adds a huge amount of overhead to the > sorting process. > 2) The size of priority queue, 30,000, adds significant overhead to sorting > operations. > *The new algorithm:* > Has a top level *merge sort iterator* that wraps segment level iterators that > perform segment level priority queue sorts. > *Segment level:* > The segment level docset will be iterated and the segment level ordinals for > the sort fields will be materialized and added to a segment level priority > queue. As the segment level iterator pops docs from the priority queue the > top level ordinals for the sort fields are materialized. Because the top > level ordinals are materialized AFTER the sort, they only need to be looked > up when the segment level ordinal changes. This takes advantage of the sort > to limit the lookups into the top level ordinal structures. This also > eliminates redundant lookups of top level ordinals that occur during the > multiple passes over the matching docset. > The segment level priority queues can be kept smaller than 30,000 to improve > performance of the sorting operations because the overall batch size will > still be 30,000 or greater when all the segment priority queue sizes are > added up. This allows for batch sizes much larger then 30,000 without using a > single large priority queue. The increased batch size means fewer iterations > over the matching docset and the decreased priority queue size means faster > sorting operations. > *Top level:* > A top level iterator does a merge sort over the segment level iterators by > comparing the top level ordinals materialized when the segment level docs are > popped from the segment level priority queues. This requires no extra memory > and will be very performant. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org