[ https://issues.apache.org/jira/browse/SOLR-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17257783#comment-17257783 ]
Joel Bernstein commented on SOLR-14608: --------------------------------------- Some performance numbers for the knew sorting algorithm: Sorting and exporting 3 million documents. Since we are isolating the performance of the sort we will only include fields in the field list that are being sorted on. Here we sort and export 2 fields using the Solr 8.7 export handler: {code} joelbernstein@Joel-Bernsteins-MacBook-Pro-2 loader % time curl 'http://localhost:10000/solr/testex/export?q=*:*&fl=test_s,stock_s&sort=test_s+asc,stock_s+asc' > /dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 210M 0 210M 0 0 19.9M 0 --:--:-- 0:00:10 --:--:-- 26.8M curl > /dev/null 0.05s user 0.09s system 1% cpu 10.587 total {code} Here we sort and export 2 fields using the new Solr export handler in this ticket: {code} joelbernstein@Joel-Bernsteins-MacBook-Pro-2 loader % time curl 'http://localhost:8983/solr/testex/export?q=*:*&fl=test_s,stock_s&sort=test_s+asc,stock_s+asc' > /dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 210M 0 210M 0 0 39.7M 0 --:--:-- 0:00:05 --:--:-- 44.8M {code} > Faster sorting for the /export handler > -------------------------------------- > > Key: SOLR-14608 > URL: https://issues.apache.org/jira/browse/SOLR-14608 > Project: Solr > Issue Type: New Feature > Reporter: Joel Bernstein > Assignee: Joel Bernstein > Priority: Major > > The largest cost of the export handler is the sorting. This ticket will > implement an improved algorithm for sorting that should greatly increase > overall throughput for the export handler. > *The current algorithm is as follows:* > Collect a bitset of matching docs. Iterate over that bitset and materialize > the top level oridinals for the sort fields in the document and add them to > priority queue of size 30000. Then export the top 30000 docs, turn off the > bits in the bit set and iterate again until all docs are sorted and sent. > There are two performance bottlenecks with this approach: > 1) Materializing the top level ordinals adds a huge amount of overhead to the > sorting process. > 2) The size of priority queue, 30,000, adds significant overhead to sorting > operations. > *The new algorithm:* > Has a top level *merge sort iterator* that wraps segment level iterators that > perform segment level priority queue sorts. > *Segment level:* > The segment level docset will be iterated and the segment level ordinals for > the sort fields will be materialized and added to a segment level priority > queue. As the segment level iterator pops docs from the priority queue the > top level ordinals for the sort fields are materialized. Because the top > level ordinals are materialized AFTER the sort, they only need to be looked > up when the segment level ordinal changes. This takes advantage of the sort > to limit the lookups into the top level ordinal structures. This also > eliminates redundant lookups of top level ordinals that occur during the > multiple passes over the matching docset. > The segment level priority queues can be kept smaller than 30,000 to improve > performance of the sorting operations because the overall batch size will > still be 30,000 or greater when all the segment priority queue sizes are > added up. This allows for batch sizes much larger then 30,000 without using a > single large priority queue. The increased batch size means fewer iterations > over the matching docset and the decreased priority queue size means faster > sorting operations. > *Top level:* > A top level iterator does a merge sort over the segment level iterators by > comparing the top level ordinals materialized when the segment level docs are > popped from the segment level priority queues. This requires no extra memory > and will be very performant. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org