[ https://issues.apache.org/jira/browse/SOLR-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294202#comment-17294202 ]
Joel Bernstein edited comment on SOLR-15210 at 3/3/21, 1:23 AM: ---------------------------------------------------------------- Let's have the best of both worlds. We can lazily build up a bitset of documents to ignore for each worker. We can then apply this bitset before the sorting stage. Here is the basic idea: 1) In the writer thread hash each key and decide if the worker should send it out. 2) When a worker thread finds a key that shouldn't be sent out add the docId to an ignore bitset for the specific worker. 3) After each run combine the ignore bitsets with a cached set of ignore bitsets per worker. 4) Before performing the sort, turn off all bits for each worker that intersects the ignore bitset. Basically this lazily builds a set of documents per worker that should NOT be sent out. This cache will warm over time making the exports faster over time. was (Author: joel.bernstein): Let's have the best of both worlds. We can lazily build up a bitset of documents to ignore for each worker. We can then apply this bitset before the sorting stage. Here is the basic idea: 1) In the writer thread hash each key and decide if the worker should send it out. 2) When a worker thread finds a key that shouldn't be sent out add the docId to an ignore bitset for the specific worker. 3) After each run combine the ignore bitsets with a cached set of ignore bitsets per worker. 4) Before performing the sort, turn off all bits for each worker that intersects the ignore bitset. Basically this lazily builds a set of documents per worker that should NOT be sent out. This cache will warm over time making the exports faster over time. > ParallelStream should execute hashing & filtering directly in ExportWriter > -------------------------------------------------------------------------- > > Key: SOLR-15210 > URL: https://issues.apache.org/jira/browse/SOLR-15210 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Priority: Major > > Currently ParallelStream uses {{HashQParserPlugin}} to partition the work > based on a hashed value of {{partitionKeys}}. Unfortunately, this filter has > a high initial runtime cost because it has to materialize all values of > {{partitionKeys}} on each worker in order to calculate their hash and decide > whether a particular doc belongs to the worker's partition. > The alternative approach would be for the worker to collect and sort all > documents and only then filter out the ones that belong to the current > partition just before they are written out by {{ExportWriter}} - at this > point we have to materialize the fields anyway but also we can benefit from a > (minimal) BytesRef caching that the FieldWriters use. On the other hand we > pay the price of sorting all documents, and we also lose the query filter > caching that the {{HashQParserPlugin}} uses. > This tradeoff is not obvious but should be investigated to see if it offers > better performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org