[ 
https://issues.apache.org/jira/browse/SOLR-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294202#comment-17294202
 ] 

Joel Bernstein edited comment on SOLR-15210 at 3/3/21, 1:49 PM:
----------------------------------------------------------------

Let's have the best of both worlds. We can lazily build up a bitset of 
documents to ignore for each worker. We can then apply this bitset before the 
sorting stage.

Here is the basic idea:

1) In the writer thread hash each key and decide if the worker should send the 
doc out.
2) When the writer finds a key that shouldn't be sent out, add the docId to an 
ignore bitset for the specific worker.
3) After each run combine the ignore bitsets with a cached set of ignore 
bitsets per worker. If we make this a segment based cache we'll stay warm even 
after commits.
4) Before performing the sort, turn off all bits for each worker that 
intersects the workers ignore bitset.

Basically this lazily builds a set of documents per worker that should NOT be 
sent out. This cache will warm over time making the exports faster over time.



was (Author: joel.bernstein):
Let's have the best of both worlds. We can lazily build up a bitset of 
documents to ignore for each worker. We can then apply this bitset before the 
sorting stage.

Here is the basic idea:

1) In the writer thread hash each key and decide if the worker should send the 
doc out.
2) When the writer finds a key that shouldn't be sent out, add the docId to an 
ignore bitset for the specific worker.
3) After each run combine the ignore bitsets with a cached set of ignore 
bitsets per worker.
4) Before performing the sort, turn off all bits for each worker that 
intersects the workers ignore bitset.

Basically this lazily builds a set of documents per worker that should NOT be 
sent out. This cache will warm over time making the exports faster over time.


> ParallelStream should execute hashing & filtering directly in ExportWriter
> --------------------------------------------------------------------------
>
>                 Key: SOLR-15210
>                 URL: https://issues.apache.org/jira/browse/SOLR-15210
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>
> Currently ParallelStream uses {{HashQParserPlugin}} to partition the work 
> based on a hashed value of {{partitionKeys}}. Unfortunately, this filter has 
> a high initial runtime cost because it has to materialize all values of 
> {{partitionKeys}} on each worker in order to calculate their hash and decide 
> whether a particular doc belongs to the worker's partition.
> The alternative approach would be for the worker to collect and sort all 
> documents and only then filter out the ones that belong to the current 
> partition just before they are written out by {{ExportWriter}} - at this 
> point we have to materialize the fields anyway but also we can benefit from a 
> (minimal) BytesRef caching that the FieldWriters use. On the other hand we 
> pay the price of sorting all documents, and we also lose the query filter 
> caching that the {{HashQParserPlugin}} uses.
> This tradeoff is not obvious but should be investigated to see if it offers 
> better performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to