It's likely that the SortStream is the issue. With the sort function you need enough memory to sort all the tuples coming from the underlying stream. The sort stream can also be done in parallel so you can split the tuples from col1 across N worker nodes. This will give you faster sorting and apply more memory to the sort.
Can you describe your exact use case? Perhaps we can think about a different Streaming flow that would work for you. Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Jul 20, 2016 at 5:29 PM, tedsolr <tsm...@sciquest.com> wrote: > I am getting an OOM error trying to combine streaming operations. I think > the > sort is the issue. This test was done on a single replica cloud setup of > v6.1 with 4GB heap. col1 has 1M docs. col2 has 10k docs. The search for > each > collection was q=*:*. Using SolrJ: > > CloudSolrStream searchStream = new CloudSolrStream(zkHosts, "col1", map); > // > base search from user > ReducerStream reducer = new ReducerStream(searchStream, comparator, > reducer); // this groups docs by a field list > SortStream idSort = new SortStream(reducer, new FieldComparator("id", > ComparatorOrder.ASCENDING)); // this resorts the results so I can compare > them against data from another collection > CloudSolrStream ruleStream = new CloudSolrStream(zkHosts, "col2", map); > TupleStream stream = new IntersectStream(idSort, ruleStream, new > FieldEqualitor("id")); // this provides a filter > > When I open the stream it works for about 30 seconds then dies. So a single > user search on a very small collection (1M docs) can overwhelm a 4GB heap. > And this chain isn't even done! I still need to merge with yet a third > collection and then resort with the user's specified sort parameter. > > Is there something fundamentally wrong with my approach? > thanks, Ted > > > > Joel Bernstein wrote > > Hi, > > > > The streaming API in Solr 6x has been expanded to supported many > different > > parallel computing workloads. For example the topic stream supports > > pub/sub > > messaging. The gatherNodes stream supports graph traversal. The facet > > stream supports aggregations inside the search engine, while the rollup > > stream supports shuffling map / reduce aggregations. Stored queries and > > large scale alerting is on the way... > > > > The sort stream is designed to be used at scale in parallel mode. It can > > currently sort about 1,000,000 docs per second on a single worker. So if > > you have 20 workers it can sort 20,000,000 docs per second. The plan is > to > > eventually switch to the fork/join merge sort so that you get parallelism > > within the same worker. > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288083.html > Sent from the Solr - User mailing list archive at Nabble.com. >