Re: Specify sorting of merged streams

2016-07-21 Thread tedsolr
The primary use case seems to require a SortStream. Ignoring the large join for now... 1. search main collection with stream (a) 2. search other collection with stream (b) 3. hash join a & b (c) 4. full sort on c 5. aggregate c with reducer 6. apply user sort criteria with top It's very likely tha

Re: Specify sorting of merged streams

2016-07-21 Thread Joel Bernstein
The tricky thing you have is a large join coupled with a reduce() group, which have different sorts. If you have a big enough cluster with enough workers, shards and replicas you can make this work. For example if you partition the large join across 30 workers, the hash join would fit in the avai

Re: Specify sorting of merged streams

2016-07-21 Thread tedsolr
I can see I may need to rethink some things. I have two joins: one is 1 to 1 (very large) and one is 1 to .03. A HashJoin may work on the smaller one. The large join looks like it may not be possible. I could get away with treating it as a filter somehow - I don't need the fields from the documents

Re: Specify sorting of merged streams

2016-07-21 Thread Joel Bernstein
A few other things for you to consider: 1) How big are the joins? 2) How fast do they need to go? 3) How many queries need to run concurrently? #1 and 2# will dictate how many shards, replicas and parallel workers are needed to perform the join. #3 needs to be carefully considered because MapRedu

Re: Specify sorting of merged streams

2016-07-20 Thread Joel Bernstein
One of the things to consider is using a hashJoin on first and second joins. If you have one large table and two smaller tables the hashJoin makes a lot of sense. One possible flow would be: parallel reduce to do the grouping hashJoin to the second table hashJoin to the third table The hashJoins

Re: Specify sorting of merged streams

2016-07-20 Thread tedsolr
I'm hoping I'm just not using the streaming API correctly. I have about 30M docs (~ 15 collections) in production right now that work well with just 4GB of heap (no streaming). I can't believe streaming would choke on my test data. I guess there are 2 primary requirements. Reindexing an entire col

Re: Specify sorting of merged streams

2016-07-20 Thread Joel Bernstein
It's likely that the SortStream is the issue. With the sort function you need enough memory to sort all the tuples coming from the underlying stream. The sort stream can also be done in parallel so you can split the tuples from col1 across N worker nodes. This will give you faster sorting and apply

Re: Specify sorting of merged streams

2016-07-20 Thread tedsolr
I am getting an OOM error trying to combine streaming operations. I think the sort is the issue. This test was done on a single replica cloud setup of v6.1 with 4GB heap. col1 has 1M docs. col2 has 10k docs. The search for each collection was q=*:*. Using SolrJ: CloudSolrStream searchStream = new

Re: Specify sorting of merged streams

2016-06-30 Thread Joel Bernstein
Hi, The streaming API in Solr 6x has been expanded to supported many different parallel computing workloads. For example the topic stream supports pub/sub messaging. The gatherNodes stream supports graph traversal. The facet stream supports aggregations inside the search engine, while the rollup s

Re: Specify sorting of merged streams

2016-06-30 Thread tedsolr
I've read about the sort stream in v6.1 but it appears to me to break the streaming design. If it has to read all the results into memory then it's not streaming. Sounds like it could be slow and memory intensive for very large result sets. Has anyone had good results with the sort stream when ther