The primary use case seems to require a SortStream. Ignoring the large join
for now...
1. search main collection with stream (a)
2. search other collection with stream (b)
3. hash join a & b (c)
4. full sort on c
5. aggregate c with reducer
6. apply user sort criteria with top
It's very likely tha
The tricky thing you have is a large join coupled with a reduce() group,
which have different sorts.
If you have a big enough cluster with enough workers, shards and replicas
you can make this work.
For example if you partition the large join across 30 workers, the hash
join would fit in the avai
I can see I may need to rethink some things. I have two joins: one is 1 to 1
(very large) and one is 1 to .03. A HashJoin may work on the smaller one.
The large join looks like it may not be possible. I could get away with
treating it as a filter somehow - I don't need the fields from the
documents
A few other things for you to consider:
1) How big are the joins?
2) How fast do they need to go?
3) How many queries need to run concurrently?
#1 and 2# will dictate how many shards, replicas and parallel workers are
needed to perform the join. #3 needs to be carefully considered because
MapRedu
One of the things to consider is using a hashJoin on first and second
joins. If you have one large table and two smaller tables the hashJoin
makes a lot of sense.
One possible flow would be:
parallel reduce to do the grouping
hashJoin to the second table
hashJoin to the third table
The hashJoins
I'm hoping I'm just not using the streaming API correctly. I have about 30M
docs (~ 15 collections) in production right now that work well with just 4GB
of heap (no streaming). I can't believe streaming would choke on my test
data.
I guess there are 2 primary requirements. Reindexing an entire col
It's likely that the SortStream is the issue. With the sort function you
need enough memory to sort all the tuples coming from the underlying
stream. The sort stream can also be done in parallel so you can split the
tuples from col1 across N worker nodes. This will give you faster sorting
and apply
I am getting an OOM error trying to combine streaming operations. I think the
sort is the issue. This test was done on a single replica cloud setup of
v6.1 with 4GB heap. col1 has 1M docs. col2 has 10k docs. The search for each
collection was q=*:*. Using SolrJ:
CloudSolrStream searchStream = new
Hi,
The streaming API in Solr 6x has been expanded to supported many different
parallel computing workloads. For example the topic stream supports pub/sub
messaging. The gatherNodes stream supports graph traversal. The facet
stream supports aggregations inside the search engine, while the rollup
s
I've read about the sort stream in v6.1 but it appears to me to break the
streaming design. If it has to read all the results into memory then it's
not streaming. Sounds like it could be slow and memory intensive for very
large result sets. Has anyone had good results with the sort stream when
ther
10 matches
Mail list logo