We likely have the same laptop :-)
There must be something weird with my schema or usage but even if I had 10x
the throughput I have now, throwing around that many docs for a single join
isn't conducive to desired latency, concurrent requests, network bandwidth,
etc. I feel like I'm not using the
So, with that setup you're getting around 150,000 docs per second
throughput. On my laptop with a similar query I was able to stream around
650,000 docs per second. I have an SSD and 16 Gigs of RAM. Also I did lots
of experimenting with different numbers of workers and tested after warming
the part
Thanks for all this info, Joel. I found if I artificially limit the
triples stream to 3M and use the /export handler with only 2 workers, I can
get results in @ 20 seconds and Solr doesn't tip over. That seems to be
the best config for this local/single instance.
It's also clear I'm not using st
One other thing to keep in is how the partitioning is done when you add the
partitionKeys.
Partitioning is done using the HashQParserPlugin, which builds a filter for
each worker. Under the covers this is using the normal filter query
mechanism. So after the filters are built and cached they are e
Ah, you also used 4 shards. That means with 8 workers there were 32
concurrent queries against the /select handler each requesting 100,000
rows. That's a really heavy load!
You can still try out the approach from my last email on the 4 shards
setup, as you add workers gradually you'll gradually ra
Hi Ryan,
The rows=10 on the /select handler is likely going to cause problems
with 8 workers. This is calling the /select handler with 8 concurrent
workers each retrieving 100,000 rows. The /select handler bogs down as the
number of rows increases. So using the rows parameter with the /select
Hello, I'm running Solr on my laptop with -Xmx8g and gave each collection 4
shards and 2 replicas.
Even grabbing 100k triple documents (like the following) is taking 20
seconds to complete and prone to fall over. I could try this in a proper
cluster with multiple hosts and more sharding, etc. I
Also the hashJoin is going to read the entire entity table into memory. If
that's a large index that could be using lots of memory.
25 million docs should be ok to /export from one node, as long as you have
enough memory to load the docValues for the fields for sorting and
exporting.
Breaking dow
Thanks very much for the advice. Yes, I'm running in a very basic single
shard environment. I thought that 25M docs was small enough to not require
anything special but I will try scaling like you suggest and let you know
what happens.
Cheers, Ryan
On Fri, May 13, 2016 at 4:53 PM, Joel Bernstei
I would try breaking down the second query to see when the problems occur.
1) Start with just a single *:* search from one of the collections.
2) Then test the innerJoin. The innerJoin won't take much memory as it's a
streaming merge join.
3) Then try the full thing.
If you're running a large joi
qt="/export" immediately fixed the query in Question #1. Sorry for missing
that in the docs!
The second query (with /export) crashes the server so I was going to look
at parallelization if you think that's a good idea. It also seems unwise
to joining into 26M docs so maybe I can reconfigure the
A couple of other things:
1) Your innerJoin can parallelized across workers to improve performance.
Take a look at the docs on the parallel function for the details.
2) It looks like you might be doing graph operations with joins. You might
to take a look at the gatherNodes function coming in 6.1
When doing things that require all the results (like joins) you need to
specify the /export handler in the search function.
qt="/export"
The search function defaults to the /select handler which is designed to
return the top N results. The /export handler always returns all results
that match the
Question #1:
triple_type collection has a few hundred docs and triple has 25M docs.
When I search for a particular subject_id in triple which I know has 14
results and do not pass in 'rows' params, it returns 0 results:
innerJoin(
search(triple, q=subject_id:1656521, fl="triple_id,subject_id
14 matches
Mail list logo