Re: Solrcloud Batch Indexing

2016-03-09 Thread Bin Wang
Hi Eric, I have done a benchmark writing directly to Solrcloud running on my Macbook using SolrJ. In a nutshell, the best indexing speed is *12K* dps (documents per second) with an optimized batch size. You can find more detail and my source code here

Re: Solrcloud Batch Indexing

2016-03-08 Thread Cassandra Targett
There is an open source Hive -> Solr SerDe available that might be worth checking out: https://github.com/lucidworks/hive-solr. I'm not sure how it would work with the source table being rebuilt every day since it uses Hive's external tables, but it might be something you could extend. On Mon, Mar

Re: Solrcloud Batch Indexing

2016-03-07 Thread Erick Erickson
Bin: The MRIT/Morphlines only makes sense if you have lots more nodes devoted to the M/R jobs than you do Solr shards since the actual work done to index a given doc is exactly the same either with MRIT/Morphlines or just sending straight to Solr. A bit of background here. I mentioned that MRIT/M

Re: Solrcloud Batch Indexing

2016-03-07 Thread Bin Wang
Hi Eric, Thanks for your quick response. >From the data's perspective, we have 300+ million rows and believe it or not, the source data is from relational database (Hive) and the database is rebuilt every day (I am as frustrated as most of you who read this but it is what it is) and potentially n

Re: Solrcloud Batch Indexing

2016-03-07 Thread Erick Erickson
I'm wondering if you need map reduce at all ;)... The achilles heel with M/R viz: Solr is all the copying around that's done at the end of the cycle. For really large bulk indexing jobs, that's a reasonable price to pay.. How many docs and how would you characterize them as far as size, fields, e