Hi Eric, I have done a benchmark writing directly to Solrcloud running on my Macbook using SolrJ. In a nutshell, the best indexing speed is *12K* dps (documents per second) with an optimized batch size. You can find more detail and my source code here <http://datafireball.com/2016/03/08/solrcloud-api-benchmark/>. It is just a laptop and the computing power is limited.
The second step for me will be to write multiple processes writing to Solrcloud in parallel. Timothy Potter has done a benchmark <http://www.slideshare.net/thelabdude/solr-performance> (page13) before on AWS. And he has reached a indexing speed of *121K* dps. You are right, looks like if we fine tune the number of records we might just be fast enough to index all the data fast enough. will keep you posted. Bin On Mon, Mar 7, 2016 at 3:40 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Bin: > > The MRIT/Morphlines only makes sense if you have lots more > nodes devoted to the M/R jobs than you do Solr shards since the > actual work done to index a given doc is exactly the same either > with MRIT/Morphlines or just sending straight to Solr. > > A bit of background here. I mentioned that MRIT/Morphlines uses > EmbeddedSolrServer. This is exactly Solr as far as the actual indexing > is concerned. So using --go-live is not buying you anything and, in fact, > is costing you quite a bit over just using <2> to index directly to Solr > since > the index has to be copied around. I confess I'm surprised that --go-live > is taking that long. basically it's just copying your index up to Solr so > perhaps there's an I/O problem or some such. > > OK, I'm lying a little bit here, _if_ you have more than one replica per > shard, then indexing straight to Solr will cost you (anecdotally) > 10-15% in indexing speed. But if this is a single replica/shard (i.e. > leader-only), then it's near enough to being the exact same. > > Anyway, at the end of the day, the index produced is self-contained. > You could even just copy it to your shards (with Solr down), and then > bring up your Solr nodes on a non-HDFS-based Solr. > > But frankly I'd avoid that and benchmark on <2> first. My expectation > is that you'll be fine there and see indexing roughly on par with your > MRIT/Morphlines. > > Now, all that said, indexing 300M docs in 'a few minutes' is a bit > surprising. > I'm really wondering if you're not being fooled by something "odd". Have > you compared the identical runs with and without --go-live? > > _Very_ often, the bottleneck isn't Solr at all, it's the data acquisition, > so be > careful when measuring that the Solr CPU's are pegged... otherwise > you're bottlenecking upstream of Solr. A super-simple way to figure that > out is to comment out the solrServer.add(list, 10000) line in <2> or just > run MRIT/Morphlines without the --go-live switch. > > BTW, with <2> you could run with as many jobs as you wanted to run > the Solr servers flat-out. > > FWIW, > Erick > > On Mon, Mar 7, 2016 at 1:14 PM, Bin Wang <binwang...@gmail.com> wrote: > > Hi Eric, > > > > Thanks for your quick response. > > > > From the data's perspective, we have 300+ million rows and believe it or > > not, the source data is from relational database (Hive) and the database > is > > rebuilt every day (I am as frustrated as most of you who read this but it > > is what it is) and potentially need to store actually all of the fields. > > In this case, I have to figure out a solution to quickly index 300+ > million > > rows as fast as I can. > > > > I am still at a stage evaluating all the different solutions, and I am > > sorry that I haven't really benchmarked the second approach yet. > > I will find a time to run some benchmark and share the result with the > > community. > > > > Regarding the approach that I suggested - mapreduce Lucene indexes, do > you > > think it is feasible and does that worth the effort to dive into? > > > > Best regards, > > > > Bin > > > > > > > > On Mon, Mar 7, 2016 at 1:57 PM, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> I'm wondering if you need map reduce at all ;)... > >> > >> The achilles heel with M/R viz: Solr is all the copying around > >> that's done at the end of the cycle. For really large bulk indexing > >> jobs, that's a reasonable price to pay.. > >> > >> How many docs and how would you characterize them as far > >> as size, fields, etc? And what are your time requirements? What > >> kind of docs? > >> > >> I'm thinking this may be an "XY Problem". You're asking about > >> a specific solution before explaining the problem. > >> > >> Why do you say that Solr is not really optimized for bulk loading? > >> I took a quick look at <2> and the approach is sound. It batches > >> up the docs in groups of 1,000 and uses CloudSolrServer as it should. > >> Have you tried it? At the end of the day, MapReduceIndexerTool does > >> the same work to index a doc as a regular Solr server would via > >> EmbeddedSolrServer so if the number of tasks you have running is > >> roughly equal to the number of shards, it _should_ be roughly > >> comparable. > >> > >> Still, though, I have to repeat my question about how many docs you're > >> talking here. Using M/R inevitably adds complexity, what are you trying > >> to gain here that you can't get with several threads in a SolrJ client? > >> > >> Best, > >> Erick > >> > >> On Mon, Mar 7, 2016 at 12:28 PM, Bin Wang <binwang...@gmail.com> wrote: > >> > Hi there, > >> > > >> > I have a fairly big data set that I need to quick index into > Solrcloud. > >> > > >> > I have done some research and none of them looked really good to me. > >> > > >> > (1) Kite Morphline: I managed to get it working, the mapreduce > finished > >> in > >> > a few minutes which is good, however, it took a really long time, like > >> one > >> > hour (60 million), to merge the indexes into Solrcloud, the go-live > part. > >> > > >> > (2) Mapreduce Using Solrcloud Server: > >> > < > >> > http://techuserhadoop.blogspot.com/2014/09/mapreduce-job-for-indexing-documents-to.html > >> > > >> > this > >> > approach is pretty straightforward, however, every document has to > funnel > >> > through the solrserver which is really not optimized for bulk loading. > >> > > >> > Here is what I am thinking, is it possible to use Mapreduce to create > a > >> few > >> > Lucene indexes first, for example, using 3 reducers to write three > >> indexes. > >> > Then create a Solr collection with three shards pointing to the > generated > >> > indexes. Can Solr easily pick up generated indexes? > >> > > >> > I am really new to Solr and wondering if this is feasible, and if > there > >> is > >> > any work that has already been done. I am not really interested in > >> cutting > >> > the edge and any existing work should be appreciated! > >> > > >> > Best regards, > >> > > >> > Bin > >> >