Hi Erick; I think that even if you use Map/Reduce you will not parallelize you indexing because indexing will parallelize as much as how many leaders you have at your SolrCloud, isn't it?
2013/5/6 Erick Erickson <erickerick...@gmail.com> > The only problem with using Hadoop (or whatever) is that you > need to be sure that documents end up on the same shard, which > means that you have to use the same routing mechanism that > SolrCloud uses. The custom doc routing may help here.... > > My very first question, though, would be whether this is necessary. > It might be sufficient to just throttle the rate of indexing, or just do > the > indexing during off hours or.... Have you measured an indexing > degradation during your heavy indexing? Indexing has costs, no > question, but it's worth asking whether the costs are heavy enough > to be worth the bother.. > > Best > Erick > > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <furkankam...@gmail.com> > wrote: > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use > > Map/Reduce jobs you split your workload, process it, and then reduce step > > takes into account. Let me explain you new SolrCloud architecture. You > > start your SolrCluoud with a numShards parameter. Let's assume that you > > have 5 shards. Then you will have 5 leader at your SolrCloud. These > leaders > > will be responsible for indexing your data. It means that your indexing > > workload will divided into 5 so it means that you have parallelized your > > data as like Map/Reduce jobs. > > > > Let's assume that you have added 10 new Solr nodes into your SolrCloud. > > They will be added as a replica for each shard. Then you will have 5 > > shards, 5 leaders of them and every shard has 2 replica. When you send a > > query into a SolrCloud every replica will help you for searching and if > you > > add more replicas to your SolrCloud your search performance will improve. > > > > > > 2013/5/6 David Parks <davidpark...@yahoo.com> > > > >> I've had trouble figuring out what options exist if I want to perform > all > >> indexing off of the production servers (I'd like to keep them only for > user > >> queries). > >> > >> > >> > >> We index data in batches roughly daily, ideally I'd index all solr cloud > >> shards offline, then move the final index files to the solr cloud > instance > >> that needs it and flip a switch and have it use the new index. > >> > >> > >> > >> Is this possible via either: > >> > >> 1. Doing the indexing in Hadoop?? (this would be ideal as we have > a > >> significant investment in a hadoop cluster already), or > >> > >> 2. Maintaining a separate "master" server that handles indexing > and > >> the nodes that receive user queries update their index from there (I > seem > >> to > >> recall reading about this configuration in 3.x, but now we're using solr > >> cloud) > >> > >> > >> > >> Is there some ideal solution I can use to "protect" the production solr > >> instances from degraded performance during large index processing > periods? > >> > >> > >> > >> Thanks! > >> > >> David > >> > >> >