Ah got it. Another generic question, is there too much of a difference between generating files in map reduce and loading into solrcloud vs using solr NRT api? Has any one run any test of that sort?
Thanks a ton, Nitin On Thu, Nov 19, 2015 at 3:00 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Sure, you can use Lucene to create indexes for shards > if (and only if) you deal with the routing issues.... > > About updates: I'm not talking about atomic updates at all. > The usual model for Solr is if you have a unique key > defined, new versions of documents replace old versions > of documents based on uniqueKey. That process is > not guaranteed by MRIT is all. > > Best, > Erick > > On Thu, Nov 19, 2015 at 12:56 PM, KNitin <nitin.t...@gmail.com> wrote: > > Thanks, Eric. Looks like MRIT uses Embedded solr running per > > mapper/reducer and uses that to index documents. Is that the recommended > > model? Can we use raw lucene libraries to generate index and then load > them > > into solrcloud? (Barring the complexities for indexing into right shard > and > > merging them). > > > > I am thinking of using this for regular offline indexing which needs to > be > > idempotent. When you mean update do you mean partial updates using _set? > > If we add and delete every time for a document that should work, right? > > (since all docs are indexed by doc id which contains all operational > > history)? Let me know if I am missing something. > > > > On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson < > erickerick...@gmail.com> > > wrote: > > > >> Note two things: > >> > >> 1> this is running on Hadoop > >> 2> it is part of the standard Solr release as MapReduceIndexerTool, > >> look in the contribs... > >> > >> If you're trying to do this yourself, you must be very careful to index > >> docs > >> to the correct shard then merge the correct shards. MRIT does this all > >> automatically. > >> > >> Additionally, it has the cool feature that if (and only if) your Solr > >> index is running over > >> HDFS, the --go-live option will automatically merge the indexes into > >> the appropriate > >> running Solr instances. > >> > >> One caveat. This tool doesn't handle _updating_ documents. So if you > >> run it twice > >> on the same data set, you'll have two copies of every doc. It's > >> designed as a bulk > >> initial-load tool. > >> > >> Best, > >> Erick > >> > >> > >> > >> On Thu, Nov 19, 2015 at 11:45 AM, KNitin <nitin.t...@gmail.com> wrote: > >> > Great. Thanks! > >> > > >> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon < > >> sam...@measuredsearch.com> > >> > wrote: > >> > > >> >> If you are trying to create a large index and want speedups there, > you > >> >> could use the MapReduceTool - > >> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. > At > >> a > >> >> high level, it takes your files (csv, json, etc) as input can create > >> either > >> >> a single or a sharded index that you can either copy it to your Solr > >> >> Servers. I've used this to create indexes that include hundreds of > >> millions > >> >> of documents in fairly decent amount of time. > >> >> > >> >> Thanks, > >> >> -- > >> >> *Sameer Maggon* > >> >> Measured Search > >> >> www.measuredsearch.com <http://measuredsearch.com/> > >> >> > >> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin <nitin.t...@gmail.com> > wrote: > >> >> > >> >> > Hi, > >> >> > > >> >> > I was wondering if there are existing tools that will generate > solr > >> >> index > >> >> > offline (in solrcloud mode) that can be later on loaded into > >> solrcloud, > >> >> > before I decide to implement my own. I found some tools that do > only > >> solr > >> >> > based index loading (non-zk mode). Is there one with zk mode > enabled? > >> >> > > >> >> > > >> >> > Thanks in advance! > >> >> > Nitin > >> >> > > >> >> > >> >