Re: Generating Index offline and loading into solrcloud

KNitin Thu, 19 Nov 2015 16:54:00 -0800

Ah got it. Another generic question, is there too much of a difference
between generating files in map reduce and loading into solrcloud vs using
solr NRT api? Has any one run any test of that sort?


Thanks a ton,
Nitin

On Thu, Nov 19, 2015 at 3:00 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Sure, you can use Lucene to create indexes for shards
> if (and only if) you deal with the routing issues....
>
> About updates: I'm not talking about atomic updates at all.
> The usual model for Solr is if you have a unique key
> defined, new versions of documents replace old versions
> of documents based on uniqueKey. That process is
> not guaranteed by MRIT is all.
>
> Best,
> Erick
>
> On Thu, Nov 19, 2015 at 12:56 PM, KNitin <nitin.t...@gmail.com> wrote:
> > Thanks, Eric.  Looks like  MRIT uses Embedded solr running per
> > mapper/reducer and uses that to index documents. Is that the recommended
> > model? Can we use raw lucene libraries to generate index and then load
> them
> > into solrcloud? (Barring the complexities for indexing into right shard
> and
> > merging them).
> >
> > I am thinking of using this for regular offline indexing which needs to
> be
> > idempotent.  When you mean update do you mean partial updates using _set?
> > If we add and delete every time for a document that should work, right?
> > (since all docs are indexed by doc id which contains all operational
> > history)? Let me know if I am missing something.
> >
> > On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> Note two things:
> >>
> >> 1> this is running on Hadoop
> >> 2> it is part of the standard Solr release as MapReduceIndexerTool,
> >> look in the contribs...
> >>
> >> If you're trying to do this yourself, you must be very careful to index
> >> docs
> >> to the correct shard then merge the correct shards. MRIT does this all
> >> automatically.
> >>
> >> Additionally, it has the cool feature that if (and only if) your Solr
> >> index is running over
> >> HDFS, the --go-live option will automatically merge the indexes into
> >> the appropriate
> >> running Solr instances.
> >>
> >> One caveat. This tool doesn't handle _updating_ documents. So if you
> >> run it twice
> >> on the same data set, you'll have two copies of every doc. It's
> >> designed as a bulk
> >> initial-load tool.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> On Thu, Nov 19, 2015 at 11:45 AM, KNitin <nitin.t...@gmail.com> wrote:
> >> > Great. Thanks!
> >> >
> >> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <
> >> sam...@measuredsearch.com>
> >> > wrote:
> >> >
> >> >> If you are trying to create a large index and want speedups there,
> you
> >> >> could use the MapReduceTool -
> >> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr.
> At
> >> a
> >> >> high level, it takes your files (csv, json, etc) as input can create
> >> either
> >> >> a single or a sharded index that you can either copy it to your Solr
> >> >> Servers. I've used this to create indexes that include hundreds of
> >> millions
> >> >> of documents in fairly decent amount of time.
> >> >>
> >> >> Thanks,
> >> >> --
> >> >> *Sameer Maggon*
> >> >> Measured Search
> >> >> www.measuredsearch.com <http://measuredsearch.com/>
> >> >>
> >> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin <nitin.t...@gmail.com>
> wrote:
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> >  I was wondering if there are existing tools that will generate
> solr
> >> >> index
> >> >> > offline (in solrcloud mode)  that can be later on loaded into
> >> solrcloud,
> >> >> > before I decide to implement my own. I found some tools that do
> only
> >> solr
> >> >> > based index loading (non-zk mode). Is there one with zk mode
> enabled?
> >> >> >
> >> >> >
> >> >> > Thanks in advance!
> >> >> > Nitin
> >> >> >
> >> >>
> >>
>

Re: Generating Index offline and loading into solrcloud

Reply via email to