Re: solr over hdfs for accessing/ changing indexes outside solr

Erick Erickson Thu, 07 Aug 2014 08:08:49 -0700

If SolrCloud meets your needs, without Hadoop, then
there's no real reason to introduce the added complexity.


There are a bunch of problems that do _not_ work
well with SolrCloud over non-Hadoop file systems. For
those problems, the combination of SolrCloud and Hadoop
make tackling them possible.

Best,
Erick


On Thu, Aug 7, 2014 at 3:55 AM, Ali Nazemian <alinazem...@gmail.com> wrote:

> Thank you very much. But why we should go for solr distributed with hadoop?
> There is already solrCloud which is pretty applicable in the case of big
> index. Is there any advantage for sending indexes over map reduce that
> solrCloud can not provide?
> Regards.
>
>
> On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > bq: Are you aware of Cloudera search? I know they provide an integrated
> > Hadoop ecosystem.
> >
> > What Cloudera Search does via the MapReduceIndexerTool (MRIT) is create N
> > sub-indexes for
> > each shard in the M/R paradigm via EmbeddedSolrServer. Eventually, these
> > sub-indexes for
> > each shard are merged (perhaps through some number of levels) in the
> reduce
> > phase and
> > maybe merged into a live Solr instance (--go-live). You'll note that this
> > tool requires the
> > address of the ZK ensemble from which it can get the network topology,
> > configuration files,
> > all that rot. If you don't use the --go-live option, the output is still
> a
> > Solr index, it's just that
> > the index for each shard is left in a specific directory on HDFS. Being
> on
> > HDFS allows
> > this kind of M/R paradigm for massively parallel indexing operations, and
> > perhaps massively
> > complex analysis.
> >
> > Nowhere is there any low-level non-Solr manipulation of the indexes.
> >
> > The Flume fork just writes directly to the Solr nodes. It knows about the
> > ZooKeeper
> > ensemble and the collection too and communicates via SolrJ I'm pretty
> sure.
> >
> > As far as integrating with HDFS, you're right, HA is part of the package.
> > As far as using
> > the Solr indexes for analysis, well you can write anything you want to
> use
> > the Solr indexes
> > from anywhere in the M/R world and have them available from anywhere in
> the
> > cluster. There's
> > no real need to even have Solr running, you could use the output from
> MRIT
> > and access the
> > sub-shards with the EmbeddedSolrServer if you wanted, leaving out all the
> > pesky servlet
> > container stuff.
> >
> > bq: So why we go for HDFS in the case of analysis if we want to use SolrJ
> > for this purpose?
> > What is the point?
> >
> > Scale and data access in a nutshell. In the HDFS world, you can scale
> > pretty linearly
> > with the number of nodes you can rack together.
> >
> > Frankly though, if your data set is small enough to fit on a single
> machine
> > _and_ you can get
> > through your analysis in a reasonable time (reasonable here is up to
> you),
> > then HDFS
> > is probably not worth the hassle. But in the big data world where we're
> > talking petabyte scale,
> > having HDFS as the underpinning opens up possibilities for working on
> data
> > that were
> > difficult/impossible with Solr previously.
> >
> > Best,
> > Erick
> >
> >
> >
> > On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian <alinazem...@gmail.com>
> > wrote:
> >
> > > Dear Erick,
> > > I remembered some times ago, somebody asked about what is the point of
> > > modify Solr to use HDFS for storing indexes. As far as I remember
> > somebody
> > > told him integrating Solr with HDFS has two advantages. 1) having
> hadoop
> > > replication and HA. 2) using indexes and Solr documents for other
> > purposes
> > > such as Analysis. So why we go for HDFS in the case of analysis if we
> > want
> > > to use SolrJ for this purpose? What is the point?
> > > Regards.
> > >
> > >
> > > On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian <alinazem...@gmail.com>
> > > wrote:
> > >
> > > > Dear Erick,
> > > > Hi,
> > > > Thank you for you reply. Yeah I am aware that SolrJ is my last
> option.
> > I
> > > > was thinking about raw I/O operation. So according to your reply
> > probably
> > > > it is not applicable somehow. What about the Lily project that
> Michael
> > > > mentioned? Is that consider SolrJ too? Are you aware of Cloudera
> > search?
> > > I
> > > > know they provide an integrated Hadoop ecosystem. Do you know what is
> > > their
> > > > suggestion?
> > > > Best regards.
> > > >
> > > >
> > > >
> > > > On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson <
> > erickerick...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >> What you haven't told us is what you mean by "modify the
> > > >> index outside Solr". SolrJ? Using raw Lucene? Trying to modify
> > > >> things by writing your own codec? Standard Java I/O operations?
> > > >> Other?
> > > >>
> > > >> You could use SolrJ to connect to an existing Solr server and
> > > >> both read and modify at will form your M/R jobs. But if you're
> > > >> thinking of trying to write/modify the segment files by raw I/O
> > > >> operations, good luck! I'm 99.99% certain that's going to cause
> > > >> you endless grief.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >>
> > > >> On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian <alinazem...@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > Actually I am going to do some analysis on the solr data using map
> > > >> reduce.
> > > >> > For this purpose it might be needed to change some part of data or
> > add
> > > >> new
> > > >> > fields from outside solr.
> > > >> >
> > > >> >
> > > >> > On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey <s...@elyograg.org>
> > > wrote:
> > > >> >
> > > >> > > On 8/5/2014 7:04 AM, Ali Nazemian wrote:
> > > >> > > > I changed solr 4.9 to write index and data on hdfs. Now I am
> > going
> > > >> to
> > > >> > > > connect to those data from the outside of solr for changing
> some
> > > of
> > > >> the
> > > >> > > > values. Could somebody please tell me how that is possible?
> > > Suppose
> > > >> I
> > > >> > am
> > > >> > > > using Hbase over hdfs for do these changes.
> > > >> > >
> > > >> > > I don't know how you could safely modify the index without a
> > Lucene
> > > >> > > application or another instance of Solr, but if you do manage to
> > > >> modify
> > > >> > > the index, simply reloading the core or restarting Solr should
> > cause
> > > >> it
> > > >> > > to pick up the changes. Either you would need to make sure that
> > Solr
> > > >> > > never modifies the index, or you would need some way of
> > coordinating
> > > >> > > updates so that Solr and the other application would never try
> to
> > > >> modify
> > > >> > > the index at the same time.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Shawn
> > > >> > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > A.Nazemian
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> > >
> > >
> > > --
> > > A.Nazemian
> > >
> >
>
>
>
> --
> A.Nazemian
>

Re: solr over hdfs for accessing/ changing indexes outside solr

Reply via email to