Re: solr over hdfs for accessing/ changing indexes outside solr

Erick Erickson Wed, 06 Aug 2014 09:40:55 -0700

bq: Are you aware of Cloudera search? I know they provide an integrated
Hadoop ecosystem.

What Cloudera Search does via the MapReduceIndexerTool (MRIT) is create N
sub-indexes for
each shard in the M/R paradigm via EmbeddedSolrServer. Eventually, these
sub-indexes for
each shard are merged (perhaps through some number of levels) in the reduce
phase and
maybe merged into a live Solr instance (--go-live). You'll note that this
tool requires the
address of the ZK ensemble from which it can get the network topology,
configuration files,
all that rot. If you don't use the --go-live option, the output is still a
Solr index, it's just that
the index for each shard is left in a specific directory on HDFS. Being on
HDFS allows
this kind of M/R paradigm for massively parallel indexing operations, and
perhaps massively
complex analysis.

Nowhere is there any low-level non-Solr manipulation of the indexes.

The Flume fork just writes directly to the Solr nodes. It knows about the
ZooKeeper
ensemble and the collection too and communicates via SolrJ I'm pretty sure.

As far as integrating with HDFS, you're right, HA is part of the package.
As far as using
the Solr indexes for analysis, well you can write anything you want to use
the Solr indexes
from anywhere in the M/R world and have them available from anywhere in the
cluster. There's
no real need to even have Solr running, you could use the output from MRIT
and access the
sub-shards with the EmbeddedSolrServer if you wanted, leaving out all the
pesky servlet
container stuff.

bq: So why we go for HDFS in the case of analysis if we want to use SolrJ
for this purpose?
What is the point?

Scale and data access in a nutshell. In the HDFS world, you can scale
pretty linearly
with the number of nodes you can rack together.

Frankly though, if your data set is small enough to fit on a single machine
_and_ you can get
through your analysis in a reasonable time (reasonable here is up to you),
then HDFS
is probably not worth the hassle. But in the big data world where we're
talking petabyte scale,
having HDFS as the underpinning opens up possibilities for working on data
that were
difficult/impossible with Solr previously.

Best,
Erick

On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian <alinazem...@gmail.com> wrote:

> Dear Erick,
> I remembered some times ago, somebody asked about what is the point of
> modify Solr to use HDFS for storing indexes. As far as I remember somebody
> told him integrating Solr with HDFS has two advantages. 1) having hadoop
> replication and HA. 2) using indexes and Solr documents for other purposes
> such as Analysis. So why we go for HDFS in the case of analysis if we want
> to use SolrJ for this purpose? What is the point?
> Regards.
>
>
> On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian <alinazem...@gmail.com>
> wrote:
>
> > Dear Erick,
> > Hi,
> > Thank you for you reply. Yeah I am aware that SolrJ is my last option. I
> > was thinking about raw I/O operation. So according to your reply probably
> > it is not applicable somehow. What about the Lily project that Michael
> > mentioned? Is that consider SolrJ too? Are you aware of Cloudera search?
> I
> > know they provide an integrated Hadoop ecosystem. Do you know what is
> their
> > suggestion?
> > Best regards.
> >
> >
> >
> > On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> What you haven't told us is what you mean by "modify the
> >> index outside Solr". SolrJ? Using raw Lucene? Trying to modify
> >> things by writing your own codec? Standard Java I/O operations?
> >> Other?
> >>
> >> You could use SolrJ to connect to an existing Solr server and
> >> both read and modify at will form your M/R jobs. But if you're
> >> thinking of trying to write/modify the segment files by raw I/O
> >> operations, good luck! I'm 99.99% certain that's going to cause
> >> you endless grief.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >> On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian <alinazem...@gmail.com>
> >> wrote:
> >>
> >> > Actually I am going to do some analysis on the solr data using map
> >> reduce.
> >> > For this purpose it might be needed to change some part of data or add
> >> new
> >> > fields from outside solr.
> >> >
> >> >
> >> > On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey <s...@elyograg.org>
> wrote:
> >> >
> >> > > On 8/5/2014 7:04 AM, Ali Nazemian wrote:
> >> > > > I changed solr 4.9 to write index and data on hdfs. Now I am going
> >> to
> >> > > > connect to those data from the outside of solr for changing some
> of
> >> the
> >> > > > values. Could somebody please tell me how that is possible?
> Suppose
> >> I
> >> > am
> >> > > > using Hbase over hdfs for do these changes.
> >> > >
> >> > > I don't know how you could safely modify the index without a Lucene
> >> > > application or another instance of Solr, but if you do manage to
> >> modify
> >> > > the index, simply reloading the core or restarting Solr should cause
> >> it
> >> > > to pick up the changes. Either you would need to make sure that Solr
> >> > > never modifies the index, or you would need some way of coordinating
> >> > > updates so that Solr and the other application would never try to
> >> modify
> >> > > the index at the same time.
> >> > >
> >> > > Thanks,
> >> > > Shawn
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> > A.Nazemian
> >> >
> >>
> >
> >
> >
> > --
> > A.Nazemian
> >
>
>
>
> --
> A.Nazemian
>

Re: solr over hdfs for accessing/ changing indexes outside solr

Reply via email to