Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏

Michael Della Bitta Mon, 15 Sep 2014 14:03:40 -0700

If all you need is better availability, I would start by trying out an
additional replica of each shard on a different box, so each box would be
serving the data for 2 shards and each shard would be available on 2 boxes.


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Sep 15, 2014 at 1:29 PM, Amey - codeinventory <
ameyjad...@codeinventory.com> wrote:

> well, i have 8 m1.large ec2 having 2 core 7gb ram and 1tb ebs attached to
> each server for index.
>
> in my case i dont expect index to be store in ram neither a quick reply as
> its not a real time application, i just want fault tolerance in application
> and availability of full data.
>
>
> Is it good to use HDFS over normal solr cloud?
>
> Best,
> Amey
>
> --- Original Message ---
>
> From: "Michael Della Bitta" <michael.della.bi...@appinions.com>
> Sent: September 15, 2014 9:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏
>
> There's not much about Solr Cloud or HDFS indexes that suggests you should
> only have one logical shard. If your goal is better uptime with a sharded
> index, you should add more replicas.
>
> If your collection is small enough that one machine can serve one query
> with acceptable performance, but you want to scale to many queries, then
> just adding mirrors of a single-sharded collection is fine. But that's a
> big "if."
>
> Switching to HDFS is an option if you have enough RAM for your whole
> collection, and have a lot of existing storage devoted to HDFS, or if you
> want to batch create indexes. It's not really aimed at preserving uptime as
> far as I know.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> w: appinions.com <http://www.appinions.com/>
>
> On Mon, Sep 15, 2014 at 11:23 AM, Amey Jadiye <
> ameyjad...@codeinventory.com>
> wrote:
>
> > Thanks for reply Erik,
> > I think i have some misconfusion about how SOLR works with HDFS, and
> > solution i am thinking could be reorganised  by user community :)
> > Here is the actual solution/situation which is implemented by me
> > *Usecase* : I need a google like search engine which should be work in
> > distributed and fault tolerant mode, we are collecting the health related
> > URLs from a third party system in large amount, approx 1Million/hour. we
> > want to build an inventory which contains all of there detail. now i am
> > fetching that URL data breaking it in H1, P, Div like tags with help of
> > Jsoup lib and putting in Solr as a documents with different boost to
> > different fields.
> > Now after the putting this data, i have a custom program with which we
> > categorise all the data Example. All the cancer related pages, i am
> > querying the SOLR and fetching all URL related to cancer with CursorMark
> > and putting in a file for further use of our system.
> > *Old Solution* : For this i have build the 8 SOLR servers with 3
> > zookeepers on the individual AWS Ec2 instances with one collection:8
> shards
> > problem with this solution is whenever any instance go down i am loosing
> > that data for a moment. link of current solution
> > http://postimg.org/image/luli3ybtj/
> > *New _OR_ could be faulty solution* : I am thinking that if i use HDFS
> > which is virtually only one file system is better so if my server go down
> > that data is available through another server, below is steps i am
> thinking
> > to do.
> > 1 > I will merge all the 8 server  indices somewhere in to one.2 > Make
> > setting for HDFS on same 8 servers.3 > Put the merged index folder in
> HDFS
> > so it will be distributed in 8 servers physically it self.4 > Restart 8
> > servers pointing to HDFS on each instance.5 > and now i am ready to go
> for
> > putting data on 8 servers and fetching through any one of SOLR , if that
> is
> > down choose another so it will be guaranteed to get all the data.
> > So is this solution sounds good, OR you guys suggest me another better
> > solution ?
> > Regards,Amey
> >
> >
> > > Date: Thu, 11 Sep 2014 14:41:48 -0700
> > > Subject: Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏
> > > From: erickerick...@gmail.com
> > > To: solr-user@lucene.apache.org
> > >
> > > Um, I really think this is pretty likely to not be a great solution.
> > > When you say "merge indexes", I'm thinking you want to go from 8
> > > shards to 1 shard. Now, this can be done with the "merge indexes" core
> > > admin API, see:
> > > https://wiki.apache.org/solr/MergingSolrIndexes
> > >
> > > BUT.
> > > 1>  This will break all things SolrCloud-ish assuming you created your
> > > 8 shards under SolrCloud.
> > > 2> Solr is usually limited by memory, so trying to fit enough of your
> > > single huge index into memory may be problematical.
> > >
> > > This feels like an XY problem, _why_ are you asking about this? What
> > > is the use-case you want to handle by this?
> > >
> > > Best,
> > > Erick
> > >
> > > On Thu, Sep 11, 2014 at 7:44 AM, Amey Jadiye
> > > <ameyjad...@codeinventory.com> wrote:
> > > > FYI, I searched the google for this problem but didn't find any
> > satisfactory answer.Here is the current situation : I have the 8 shards
> in
> > my solr cloud backed up with 3 zookeeper all are setup on AWS EC2
> > instances, all 8 are leader with no replicas.I have only 1 collection say
> > collection1 divided in 8 shards, i have configured the index and tlog
> > folder on each server pointing into 1TB EBS disk attached to each
> servers,
> > all 8 servers are having around 100GB for index folder each. so total
> index
> > files i have is ~800Gb.Now, i want to move all the data to HDFS, so I am
> > going to setup the HDFS on all 8 serversMerge all the indexes from 8
> > serversPut in HDFS.Stop  and Start my all solr servers on HDFS to access
> > that common index data with setting  below cp parameter and few
> > more.-Dsolr.directoryFactory=HdfsDirectoryFactory
> >  -Dsolr.lock.type=hdfs     -Dsolr.data.dir=hdfs://host:port/path
> >  -Dsolr.updatelog=hdfs://host:port/path -jarNow could you tell me is this
> > correct approach? if yes how can i merge all indices from 8 server
> > ?Regards,Amey
> >
> >
>

Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏

Reply via email to