Dear Shawn,
Hi and thank you for you reply.
Could you please tell me about the performance and scalability of the
mentioned solutions? Suppose I have a SolrCloud with 4 different machine.
Would it scale linearly if I add another 4 machines to that? I mean when
the documents number increases from 10m to 100m documents.
Regards.


On Mon, May 26, 2014 at 8:30 PM, Shawn Heisey <s...@elyograg.org> wrote:

> On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> > I was wondering which scenario (or the combination) would be better for
> my
> > application. From the aspect of performance, scalability and high
> > availability. Here is my application:
> >
> > Suppose I am going to have more than 10m documents and it grows every
> day.
> > (probably in 1 years it reaches to more than 100m docs. I want to use
> Solr
> > as tool for indexing these documents but the problem is I have some data
> > fields that could change frequently. (not too much but it could change)
>
> Choosing which database software to use to hold your data is a problem
> with many possible solutions.  Everyone will have a different answer for
> you.  Each solution has strengths and weaknesses, and in the end, only
> you can really know what your requirements are.
>
> > Scenarios:
> >
> > 1- Using SolrCloud as database for all data. (even the one that could be
> > changed)
>
> If you choose to use Solr as a NoSQL, I would strongly recommend that
> you have two Solr installs.  The first install would be purely for data
> storage and would have no indexed fields.  If you can get machines with
> enough RAM, it would also probably be preferable to use a single index
> (or SolrCloud with one shard) for that install.  The other install would
> be for searching.  Sharding would not be an issue on that index.  The
> reason that I make this recommendation is that when you use Solr for
> searching, you have to do a complete reindex if you change your search
> schema.  It's difficult to reindex if the search index is also your
> canonical data source.
>
> > 2- Using SolrCloud as database for static data and using RDBMS (such as
> > oracle) for storing dynamic fields.
>
> I don't think it would be a good idea to have two canonical data
> sources.  Pick one.  As already mentioned, Solr is better as a search
> technology, serving up pointers to data in another data source, than as
> a database.
>
> If you want to use RDBMS technology, why would you spend all that money
> on Oracle?  Just use one of the free databases.  Our really large Solr
> index comes from a database.  At one time that database was in Oracle.
> When my employer purchased the company with that database, we thought we
> were obtaining a full Oracle license.  It turns out we weren't.  It
> would have cost about half a million dollars to buy that license, so we
> switched to MySQL.
>
> Since making that move to MySQL, performance is actually *better*.  The
> source table for our data has 96 million rows right now, growing at a
> rate of a few million per year.  This is completely in line with your
> 100 million document requirement.  For the massive table that feeds
> Solr, we might switch to MongoDB, but that has not been decided yet.
>
> Later we switched from EasyAsk to Solr, a move that has *also* given us
> better performance.  Because both MySQL and Solr are free, we've
> achieved a substantial cost savings.
>
> > 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
> > data.
>
> I have no experience with this technology, but I think that if you are
> thinking about a database on HDFS, you're probably actually talking
> about HBase, the Apache implementation of Google's BigTable.
>
> Thanks,
> Shawn
>
>


-- 
A.Nazemian

Reply via email to