Dear Erick,
Thank you for you reply.
Some parts of documents come from Nutch crawler and the other parts come
from processing those documents.
I really need it to be as fast as possible and 10 hours for indexing is not
acceptable for my application.
Regards.


On Mon, May 26, 2014 at 9:25 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> What you haven't told us is where the data comes from. But until
> you put some numbers to it, it's hard to decide.
>
> I tend to prefer storing the data somewhere else, filesystem, whatever
> and indexing to Solr when data changes. Even if that means re-indexing
> the entire corpus. I don't like going to more complicated solutions until
> that proves untenable.
>
> Backup/restore solutions for filesystems, DBs, whatever are are a very
> mature technology, I rely on that first to store my original source.
>
> Now you can re-index at will.
>
> So let's claim your data comes in from some stream somewhere. I'd
> 1> store it to the file system.
> 2> write a program to pull it off the file system and index.
> 3> Your comment about MapReduceIndexerTool is germane. You can re-index
> all that data very quickly. And it'll find files on your file system
> for you too!
>
> But I wouldn't even go there until I'd tried
> indexing my 10M docs straight with SolrJ or similar. If you can index
> your 10M docs
> in 1 hour and, by extrapolation your 100M docs in 10 hours, is that good
> enough?
> I don't know, it's your problem space after all ;). And is it acceptable
> to not
> see changes to the schema until tomorrow morning? If so, there's no need
> to get
> more complicated....
>
> Best,
> Erick
>
> On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey <s...@elyograg.org> wrote:
> > On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> >> I was wondering which scenario (or the combination) would be better for
> my
> >> application. From the aspect of performance, scalability and high
> >> availability. Here is my application:
> >>
> >> Suppose I am going to have more than 10m documents and it grows every
> day.
> >> (probably in 1 years it reaches to more than 100m docs. I want to use
> Solr
> >> as tool for indexing these documents but the problem is I have some data
> >> fields that could change frequently. (not too much but it could change)
> >
> > Choosing which database software to use to hold your data is a problem
> > with many possible solutions.  Everyone will have a different answer for
> > you.  Each solution has strengths and weaknesses, and in the end, only
> > you can really know what your requirements are.
> >
> >> Scenarios:
> >>
> >> 1- Using SolrCloud as database for all data. (even the one that could be
> >> changed)
> >
> > If you choose to use Solr as a NoSQL, I would strongly recommend that
> > you have two Solr installs.  The first install would be purely for data
> > storage and would have no indexed fields.  If you can get machines with
> > enough RAM, it would also probably be preferable to use a single index
> > (or SolrCloud with one shard) for that install.  The other install would
> > be for searching.  Sharding would not be an issue on that index.  The
> > reason that I make this recommendation is that when you use Solr for
> > searching, you have to do a complete reindex if you change your search
> > schema.  It's difficult to reindex if the search index is also your
> > canonical data source.
> >
> >> 2- Using SolrCloud as database for static data and using RDBMS (such as
> >> oracle) for storing dynamic fields.
> >
> > I don't think it would be a good idea to have two canonical data
> > sources.  Pick one.  As already mentioned, Solr is better as a search
> > technology, serving up pointers to data in another data source, than as
> > a database.
> >
> > If you want to use RDBMS technology, why would you spend all that money
> > on Oracle?  Just use one of the free databases.  Our really large Solr
> > index comes from a database.  At one time that database was in Oracle.
> > When my employer purchased the company with that database, we thought we
> > were obtaining a full Oracle license.  It turns out we weren't.  It
> > would have cost about half a million dollars to buy that license, so we
> > switched to MySQL.
> >
> > Since making that move to MySQL, performance is actually *better*.  The
> > source table for our data has 96 million rows right now, growing at a
> > rate of a few million per year.  This is completely in line with your
> > 100 million document requirement.  For the massive table that feeds
> > Solr, we might switch to MongoDB, but that has not been decided yet.
> >
> > Later we switched from EasyAsk to Solr, a move that has *also* given us
> > better performance.  Because both MySQL and Solr are free, we've
> > achieved a substantial cost savings.
> >
> >> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for
> all
> >> data.
> >
> > I have no experience with this technology, but I think that if you are
> > thinking about a database on HDFS, you're probably actually talking
> > about HBase, the Apache implementation of Google's BigTable.
> >
> > Thanks,
> > Shawn
> >
>



-- 
A.Nazemian

Reply via email to