Re: Support for huge data set?

Darren Govoni Fri, 13 May 2011 10:09:57 -0700

Can I ask if you do any faceted or MLT type searches? Do those even work
across shards?


On Fri, 2011-05-13 at 08:59 -0600, Shawn Heisey wrote:
> Our system, which I am not at liberty to disclose, consists of 55 
> million documents, mostly photos and text, but video is starting to 
> become prominent.  The entire archive is about 80 terabytes, but we only 
> index a subset of the metadata, stored in a MySQL database, which is 
> about 100GB or so in size.
> 
> The Solr index (version 1.4.1) consists of six large shards, each about 
> 16GB in size, plus a  seventh shard containing the most recent 7 days, 
> which is usually less than 1 GB.  The entire system is replicated to 
> slave servers.  Each virtual machine that houses a large shard has 9GB 
> of RAM, and there are three large shards on each of the four physical 
> hosts.  Each physical host is dual quad-core with 32GB of RAM, with a 
> six drive SATA RAID10.  We went with virtualization (Xen) for cost reasons.
> 
> Performance is good.  If we could move to physical machines instead of 
> virtualization, that would be optimal, but I think I'll have to settle 
> for a RAM upgrade instead.
> 
> The main reason I stuck with distributed search is because of index 
> rebuild time.  I can currently rebuild the entire index in 3-4 hours.  
> It would take 5-6 times that if I had a single large index.
> 
> 
> On 5/13/2011 12:37 AM, Otis Gospodnetic wrote:
> > With that many documents, I think GSA cost might be in millions of USD.  
> > Don't
> > go there.
> >
> > 300 MB docs might be called medium these days.  Of course, if those 
> > documents
> > themselves are huge, then it's more resource intensive.  10 TB sounds like 
> > a lot
> > when it comes to search, but it's hard to tell what that represents (e.g. 
> > are
> > those docs with lots of photos in them?  Presentations very light on text?
> > Plain text documents with 300 words per page? etc.)
> >
> > Anyhow, yes, Solr is a fine choice for this.
> >
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > ----- Original Message ----
> >> From: atreyu<[email protected]>
> >> To: [email protected]
> >> Sent: Thu, May 12, 2011 12:59:28 PM
> >> Subject: Support for huge data set?
> >>
> >> Hi,
> >>
> >> I have about 300 million docs (or 10TB data) which is doubling every  3
> >> years, give or take.  The data mostly consists of Oracle records,  webpage
> >> files (HTML/XML, etc.) and office doc files.  There are b/t two  and four
> >> dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
> >> but it still gets extremely taxed, and this will only get  worse.
> >>
> >> Would Solr be able to efficiently deal with a load of this  size?  I am
> >> trying to avoid the heavy cost of GSA,  etc...
> >>
> >> Thanks.
> >>
> >>
> >> --
> >> View this message in context:
> >> http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html
> >>
> >> Sent  from the Solr - User mailing list archive at Nabble.com.
> >>
>

Re: Support for huge data set?

Reply via email to