Can I ask if you do any faceted or MLT type searches? Do those even work across shards?
On Fri, 2011-05-13 at 08:59 -0600, Shawn Heisey wrote: > Our system, which I am not at liberty to disclose, consists of 55 > million documents, mostly photos and text, but video is starting to > become prominent. The entire archive is about 80 terabytes, but we only > index a subset of the metadata, stored in a MySQL database, which is > about 100GB or so in size. > > The Solr index (version 1.4.1) consists of six large shards, each about > 16GB in size, plus a seventh shard containing the most recent 7 days, > which is usually less than 1 GB. The entire system is replicated to > slave servers. Each virtual machine that houses a large shard has 9GB > of RAM, and there are three large shards on each of the four physical > hosts. Each physical host is dual quad-core with 32GB of RAM, with a > six drive SATA RAID10. We went with virtualization (Xen) for cost reasons. > > Performance is good. If we could move to physical machines instead of > virtualization, that would be optimal, but I think I'll have to settle > for a RAM upgrade instead. > > The main reason I stuck with distributed search is because of index > rebuild time. I can currently rebuild the entire index in 3-4 hours. > It would take 5-6 times that if I had a single large index. > > > On 5/13/2011 12:37 AM, Otis Gospodnetic wrote: > > With that many documents, I think GSA cost might be in millions of USD. > > Don't > > go there. > > > > 300 MB docs might be called medium these days. Of course, if those > > documents > > themselves are huge, then it's more resource intensive. 10 TB sounds like > > a lot > > when it comes to search, but it's hard to tell what that represents (e.g. > > are > > those docs with lots of photos in them? Presentations very light on text? > > Plain text documents with 300 words per page? etc.) > > > > Anyhow, yes, Solr is a fine choice for this. > > > > Otis > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > ----- Original Message ---- > >> From: atreyu<[email protected]> > >> To: [email protected] > >> Sent: Thu, May 12, 2011 12:59:28 PM > >> Subject: Support for huge data set? > >> > >> Hi, > >> > >> I have about 300 million docs (or 10TB data) which is doubling every 3 > >> years, give or take. The data mostly consists of Oracle records, webpage > >> files (HTML/XML, etc.) and office doc files. There are b/t two and four > >> dozen concurrent users, typically. The indexing server has> 27 GB of RAM, > >> but it still gets extremely taxed, and this will only get worse. > >> > >> Would Solr be able to efficiently deal with a load of this size? I am > >> trying to avoid the heavy cost of GSA, etc... > >> > >> Thanks. > >> > >> > >> -- > >> View this message in context: > >> http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html > >> > >> Sent from the Solr - User mailing list archive at Nabble.com. > >> >
