Re: Micro-Sharding

2011-12-05 Thread Shawn Heisey
On 12/5/2011 6:57 PM, Jamie Johnson wrote: Question which is a bit off topic. You mention your algorithm for sharding, how do you handle updates or do you not have to deal with that in your scenario? I have a long running program based on SolrJ that handles updates. Once a minute, I run thro

Re: Micro-Sharding

2011-12-05 Thread Jamie Johnson
ways.  The most recent data > (between 3.5 to 7 days, trying to keep it below 500,000 records) goes into > one shard.  The rest of the data is split using the formula crc32(did) % > numShards.  The value of numShards is currently six.  Each of those large > shards has nearly 11 million d

Re: Micro-Sharding

2011-12-05 Thread Ted Dunning
On Mon, Dec 5, 2011 at 3:28 PM, Shawn Heisey wrote: > On 12/4/2011 12:41 AM, Ted Dunning wrote: > >> Read the papers I referred to. They describe how to search fairly >> enormous >> corpus with an 8GB in-memory index (and no disk cache at all). >> > > They would seem to indicate moving away from

Re: Micro-Sharding

2011-12-05 Thread Shawn Heisey
On 12/4/2011 12:41 AM, Ted Dunning wrote: Read the papers I referred to. They describe how to search fairly enormous corpus with an 8GB in-memory index (and no disk cache at all). They would seem to indicate moving away from Solr. While that would not be entirely out of the question, I don't

Re: Micro-Sharding

2011-12-03 Thread Ted Dunning
On Sat, Dec 3, 2011 at 6:36 PM, Shawn Heisey wrote: > On 12/3/2011 2:25 PM, Ted Dunning wrote: > >> Things have changed since I last did this sort of thing seriously. My >> guess is that this is a relatively small amount of memory to devote to >> search. It used to be that the only way to do this

Re: Micro-Sharding

2011-12-03 Thread Shawn Heisey
On 12/3/2011 2:25 PM, Ted Dunning wrote: Things have changed since I last did this sort of thing seriously. My guess is that this is a relatively small amount of memory to devote to search. It used to be that the only way to do this effectively with Lucene based systems was to keep the heap rel

Re: Micro-Sharding

2011-12-03 Thread Ted Dunning
s is currently six. Each of those large > shards has nearly 11 million documents in 20GB of disk space. > OK. That is a relatively common arrangement. I am already using the concept of micro-sharding, but certainly not on a > grand scale. One copy of the index is served by two host

Micro-Sharding

2011-12-03 Thread Shawn Heisey
already using the concept of micro-sharding, but certainly not on a grand scale. One copy of the index is served by two hosts with 8 CPU cores, so each host has three of the large shards. Doing some least common multiple calculations, I have determined that 420 shards would allow me to use the