Thanks, Erick. I appreciate the sanity check. -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, May 28, 2015 5:50 PM To: solr-user@lucene.apache.org Subject: Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting
Charles: You raise good points, and I didn't mean to say that co-locating docs due to some critera was never a good idea. That said, it does add administrative complexity that I'd prefer to avoid unless necessary. I suppose it largely depends on what the load and response SLAs are. If there's 1 query/second peak load, the sharding overhead for queries is probably not noticeable. If there are 1,000 QPS, then it might be worth it. Measure, measure, measure...... I think your composite ID understanding is fine. Best, Erick On Thu, May 28, 2015 at 1:40 PM, Reitzel, Charles <charles.reit...@tiaa-cref.org> wrote: > We have used a similar sharding strategy for exactly the reasons you say. > But we are fairly certain that the # of documents per user ID is < 5000 and, > typically, <500. Thus, we think the overhead of distributed searches > clearly outweighs the benefits. Would you agree? We have done some load > testing (with 100's of simultaneous users) and performance has been good with > data and queries distributed evenly across shards. > > In Matteo's case, this model appears to apply well to user types B and C. > Not sure about user type A, though. At > 100,000 docs per user per year, > on average, that load seems ok for one node. But, is it enough to benefit > significantly from a parallel search? > > With a 2 part composite ID, each part will contribute 16 bits to a 32 bit > hash value, which is then compared to the set of hash ranges for each active > shard. Since the user ID will contribute the high-order bytes, it will > dominate in matching the target shard(s). But dominance doesn't mean the > lower order 16 bits will always be ignored, does it? I.e. if the original > shard has been split, perhaps multiple times, isn't it possible that one user > IDs documents will be spread over a multiple shards? > > In Matteo's case, it might make sense to specify fewer bits to the user ID > for user category A. I.e. what I described above is the default for > userId!docId. But if you use userId/8!docId/24 (8 bits for userId and 24 > bits for the document ID), then couldn't one user's docs might be split over > multiple shards, even without splitting? > > I'm just making sure I understand how composite ID sharding works correctly. > Have I got it right? Has any of this logic changed in 5.x? > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Thursday, May 21, 2015 11:30 AM > To: solr-user@lucene.apache.org > Subject: Re: optimal shard assignment with low shard key cardinality > using compositeId to enable shard splitting > > I question your base assumption: > > bq: So shard by document producer seems a good choice > > Because what this _also_ does is force all of the work for a query onto one > node and all indexing for a particular producer ditto. And will cause you to > manually monitor your shards to see if some of them grow out of proportion to > others. And.... > > I think it would be much less hassle to just let Solr distribute the docs as > it may based on the uniqueKey and forget about it. Unless you want, say, to > do joins etc.... There will, of course, be some overhead that you pay here, > but unless you an measure it and it's a pain I wouldn't add the complexity > you're talking about, especially at the volumes you're talking. > > Best, > Erick > > On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla <matteo.gro...@gmail.com> > wrote: >> Hi >> I'd like some feedback on how I'd like to solve the following >> sharding problem >> >> >> I have a collection that will eventually become big >> >> Average document size is 1.5kb >> Every year 30 Million documents will be indexed >> >> Data come from different document producers (a person, owner of his >> documents) and queries are almost always performed by a document >> producer who can only query his own document. So shard by document >> producer seems a good choice >> >> there are 3 types of doc producer >> type A, >> cardinality 105 (there are 105 producers of this type) produce 17M >> docs/year (the aggregated production af all type A producers) type B >> cardinality ~10k produce 4M docs/year type C cardinality ~10M produce >> 9M docs/year >> >> I'm thinking about >> use compositeId ( solrDocId = producerId!docId ) to send all docs of the >> same producer to the same shards. When a shard becomes too large I can use >> shard splitting. >> >> problems >> -documents from type A producers could be oddly distributed among >> shards, because hashing doesn't work well on small numbers (105) see >> Appendix >> >> As a solution I could do this when a new typeA producer (producerA1) arrives: >> >> 1) client app: generate a producer code >> 2) client app: simulate murmurhashing and shard assignment >> 3) client app: check shard assignment is optimal (producer code is >> assigned to the shard with the least type A producers) otherwise goto >> 1) and try with another code >> >> when I add documents or perform searches for producerA1 I use it's >> producer code respectively in the compositeId or in the route parameter What >> do you think? >> >> >> -----------Appendix: murmurhash shard assignment >> simulation----------------------- >> >> import mmh3 >> >> hashes = [mmh3.hash(str(i))>>16 for i in xrange(105)] >> >> num_shards = 16 >> shards = [0]*num_shards >> >> for hash in hashes: >> idx = hash % num_shards >> shards[idx] += 1 >> >> print shards >> print sum(shards) >> >> ------------- >> >> result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8] >> >> so with 16 shards and 105 shard keys I can have shards with 1 key >> shards with 11 keys >> > > ********************************************************************** > *** This e-mail may contain confidential or privileged information. > If you are not the intended recipient, please notify the sender immediately > and then delete it. > > TIAA-CREF > ********************************************************************** > *** ************************************************************************* This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *************************************************************************