On 6/16/2011 4:41 PM, Mari Masuda wrote:
One reservation I have is that eventually we would like to be able to type in "Iraq" and 
find records across all of the collections at once instead of having to search each collection 
separately.  Although I don't know anything about it at this stage, I did Google 
"sharding" after reading someone's recent post on this list and it sounds like that may 
be a potential answer to my question.

So this kind of stuff can be tricky, but with that eventual requirement I would NOT put these in seperate cores. Sharding isn't (IMO, if someone disagrees, they will hopefully say so!) a good answer to searching accross entirely different 'schemas', or avoiding frequent-commit issues -- sharding is really just for scaling/performance when your index gets very very large. (Which it doesn't sound like yours will be, but you can deal with that as a separate issue if it becomes so).

If you're going to want to search across all the collections, put them all in the same core. Either in the exact same indexed fields, or using certain common indexed fields -- those common ones are the ones you'll be able to search across all collections on. It's okay if some collections have unique indexed fields too --- documents in the core that don't belong to that collection just won't have any terms in that indexed field that is only used by a certain collection, no problem. (Then you can distribute this single core into shards if you need to for performance reasons related to number of documents/size of index).

You're right to be thinking about the fact that very frequent commits can be performance issues in Solr. But separating in different cores is going to create more problems for yourself (if you want to be able to search accross all collections), in an attempt to solve that one. (Among other things, not every Solr feature works in a distributed/sharded environment, it's just a more complicated and somewhat less mature setup for Solr).

The way I deal with the frequent-commit issue is by NOT doing frequent commits to my production Solr. Instead, I use Solr replication to have a 'master' Solr index that I do commits to whenever I want, and a 'slave' Solr index that serves the production searches, and which only replicates from master periodically -- not too often to be too-frequent-commits. That seems to be a somewhat common solution, if that use pattern works for you.

There are also some "near real time" features in more recent versions of Solr, that I'm not very familiar with. (not sure if any are included in the current latest release, or if they are all only still in the repo) My sense is that they too only work for certain use patterns, they aren't magic bullets for "commit whatever you want as often as you want to Solr". In general Solr isn't so great at very frequent major changes to the index. Depending on exactly what sort of use pattern you are predicting/planning for your commits, maybe people can give you advice on how (or if) to do it.

But I personally don't think your idea of splitting your collections (that you'll eventually want to search accross into a single search) into shards is a good solution to frequent-commit issues. You'd be complicating your setup and causing other problems for yourself, and not really even entirely addressing the too-frequent-commit issue with that setup.

Reply via email to