Hi, all, I'm new here. Used Solr on a couple of projects before, but didn't need to dive deep into anything until now. These days, I'm doing a spike for a "yellow pages" type search server with the following technical requirements:
~10 mln listings in the database. A listing has a name, address, description, coordinates and a number of tags / filtering fields; no more than a kilobyte all told; i.e. theoretically the whole thing should fit in RAM without sharding. A typical query is either "all text matches on name and/or description within a bounded box", or "some combination of tag matches within a bounded box". Bounded boxes are 1 to 50 km wide, and contain up to 10^5 unfiltered listings (the average is more like 10^3). More than 50% of all the listings are in the frequently requested bounding boxes, however a vast majority of listings are almost never displayed (because they don't match the other filters). Data "never changes" (i.e., a daily batch update; rebuild of the entire index and restart of all search servers is feasible, as long as it takes minutes, not hours). This thing ideally should serve up to 10^3 requests per second on a small (as in, "less than 10 commodity boxes") cluster. In other words, a typical request should be CPU bound and take ~100-200 msec to process. Because of coordinates (that are almost never the same), caching of queries makes no sense; from what little I understand about Lucene internals, caching of filters probably doesn't make sense either. After perusing documentation and some googling (but almost no source code exploring yet), I understand how the schema and the queries will look like, and now have to figure out a specific configuration that fits the performance/scalability requirements. Here is what I'm thinking: 1. Search server is an internal service that uses embedded Solr for the indexing part. RAMDirectoryFactory as index storage. 2. All data is in some sort of persistent storage on a file system, and is loaded into the memory when a search server starts up. 3. Data updates are handled as "update the persistent storage, start another cluster, load the world into RAM, flip the load balancer, kill the old cluster" 4. Solr returns IDs with relevance scores; actual presentations of listings (as JSON documents) are constructed outside of Solr and cached in Memcached, as a mostly static content with a few templated bits, like <distance><%=DISTANCE_TO(-123.0123, 45.6789) %>. 5. All Solr caching is switched off. Obviously, we are not the first people to do something like this with Solr, so I'm hoping for some collective wisdom on the following: Does this sounds like a feasible set of requirements in terms of performance and scalability for Solr? Are we on the right path to solving this problem well? If not, what should we be doing instead? What nasty technical/architectural gotchas are we probably missing at this stage? One particular advice I'd be really happy to hear is "you may not need RAMDataFactory if you use <some combination of fast distributed file system and caching> instead". Aso, is there a blog, wiki page or a maillist thread where a similar problem is discussed? Yes, we have seen http://www.ibm.com/developerworks/opensource/library/j-spatial, it's a good introduction that is outdated and doesn't go into the nasty bits, anyway. Many thanks in advance, -- Alex Verkhovsky