Hi, all,

I'm new here. Used Solr on a couple of projects before, but didn't need to
dive deep into anything until now. These days, I'm doing a spike for a
"yellow pages" type search server with the following technical requirements:

~10 mln listings in the database. A listing has a name, address,
description, coordinates and a number of tags / filtering fields; no more
than a kilobyte all told; i.e. theoretically the whole thing should fit in
RAM without sharding. A typical query is either "all text matches on name
and/or description within a bounded box", or "some combination of tag
matches within a bounded box". Bounded boxes are 1 to 50 km wide, and
contain up to 10^5 unfiltered listings (the average is more like 10^3).
More than 50% of all the listings are in the frequently requested bounding
boxes, however a vast majority of listings are almost never displayed
(because they don't match the other filters).

Data "never changes" (i.e., a daily batch update; rebuild of the entire
index and restart of all search servers is feasible, as long as it takes
minutes, not hours). This thing ideally should serve up to 10^3 requests
per second on a small (as in, "less than 10 commodity boxes") cluster. In
other words, a typical request should be CPU bound and take ~100-200 msec
to process. Because of coordinates (that are almost never the same),
caching of queries makes no sense; from what little I understand about
Lucene internals, caching of filters probably doesn't make sense either.

After perusing documentation and some googling (but almost no source code
exploring yet), I understand how the schema and the queries will look like,
and now have to figure out a specific configuration that fits the
performance/scalability requirements. Here is what I'm thinking:

1. Search server is an internal service that uses embedded Solr for the
indexing part. RAMDirectoryFactory as index storage.
2. All data is in some sort of persistent storage on a file system, and is
loaded into the memory when a search server starts up.
3. Data updates are handled as "update the persistent storage, start
another cluster, load the world into RAM, flip the load balancer, kill the
old cluster"
4. Solr returns IDs with relevance scores; actual presentations of listings
(as JSON documents) are constructed outside of Solr and cached in
Memcached, as a mostly static content with a few templated bits, like
<distance><%=DISTANCE_TO(-123.0123, 45.6789) %>.
5. All Solr caching is switched off.

Obviously, we are not the first people to do something like this with Solr,
so I'm hoping for some collective wisdom on the following:

Does this sounds like a feasible set of requirements in terms of
performance and scalability for Solr? Are we on the right path to solving
this problem well? If not, what should we be doing instead? What nasty
technical/architectural gotchas are we probably missing at this stage?

One particular advice I'd be really happy to hear is "you may not need
RAMDataFactory if you use <some combination of fast distributed file system
and caching> instead".

Aso, is there a blog, wiki page or a maillist thread where a similar
problem is discussed? Yes, we have seen
http://www.ibm.com/developerworks/opensource/library/j-spatial, it's a good
introduction that is outdated and doesn't go into the nasty bits, anyway.

Many thanks in advance,
-- Alex Verkhovsky

Reply via email to