On 4-Dec-07, at 11:52 AM, [EMAIL PROTECTED] wrote:

Any suggestions are helpful to me,. even general.. Here is the info from my index:
How big is the index on disk (the most important files are .frq,
and .prx if you do phrase queries?
- Total index folder size is 30.7 Gb
- .frq is 12.2 Gb
- .prx is 6 Gb

Getting a good chunk of .frq in the OS disk cache is key. .prx too if you are going phrase queries. With .fdt uncached, you can expect a disk seek per doc retrieved, but this may not be too bad (10 docs * ~10ms = 100ms overhead).

How big and what exactly is a record in your system?
- record is a document with 100 fields indexed and 10 of them stored. Approximately 60% of fields are containing data.

How big is each record, in terms of # of tokens?

Do you do faceting/sorting?
- Yes, I'm planing to do both.

This is probably the biggest issue. Each sort field will occupy 400megs+ of ram. Each value in a multivalued facet field will occupy 13megs. Each field in a single-valued facet field with also occupy 400megs+ (the same 400megs if it is also used for sorting).

You'll have to be careful of the status of the Solr fieldCache to make sure you aren't expulsing entried needlessly.

How much memory do you have?
- I do have 8Gb of RAM I could get up to 16Gb

What does a typical query look like?
- I don't know yet. We are in prototype mode. We try everything possible. In general we are able to get results in sub-second. But some queries take long, for example TOWN:L* I know this is very broad query, and probably the worst one. But we could need such queries to get quantity of such towns with name starting with "L", for example. Cache helps a little, for example after this query if I run TOWN:La* I'm getting result in milliseconds. But what wonders me is: if I'm running query like this: TOWN:L* OR STREET:S* I'm guessing it should cache all data of this set. If after I run just TOWN:L* , which is subset of the first query, it still takes time to get the result back, as if it's not cached..

OR queries aren't explicitly cached in parts, though you should benefit from the os disk cache in the scenario.

If you have very broad wildcard queries, it is usually better to factor these out. For instance, you could index a TOWN_FIRSTLETTER field which pre-computes these disjunctions. This would also allow you to get the quantity using TermDocs/LukeRequestHandler.

Without indexing any other fields, if you just want the _quantity_ of towns that start with L, it is possible to do so using faceting: facet.field=TOWN, f.TOWN.facet.prefix=L, then add up all the individual term counts (or, again, using LukeRequestHandler).

Ultimately, I think that it will be relatively hard to get sub-second performance on such an index on a single box, but it may be possible if you structure your queries intelligently. Definitely go for 16 gigs.

cheers,
-Mike


----- Original Message ----
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, December 4, 2007 2:33:24 PM
Subject: Re: Cache use

On 4-Dec-07, at 8:43 AM, Evgeniy Strokin wrote:

Hello,...
we have 110M records index under Solr. Some queries takes a while,
but we need sub-second results. I guess the only solution is cache
(something else?)...
We use standard LRUCache. In docs it says (as far as I understood)
that it loads view of index in to memory and next time works with
memory instead of hard drive.
So, my question: hypothetically, we can have all index in memory if
we'd have enough memory size, right? In this case the result should
come up very fast. We have very rear updates. So I think this could
be a solution.

How big is the index on disk (the most important files are .frq,
and .prx if you do phrase queries?  How big and what exactly is a
record in your system?  Do you do faceting/sorting?  How much memory
do you have?  What does a typical query look like?

Performance is a tricky subject.  It is hard to give any kind of
useful answer that applies in general.  The one thing I can say is
that 110M is a _lot_ of docs for one system, especially if these are
normal-sized documents

regards,
-Mike

Reply via email to