Re: Cache use

Mike Klaas Thu, 06 Dec 2007 15:50:54 -0800

On 4-Dec-07, at 11:52 AM, [EMAIL PROTECTED] wrote:

Any suggestions are helpful to me,. even general.. Here is the infofrom my index:
How big is the index on disk (the most important files are .frq,
and .prx if you do phrase queries?
- Total index folder size is 30.7 Gb
- .frq is 12.2 Gb
- .prx is 6 Gb

Getting a good chunk of .frq in the OS disk cache is key. .prx tooif you are going phrase queries. With .fdt uncached, you can expecta disk seek per doc retrieved, but this may not be too bad (10 docs *~10ms = 100ms overhead).

How big and what exactly is a record in your system?
- record is a document with 100 fields indexed and 10 of themstored. Approximately 60% of fields are containing data.


How big is each record, in terms of # of tokens?

Do you do faceting/sorting?
- Yes, I'm planing to do both.

This is probably the biggest issue. Each sort field will occupy400megs+ of ram. Each value in a multivalued facet field will occupy13megs. Each field in a single-valued facet field with also occupy400megs+ (the same 400megs if it is also used for sorting).

You'll have to be careful of the status of the Solr fieldCache tomake sure you aren't expulsing entried needlessly.

How much memory do you have?
- I do have 8Gb of RAM I could get up to 16Gb

What does a typical query look like?
- I don't know yet. We are in prototype mode. We try everythingpossible. In general we are able to get results in sub-second. Butsome queries take long, for example TOWN:L* I know this is verybroad query, and probably the worst one. But we could need suchqueries to get quantity of such towns with name starting with "L",for example. Cache helps a little, for example after this query ifI run TOWN:La* I'm getting result in milliseconds.But what wonders me is: if I'm running query like this: TOWN:L* ORSTREET:S* I'm guessing it should cache all data of this set. Ifafter I run just TOWN:L* , which is subset of the first query, itstill takes time to get the result back, as if it's not cached..

OR queries aren't explicitly cached in parts, though you shouldbenefit from the os disk cache in the scenario.

If you have very broad wildcard queries, it is usually better tofactor these out. For instance, you could index a TOWN_FIRSTLETTERfield which pre-computes these disjunctions. This would also allowyou to get the quantity using TermDocs/LukeRequestHandler.

Without indexing any other fields, if you just want the _quantity_ oftowns that start with L, it is possible to do so using faceting:facet.field=TOWN, f.TOWN.facet.prefix=L, then add up all theindividual term counts (or, again, using LukeRequestHandler).

Ultimately, I think that it will be relatively hard to get sub-secondperformance on such an index on a single box, but it may be possibleif you structure your queries intelligently. Definitely go for 16 gigs.


cheers,
-Mike


----- Original Message ----
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, December 4, 2007 2:33:24 PM
Subject: Re: Cache use

On 4-Dec-07, at 8:43 AM, Evgeniy Strokin wrote:

Hello,...
we have 110M records index under Solr. Some queries takes a while,
but we need sub-second results. I guess the only solution is cache
(something else?)...
We use standard LRUCache. In docs it says (as far as I understood)
that it loads view of index in to memory and next time works with
memory instead of hard drive.
So, my question: hypothetically, we can have all index in memory if
we'd have enough memory size, right? In this case the result should
come up very fast. We have very rear updates. So I think this could
be a solution.


How big is the index on disk (the most important files are .frq,
and .prx if you do phrase queries?  How big and what exactly is a
record in your system?  Do you do faceting/sorting?  How much memory
do you have?  What does a typical query look like?

Performance is a tricky subject.  It is hard to give any kind of
useful answer that applies in general.  The one thing I can say is
that 110M is a _lot_ of docs for one system, especially if these are
normal-sized documents

regards,
-Mike

Re: Cache use

Reply via email to