On 4-Dec-07, at 11:52 AM, [EMAIL PROTECTED] wrote:
Any suggestions are helpful to me,. even general.. Here is the info
from my index:
How big is the index on disk (the most important files are .frq,
and .prx if you do phrase queries?
- Total index folder size is 30.7 Gb
- .frq is 12.2 Gb
- .prx is 6 Gb
Getting a good chunk of .frq in the OS disk cache is key. .prx too
if you are going phrase queries. With .fdt uncached, you can expect
a disk seek per doc retrieved, but this may not be too bad (10 docs *
~10ms = 100ms overhead).
How big and what exactly is a record in your system?
- record is a document with 100 fields indexed and 10 of them
stored. Approximately 60% of fields are containing data.
How big is each record, in terms of # of tokens?
Do you do faceting/sorting?
- Yes, I'm planing to do both.
This is probably the biggest issue. Each sort field will occupy
400megs+ of ram. Each value in a multivalued facet field will occupy
13megs. Each field in a single-valued facet field with also occupy
400megs+ (the same 400megs if it is also used for sorting).
You'll have to be careful of the status of the Solr fieldCache to
make sure you aren't expulsing entried needlessly.
How much memory do you have?
- I do have 8Gb of RAM I could get up to 16Gb
What does a typical query look like?
- I don't know yet. We are in prototype mode. We try everything
possible. In general we are able to get results in sub-second. But
some queries take long, for example TOWN:L* I know this is very
broad query, and probably the worst one. But we could need such
queries to get quantity of such towns with name starting with "L",
for example. Cache helps a little, for example after this query if
I run TOWN:La* I'm getting result in milliseconds.
But what wonders me is: if I'm running query like this: TOWN:L* OR
STREET:S* I'm guessing it should cache all data of this set. If
after I run just TOWN:L* , which is subset of the first query, it
still takes time to get the result back, as if it's not cached..
OR queries aren't explicitly cached in parts, though you should
benefit from the os disk cache in the scenario.
If you have very broad wildcard queries, it is usually better to
factor these out. For instance, you could index a TOWN_FIRSTLETTER
field which pre-computes these disjunctions. This would also allow
you to get the quantity using TermDocs/LukeRequestHandler.
Without indexing any other fields, if you just want the _quantity_ of
towns that start with L, it is possible to do so using faceting:
facet.field=TOWN, f.TOWN.facet.prefix=L, then add up all the
individual term counts (or, again, using LukeRequestHandler).
Ultimately, I think that it will be relatively hard to get sub-second
performance on such an index on a single box, but it may be possible
if you structure your queries intelligently. Definitely go for 16 gigs.
cheers,
-Mike
----- Original Message ----
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, December 4, 2007 2:33:24 PM
Subject: Re: Cache use
On 4-Dec-07, at 8:43 AM, Evgeniy Strokin wrote:
Hello,...
we have 110M records index under Solr. Some queries takes a while,
but we need sub-second results. I guess the only solution is cache
(something else?)...
We use standard LRUCache. In docs it says (as far as I understood)
that it loads view of index in to memory and next time works with
memory instead of hard drive.
So, my question: hypothetically, we can have all index in memory if
we'd have enough memory size, right? In this case the result should
come up very fast. We have very rear updates. So I think this could
be a solution.
How big is the index on disk (the most important files are .frq,
and .prx if you do phrase queries? How big and what exactly is a
record in your system? Do you do faceting/sorting? How much memory
do you have? What does a typical query look like?
Performance is a tricky subject. It is hard to give any kind of
useful answer that applies in general. The one thing I can say is
that 110M is a _lot_ of docs for one system, especially if these are
normal-sized documents
regards,
-Mike