Re: fq performance

Michael Kuhlmann Fri, 17 Mar 2017 01:46:54 -0700

Hi Ganesh,

you might want to use something like this:


fq=access_control:(g1 g2 g5 g99 ...)

Then it's only one fq filter per request. Internally it's like an OR condition, 
but in a more condensed form. I already have used this with up to 500 values 
without larger performance degradation (but in that case it was the unique id 
field).

You should think a minute about your filter cache here. Since you only have one 
fq filter per request, you won't blow your cache that fast. But it depends on 
your use case whether you should cache these filters at all. When it's common 
that a single user will send several requests within one commit interval, or 
when it's likely that several users will be in the same groups, that just use 
it like that. But when it's more likely that each request belongs to a 
different user with different security settings, then you should consider 
disabling the cache for this fq filter so that your filter cache (for other 
filters you probably have) won't be polluted: 
fq=*{!cache=false}*access_control:(g1 g2 g5 g99 ...). See 
http://yonik.com/advanced-filter-caching-in-solr/ for information on that.

-Michael



Am 17.03.2017 um 07:46 schrieb Ganesh M:

Hi Shawn / Michael,

Thanks for your replies and I guess you have got my scenarios exactly right.

Initially my document contains information about who have access to the
documents, like field as (U1_s:true). if 100 users can access a document,
we will have 100 such fields for each user.
So when U1 wants to see all this documents..i will query like get all
documents where U1_s:true.

If user U5 added to group G1, then I have to take all the documents of
group G1 and have to set the information of user U5 in the document like
U5_s:true in the document. For this, I have re-index all the documents in
that group.

To avoid this, I was trying to keep group information instead of user
information like G1_s:true, G2_s:true in the document. And for querying
user documents, I will first get all the groups of User U1, and then query
get all documents where G1_s:true OR G2_s:true or G3_s:true....  By this we
don't need to re-index all the documents. But while querying I need to
query with OR of all the groups user belongs to.

For how many ORs solr can give the results in less than one second.Can I
pass 100's of OR condtion in the solr query? will that affects the
performance ?

Pls share your valuable inputs.

On Thu, Mar 16, 2017 at 6:04 PM Shawn Heisey <apa...@elyograg.org> wrote:

On 3/16/2017 6:02 AM, Ganesh M wrote:

We have 1 million of documents and would like to query with multiple fq

values.

We have kept the access_control ( multi value field ) which holds

information about for which group that document is accessible.

Now to get the list of all the documents of an user, we would like to

pass multiple fq values ( one for each group user belongs to )

q:somefiled:value&fq:access_control:g1&fq:access_control:g2&fq:access_control:g3&fq:access_control:g4&fq:access_control:g5...

Like this, there could be 100 groups for an user.

The correct syntax is fq=field:value -- what you have there is not going
to work.

This might not do what you expect.  Filter queries are ANDed together --
*every* filter must match, which means that if a document that you want
has only one of those values in access_control, or has 98 of them but
not all 100, then the query isn't going to match that document.  The
solution is one filter query that can match ANY of them, which also
might run faster.  I can't say whether this is a problem for you or
not.  Your data might be completely correct for matching 100 filters.

Also keep in mind that there is a limit to the size of a URL that you
can send into any webserver, including the container that runs Solr.
That default limit is 8192 bytes, and includes the "GET " or "POST " at
the beginning and the " HTTP/1.1" at the end (note the spaces).  The
filter query information for 100 of the filters you mentioned is going
to be over 2K, which will fit in the default, but if your query has more
complexity than you have mentioned here, the total URL might not fit.
There's a workaround to this -- use a POST request and put the
parameters in the request body.

If we fire query with 100 values in the fq, whats the penalty on the

performance ? Can we get the result in less than one second for 1 million
of documents.

With one million documents, each internal filter query result is 250000
bytes -- the number of documents divided by eight.  That's 2.5 megabytes
for 100 of them.  In addition, every time a filter is run, it must
examine every document in the index to create that 250000 byte
structure, which means that filters which *aren't* found in the
filterCache are relatively slow.  If they are found in the cache,
they're lightning fast, because the cache will contain the entire 250000
byte bitset.

If you make your filterCache large enough, it's going to consume a LOT
of java heap memory, particularly if the index gets bigger.  The nice
thing about the filterCache is that once the cache entries exist, the
filters are REALLY fast, and if they're all cached, you would DEFINITELY
be able to get results in under one second.  I have no idea whether the
same would happen when filters aren't cached.  It might.  Filters that
do not exist in the cache will be executed in parallel, so the number of
CPUs that you have in the machine, along with the query rate, will have
a big impact on the overall performance of a single query with a lot of
filters.

Also related to the filterCache, keep in mind that every time a commit
is made that opens a new searcher, the filterCache will be autowarmed.
If the autowarmCount value for the filterCache is large, that can make
commits take a very long time, which will cause problems if commits are
happening frequently.  On the other hand, a very small autowarmCount can
cause slow performance after a commit if you use a lot of filters.

My reply is longer and more dense than I had anticipated.  Apologies if
it's information overload.

Thanks,
Shawn

Re: fq performance

Reply via email to