Hi Scott,

When intersecting the two sets, Lucene has the advantage that both sets are
sorted. So Lucene can perform a merge join on the two sets in a single
pass. This turns out to be a very fast operation.

Where you'll run into performance issues is if there are a huge number of
documents that match the entire query that need to be scored/ranked. I
wouldn't start worrying about this either though until your dealing with
results with millions of documents.

Joel




On Fri, Nov 8, 2013 at 8:43 AM, Erick Erickson <erickerick...@gmail.com>wrote:

> Have you tried this and measured or is this theoretical? Because
> filter queries are _designed_ for this kind of use case.
>
> bq: If the user has 100 documents, then finding the intersection
> requires checking each list ~100 times
>
> The cached fq is a bitset. Before checking each document,
> all that has to happen to "check the list" is index into the bitset and
> see if the bit is turned on. If it isn't, the document is bypassed.
>
>
> The lots of cores solution has this drawback. The first time a query
> comes in for a particular core, it may be loaded, which will be
> noticeably slow, so your users have to be able to tolerate first-time
> searches that take a bit of time. That said, test to see if it's
> "fast enough" before settling on the solution.
>
> But really, I'd bypass this and just try the filter query solution and
> measure. Because I'd be surprised if you really had performance
> issues here, assuming your filter queries are indeed cached and
> re-used.
>
> Best,
> Erick
>
>
> On Thu, Nov 7, 2013 at 7:02 PM, Scott Schneider <
> scott_schnei...@symantec.com> wrote:
>
> > Digging a bit more, I think I have answered my own questions.  Can
> someone
> > please say if this sounds right?
> >
> > http://wiki.apache.org/solr/LotsOfCores looks like a pretty good
> > solution.  If I give each user his own shard, each query can be run in
> only
> > one shard.  The effect of the filter query will basically be to find that
> > shard.  The requirements listed on the wiki suggest that performance will
> > be good.  But in Solr 3.x, this won't scale with the # users/shards.
> >
> > Prepending a user id to indexed keywords using an analyzer will break
> > wildcard search.  If there is a wildcard, the query analyzer doesn't run
> > filters, so it won't prepend the user id.  I could prepend the user id
> > myself before calling Solr, but that seems... bad.
> >
> > Scott
> >
> >
> >
> > > -----Original Message-----
> > > From: Scott Schneider [mailto:scott_schnei...@symantec.com]
> > > Sent: Thursday, November 07, 2013 2:03 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: fq efficiency
> > >
> > > Thanks, that link is very helpful, especially the section, "Leapfrog,
> > > anyone?"  This actually seems quite slow for my use case.  Suppose we
> > > have 10,000 users and 1,000,000 documents.  We search for "hello" for a
> > > particular user and let's assume that the fq set for the user is
> > > cached.  "hello" is a common word and perhaps 10,000 documents will
> > > match.  If the user has 100 documents, then finding the intersection
> > > requires checking each list ~100 times.  If the user has 1,000
> > > documents, we check each list ~1,000 times.  That doesn't scale well.
> > >
> > > My searches are usually in one user's data.  How can I take advantage
> > > of that?  I could have a separate index for each user, but loading so
> > > many indexes at once seems infeasible; and dynamically loading &
> > > unloading indexes is a pain.
> > >
> > > Or I could create a filter that takes tokens and prepends them with the
> > > user id.  That seems like a good solution, since my keyword searches
> > > always include a user id (and usually just 1 user id).  Though I wonder
> > > if there is a downside I haven't thought of.
> > >
> > > Thanks,
> > > Scott
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shawn Heisey [mailto:s...@elyograg.org]
> > > > Sent: Tuesday, November 05, 2013 4:35 PM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: fq efficiency
> > > >
> > > > On 11/5/2013 3:36 PM, Scott Schneider wrote:
> > > > > I'm wondering if filter queries are efficient enough for my use
> > > > cases.  I have lots and lots of users in a big, multi-tenant, sharded
> > > > index.  To run a search, I can use an fq on the user id and pass in
> > > the
> > > > search terms.  Does this scale well with the # users?  I suppose
> > > that,
> > > > since user id is indexed, generating the filter data (which is
> > > cached)
> > > > will be fast.  And looking up search terms is fast, of course.  But
> > > if
> > > > the search term is a common one that many users have in their
> > > > documents, then Solr may have to perform an intersection between two
> > > > large sets:  docs from all users with the search term and all of the
> > > > current user's docs.
> > > > >
> > > > > Also, how about auto-complete and searching with a trailing
> > > wildcard?
> > > > As I understand it, these work well in a single-tenant index because
> > > > keywords are sorted in the index, so it's easy to get all the search
> > > > terms that match "foo*".  In a multi-tenant index, all users'
> > > keywords
> > > > are stored together.  So if Lucene were to look at all the keywords
> > > > from "foo" to "foozzzzz" (I'm not sure if it actually does this), it
> > > > would skip over a large majority of keywords that don't belong to
> > > this
> > > > user.
> > > >
> > > >  From what I understand, there's not really a whole lot of difference
> > > > between queries and filter queries when they are NOT cached, except
> > > > that
> > > > the main query and the filter queries are executed in parallel, which
> > > > can save time.
> > > >
> > > > When filter queries are found in the filterCache, it's a different
> > > > story.  They get applied *before* the main query, which means that
> > > the
> > > > main query won't have to work as hard.  The filterCache stores
> > > > information about which documents in the entire index match the
> > > filter.
> > > > By storing it as a bitset, the amount of space required is relatively
> > > > low.  Applying filterCache results is very efficient.
> > > >
> > > > There are also advanced techniques, like assigning a cost to each
> > > > filter
> > > > and creating postfilters:
> > > >
> > > > http://yonik.com/posts/advanced-filter-caching-in-solr/
> > > >
> > > > Thanks,
> > > > Shawn
> >
> >
>



-- 
Joel Bernstein
Search Engineer at Heliosearch

Reply via email to