Re: Querying multiple collections in SolrCloud

Erick Erickson Wed, 26 Jun 2013 07:02:55 -0700

bq: Would the above setup qualify as "multiple compatible collections"

No. While there may be enough fields in common to form a single query,
the TF/IDF calculations will not be "compatible" and the scores from the
various collections will NOT be comparable. So simply getting the list of
top N docs will probably be dominated by the docs from a single type.

bq: How does SolrCloud combine the query results from multiple collections?

It doesn't. SolrCloud sorts the results from multiple nodes in the
_same_ collection
according to whatever sort criteria are specified, defaulting to score. Say you
ask for the top 20 docs. A node from each shard returns the top 20 docs for that
shard. The node processing them just merges all the returned lists and
only keeps
the top 20.

I don't think your last two questions are really relevant, SolrCloud
isn't built to
query multiple collections and return the results coherently.

The root problem here is that you're trying to compare docs from
different collections for "goodness" to return the top N. This isn't
actually hard
_except_ when "goodness" is the score, then it just doesn't work. You can't
even compare scores from different queries on the _same_ collection, much
less different ones. Consider two collections, books and songs. One consists
of lots and lots of text and the ter frequency and inverse doc freq (TF/IDF)
will be hugely different than songs. Not to mention field length normalization.

Now, all that aside there's an option. Index all the docs in a single
collection and
use grouping (aka field collapsing) to get a single response that has the top N
docs from each type (they'll be in different sections of the original
response) and present
them to the user however makes sense. You'll get "hands on" experience in
why this isn't something that's easy to do automatically if you try to
sort these
into a single list by relevance <G>...

Best
Erick

On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey <ctoo...@gmail.com> wrote:
> Thanks Jack for the alternatives.  The first is interesting but has the
> downside of requiring multiple queries to get the full matching docs.  The
> second is interesting and very simple, but has the downside of not being
> modular and being difficult to configure field boosting when the
> collections have overlapping field names with different boosts being needed
> for the same field in different document types.
>
> I'd still like to know about the viability of my original approach though
> too.
>
> Chris
>
>
> On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky 
> <j...@basetechnology.com>wrote:
>
>> One simple scenario to consider: N+1 collections - one collection per
>> document type with detailed fields for that document type, and one common
>> collection that indexes a subset of the fields. The main user query would
>> be an edismax over the common fields in that "main" collection. You can
>> then display summary results from the common collection. You can also then
>> support "drill down" into the type-specific collection based on a "type"
>> field for each document in the main collection.
>>
>> Or, sure, you actually CAN index multiple document types in the same
>> collection - add all the fields to one schema - there is no time or space
>> penalty if most of the field are empty for most documents.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Chris Toomey
>> Sent: Tuesday, June 25, 2013 6:08 PM
>> To: solr-user@lucene.apache.org
>> Subject: Querying multiple collections in SolrCloud
>>
>>
>> Hi, I'm investigating using SolrCloud for querying documents of different
>> but similar/related types, and have read through docs. on the wiki and done
>> many searches in these archives, but still have some questions.  Thanks in
>> advance for your help.
>>
>> Setup:
>> * Say that I have N distinct types of documents and I want to do queries
>> that return the best matches regardless document type.  I.e., something
>> akin to a Google search where I'd like to get the best matches from the
>> web, news, images, and maps.
>>
>> * Our main use case is supporting simple user-entered searches, which would
>> just contain terms / phrases and wouldn't specify fields.
>>
>> * The document types will not all have the same fields, though there may be
>> some overlap in the fields.
>>
>> * We plan to use a separate collection for each document type, and to use
>> the eDisMax query parser.  Each collection would have a document-specific
>> schema configuration with appropriate defaults for query fields and boosts,
>> etc.
>>
>> Questions:
>> * Would the above setup qualify as "multiple compatible collections", such
>> that we could search all N collections with a single SolrCloud query, as in
>> the example query "
>> http://localhost:8983/solr/**collection1/select?q=apple%**
>> 20pie&collection=c1,c2,..<http://localhost:8983/solr/collection1/select?q=apple%20pie&collection=c1,c2,..>
>> .,cN"**?
>> Again, we're not querying against specific fields.
>>
>> * How does SolrCloud combine the query results from multiple collections?
>> Does it re-sort the combined result set, or does it just return the
>> concatenation of the (unmerged) results from each of the collections?
>>
>> * Does SolrCloud impose any restrictions on querying multiple, sharded
>> collections?  I know it supports querying say all 3 shards of a single
>> collection, so want to make sure it would also support say all Nx3 shards
>> of N collections.
>>
>> * When SolrCloud queries multiple shards/collections, it queries them
>> concurrently vs. serially, correct?
>>
>> thanks much,
>> Chris
>>

Re: Querying multiple collections in SolrCloud

Reply via email to