Thanks Erick, that's a very helpful answer.

Regarding the grouping option, does that require all the docs to be put
into a single collection, or could it be done with across N collections
(assuming each collection had a common "type" field for grouping on)?

Chris


On Wed, Jun 26, 2013 at 7:01 AM, Erick Erickson <erickerick...@gmail.com>wrote:

> bq: Would the above setup qualify as "multiple compatible collections"
>
> No. While there may be enough fields in common to form a single query,
> the TF/IDF calculations will not be "compatible" and the scores from the
> various collections will NOT be comparable. So simply getting the list of
> top N docs will probably be dominated by the docs from a single type.
>
> bq: How does SolrCloud combine the query results from multiple collections?
>
> It doesn't. SolrCloud sorts the results from multiple nodes in the
> _same_ collection
> according to whatever sort criteria are specified, defaulting to score.
> Say you
> ask for the top 20 docs. A node from each shard returns the top 20 docs
> for that
> shard. The node processing them just merges all the returned lists and
> only keeps
> the top 20.
>
> I don't think your last two questions are really relevant, SolrCloud
> isn't built to
> query multiple collections and return the results coherently.
>
> The root problem here is that you're trying to compare docs from
> different collections for "goodness" to return the top N. This isn't
> actually hard
> _except_ when "goodness" is the score, then it just doesn't work. You can't
> even compare scores from different queries on the _same_ collection, much
> less different ones. Consider two collections, books and songs. One
> consists
> of lots and lots of text and the ter frequency and inverse doc freq
> (TF/IDF)
> will be hugely different than songs. Not to mention field length
> normalization.
>
> Now, all that aside there's an option. Index all the docs in a single
> collection and
> use grouping (aka field collapsing) to get a single response that has the
> top N
> docs from each type (they'll be in different sections of the original
> response) and present
> them to the user however makes sense. You'll get "hands on" experience in
> why this isn't something that's easy to do automatically if you try to
> sort these
> into a single list by relevance <G>...
>
> Best
> Erick
>
> On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey <ctoo...@gmail.com> wrote:
> > Thanks Jack for the alternatives.  The first is interesting but has the
> > downside of requiring multiple queries to get the full matching docs.
>  The
> > second is interesting and very simple, but has the downside of not being
> > modular and being difficult to configure field boosting when the
> > collections have overlapping field names with different boosts being
> needed
> > for the same field in different document types.
> >
> > I'd still like to know about the viability of my original approach though
> > too.
> >
> > Chris
> >
> >
> > On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky <j...@basetechnology.com
> >wrote:
> >
> >> One simple scenario to consider: N+1 collections - one collection per
> >> document type with detailed fields for that document type, and one
> common
> >> collection that indexes a subset of the fields. The main user query
> would
> >> be an edismax over the common fields in that "main" collection. You can
> >> then display summary results from the common collection. You can also
> then
> >> support "drill down" into the type-specific collection based on a "type"
> >> field for each document in the main collection.
> >>
> >> Or, sure, you actually CAN index multiple document types in the same
> >> collection - add all the fields to one schema - there is no time or
> space
> >> penalty if most of the field are empty for most documents.
> >>
> >> -- Jack Krupansky
> >>
> >> -----Original Message----- From: Chris Toomey
> >> Sent: Tuesday, June 25, 2013 6:08 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Querying multiple collections in SolrCloud
> >>
> >>
> >> Hi, I'm investigating using SolrCloud for querying documents of
> different
> >> but similar/related types, and have read through docs. on the wiki and
> done
> >> many searches in these archives, but still have some questions.  Thanks
> in
> >> advance for your help.
> >>
> >> Setup:
> >> * Say that I have N distinct types of documents and I want to do queries
> >> that return the best matches regardless document type.  I.e., something
> >> akin to a Google search where I'd like to get the best matches from the
> >> web, news, images, and maps.
> >>
> >> * Our main use case is supporting simple user-entered searches, which
> would
> >> just contain terms / phrases and wouldn't specify fields.
> >>
> >> * The document types will not all have the same fields, though there
> may be
> >> some overlap in the fields.
> >>
> >> * We plan to use a separate collection for each document type, and to
> use
> >> the eDisMax query parser.  Each collection would have a
> document-specific
> >> schema configuration with appropriate defaults for query fields and
> boosts,
> >> etc.
> >>
> >> Questions:
> >> * Would the above setup qualify as "multiple compatible collections",
> such
> >> that we could search all N collections with a single SolrCloud query,
> as in
> >> the example query "
> >> http://localhost:8983/solr/**collection1/select?q=apple%**
> >> 20pie&collection=c1,c2,..<
> http://localhost:8983/solr/collection1/select?q=apple%20pie&collection=c1,c2,.
> .>
> >> .,cN"**?
> >> Again, we're not querying against specific fields.
> >>
> >> * How does SolrCloud combine the query results from multiple
> collections?
> >> Does it re-sort the combined result set, or does it just return the
> >> concatenation of the (unmerged) results from each of the collections?
> >>
> >> * Does SolrCloud impose any restrictions on querying multiple, sharded
> >> collections?  I know it supports querying say all 3 shards of a single
> >> collection, so want to make sure it would also support say all Nx3
> shards
> >> of N collections.
> >>
> >> * When SolrCloud queries multiple shards/collections, it queries them
> >> concurrently vs. serially, correct?
> >>
> >> thanks much,
> >> Chris
> >>
>

Reply via email to