Re: Querying multiple collections in SolrCloud

Erick Erickson Thu, 27 Jun 2013 10:46:36 -0700

I'd _guess_ that this is unsupported across collections if
for no other reason than scores really aren't comparable
across collections and the default ordering within groups
is score. This is really a "federated search" type problem.


But if it makes sense to use N collections for other reasons,
it's really the same thing as grouping functionally, you just
send a separate request to each collection and combine
the results of those N requests rather than from N
groups in a single query. If the collections are hosted on
different machines for instance, you might get quicker
overall response by firing off parallel queries,
It Depends (tm)...

Best
Erick


On Wed, Jun 26, 2013 at 1:46 PM, Chris Toomey <ctoo...@gmail.com> wrote:

> Thanks Erick, that's a very helpful answer.
>
> Regarding the grouping option, does that require all the docs to be put
> into a single collection, or could it be done with across N collections
> (assuming each collection had a common "type" field for grouping on)?
>
> Chris
>
>
> On Wed, Jun 26, 2013 at 7:01 AM, Erick Erickson <erickerick...@gmail.com
> >wrote:
>
> > bq: Would the above setup qualify as "multiple compatible collections"
> >
> > No. While there may be enough fields in common to form a single query,
> > the TF/IDF calculations will not be "compatible" and the scores from the
> > various collections will NOT be comparable. So simply getting the list of
> > top N docs will probably be dominated by the docs from a single type.
> >
> > bq: How does SolrCloud combine the query results from multiple
> collections?
> >
> > It doesn't. SolrCloud sorts the results from multiple nodes in the
> > _same_ collection
> > according to whatever sort criteria are specified, defaulting to score.
> > Say you
> > ask for the top 20 docs. A node from each shard returns the top 20 docs
> > for that
> > shard. The node processing them just merges all the returned lists and
> > only keeps
> > the top 20.
> >
> > I don't think your last two questions are really relevant, SolrCloud
> > isn't built to
> > query multiple collections and return the results coherently.
> >
> > The root problem here is that you're trying to compare docs from
> > different collections for "goodness" to return the top N. This isn't
> > actually hard
> > _except_ when "goodness" is the score, then it just doesn't work. You
> can't
> > even compare scores from different queries on the _same_ collection, much
> > less different ones. Consider two collections, books and songs. One
> > consists
> > of lots and lots of text and the ter frequency and inverse doc freq
> > (TF/IDF)
> > will be hugely different than songs. Not to mention field length
> > normalization.
> >
> > Now, all that aside there's an option. Index all the docs in a single
> > collection and
> > use grouping (aka field collapsing) to get a single response that has the
> > top N
> > docs from each type (they'll be in different sections of the original
> > response) and present
> > them to the user however makes sense. You'll get "hands on" experience in
> > why this isn't something that's easy to do automatically if you try to
> > sort these
> > into a single list by relevance <G>...
> >
> > Best
> > Erick
> >
> > On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey <ctoo...@gmail.com> wrote:
> > > Thanks Jack for the alternatives.  The first is interesting but has the
> > > downside of requiring multiple queries to get the full matching docs.
> >  The
> > > second is interesting and very simple, but has the downside of not
> being
> > > modular and being difficult to configure field boosting when the
> > > collections have overlapping field names with different boosts being
> > needed
> > > for the same field in different document types.
> > >
> > > I'd still like to know about the viability of my original approach
> though
> > > too.
> > >
> > > Chris
> > >
> > >
> > > On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky <
> j...@basetechnology.com
> > >wrote:
> > >
> > >> One simple scenario to consider: N+1 collections - one collection per
> > >> document type with detailed fields for that document type, and one
> > common
> > >> collection that indexes a subset of the fields. The main user query
> > would
> > >> be an edismax over the common fields in that "main" collection. You
> can
> > >> then display summary results from the common collection. You can also
> > then
> > >> support "drill down" into the type-specific collection based on a
> "type"
> > >> field for each document in the main collection.
> > >>
> > >> Or, sure, you actually CAN index multiple document types in the same
> > >> collection - add all the fields to one schema - there is no time or
> > space
> > >> penalty if most of the field are empty for most documents.
> > >>
> > >> -- Jack Krupansky
> > >>
> > >> -----Original Message----- From: Chris Toomey
> > >> Sent: Tuesday, June 25, 2013 6:08 PM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Querying multiple collections in SolrCloud
> > >>
> > >>
> > >> Hi, I'm investigating using SolrCloud for querying documents of
> > different
> > >> but similar/related types, and have read through docs. on the wiki and
> > done
> > >> many searches in these archives, but still have some questions.
>  Thanks
> > in
> > >> advance for your help.
> > >>
> > >> Setup:
> > >> * Say that I have N distinct types of documents and I want to do
> queries
> > >> that return the best matches regardless document type.  I.e.,
> something
> > >> akin to a Google search where I'd like to get the best matches from
> the
> > >> web, news, images, and maps.
> > >>
> > >> * Our main use case is supporting simple user-entered searches, which
> > would
> > >> just contain terms / phrases and wouldn't specify fields.
> > >>
> > >> * The document types will not all have the same fields, though there
> > may be
> > >> some overlap in the fields.
> > >>
> > >> * We plan to use a separate collection for each document type, and to
> > use
> > >> the eDisMax query parser.  Each collection would have a
> > document-specific
> > >> schema configuration with appropriate defaults for query fields and
> > boosts,
> > >> etc.
> > >>
> > >> Questions:
> > >> * Would the above setup qualify as "multiple compatible collections",
> > such
> > >> that we could search all N collections with a single SolrCloud query,
> > as in
> > >> the example query "
> > >> http://localhost:8983/solr/**collection1/select?q=apple%**
> > >> 20pie&collection=c1,c2,..<
> >
> http://localhost:8983/solr/collection1/select?q=apple%20pie&collection=c1,c2,
> .
> > .>
> > >> .,cN"**?
> > >> Again, we're not querying against specific fields.
> > >>
> > >> * How does SolrCloud combine the query results from multiple
> > collections?
> > >> Does it re-sort the combined result set, or does it just return the
> > >> concatenation of the (unmerged) results from each of the collections?
> > >>
> > >> * Does SolrCloud impose any restrictions on querying multiple, sharded
> > >> collections?  I know it supports querying say all 3 shards of a single
> > >> collection, so want to make sure it would also support say all Nx3
> > shards
> > >> of N collections.
> > >>
> > >> * When SolrCloud queries multiple shards/collections, it queries them
> > >> concurrently vs. serially, correct?
> > >>
> > >> thanks much,
> > >> Chris
> > >>
> >
>

Re: Querying multiple collections in SolrCloud

Reply via email to