Thanks Erick, that's a very helpful answer. Regarding the grouping option, does that require all the docs to be put into a single collection, or could it be done with across N collections (assuming each collection had a common "type" field for grouping on)?
Chris On Wed, Jun 26, 2013 at 7:01 AM, Erick Erickson <erickerick...@gmail.com>wrote: > bq: Would the above setup qualify as "multiple compatible collections" > > No. While there may be enough fields in common to form a single query, > the TF/IDF calculations will not be "compatible" and the scores from the > various collections will NOT be comparable. So simply getting the list of > top N docs will probably be dominated by the docs from a single type. > > bq: How does SolrCloud combine the query results from multiple collections? > > It doesn't. SolrCloud sorts the results from multiple nodes in the > _same_ collection > according to whatever sort criteria are specified, defaulting to score. > Say you > ask for the top 20 docs. A node from each shard returns the top 20 docs > for that > shard. The node processing them just merges all the returned lists and > only keeps > the top 20. > > I don't think your last two questions are really relevant, SolrCloud > isn't built to > query multiple collections and return the results coherently. > > The root problem here is that you're trying to compare docs from > different collections for "goodness" to return the top N. This isn't > actually hard > _except_ when "goodness" is the score, then it just doesn't work. You can't > even compare scores from different queries on the _same_ collection, much > less different ones. Consider two collections, books and songs. One > consists > of lots and lots of text and the ter frequency and inverse doc freq > (TF/IDF) > will be hugely different than songs. Not to mention field length > normalization. > > Now, all that aside there's an option. Index all the docs in a single > collection and > use grouping (aka field collapsing) to get a single response that has the > top N > docs from each type (they'll be in different sections of the original > response) and present > them to the user however makes sense. You'll get "hands on" experience in > why this isn't something that's easy to do automatically if you try to > sort these > into a single list by relevance <G>... > > Best > Erick > > On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey <ctoo...@gmail.com> wrote: > > Thanks Jack for the alternatives. The first is interesting but has the > > downside of requiring multiple queries to get the full matching docs. > The > > second is interesting and very simple, but has the downside of not being > > modular and being difficult to configure field boosting when the > > collections have overlapping field names with different boosts being > needed > > for the same field in different document types. > > > > I'd still like to know about the viability of my original approach though > > too. > > > > Chris > > > > > > On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky <j...@basetechnology.com > >wrote: > > > >> One simple scenario to consider: N+1 collections - one collection per > >> document type with detailed fields for that document type, and one > common > >> collection that indexes a subset of the fields. The main user query > would > >> be an edismax over the common fields in that "main" collection. You can > >> then display summary results from the common collection. You can also > then > >> support "drill down" into the type-specific collection based on a "type" > >> field for each document in the main collection. > >> > >> Or, sure, you actually CAN index multiple document types in the same > >> collection - add all the fields to one schema - there is no time or > space > >> penalty if most of the field are empty for most documents. > >> > >> -- Jack Krupansky > >> > >> -----Original Message----- From: Chris Toomey > >> Sent: Tuesday, June 25, 2013 6:08 PM > >> To: solr-user@lucene.apache.org > >> Subject: Querying multiple collections in SolrCloud > >> > >> > >> Hi, I'm investigating using SolrCloud for querying documents of > different > >> but similar/related types, and have read through docs. on the wiki and > done > >> many searches in these archives, but still have some questions. Thanks > in > >> advance for your help. > >> > >> Setup: > >> * Say that I have N distinct types of documents and I want to do queries > >> that return the best matches regardless document type. I.e., something > >> akin to a Google search where I'd like to get the best matches from the > >> web, news, images, and maps. > >> > >> * Our main use case is supporting simple user-entered searches, which > would > >> just contain terms / phrases and wouldn't specify fields. > >> > >> * The document types will not all have the same fields, though there > may be > >> some overlap in the fields. > >> > >> * We plan to use a separate collection for each document type, and to > use > >> the eDisMax query parser. Each collection would have a > document-specific > >> schema configuration with appropriate defaults for query fields and > boosts, > >> etc. > >> > >> Questions: > >> * Would the above setup qualify as "multiple compatible collections", > such > >> that we could search all N collections with a single SolrCloud query, > as in > >> the example query " > >> http://localhost:8983/solr/**collection1/select?q=apple%** > >> 20pie&collection=c1,c2,..< > http://localhost:8983/solr/collection1/select?q=apple%20pie&collection=c1,c2,. > .> > >> .,cN"**? > >> Again, we're not querying against specific fields. > >> > >> * How does SolrCloud combine the query results from multiple > collections? > >> Does it re-sort the combined result set, or does it just return the > >> concatenation of the (unmerged) results from each of the collections? > >> > >> * Does SolrCloud impose any restrictions on querying multiple, sharded > >> collections? I know it supports querying say all 3 shards of a single > >> collection, so want to make sure it would also support say all Nx3 > shards > >> of N collections. > >> > >> * When SolrCloud queries multiple shards/collections, it queries them > >> concurrently vs. serially, correct? > >> > >> thanks much, > >> Chris > >> >