In the library subject heading context, I wonder if a layered approach would bring performance into the acceptable range. Since Library of Congress Subject Headings break into standard parts, you could have first-tier facets representing the main heading, second-tier facets with the main heading and first subdivision, etc. So to extract the subject headings from a given result set, you'd first test all the first-tier facets like "Body, Human", then where warranted test the associated second-tier facets like "Body, Human--Social aspects.". If the first-tier facets represent a small enough subset of the set of subject headings as a whole, that might be enough to reduce the total number of facet tests.
I'm told by our metadata librarian, by the way, that there are 280,000 subject headings defined in LCSH at the moment (including cross-references), so that provides a rough upper limit on distinct values... Peter -----Original Message----- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 07, 2007 2:02 PM To: solr-user@lucene.apache.org Subject: Re: facet optimizing : Andrew, I haven't yet found a successful way to implement the SOLR : faceting for library catalog data. I developed my own system, so for Just to clarify: the "out of hte box" faceting support Solr has at the moment is very deliberately refered to as "SimpleFacets" ... it's intended to solve Simple problems where you want Facets based on all of the values in a field, or one specific hardcoded queries. It was primarily written as a demonstration of what is possiblewhen writting a custom SolrRequestHandler. when you start talking about really large data sets, with an extremely large vloume of unique field values for fields you want to facet on, then "generic" solutions stop being very feasible, and you have to start ooking at solutions more tailored to your dataset. at CNET, when dealing with Product data, we don't make any attempt to use the Simple Facet support Solr provides to facet on things like Manufacturer or Operating System because enumerating through every Manufacturer in the catalog on every query would be too expensive -- instead we have strucured metadata that drives the logic: only compute the constraint counts for this subset of manufactures where looking at teh Desktops category, only look at teh Operating System facet when in these categories, etc... rules like these need to be defined based on your user experience, and it can be easy to build them using the metadata in your index -- but they really need to be precomputed, not calculated on the fly every time. For something like a LIbrary system, where you might want to facet on Author, but you have way to many to be practical, a system that either required a category to be picked first (allowing you to constrain the list of authors you need to worry about) or precomputed the top 1000 authors for displaying initially (when the user hasn't provided any other constraints) are examples of the types of things a RequestHandler Solr Plugin might do -- but the logic involved would probably be domain specific. -Hoss