At 5:06 AM -0500 1/12/07, Erik Hatcher wrote: >What the user-interface needs is a way to ask Solr for terms that begin with a >specified prefix, as the user types. Paging via start/rows is necessary, and >also sorting by frequency given some specified constraints. I like the >start/end term idea also, though I can't think of a scenario in my application >where this would be different than having a prefix parameter. If I want all >the 1860's, prefix=186&field=year, for example.
I also have exactly this requirement: Paging through the terms (and getting the document count for each term) optionally limited to those matching a supplied prefix (there can be thousands of terms for a prefix so start/rows is absolutely necessary even when prefixing). Choosing whether terms were sorted by index-order or document-count order would be a plus. I would love to have this be provided by an extension to the Faceting logic, as suggested by Yonik and Hoss, incorporating the non-query pathway raised by Erik: - Assemble the list of term/frequency pairs for a field either by tallying the term references found in a DocList, or by using the term frequency information found in the index (optimization for non-query case) - Apply a criterion (RegExp based would obviously be most flexible -- no need for full Lucene query syntax -- but prefix-only might be an optimization that could be applied in the non-query case) to filter the terms, either during assembly or post-facto. - Apply the faceting criteria (e.g. facet.zeros, though facet.mincount would have been a more flexible option in all cases) - Optionally pass through the BoundedTreeSet/PriorityQueue mechanism to sort by frequency and in that case optionally keep only the top facet.limit terms - Cache the results with the query (including a special key for the non-query case) so paging could be done without any requerying, retallying, or resorting - Return in the response a subrange of the list - Naturally allow the full complement of response encodings - (Am I missing anything?) While a commendable endeavor, this is a fair bit of work, and it may take a while before someone (perhaps me even) steps up to the plate, for performance if not functional considerations. So IMHO it would also be worthwhile to craft a simpler index-only version. >I would be thrilled if this just magically appeared in Solr's codebase before >I have a chance to build it. :) Well, after my current deadline (next week) passes, this functionality is on my task list for my next milestone... so I'd be equally elated if I didn't have to write it myself. :-) And adding 2 cents to the other topic in this thread... >As for Hoss's suggestion of a Stats handler - I still hold the opinion that >all of the admin JSPs really ought to be first class request handlers that go >through the whole ResponseWriter stuff, so I can get all of that great >capability in Ruby format instead of XML. Agreed in principle, though I'm an XML-person. >As it is, to build a Ruby API to Solr and provide access to the stats, there >has to be two different parsing mechanisms. I know he meant index stats, not >Solr admin stats, but it reminded me of the XML pain I'm going to feel in >solrb to add Solr stats :) I am happy to merely be a spectator of the Rubification of SOLR! Also, >On Jan 11, 2007, at 3:13 PM, Yonik Seeley wrote: >>> Attempting to enumerating >>>all of the values for a field could be dangerous >> >>We do it for faceting :-) But we don't drag it all into memory at once... Not entirely true: The FieldCache pathway of faceting single-valued fields does just that. In some cases I've set multivalued=true even when it's not accurate in order to force the cached-filter pathway. - J.J.