Re: listing/enumerating field information

J.J. Larrea Sat, 13 Jan 2007 14:28:53 -0800

At 5:06 AM -0500 1/12/07, Erik Hatcher wrote:
>What the user-interface needs is a way to ask Solr for terms that begin with a 
>specified prefix, as the user types.   Paging via start/rows is necessary, and 
>also sorting by frequency given some specified constraints.  I like the 
>start/end term idea also, though I can't think of a scenario in my application 
>where this would be different than having a prefix parameter.  If I want all 
>the 1860's, prefix=186&field=year, for example.

I also have exactly this requirement: Paging through the terms (and getting the 
document count for each term) optionally limited to those matching a supplied 
prefix (there can be thousands of terms for a prefix so start/rows is 
absolutely necessary even when prefixing). Choosing whether terms were sorted 
by index-order or document-count order would be a plus.

I would love to have this be provided by an extension to the Faceting logic, as 
suggested by Yonik and Hoss, incorporating the non-query pathway raised by Erik:
  - Assemble the list of term/frequency pairs for a field either by tallying 
the term references found in a DocList, or by using the term frequency 
information found in the index (optimization for non-query case)
  - Apply a criterion (RegExp based would obviously be most flexible -- no need 
for full Lucene query syntax -- but prefix-only might be an optimization that 
could be applied in the non-query case) to filter the terms, either during 
assembly or post-facto.
  - Apply the faceting criteria (e.g. facet.zeros, though facet.mincount would 
have been a more flexible option in all cases)
  - Optionally pass through the BoundedTreeSet/PriorityQueue mechanism to sort 
by frequency and in that case optionally keep only the top facet.limit terms
  - Cache the results with the query (including a special key for the non-query 
case) so paging could be done without any requerying, retallying, or resorting
  - Return in the response a subrange of the list
  - Naturally allow the full complement of response encodings
  - (Am I missing anything?)

While a commendable endeavor, this is a fair bit of work, and it may take a 
while before someone (perhaps me even) steps up to the plate, for performance 
if not functional considerations.  So IMHO it would also be worthwhile to craft 
a simpler index-only version.

>I would be thrilled if this just magically appeared in Solr's codebase before 
>I have a chance to build it. :)

Well, after my current deadline (next week) passes, this functionality is on my 
 task list for my next milestone... so I'd be equally elated if I didn't have 
to write it myself. :-)

And adding 2 cents to the other topic in this thread...

>As for Hoss's suggestion of a Stats handler - I still hold the opinion that 
>all of the admin JSPs really ought to be first class request handlers that go 
>through the whole ResponseWriter stuff, so I can get all of that great 
>capability in Ruby format instead of XML. 

Agreed in principle, though I'm an XML-person.

>As it is, to build a Ruby API to Solr and provide access to the stats, there 
>has to be two different parsing mechanisms.  I know he meant index stats, not 
>Solr admin stats, but it reminded me of the XML pain I'm going to feel in 
>solrb to add Solr stats :)

I am happy to merely be a spectator of the Rubification of SOLR!

Also,

>On Jan 11, 2007, at 3:13 PM, Yonik Seeley wrote:
>>> Attempting to enumerating
>>>all of the values for a field could be dangerous
>>
>>We do it for faceting :-)  But we don't drag it all into memory at once...

Not entirely true: The FieldCache pathway of faceting single-valued fields does 
just that.  In some cases I've set multivalued=true even when it's not accurate 
in order to force the cached-filter pathway.

- J.J.

Re: listing/enumerating field information

Reply via email to