What you're describing is implemented with Graph aggregations in this ticket using tf-idf. Other scoring methods can be implemented as well.
https://issues.apache.org/jira/browse/SOLR-9193 I'll update this thread with a description of how this can be used with the facet() streaming expression as well as with graph queries later today. Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Aug 3, 2016 at 8:18 AM, <heuw...@uni-hildesheim.de> wrote: > Dear everybody, > > as the JSON-API now makes configuration of facets and sub-facets easier, > there appears to be a lot of potential to enable instant calculation of > facet-recommendations for a query, that is, to sort facets by their > relative importance/interestingess/signficance for a current query relative > to the complete collection or relative to a result set defined by a > different query. > > An example would be to show the most typical terms which are used in > descriptions of horror-movies, in contrast to the most popular ones for > this query, as these may include terms that occur as often in other genres. > > This feature has been discussed earlier in the context of solr: > * > http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity > * > http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html > > In elasticsearch, the specific feature that I am looking for is called > Significant Terms Aggregation: > https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation > > As of now, I have two questions: > > a) Are there workarounds in the current solr-implementation or known > patches that implement such a sort-option for fields with a large number of > possible values, e.g. text-fields? (for smaller vocabularies it is easy to > do this client-side with two queries) > b) Are there plans to implement this in facet.pivot or in the > facet.json-API? > > The first step could be to define "interestingness" as a sort-option for > facets and to define interestingness as facet-count in the result-set as > compared to the complete collection: documentfrequency_termX(bucket) * > inverse_documentfrequency_termX(collection) > > As an extension, the JSON-API could be used to change the domain used as > base for the comparison. Another interesting option would be to compare > facet-counts against a current parent-facet for nested facets, e.g. the 5 > most interesting terms by genre for a query on 70s movies, returning the > terms specific to horror, comedy, action etc. compared to all terminology > at the time (i.e. in the parent-query). > > A call-back-function could be used to define other measures of > interestingness such as the log-likelihood-ratio ( > http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html). Most > measures need at least the following 4 values: document-frequency for a > term for the result-set, document-frequency for the result-set, > document-frequency for a term in the index (or base-domain), > document-frequency in the index (or base-domain). > > I guess, this feature might be of interest for those who want to do some > small-scale term-analysis in addition to search, e.g. as in my case in > digital humanities projects. But it might also be an interesting navigation > device, e.g. when searching on job-offers to show the skills that are most > distinctive for a category. > > It would be great to know, if others are interested in this feature. If > there are any implementations out there or if anybody else is working on > this, a pointer would be a great start. In the absence of existing > solutions: Perhaps somebody has some idea on where and how to start > implementing this? > > Best regards, > > Ben > > >