What you're describing is implemented with Graph aggregations in this
ticket using tf-idf. Other scoring methods can be implemented as well.

https://issues.apache.org/jira/browse/SOLR-9193

I'll update this thread with a description of how this can be used with the
facet() streaming expression as well as with graph queries later today.



Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 8:18 AM, <heuw...@uni-hildesheim.de> wrote:

> Dear everybody,
>
> as the JSON-API now makes configuration of facets and sub-facets easier,
> there appears to be a lot of potential to enable instant calculation of
> facet-recommendations for a query, that is, to sort facets by their
> relative importance/interestingess/signficance for a current query relative
> to the complete collection or relative to a result set defined by a
> different query.
>
> An example would be to show the most typical terms which are used in
> descriptions of horror-movies, in contrast to the most popular ones for
> this query, as these may include terms that occur as often in other genres.
>
> This feature has been discussed earlier in the context of solr:
> *
> http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
> *
> http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html
>
> In elasticsearch, the specific feature that I am looking for is called
> Significant Terms Aggregation:
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation
>
> As of now, I have two questions:
>
> a) Are there workarounds in the current solr-implementation or known
> patches that implement such a sort-option for fields with a large number of
> possible values, e.g. text-fields? (for smaller vocabularies it is easy to
> do this client-side with two queries)
> b) Are there plans to implement this in facet.pivot or in the
> facet.json-API?
>
> The first step could be to define "interestingness" as a sort-option for
> facets and to define interestingness as facet-count in the result-set as
> compared to the complete collection: documentfrequency_termX(bucket) *
> inverse_documentfrequency_termX(collection)
>
> As an extension, the JSON-API could be used to change the domain used as
> base for the comparison. Another interesting option would be to compare
> facet-counts against a current parent-facet for nested facets, e.g. the 5
> most interesting terms by genre for a query on 70s movies, returning the
> terms specific to horror, comedy, action etc. compared to all terminology
> at the time (i.e. in the parent-query).
>
> A call-back-function could be used to define other measures of
> interestingness such as the log-likelihood-ratio (
> http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html). Most
> measures need at least the following 4 values: document-frequency for a
> term for the result-set, document-frequency for the result-set,
> document-frequency for a term in the index (or base-domain),
> document-frequency in the index (or base-domain).
>
> I guess, this feature might be of interest for those who want to do some
> small-scale term-analysis in addition to search, e.g. as in my case in
> digital humanities projects. But it might also be an interesting navigation
> device, e.g. when searching on job-offers to show the skills that are most
> distinctive for a category.
>
> It would be great to know, if others are interested in this feature. If
> there are any implementations out there or if anybody else is working on
> this, a pointer would be a great start. In the absence of existing
> solutions: Perhaps somebody has some idea on where and how to start
> implementing this?
>
> Best regards,
>
> Ben
>
>
>

Reply via email to