Re: Facet full-text

Shawn Heisey Tue, 06 Mar 2018 19:35:22 -0800

On 3/6/2018 10:16 AM, Moncif Aidi wrote:

I am using Solr to power faceting features for our  application.


I know that SOLR can do free text search but what is the best practice for
faceting on common terms inside SOLR text fields?

Based on everything below, there might be a little bit of confusionabout exactly what faceting can offer you. It is an enormously powerfulfeature, and generally has impressive performance. But there arelimitations, and sometimes performance is not what people expect.

As your other reply mentioned, configuring docValues on a field isrecommended for performance and other reasons with faceting. But whenyou're dealing with a field set up for full-text search, thatrecommendation generally has to be ignored, because you can't configuredocValues on a field using the TextField class.

For example, we have a large blob of text (a description of a property)
which contains useful text to facet on like 'city', 'formation', 'year',
'school', 'skill', ... dozens more like these.

When you have a "large blob of text" there are generally two choices forthe information in a facet.

One is the entirety of the blob, which usually means that every singledocument has a unique value, and in that case, facets are pretty muchuseless, and will have terrible performance. It's useless because allof the entries in the facet are probably going to have "1" for thecount, because only one document has each value.

The other is the individual terms (usually words) in the text. This isalso generally useless for facets, and usually has terribleperformance. Knowing that there are 100 million documents that have"the" in the field somewhere is not very useful.

One obvious solution is to pre-process the data, parse the text, and create
the facets for each of these key phrases with a boolean yes/no value.

I'd ideally like to automate this, so I imagine the SOLR free text search
engine might allow this? e.g. Can I use the free text search engine to
remove stop words and collect counts of common phrases which we can then
present to the user?

And now you've mentioned that what you want is *phrases*. How do yousuggest Solr obtain this information? There are no filters includedwith Solr that can figure out that one section of a few words is NOT aphrase that people will be interested in, but another IS.

To get document counts that include a phrase, you have to have somethingthat can extract phrases from the big blob of text and add them toanother field, usually of type "string" -- using class StrField. Thisprobably has to happen in your indexing pipeline, not in Solr.

Then when you facet on that field, Solr will count the documents foreach value and give you that information.


Thanks,
Shawn

Re: Facet full-text

Reply via email to