On 3/6/2018 10:16 AM, Moncif Aidi wrote:
I am using Solr to power faceting features for our  application.

I know that SOLR can do free text search but what is the best practice for
faceting on common terms inside SOLR text fields?

Based on everything below, there might be a little bit of confusion about exactly what faceting can offer you.  It is an enormously powerful feature, and generally has impressive performance.  But there are limitations, and sometimes performance is not what people expect.

As your other reply mentioned, configuring docValues on a field is recommended for performance and other reasons with faceting.  But when you're dealing with a field set up for full-text search, that recommendation generally has to be ignored, because you can't configure docValues on a field using the TextField class.

For example, we have a large blob of text (a description of a property)
which contains useful text to facet on like 'city', 'formation', 'year',
'school', 'skill', ... dozens more like these.

When you have a "large blob of text" there are generally two choices for the information in a facet.

One is the entirety of the blob, which usually means that every single document has a unique value, and in that case, facets are pretty much useless, and will have terrible performance.  It's useless because all of the entries in the facet are probably going to have "1" for the count, because only one document has each value.

The other is the individual terms (usually words) in the text.  This is also generally useless for facets, and usually has terrible performance.  Knowing that there are 100 million documents that have "the" in the field somewhere is not very useful.

One obvious solution is to pre-process the data, parse the text, and create
the facets for each of these key phrases with a boolean yes/no value.

I'd ideally like to automate this, so I imagine the SOLR free text search
engine might allow this? e.g. Can I use the free text search engine to
remove stop words and collect counts of common phrases which we can then
present to the user?

And now you've mentioned that what you want is *phrases*. How do you suggest Solr obtain this information?  There are no filters included with Solr that can figure out that one section of a few words is NOT a phrase that people will be interested in, but another IS.

To get document counts that include a phrase, you have to have something that can extract phrases from the big blob of text and add them to another field, usually of type "string" -- using class StrField.  This probably has to happen in your indexing pipeline, not in Solr.

Then when you facet on that field, Solr will count the documents for each value and give you that information.

Thanks,
Shawn

Reply via email to