Re: shingles + stop words

Emir Arnautović Mon, 10 Dec 2018 07:10:29 -0800

Hi David,
As you already observed shingles are concatenating tokens based on positions 
and in case of stopwords it results in empty string (you can configure it to be 
something else with fillerToken option).
You can do the following:
1. if you do not have too many stopwords, you could use 
PatternReplaceChartFilter to remove stopwords before it hits tokenizer. That 
way stopwords will not increase positions and it’ll result with expected 
shingles. This way you will loose managed part of stopwords and will have to 
reload cores in order to change stopwords.
2. customise stopword filter not to increment positions when finds stopword.
3. customise shingle filter to be able to add desired flag


HTH,
Emir 
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Dec 2018, at 15:18, David Hastings <hastings.recurs...@gmail.com> wrote:
> 
> Hey there, I have a field type defined as such:
> <fieldType name="skw2" class="solr.TextField" positionIncrementGap="100">
>    <analyzer>
>      <tokenizer class="solr.StandardTokenizerFactory"/>
>      <filter class="solr.ManagedStopFilterFactory" managed="english"/>
>      <filter class="solr.ShingleFilterFactory" minShingleSize="2"
> outputUnigrams="false" fillerToken="" maxShingleSize="2"/>
>    </analyzer>
>  </fieldType>
> 
> but whats happening is the shingles being returned are often times "
> nonstopword"
> with the space being defined as the filter token.  I was hoping that the
> ManagedStopFilterFactory would have removed the stop words completely
> before going to the shingle factory, and would have returned "nonstopword1
> nonstopword2" with an indexed value of
> "nonstopword1 stopword1 stopword2 nonstopword2" but obviously isnt the
> case.  is there a way to force it as such?
> 
> Thanks, David

Re: shingles + stop words

Reply via email to