Hi David, As you already observed shingles are concatenating tokens based on positions and in case of stopwords it results in empty string (you can configure it to be something else with fillerToken option). You can do the following: 1. if you do not have too many stopwords, you could use PatternReplaceChartFilter to remove stopwords before it hits tokenizer. That way stopwords will not increase positions and it’ll result with expected shingles. This way you will loose managed part of stopwords and will have to reload cores in order to change stopwords. 2. customise stopword filter not to increment positions when finds stopword. 3. customise shingle filter to be able to add desired flag
HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 7 Dec 2018, at 15:18, David Hastings <hastings.recurs...@gmail.com> wrote: > > Hey there, I have a field type defined as such: > <fieldType name="skw2" class="solr.TextField" positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.ManagedStopFilterFactory" managed="english"/> > <filter class="solr.ShingleFilterFactory" minShingleSize="2" > outputUnigrams="false" fillerToken="" maxShingleSize="2"/> > </analyzer> > </fieldType> > > but whats happening is the shingles being returned are often times " > nonstopword" > with the space being defined as the filter token. I was hoping that the > ManagedStopFilterFactory would have removed the stop words completely > before going to the shingle factory, and would have returned "nonstopword1 > nonstopword2" with an indexed value of > "nonstopword1 stopword1 stopword2 nonstopword2" but obviously isnt the > case. is there a way to force it as such? > > Thanks, David