Hi Jessy, Several ShingleFilter(Factory) improvements, including the ability to specify minShingleSize, were introduced on the Solr/Lucene 3.x, and so are not available in Solr 1.4.X/Lucene 2.9.X. (This is your #1 issue.)
For details about the changes and when they were introduced: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory The "_"-only tokens you're seeing, which are likely the result of placeholder tokens where stopwords used to be, is also fixed under Solr/Lucene 3.x, so that only shingles with at least one "real" token are output. (This is your #2 issue.) Steve > -----Original Message----- > From: Jessy Kate [mailto:jessy.cowansh...@gmail.com] > Sent: Monday, November 22, 2010 3:33 PM > To: solr-user@lucene.apache.org > Subject: Shingles and Delimiter Help > > Hello Solr community, > > I'm using Solr for an app to index documents, with shingles to index n- > grams > (right now 2- 3- and 4-grams). this is solr 1.4.1 with lucene 2.9.3. i'm > having two challenges: > > 1. the shingles configuration is not respecting the lower limit set in the > config file: > > <filter class="solr.ShingleFilterFactory" > minShingleSize="3" > maxShingleSize="3" > outputUnigrams="false" > /> > > I still see bi-grams and tri-grams in the 4-gram results, for example. > This install was assembled a few months ago-- so perhaps it was a bug > that's been fixed? (I looked then and did not find anything, but know > it was a relatively new feature). > > > 2. the second is that for some reason the delimiters appear to be > getting indexed with my n-gram tokens (except unigrams), so that i get > a lot of search results for ____ xxxxx, where xxxxx is a real word in > my documents. i'm sure this is just a misunderstanding of the docs on > my part, but i just can't seem to figure out how to do this right. > Here is the configuration stanza for bigrams (it is equivalent for > tri-grams and 4-grams): > > > <fieldType name="bigrams" class="solr.TextField" positionIncrementGap="1" > > > <analyzer> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" > generateNumberParts="1" catenateWords="0" > catenateNumbers="0" > catenateAll="0" splitOnCaseChange="0"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory" > ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true" > /> > <filter class="solr.ShingleFilterFactory" > minShingleSize="2" > maxShingleSize="2" > outputUnigrams="false" > /> > </analyzer> > </fieldType> > > > > an example output for bigrams: > > > facet_counts: { > > - facet_queries: { } > - - > facet_fields: { > - - > bigrams: [ > - "_ _" > - 67567 > - "_ speaker" > - 18932 > - "speaker _" > - 16186 > - "_ bill" > - 14513 > - "_ house" > - 14058 > - "bill _" > - 13205 > - "_ time" > - 13021 > - "time _" > - 12239 > - "house _" > - 10704 > - "today _" > - 10577 > ] > } > > > > the "positionIncrementGap" for the copyField i use to store the main > searchable fields in, is actually set to 100, so i thought that might > be it, but i tried modifying that and it didn't solve the problem. > > > any help on either issue would be greatly appreciated. happy to > provide any other details. the full config file is available at: > > https://github.com/sunlightlabs/Capitol-Words/blob/master/solr/schema.xml > > > thank you in advance! > > jessy > > > -- > Jessy Cowan-Sharp > http://jessykate.com