fantastic, thanks! i'll update the release and keep my fingers crossed. many thanks for the speedy response.
jessy On Mon, Nov 22, 2010 at 4:53 PM, Steven A Rowe <sar...@syr.edu> wrote: > Hi Jessy, > > Several ShingleFilter(Factory) improvements, including the ability to > specify minShingleSize, were introduced on the Solr/Lucene 3.x, and so are > not available in Solr 1.4.X/Lucene 2.9.X. (This is your #1 issue.) > > For details about the changes and when they were introduced: > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory > > The "_"-only tokens you're seeing, which are likely the result of > placeholder tokens where stopwords used to be, is also fixed under > Solr/Lucene 3.x, so that only shingles with at least one "real" token are > output. (This is your #2 issue.) > > Steve > > > -----Original Message----- > > From: Jessy Kate [mailto:jessy.cowansh...@gmail.com] > > Sent: Monday, November 22, 2010 3:33 PM > > To: solr-user@lucene.apache.org > > Subject: Shingles and Delimiter Help > > > > Hello Solr community, > > > > I'm using Solr for an app to index documents, with shingles to index n- > > grams > > (right now 2- 3- and 4-grams). this is solr 1.4.1 with lucene 2.9.3. i'm > > having two challenges: > > > > 1. the shingles configuration is not respecting the lower limit set in > the > > config file: > > > > <filter class="solr.ShingleFilterFactory" > > minShingleSize="3" > > maxShingleSize="3" > > outputUnigrams="false" > > /> > > > > I still see bi-grams and tri-grams in the 4-gram results, for example. > > This install was assembled a few months ago-- so perhaps it was a bug > > that's been fixed? (I looked then and did not find anything, but know > > it was a relatively new feature). > > > > > > 2. the second is that for some reason the delimiters appear to be > > getting indexed with my n-gram tokens (except unigrams), so that i get > > a lot of search results for ____ xxxxx, where xxxxx is a real word in > > my documents. i'm sure this is just a misunderstanding of the docs on > > my part, but i just can't seem to figure out how to do this right. > > Here is the configuration stanza for bigrams (it is equivalent for > > tri-grams and 4-grams): > > > > > > <fieldType name="bigrams" class="solr.TextField" positionIncrementGap="1" > > > > > <analyzer> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.WordDelimiterFilterFactory" > > generateWordParts="1" > > generateNumberParts="1" catenateWords="0" > > catenateNumbers="0" > > catenateAll="0" splitOnCaseChange="0"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.StopFilterFactory" > > ignoreCase="true" > > words="stopwords.txt" > > enablePositionIncrements="true" > > /> > > <filter class="solr.ShingleFilterFactory" > > minShingleSize="2" > > maxShingleSize="2" > > outputUnigrams="false" > > /> > > </analyzer> > > </fieldType> > > > > > > > > an example output for bigrams: > > > > > > facet_counts: { > > > > - facet_queries: { } > > - - > > facet_fields: { > > - - > > bigrams: [ > > - "_ _" > > - 67567 > > - "_ speaker" > > - 18932 > > - "speaker _" > > - 16186 > > - "_ bill" > > - 14513 > > - "_ house" > > - 14058 > > - "bill _" > > - 13205 > > - "_ time" > > - 13021 > > - "time _" > > - 12239 > > - "house _" > > - 10704 > > - "today _" > > - 10577 > > ] > > } > > > > > > > > the "positionIncrementGap" for the copyField i use to store the main > > searchable fields in, is actually set to 100, so i thought that might > > be it, but i tried modifying that and it didn't solve the problem. > > > > > > any help on either issue would be greatly appreciated. happy to > > provide any other details. the full config file is available at: > > > > > https://github.com/sunlightlabs/Capitol-Words/blob/master/solr/schema.xml > > > > > > thank you in advance! > > > > jessy > > > > > > -- > > Jessy Cowan-Sharp > > http://jessykate.com > -- Jessy Cowan-Sharp http://jessykate.com