Hi Jessy,

Several ShingleFilter(Factory) improvements, including the ability to specify 
minShingleSize, were introduced on the Solr/Lucene 3.x, and so are not 
available in Solr 1.4.X/Lucene 2.9.X. (This is your #1 issue.)

For details about the changes and when they were introduced: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory

The "_"-only tokens you're seeing, which are likely the result of placeholder 
tokens where stopwords used to be, is also fixed under Solr/Lucene 3.x, so that 
only shingles with at least one "real" token are output.  (This is your #2 
issue.)

Steve

> -----Original Message-----
> From: Jessy Kate [mailto:jessy.cowansh...@gmail.com]
> Sent: Monday, November 22, 2010 3:33 PM
> To: solr-user@lucene.apache.org
> Subject: Shingles and Delimiter Help
> 
> Hello Solr community,
> 
> I'm using Solr for an app to index documents, with shingles to index n-
> grams
> (right now 2- 3- and 4-grams). this is solr 1.4.1 with lucene 2.9.3. i'm
> having two challenges:
> 
> 1. the shingles configuration is not respecting the lower limit set in the
> config file:
> 
> <filter class="solr.ShingleFilterFactory"
>                 minShingleSize="3"
>                 maxShingleSize="3"
>                 outputUnigrams="false"
>             />
> 
> I still see bi-grams and tri-grams in the 4-gram results, for example.
> This install was assembled a few months ago-- so perhaps it was a bug
> that's been fixed? (I looked then and did not find anything, but know
> it was a relatively new feature).
> 
> 
> 2. the second is that for some reason the delimiters appear to be
> getting indexed with my n-gram tokens (except unigrams), so that i get
> a lot of search results for ____ xxxxx, where xxxxx is a real word in
> my documents. i'm sure this is just a misunderstanding of the docs on
> my part, but i just can't seem to figure out how to do this right.
> Here is the configuration stanza for bigrams (it is equivalent for
> tri-grams and 4-grams):
> 
> 
> <fieldType name="bigrams" class="solr.TextField" positionIncrementGap="1"
> >
>         <analyzer>
>             <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>             <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
>                     generateNumberParts="1" catenateWords="0"
> catenateNumbers="0"
>                     catenateAll="0" splitOnCaseChange="0"/>
>             <filter class="solr.LowerCaseFilterFactory"/>
>             <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords.txt"
>                 enablePositionIncrements="true"
>                 />
>              <filter class="solr.ShingleFilterFactory"
>                 minShingleSize="2"
>                 maxShingleSize="2"
>                 outputUnigrams="false"
>             />
>        </analyzer>
>     </fieldType>
> 
> 
> 
> an example output for bigrams:
> 
> 
> facet_counts: {
> 
>    - facet_queries: { }
>    - -
>    facet_fields: {
>       - -
>       bigrams: [
>          - "_ _"
>          - 67567
>          - "_ speaker"
>          - 18932
>          - "speaker _"
>          - 16186
>          - "_ bill"
>          - 14513
>          - "_ house"
>          - 14058
>          - "bill _"
>          - 13205
>          - "_ time"
>          - 13021
>          - "time _"
>          - 12239
>          - "house _"
>          - 10704
>          - "today _"
>          - 10577
>       ]
>    }
> 
> 
> 
> the "positionIncrementGap" for the copyField i use to store the main
> searchable fields in, is actually set to 100, so i thought that might
> be it, but i tried modifying that and it didn't solve the problem.
> 
> 
> any help on either issue would be greatly appreciated. happy to
> provide any other details. the full config file is available at:
> 
> https://github.com/sunlightlabs/Capitol-Words/blob/master/solr/schema.xml
> 
> 
> thank you in advance!
> 
> jessy
> 
> 
> --
> Jessy Cowan-Sharp
> http://jessykate.com

Reply via email to