Shingles and Delimiter Help

Jessy Kate Mon, 22 Nov 2010 12:33:00 -0800

Hello Solr community,

I'm using Solr for an app to index documents, with shingles to index n-grams
(right now 2- 3- and 4-grams). this is solr 1.4.1 with lucene 2.9.3. i'm
having two challenges:


1. the shingles configuration is not respecting the lower limit set in the
config file:

<filter class="solr.ShingleFilterFactory"
                minShingleSize="3"
                maxShingleSize="3"
                outputUnigrams="false"
            />

I still see bi-grams and tri-grams in the 4-gram results, for example.
This install was assembled a few months ago-- so perhaps it was a bug
that's been fixed? (I looked then and did not find anything, but know
it was a relatively new feature).


2. the second is that for some reason the delimiters appear to be
getting indexed with my n-gram tokens (except unigrams), so that i get
a lot of search results for ____ xxxxx, where xxxxx is a real word in
my documents. i'm sure this is just a misunderstanding of the docs on
my part, but i just can't seem to figure out how to do this right.
Here is the configuration stanza for bigrams (it is equivalent for
tri-grams and 4-grams):


<fieldType name="bigrams" class="solr.TextField" positionIncrementGap="1" >
        <analyzer>
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
                    generateNumberParts="1" catenateWords="0"
catenateNumbers="0"
                    catenateAll="0" splitOnCaseChange="0"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
             <filter class="solr.ShingleFilterFactory"
                minShingleSize="2"
                maxShingleSize="2"
                outputUnigrams="false"
            />
       </analyzer>
    </fieldType>



an example output for bigrams:


facet_counts: {

   - facet_queries: { }
   - -
   facet_fields: {
      - -
      bigrams: [
         - "_ _"
         - 67567
         - "_ speaker"
         - 18932
         - "speaker _"
         - 16186
         - "_ bill"
         - 14513
         - "_ house"
         - 14058
         - "bill _"
         - 13205
         - "_ time"
         - 13021
         - "time _"
         - 12239
         - "house _"
         - 10704
         - "today _"
         - 10577
      ]
   }



the "positionIncrementGap" for the copyField i use to store the main
searchable fields in, is actually set to 100, so i thought that might
be it, but i tried modifying that and it didn't solve the problem.


any help on either issue would be greatly appreciated. happy to
provide any other details. the full config file is available at:

https://github.com/sunlightlabs/Capitol-Words/blob/master/solr/schema.xml


thank you in advance!

jessy


-- 
Jessy Cowan-Sharp
http://jessykate.com

Shingles and Delimiter Help

Reply via email to