Re: Shingles and Delimiter Help

Jessy Kate Mon, 22 Nov 2010 14:12:13 -0800

fantastic, thanks! i'll update the release and keep my fingers crossed. many
thanks for the speedy response.


jessy

On Mon, Nov 22, 2010 at 4:53 PM, Steven A Rowe <sar...@syr.edu> wrote:

> Hi Jessy,
>
> Several ShingleFilter(Factory) improvements, including the ability to
> specify minShingleSize, were introduced on the Solr/Lucene 3.x, and so are
> not available in Solr 1.4.X/Lucene 2.9.X. (This is your #1 issue.)
>
> For details about the changes and when they were introduced:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory
>
> The "_"-only tokens you're seeing, which are likely the result of
> placeholder tokens where stopwords used to be, is also fixed under
> Solr/Lucene 3.x, so that only shingles with at least one "real" token are
> output.  (This is your #2 issue.)
>
> Steve
>
> > -----Original Message-----
> > From: Jessy Kate [mailto:jessy.cowansh...@gmail.com]
> > Sent: Monday, November 22, 2010 3:33 PM
> > To: solr-user@lucene.apache.org
> > Subject: Shingles and Delimiter Help
> >
> > Hello Solr community,
> >
> > I'm using Solr for an app to index documents, with shingles to index n-
> > grams
> > (right now 2- 3- and 4-grams). this is solr 1.4.1 with lucene 2.9.3. i'm
> > having two challenges:
> >
> > 1. the shingles configuration is not respecting the lower limit set in
> the
> > config file:
> >
> > <filter class="solr.ShingleFilterFactory"
> >                 minShingleSize="3"
> >                 maxShingleSize="3"
> >                 outputUnigrams="false"
> >             />
> >
> > I still see bi-grams and tri-grams in the 4-gram results, for example.
> > This install was assembled a few months ago-- so perhaps it was a bug
> > that's been fixed? (I looked then and did not find anything, but know
> > it was a relatively new feature).
> >
> >
> > 2. the second is that for some reason the delimiters appear to be
> > getting indexed with my n-gram tokens (except unigrams), so that i get
> > a lot of search results for ____ xxxxx, where xxxxx is a real word in
> > my documents. i'm sure this is just a misunderstanding of the docs on
> > my part, but i just can't seem to figure out how to do this right.
> > Here is the configuration stanza for bigrams (it is equivalent for
> > tri-grams and 4-grams):
> >
> >
> > <fieldType name="bigrams" class="solr.TextField" positionIncrementGap="1"
> > >
> >         <analyzer>
> >             <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >             <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1"
> >                     generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0"
> >                     catenateAll="0" splitOnCaseChange="0"/>
> >             <filter class="solr.LowerCaseFilterFactory"/>
> >             <filter class="solr.StopFilterFactory"
> >                 ignoreCase="true"
> >                 words="stopwords.txt"
> >                 enablePositionIncrements="true"
> >                 />
> >              <filter class="solr.ShingleFilterFactory"
> >                 minShingleSize="2"
> >                 maxShingleSize="2"
> >                 outputUnigrams="false"
> >             />
> >        </analyzer>
> >     </fieldType>
> >
> >
> >
> > an example output for bigrams:
> >
> >
> > facet_counts: {
> >
> >    - facet_queries: { }
> >    - -
> >    facet_fields: {
> >       - -
> >       bigrams: [
> >          - "_ _"
> >          - 67567
> >          - "_ speaker"
> >          - 18932
> >          - "speaker _"
> >          - 16186
> >          - "_ bill"
> >          - 14513
> >          - "_ house"
> >          - 14058
> >          - "bill _"
> >          - 13205
> >          - "_ time"
> >          - 13021
> >          - "time _"
> >          - 12239
> >          - "house _"
> >          - 10704
> >          - "today _"
> >          - 10577
> >       ]
> >    }
> >
> >
> >
> > the "positionIncrementGap" for the copyField i use to store the main
> > searchable fields in, is actually set to 100, so i thought that might
> > be it, but i tried modifying that and it didn't solve the problem.
> >
> >
> > any help on either issue would be greatly appreciated. happy to
> > provide any other details. the full config file is available at:
> >
> >
> https://github.com/sunlightlabs/Capitol-Words/blob/master/solr/schema.xml
> >
> >
> > thank you in advance!
> >
> > jessy
> >
> >
> > --
> > Jessy Cowan-Sharp
> > http://jessykate.com
>



-- 
Jessy Cowan-Sharp
http://jessykate.com

Re: Shingles and Delimiter Help

Reply via email to