You'll probably have to index them in separate fields to get what you want. The question is always whether it's worth it, is the use-case really well served by having a variant that keeps dots and things? But that's always more a question for your product manager....
Best Erick On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown <r...@intelcompute.com> wrote: > Thanks Erick, > > I didn't get confused with multiple tokens vs multiValued :) > > Before I go ahead and re-index 4m docs, and believe me I'm using the > analysis page like a mad-man! > > What do I need to configure to have the following both indexed with and > without the dots... > > .net > sales manager. > £12.50 > > Currently... > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" > generateNumberParts="1" > catenateWords="1" > catenateNumbers="1" > catenateAll="1" > splitOnCaseChange="1" > splitOnNumerics="1" > types="wdftypes.txt" > /> > > with nothing specific in wdftypes.txt for full-stops. > > Should there also be any difference when quoting my searches? > > The analysis page seems to just drop the quotes, but surely actual > calls don't do this? > > > > --- > > IntelCompute > Web Design & Local Online Marketing > > http://www.intelcompute.com > > > On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson > <erickerick...@gmail.com> wrote: >> Yes, WDDF creates multiple tokens. But that has >> nothing to do with the multiValued suggestion. >> >> You can get exactly what you want by >> 1> setting multiValued="true" in your schema file and re-indexing. Say >> positionIncrementGap is set to 100 >> 2> When you index, add the field for each sentence, so your doc >> looks something like: >> <doc> >> <field name="sentences">i am a sales-manager in here</field> >> <field name="sentences">using asp.net and .net daily</field> >> ..... >> </doc> >> 3> search like "sales manager"~100 >> >> Best >> Erick >> >> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown <r...@intelcompute.com> wrote: >>> Apologies if things were a little vague. >>> >>> Given the example snippet to index (numbered to show searches needed to >>> match)... >>> >>> 1: i am a sales-manager in here >>> 2: using asp.net and .net daily >>> 3: working in design. >>> 4: using something called sage 200. and i'm fluent >>> 5: german sausages. >>> 6: busy A&E dept earning £10,000 annually >>> >>> >>> ... all with newlines in place. >>> >>> able to match... >>> >>> 1. sales >>> 1. "sales manager" >>> 1. sales-manager >>> 1. "sales-manager" >>> 2. .net >>> 2. asp.net >>> 3. design >>> 4. sage 200 >>> 6. A&E >>> 6. £10,000 >>> >>> But do NOT match "fluent german" from 4 + 5 since there's a newline >>> between them when indexed, but not when searched. >>> >>> >>> Do the filters (wdf in this case) not create multiple tokens, so if >>> splitting on period in "asp.net" would create tokens for all of "asp", >>> "asp.", "asp.net", ".net", "net". >>> >>> >>> Cheers, >>> Rob >>> >>> -- >>> >>> IntelCompute >>> Web Design and Online Marketing >>> >>> http://www.intelcompute.com >>> >>> >>> -----Original Message----- >>> From: Chris Hostetter <hossman_luc...@fucit.org> >>> Reply-to: solr-user@lucene.apache.org >>> To: solr-user@lucene.apache.org >>> Subject: Re: Which Tokeniser (and/or filter) >>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST) >>> >>> : This all seems a bit too much work for such a real-world scenario? >>> >>> You haven't really told us what your scenerio is. >>> >>> You said you want to split tokens on whitespace, full-stop (aka: >>> period) and comma only, but then in response to some suggestions you added >>> comments other things that you never mentioned previously... >>> >>> 1) evidently you don't want the "." in foo.net to cause a split in tokens? >>> 2) evidently you not only want token splits on newlines, but also >>> positition gaps to prevent phrases matching across newlines. >>> >>> ...these are kind of important details that affect suggestions people >>> might give you. >>> >>> can you please provide some concrete examples of hte types of data you >>> have, the types of queries you want them to match, and the types of >>> queries you *don't* want to match? >>> >>> >>> -Hoss >>> >