Re: Which Tokeniser (and/or filter)

Erick Erickson Wed, 08 Feb 2012 08:27:51 -0800

You'll probably have to index them in separate fields to
get what you want. The question is always whether it's
worth it, is the use-case really well served by having a
variant that keeps dots and things? But that's always more
a question for your product manager....


Best
Erick

On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown <r...@intelcompute.com> wrote:
> Thanks Erick,
>
> I didn't get confused with multiple tokens vs multiValued  :)
>
> Before I go ahead and re-index 4m docs, and believe me I'm using the
> analysis page like a mad-man!
>
> What do I need to configure to have the following both indexed with and
> without the dots...
>
> .net
> sales manager.
> £12.50
>
> Currently...
>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
>        generateWordParts="1"
>        generateNumberParts="1"
>        catenateWords="1"
>        catenateNumbers="1"
>        catenateAll="1"
>        splitOnCaseChange="1"
>        splitOnNumerics="1"
>        types="wdftypes.txt"
> />
>
> with nothing specific in wdftypes.txt for full-stops.
>
> Should there also be any difference when quoting my searches?
>
> The analysis page seems to just drop the quotes, but surely actual
> calls don't do this?
>
>
>
> ---
>
> IntelCompute
> Web Design & Local Online Marketing
>
> http://www.intelcompute.com
>
>
> On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
> <erickerick...@gmail.com> wrote:
>> Yes, WDDF creates multiple tokens. But that has
>> nothing to do with the multiValued suggestion.
>>
>> You can get exactly what you want by
>> 1> setting multiValued="true" in your schema file and re-indexing. Say
>> positionIncrementGap is set to 100
>> 2> When you index, add the field for each sentence, so your doc
>>       looks something like:
>>      <doc>
>>         <field name="sentences">i am a sales-manager in here</field>
>>        <field name="sentences">using asp.net and .net daily</field>
>>          .....
>>       </doc>
>> 3> search like "sales manager"~100
>>
>> Best
>> Erick
>>
>> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown <r...@intelcompute.com> wrote:
>>> Apologies if things were a little vague.
>>>
>>> Given the example snippet to index (numbered to show searches needed to
>>> match)...
>>>
>>> 1: i am a sales-manager in here
>>> 2: using asp.net and .net daily
>>> 3: working in design.
>>> 4: using something called sage 200. and i'm fluent
>>> 5: german sausages.
>>> 6: busy A&E dept earning £10,000 annually
>>>
>>>
>>> ... all with newlines in place.
>>>
>>> able to match...
>>>
>>> 1. sales
>>> 1. "sales manager"
>>> 1. sales-manager
>>> 1. "sales-manager"
>>> 2. .net
>>> 2. asp.net
>>> 3. design
>>> 4. sage 200
>>> 6. A&E
>>> 6. £10,000
>>>
>>> But do NOT match "fluent german" from 4 + 5 since there's a newline
>>> between them when indexed, but not when searched.
>>>
>>>
>>> Do the filters (wdf in this case) not create multiple tokens, so if
>>> splitting on period in "asp.net" would create tokens for all of "asp",
>>> "asp.", "asp.net", ".net", "net".
>>>
>>>
>>> Cheers,
>>> Rob
>>>
>>> --
>>>
>>> IntelCompute
>>> Web Design and Online Marketing
>>>
>>> http://www.intelcompute.com
>>>
>>>
>>> -----Original Message-----
>>> From: Chris Hostetter <hossman_luc...@fucit.org>
>>> Reply-to: solr-user@lucene.apache.org
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Which Tokeniser (and/or filter)
>>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)
>>>
>>> : This all seems a bit too much work for such a real-world scenario?
>>>
>>> You haven't really told us what your scenerio is.
>>>
>>> You said you want to split tokens on whitespace, full-stop (aka:
>>> period) and comma only, but then in response to some suggestions you added
>>> comments other things that you never mentioned previously...
>>>
>>> 1) evidently you don't want the "." in foo.net to cause a split in tokens?
>>> 2) evidently you not only want token splits on newlines, but also
>>> positition gaps to prevent phrases matching across newlines.
>>>
>>> ...these are kind of important details that affect suggestions people
>>> might give you.
>>>
>>> can you please provide some concrete examples of hte types of data you
>>> have, the types of queries you want them to match, and the types of
>>> queries you *don't* want to match?
>>>
>>>
>>> -Hoss
>>>
>

Re: Which Tokeniser (and/or filter)

Reply via email to