Re: Analyzing CSV phrase fields

Yonik Seeley Tue, 25 Nov 2008 06:48:57 -0800

The easiest solution would be to create the documents you send to solr
with multiple keywords fields... they will be separated by a
positionIncrement so a phrase query won't see yankees adjacent to
cleveland.


If you can't do that, then perhaps patch PatternTokenizer filter to
put a larger positionIncrement between groups.  Then you would need to
follow it by another filter that tokens on whitespace or some other
regex (which we currently don't have).

-Yonik

On Tue, Nov 25, 2008 at 2:10 AM, Neal Richter <[EMAIL PROTECTED]> wrote:
> Hey all,
>
> Very basic question.. I want to index fields of comma separated values:
>
> Example document:
> id: 1
> title: Football Teams
> keywords: philadelphia eagles, cleveland browns, new york jets
>
> id: 2
> title: Baseball Teams
> keywords:"philadelphia phillies", "new york yankees", "cleveland indians"
>
> A query of 'new york' should return the obvious documents, but a quoted
> phrase query of "yankees cleveland" should return nothing... meaning that
> comma breaks phrases without fail.
>
> I've created a textCSV type in the schema.xml file and used the
> PatternTokenizerFactory to split on commas, and from there analysis can
> proceed as normal via StopFilterFactory, LowerCaseFilter,
> RemoveDuplicatesTokenFilter
>
> <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"
> group="-1"/>
>
> Has anyone done this before?  Can I somehow use an existing (or combination
> of) Analyzer?  It seems as though I need to create a PhraseDelimiterFilter
> from the WordDelimiterFilter.. though I am sure there is a way to make an
> existing analyzer to break things up the way I want.
>
> Thanks - Neal Richter
>

Re: Analyzing CSV phrase fields

Reply via email to