RE: Index an entire Phrase and not it's constituent parts?

Christopher Ball Sat, 13 Mar 2010 04:47:35 -0800

Ok, let me try and explaining what I am hoping to achieve at a higher level:
I want to aggressively remove stop words to reduce the size of my index, but
there are certain domain specific multiword phrases which include stop words
that I need to retain in the index.


So I want to stop out words such as: "the", "a", "as", "in", "of", etc . . .
but I need to index phrases such as "as much as" and "in the amount of".

So during analysis how can I both ensure I match and index domain specific
multiword phrases like "in the amount of" but prevent stop words from being
indexed when they not part of a domain specific multiword phrase. For
example I would not want any of the stop words (in single quotes) in the
following sentence: "We were not given a notification 'of' 'the' groups
formation 'as' we had expected 'in' December."

Let me know if that clarifies or not.

Most grateful,

Christopher

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, March 09, 2010 7:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Index an entire Phrase and not it's constituent parts?

P.S. although phrase queries with fields that do NOT
have stopwords removed feels kinda like what you're
hinting at.

Erick

On Tue, Mar 9, 2010 at 6:49 PM, Erick Erickson
<erickerick...@gmail.com>wrote:

> I think you need to back up and tell us what you're
> trying to accomplish from a higher level.
> See Hossman's apache page:
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
>
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
>
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
> Erick
>
>
> On Tue, Mar 9, 2010 at 6:16 PM, Christopher Ball <
> christopher.b...@metaheuristica.com> wrote:
>
>> Unfortunately, I don't see how the KeywordTokenizerFactory could work
>> given
>> the field in question is delimited text (paragraphs) and the
>> KeywordTokenizerFactory essentially does nothing to the inbound content.
>>
>>
>>
>> Feel like I must be missing something . . . but can't figure out what.
>>
>>
>>
>> Do I really need to write a custom analyzer for this?
>>
>>
>>
>>  _____
>>
>> From Erick Erickson <erickerick...@gmail.com> Subject Re: Index an entire
>> Phrase and not it's constituent parts? Date Thu, 04 Mar 2010 19:55:58 GMT
>>
>> Try KeywordTokenizerFactory. This page is very useful:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>
>> HTH
>> <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>Erick
>>
>> On Thu, Mar 4, 2010 at 2:31 PM, Christopher Ball <
>> christopher.b...@metaheuristica.com> wrote:
>>
>> > How can I Index an entire Phrases and not it's constituent parts?
>> >
>> >
>> >
>> > I want to index collations as a single term in the index, and not as
the
>> > multiple terms that comprise the phrase, for example, I want to index:
>> "as
>> > much as" but not the independent parts: "as", "much", "as".
>> >
>> >
>> >
>> > Any guidance appreciated,
>> >
>> >
>> >
>> > Christopher
>>
>>
>>
>>
>>
>

RE: Index an entire Phrase and not it's constituent parts?

Reply via email to