Question about solr.WordDelimiterFilterFactory

2012-04-11 Thread Jian Xu
Hello,

I am new to solr/lucene. I am tasked to index a large number of documents. Some 
of these documents contain decimal points. I am looking for a way to index 
these documents so that adjacent numeric characters (such as [0-9.,]) are 
treated as single token. For example,

12.34 => "12.34"
12,345 => "12,345"

However, "," and "." should be treated as usual when around non-digital 
characters. For example,

ab,cd => "ab" "cd".

It is so that searching for "12.34" will match "12.34" not "12 34". Searching 
for "ab.cd" should match both "ab.cd" and "ab cd".

After doing some research on solr, It seems that there is a build-in analyzer 
called solr.WordDelimiterFilter that supports a "types" attribute which map 
special characters as different delimiters.  However, it isn't exactly what I 
want. It doesn't provide context check such as "," or "." must surround by 
digital characters, etc. 

Does anyone have any experience configuring solr to meet this requirements?  Is 
writing my own plugin necessary for this simple thing?

Thanks in advance!

-Jian

Re: Question about solr.WordDelimiterFilterFactory

2012-04-12 Thread Jian Xu
Erick,

Thank you for your response! 

The problem with this approach is that searching for "12:34" will also match 
"12.34" which is not what I want.



 From: Erick Erickson 
To: solr-user@lucene.apache.org; Jian Xu  
Sent: Thursday, April 12, 2012 8:01 AM
Subject: Re: Question about solr.WordDelimiterFilterFactory
 
WordDelimiterFilterFactory will _almost_ do what you want
by setting things like catenateWords=0 and catenateNumbers=1,
_except_ that the punctuation will be removed. So
12.34 -> 1234
ab,cd -> ab cd

is that "close enough"?

Otherwise, writing a simple Filter is probably the way to go.

Best
Erick

On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu  wrote:
> Hello,
>
> I am new to solr/lucene. I am tasked to index a large number of documents. 
> Some of these documents contain decimal points. I am looking for a way to 
> index these documents so that adjacent numeric characters (such as [0-9.,]) 
> are treated as single token. For example,
>
> 12.34 => "12.34"
> 12,345 => "12,345"
>
> However, "," and "." should be treated as usual when around non-digital 
> characters. For example,
>
> ab,cd => "ab" "cd".
>
> It is so that searching for "12.34" will match "12.34" not "12 34". Searching 
> for "ab.cd" should match both "ab.cd" and "ab cd".
>
> After doing some research on solr, It seems that there is a build-in analyzer 
> called solr.WordDelimiterFilter that supports a "types" attribute which map 
> special characters as different delimiters.  However, it isn't exactly what I 
> want. It doesn't provide context check such as "," or "." must surround by 
> digital characters, etc.
>
> Does anyone have any experience configuring solr to meet this requirements?  
> Is writing my own plugin necessary for this simple thing?
>
> Thanks in advance!
>
> -Jian