Sounds like a possible application of solr.PatternTokenizerFactory  

http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternTokenizerFactory.html

You could use copyField to copy the entire string to a separate field (or set 
of fields) that are processed by patterns.

JRJ

-----Original Message-----
From: Memory Makers [mailto:memmakers...@gmail.com] 
Sent: Tuesday, October 25, 2011 9:27 AM
To: solr-user@lucene.apache.org
Subject: Points to processing hastags

Greetings,

I am trying to index hashtags from twitter -- so they are tokens that start
with a # symbol and can have any number of alpha numeric characters.

Examples:
1. #jane
2. #Jane
3. #Jane!

At a high level I'd like to be able to:
1. differentiate between say #jane and #jane!
2. differentiate between a hashtag such as #jane and a regular text token
jane
3. ask for variation on #jane -- by this I mean #jane? #jane!!! #jane!?!??
are all variations of jane

I'd appreciate points to what my considerations should be when I attempt to
do the above.

Thanks,

MM.

Reply via email to