On May 30, 2008, at 10:22 AM, Dallan Quass wrote:

this may sound a bit too KISS - but another approach could be
based on synonyms, i.e. if the number of abbreviation is
limited and defined ("All US States"), you can simply define
complete state name for each abbreviation, this way a
"Chicago, IL" will be "translated" (...) in "Chicago,
Illinois" during indexing and/or querying... but this may
depend by the Tokenizer you use and how your index is defined
(do a search for "Chicago, Illinois" on a field gives you a
doc with "Chicago, Cook, Illinois" in some (other/same) field?)

Thanks for the suggestion! The problem is there are over 1M places (it's a database of historic places worldwide), most with multiple variations in the way that they're written. A complete synonym file would be pretty large.
Issuing queries before indexing the docs would be preferable to a
~100-megabyte synonym file, especially because it's a wiki and people can add new places anytime so I'd have to re-build the synonym file on a regular
basis.


Can you describe your indexing process a bit more? Do you just have one or two tokens that you have "translate" or is it that you are going to query on every token in your text? I just don't see how that will perform at all to look up every token in some index, so maybe if we have some more info, something more obvious will arise.



I sure wish I could figure out how to access the solr core object in my
token filter class though.

-dallan


--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







Reply via email to