On May 30, 2008, at 10:22 AM, Dallan Quass wrote:
this may sound a bit too KISS - but another approach could be
based on synonyms, i.e. if the number of abbreviation is
limited and defined ("All US States"), you can simply define
complete state name for each abbreviation, this way a
"Chicago, IL" will be "translated" (...) in "Chicago,
Illinois" during indexing and/or querying... but this may
depend by the Tokenizer you use and how your index is defined
(do a search for "Chicago, Illinois" on a field gives you a
doc with "Chicago, Cook, Illinois" in some (other/same) field?)
Thanks for the suggestion! The problem is there are over 1M places
(it's a
database of historic places worldwide), most with multiple
variations in the
way that they're written. A complete synonym file would be pretty
large.
Issuing queries before indexing the docs would be preferable to a
~100-megabyte synonym file, especially because it's a wiki and
people can
add new places anytime so I'd have to re-build the synonym file on a
regular
basis.
Can you describe your indexing process a bit more? Do you just have
one or two tokens that you have "translate" or is it that you are
going to query on every token in your text? I just don't see how that
will perform at all to look up every token in some index, so maybe if
we have some more info, something more obvious will arise.
I sure wish I could figure out how to access the solr core object in
my
token filter class though.
-dallan
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ