Dallas, got money to spend on solving this problem? I believe this is something that tools like LingPipe can solve through language model training and named entity extraction.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Dallan Quass <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, May 30, 2008 4:22:37 PM > Subject: RE: Issuing queries during analysis? > > > this may sound a bit too KISS - but another approach could be > > based on synonyms, i.e. if the number of abbreviation is > > limited and defined ("All US States"), you can simply define > > complete state name for each abbreviation, this way a > > "Chicago, IL" will be "translated" (...) in "Chicago, > > Illinois" during indexing and/or querying... but this may > > depend by the Tokenizer you use and how your index is defined > > (do a search for "Chicago, Illinois" on a field gives you a > > doc with "Chicago, Cook, Illinois" in some (other/same) field?) > > Thanks for the suggestion! The problem is there are over 1M places (it's a > database of historic places worldwide), most with multiple variations in the > way that they're written. A complete synonym file would be pretty large. > Issuing queries before indexing the docs would be preferable to a > ~100-megabyte synonym file, especially because it's a wiki and people can > add new places anytime so I'd have to re-build the synonym file on a regular > basis. > > I sure wish I could figure out how to access the solr core object in my > token filter class though. > > -dallan