On Oct 27, 2008, at 6:10 PM, Grant Ingersoll wrote:
Warning: shameless plug: Tom Morton and I have a chapter on NER and
OpenNLP (and Solr, for that matter) in our book "Taming
Text" (Manning) and the code will be open once we have a place to
put it (hopefully soon). In fact, you'll see us doing a lot of this
kind of stuff w/ Solr and it should all be coming back to Solr/
Lucene/Mahout at some point (for instance, see https://issues.apache.org/jira/browse/SOLR-769
, as I'm sure FAST told you they can do clustering, too!)
--end shameless plug ---
thats great!
I just got the MEAP copy, it looks really good
http://www.manning.com/ingersoll/
As for Mahout, NER is a classification problem, and there are some
tools in Mahout to do classification, but nothing specifically
targeted at NER at the moment. Mahout, like Nutch, also takes
advantage of Hadoop for scaling. The combination of Mahout in Solr
makes a lot of sense, IMO.
Perhaps this is more appropriate to ask on the mahout list, but...
when you say "Mahout, like Nutch, also takes advantage of Hadoop for
scaling", does that mean that much of Mahout requires hadoop? Is it
possible to do smaller scale problems on a simple setup and only
invoke hadoop when required?
ryan