On Oct 27, 2008, at 8:53 PM, Ryan McKinley wrote:
On Oct 27, 2008, at 6:10 PM, Grant Ingersoll wrote:
Warning: shameless plug: Tom Morton and I have a chapter on NER
and OpenNLP (and Solr, for that matter) in our book "Taming
Text" (Manning) and the code will be open once we have a place to
put it (hopefully soon). In fact, you'll see us doing a lot of
this kind of stuff w/ Solr and it should all be coming back to Solr/
Lucene/Mahout at some point (for instance, see https://issues.apache.org/jira/browse/SOLR-769
, as I'm sure FAST told you they can do clustering, too!)
--end shameless plug ---
thats great!
I just got the MEAP copy, it looks really good
http://www.manning.com/ingersoll/
Thanks!
As for Mahout, NER is a classification problem, and there are some
tools in Mahout to do classification, but nothing specifically
targeted at NER at the moment. Mahout, like Nutch, also takes
advantage of Hadoop for scaling. The combination of Mahout in Solr
makes a lot of sense, IMO.
Perhaps this is more appropriate to ask on the mahout list, but...
when you say "Mahout, like Nutch, also takes advantage of Hadoop for
scaling", does that mean that much of Mahout requires hadoop? Is it
possible to do smaller scale problems on a simple setup and only
invoke hadoop when required?
Yes, probably better asked on Mahout, but to answer your question,
yes, most of the implementations require Hadoop so far, but it is not
a strict requirement. That being said, it is fairly easy to run them
on a simple setup (i.e. single node).