Warning: shameless plug: Tom Morton and I have a chapter on NER and
OpenNLP (and Solr, for that matter) in our book "Taming
Text" (Manning) and the code will be open once we have a place to put
it (hopefully soon). In fact, you'll see us doing a lot of this kind
of stuff w/ Solr and it should all be coming back to Solr/Lucene/
Mahout at some point (for instance, see https://issues.apache.org/jira/browse/SOLR-769
, as I'm sure FAST told you they can do clustering, too!)
--end shameless plug ---
As for Mahout, NER is a classification problem, and there are some
tools in Mahout to do classification, but nothing specifically
targeted at NER at the moment. Mahout, like Nutch, also takes
advantage of Hadoop for scaling. The combination of Mahout in Solr
makes a lot of sense, IMO.
On Oct 25, 2008, at 11:25 PM, Vaijanath N. Rao wrote:
Hi,
One can use the OpenNLP Max entropy library and create there own
named-entity extraction.
I had used it in one of the projects which I did with Solr.
It is easy to integrate most of the NLP libraries with Solr. Though
we had named-entity extraction embedded in our crawler which would
populate a field called entities in the database, which we would
ingest in Solr as yet another field.
--Thanks and Regards
Vaijanath N. Rao
Julien Nioche wrote:
Hi,
Open Source NLP platforms like GATE (http://gate.ac.uk) or Apache
UIMA are
typically used for these types of tasks. GATE in particular comes
with an
application called ANNIE which does Named Entity Recognition.
OpenCalais
does that as well and should be easy to embed, but it can't be
tuned to do
more specific things unlike UIMA or GATE based applications.
Depending on the architecture you have in mind it could be worth
investigating Nutch and add the NER as a custom plugin; NLP being
often a
CPU intensive task you could leverage the scalability of Hadoop in
Nutch.
There is a patch which allows to delegate the indexing to SOLR. As
someone
else already said these named entities could then be used as facets.
HTH
Julien
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ