Hi all, I recently posted parts 1 & 2 of a series on extracting text features for machine learning…
http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/ http://www.scaleunlimited.com/2013/07/21/text-feature-selection-for-machine-learning-part-2/ It uses Solr to generate terms from mailing list text, and then does analysis to extract good features for things like classification, similarity and clustering. The last part will cover using Solr to implement a real-time similarity engine, and maybe a recommendation engine as well. It undoubtedly has some things that are unclear or even incorrect, so please comment :) Regards, -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr