Hey all, I'm in the processing of implementing a system to do 'text classification' with Solr. The basic idea is to take an ontology/taxonomy like dmoz of {label: "X", tags: "a,b,c,d,e"}, index it and then classify documents into the taxonomy by pushing parsed document into the Solr search API. Why? Lucene/Solr's ability to do weighted term boosting at both search and index time has lots of obvious uses here.
Has anyone worked on this or a similar project yet? I've seen some talk on the list about this area but it's pretty thin... December thread "Taxonomy Support on Solr". I'm assuming Grant Ingersoll is looking at similar things with his 'taming text' project. I store the 'documents' in another repository and they are far too dynamic (write intensive) for direct indexing in Solr... so the previously suggested procedure of 1) store document 2) execute more-like-this and 3) delete document would be too slow. If people are interested I could start a JIRA issue on this (I do not see anything there at the moment). Thanks - Neal Richter http://aicoder.blogspot.com