>>Instead of indexing documents about 'sports' and searching for hits >>based upon 'basketball', 'football' etc.. I simply want to index the >>taxonomy and classify documents into it. This is a an ancient >>AI/Data-Mining discipline.. but the standard methods of 'indexing' the >>taxonomy are/were primitive compared to what one /could/ do with >>something like Lucene. Yeah, know it, the challenge on this method is the calculation of the score and parametrization of thresholds.
Is it really neccessary to use Solr for it? Things going much faster with Lucene low-level api and much faster if you're loading the classification corpus into the RAM. On Mon, Jan 26, 2009 at 7:24 PM, Neal Richter <nrich...@gmail.com> wrote: > Thanks for the link Shalin... played with that a while back.. It's > possibly got some indirect possibilities. > > On Mon, Jan 26, 2009 at 10:46 AM, Hannes Carl Meyer <m...@hcmeyer.com> > wrote: > > I didn't understand, is the corpus of documents you want to use to > classify > > fix? > > Assume the 'documents' are not stored in the same index and I want to > only store the taxonomy or ontology in this index. > > Instead of indexing documents about 'sports' and searching for hits > based upon 'basketball', 'football' etc.. I simply want to index the > taxonomy and classify documents into it. This is a an ancient > AI/Data-Mining discipline.. but the standard methods of 'indexing' the > taxonomy are/were primitive compared to what one /could/ do with > something like Lucene. > > Here's a 2007 research paper that used Lucene directly for > classification, but doing the inverse of what I described: > http://www.cs.ucl.ac.uk/staff/R.Hirsch/papers/gecco_HHS.pdf > > >>>previously suggested procedure of 1) store document 2) execute > >>>more-like-this and 3) delete document would be too slow. > > Do you mean the document to classify? > > Why do you then want to put it into the index (very expensive), you just > > need the contents of it to build a query! > > Exactly.. in the December Taxonomy thread Walter Underwood outlined a > store/classify/delete procedure. Too slow if you have no need to > index the document itself. > > - Neal >