Hello, I am working on implementation of system to categorize URLs/Web Pages.
I would have categories like ... Adult Health Business Arts Home Science I am looking at how Lucence/Solr could help me out to achive this. I came across links that mention MoreLikeThis could be of my help. I found LucidWorks Search of help for me as it has done installation for Jetty, Solr in few clicks. Importing data and Query was also straight forward. My question is: - I have pre-defined list of categories for which I would have webpages + documents that could be stored in solr index assigned with category - have input processors like on each page Text extractor (from HTML, PDF, Office format) Text language detection Standard text processors - stemming, remove stopwords, lowwercase etc Title extractor Summary extractor Field mapping Header and footer remover - All these document could be processed and stored in Solr Index with known category - When new request comes I need to for MLT or solr Query based on content of webpage and get similar documents. Based on results I could reply back with top 3 categories. Please let me know if using solr for this problem in correct way ? If yes how to go with the forming query based on web page contents ? Thanks Rajesh