using solr for web page classification

Rajesh Nikam Mon, 27 May 2013 05:58:56 -0700

Hello,

I am working on implementation of system to categorize URLs/Web Pages.


I would have categories like ...

Adult          Health         Business
Arts           Home           Science

I am looking at how Lucence/Solr could help me out to achive this.
I came across links that mention MoreLikeThis could be of my help.

I found LucidWorks Search of help for me as it has done installation for
Jetty, Solr in few clicks.

Importing data and Query was also straight forward.

 My question is:

 - I have pre-defined list of categories for which I would have webpages +
documents that could be stored in solr index assigned with category

 - have input processors like on each page

         Text extractor (from HTML, PDF, Office format)
         Text language detection
         Standard text processors - stemming, remove stopwords, lowwercase
etc
         Title extractor
    Summary extractor
    Field mapping
    Header and footer remover

 - All these document could be processed and stored in Solr Index with
known category

 - When new request comes I need to for MLT or solr Query based on content
of webpage and get similar documents.
 Based on results I could reply back with top 3 categories.


 Please let me know if using solr for this problem in correct way ?
 If yes how to go with the forming query based on web page contents ?

Thanks
Rajesh

using solr for web page classification

Reply via email to