Solr on Amazon EC2

2013-05-06 Thread Rajesh Nikam
Hello,

I am looking into how to do document classification for categorization of
html documents. I see Solr/Lucene + MoreLikeThis that suits to find similar
documents for given document.

I am able to do classification using Lucene + MoreLikeThis example.

Then I was looking for how to host Solr on Amazon EC2. I see bitnami
provide AMI images for the same.
I see there are 4000+ AMI IDs to select from. I am not sure which to use ?

Could you please let me know which is correct image to use in this case ?
Or how to create new image with tomcat + Solr and save it for future usage ?

Thanks,
Rajesh


Install Solr on EC2

2013-05-10 Thread Rajesh Nikam
Hi All,

I am looking for steps to run Solr 3.6.2 or latest stable version on Amazon
EC2.

I want this image to saved once created.

Could you please help with the steps that needs to be followed.

I have tried steps from
https://github.com/sunspot/sunspot/wiki/Configure-Solr-on-Ubuntu,-the-quickest-way

however after

$ sudo apt-get install solr-tomcat

give following error:

The following packages have unmet dependencies:
 solr-tomcat : Depends: solr-common (= 3.6.0+dfsg-1) but it is not going to
be installed
E: Broken packages


Thanks,
Rajesh


Re: [ANNOUNCE] Web Crawler

2013-05-22 Thread Rajesh Nikam
Hi,

crawl anywhere seems to using old versions of java, tomcat, etc.

http://www.crawl-anywhere.com/installation-v300/

Will it work with new versions of these required software ?

Is there updated installation guide available ?

Thanks
Rajesh





On Wed, May 22, 2013 at 6:48 PM, Dominique Bejean  wrote:

> Hi,
>
> Crawl-Anywhere is now open-source - https://github.com/bejean/**
> crawl-anywhere 
>
> Best regards.
>
>
> Le 02/03/11 10:02, findbestopensource a écrit :
>
>> Hello Dominique Bejean,
>>
>> Good job.
>>
>> We identified almost 8 open source web crawlers
>> http://www.findbestopensource.**com/tagged/webcrawler
>>   I don't know how far yours would be different from the rest.
>>
>> Your license states that it is not open source but it is free for
>> personnel use.
>>
>> Regards
>> Aditya
>> www.findbestopensource.com 
>> 
>> >
>>
>>
>> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean <
>> dominique.bej...@eolya.fr 
>> >
>> wrote:
>>
>> Hi,
>>
>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
>> Web Crawler. It includes :
>>
>>   * a crawler
>>   * a document processing pipeline
>>   * a solr indexer
>>
>> The crawler has a web administration in order to manage web sites
>> to be crawled. Each web site crawl is configured with a lot of
>> possible parameters (no all mandatory) :
>>
>>   * number of simultaneous items crawled by site
>>   * recrawl period rules based on item type (html, PDF, …)
>>   * item type inclusion / exclusion rules
>>   * item path inclusion / exclusion / strategy rules
>>   * max depth
>>   * web site authentication
>>   * language
>>   * country
>>   * tags
>>   * collections
>>   * ...
>>
>> The pileline includes various ready to use stages (text
>> extraction, language detection, Solr ready to index xml writer, ...).
>>
>> All is very configurable and extendible either by scripting or
>> java coding.
>>
>> With scripting technology, you can help the crawler to handle
>> javascript links or help the pipeline to extract relevant title
>> and cleanup the html pages (remove menus, header, footers, ..)
>>
>> With java coding, you can develop your own pipeline stage stage
>>
>> The Crawl Anywhere web site provides good explanations and screen
>> shots. All is documented in a wiki.
>>
>> The current version is 1.1.4. You can download and try it out from
>> here : www.crawl-anywhere.com 
>>
>>
>> Regards
>>
>> Dominique
>>
>>
>>
> --
> Dominique Béjean
> +33 6 08 46 12 43
> skype: dbejean
> www.eolya.fr
> www.crawl-anywhere.com
> www.mysolrserver.com
>
>


using solr for web page classification

2013-05-27 Thread Rajesh Nikam
Hello,

I am working on implementation of system to categorize URLs/Web Pages.

I would have categories like ...

Adult  Health Business
Arts   Home   Science

I am looking at how Lucence/Solr could help me out to achive this.
I came across links that mention MoreLikeThis could be of my help.

I found LucidWorks Search of help for me as it has done installation for
Jetty, Solr in few clicks.

Importing data and Query was also straight forward.

 My question is:

 - I have pre-defined list of categories for which I would have webpages +
documents that could be stored in solr index assigned with category

 - have input processors like on each page

 Text extractor (from HTML, PDF, Office format)
 Text language detection
 Standard text processors - stemming, remove stopwords, lowwercase
etc
 Title extractor
Summary extractor
Field mapping
Header and footer remover

 - All these document could be processed and stored in Solr Index with
known category

 - When new request comes I need to for MLT or solr Query based on content
of webpage and get similar documents.
 Based on results I could reply back with top 3 categories.


 Please let me know if using solr for this problem in correct way ?
 If yes how to go with the forming query based on web page contents ?

Thanks
Rajesh


Re: [blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-27 Thread Rajesh Nikam
Hello Koji,

This is seems pretty useful post on how to create synonyms file.
Thanks a lot for sharing this !

Have you shared source code / jar for the same so at it could be used ?

Thanks,
Rajesh



On Mon, May 27, 2013 at 8:44 PM, Koji Sekiguchi  wrote:

> Hello,
>
> Sorry for cross post. I just wanted to announce that I've written a blog
> post on
> how to create synonyms.txt file automatically from Wikipedia:
>
>
> http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html
>
> Hope that the article gives someone a good experience!
>
> koji
> --
>
> http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
>


Re: [blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-28 Thread Rajesh Nikam
Hi Koji,

Great news ! I am looking forward for this OpenNLP toolkit.

Thanks a lot !
Rajesh



On Wed, May 29, 2013 at 4:12 AM, Koji Sekiguchi  wrote:

> Hi Rajesh,
>
> Thanks!
> I'm planning to open an NLP tool kit for Lucene, and the tool kit will
> include
> the following synonym library.
>
> koji
>
>
> (13/05/28 14:12), Rajesh Nikam wrote:
>
>> Hello Koji,
>>
>> This is seems pretty useful post on how to create synonyms file.
>> Thanks a lot for sharing this !
>>
>> Have you shared source code / jar for the same so at it could be used ?
>>
>> Thanks,
>> Rajesh
>>
>>
>>
>> On Mon, May 27, 2013 at 8:44 PM, Koji Sekiguchi 
>> wrote:
>>
>>  Hello,
>>>
>>> Sorry for cross post. I just wanted to announce that I've written a blog
>>> post on
>>> how to create synonyms.txt file automatically from Wikipedia:
>>>
>>>
>>> http://soleami.com/blog/**automatically-acquiring-**
>>> synonym-knowledge-from-**wikipedia.html<http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html>
>>>
>>> Hope that the article gives someone a good experience!
>>>
>>> koji
>>> --
>>>
>>> http://soleami.com/blog/**lucene-4-is-super-convenient-**
>>> for-developing-nlp-tools.html<http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html>
>>>
>>>
>>
>
> --
> http://soleami.com/blog/**automatically-acquiring-**
> synonym-knowledge-from-**wikipedia.html<http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html>
>