Re: Website (crawler for) indexing

2012-09-10 Thread Bernd Fehling
Some month ago I have tested YaCy, this works pretty well. http://yacy.net/en/ You can install it as stand-alone and setup your own crawler (single or cluster). Very nice admin and control surface. After installation disable the internal database and enable the feed to SOLR, thats it. Regards,

Re: Website (crawler for) indexing

2012-09-07 Thread Dominique Bejean
May be you can take a look at Crawl-Anywhere which have administration web interface, solr indexer and search web application. www.crawl-anywhere.com Regards. Dominique Le 05/09/12 17:05, Lochschmied, Alexander a écrit : This may be a bit off topic: How do you index an existing website and c

RE: Website (crawler for) indexing

2012-09-06 Thread Markus Jelsma
-Original message- > From:Lochschmied, Alexander > Sent: Thu 06-Sep-2012 16:04 > To: solr-user@lucene.apache.org > Subject: AW: Website (crawler for) indexing > > Thanks Rafał and Markus for your comments. > > I think Droids it has serious problem with URL param

Re: AW: Website (crawler for) indexing

2012-09-06 Thread Rafał Kuć
Hello! I think that really depends on what you want to achieve and what parts of your current system you would like to reuse. If it is only HTML processing I would let Nutch and Solr do that. Of course you can extend Nutch (it has a plugin API) and implement the custom logic you need as a Nutch pl

AW: Website (crawler for) indexing

2012-09-06 Thread Lochschmied, Alexander
Thanks Rafał and Markus for your comments. I think Droids it has serious problem with URL parameters in current version (0.2.0) from Maven central: https://issues.apache.org/jira/browse/DROIDS-144 I knew about Nutch, but I haven't been able to implement a crawler with it. Have you done that or

Re: Website (crawler for) indexing

2012-09-05 Thread Rafał Kuć
Hello! You can implement your own crawler using Droids (http://incubator.apache.org/droids/) or use Apache Nutch (http://nutch.apache.org/), which is very easy to integrate with Solr and is very powerful crawler. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -

RE: Website (crawler for) indexing

2012-09-05 Thread Markus Jelsma
Please take a look at the Apache Nutch project. http://nutch.apache.org/ -Original message- > From:Lochschmied, Alexander > Sent: Wed 05-Sep-2012 17:09 > To: solr-user@lucene.apache.org > Subject: Website (crawler for) indexing > > This may be a bit off topic: H

Website (crawler for) indexing

2012-09-05 Thread Lochschmied, Alexander
This may be a bit off topic: How do you index an existing website and control the data going into index? We already have Java code to process the HTML (or XHTML) and turn it into a SolrJ Document (removing tags and other things we do not want in the index). We use SolrJ for indexing. So I guess