Thanks Rafał and Markus for your comments. I think Droids it has serious problem with URL parameters in current version (0.2.0) from Maven central: https://issues.apache.org/jira/browse/DROIDS-144
I knew about Nutch, but I haven't been able to implement a crawler with it. Have you done that or seen an example application? It's probably easy to call a Nutch jar and make it index a website and maybe I will have to do that. But as we already have a Java implementation to index other sources, it would be nice if we could integrate the crawling part too. Regards, Alexander ------------------------------------ Hello! You can implement your own crawler using Droids (http://incubator.apache.org/droids/) or use Apache Nutch (http://nutch.apache.org/), which is very easy to integrate with Solr and is very powerful crawler. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch > This may be a bit off topic: How do you index an existing website and > control the data going into index? > We already have Java code to process the HTML (or XHTML) and turn it > into a SolrJ Document (removing tags and other things we do not want > in the index). We use SolrJ for indexing. > So I guess the question is essentially which Java crawler could be useful. > We used to use wget on command line in our publishing process, but we do no > longer want to do that. > Thanks, > Alexander