Hi Dominique, This looks nice. In the past, I've been interested in (semi)-automatically inducing a scheme/wrapper from a set of example webpages (often called 'wrapper induction' is the scientific field) . This would allow for fast scheme-creation which could be used as a basis for extraction.
Lately I've been looking for crawlers that incoporate this technology but without success. Any plans on incorporating this? Cheers, Geert-Jan 2011/3/2 Dominique Bejean <dominique.bej...@eolya.fr> > Rosa, > > In the pipeline, there is a stage that extract the text from the original > document (PDF, HTML, ...). > It is possible to plug scripts (Java 6 compliant) in order to keep only > relevant parts of the document. > See > http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage > > Dominique > > Le 02/03/11 09:36, Rosa (Anuncios) a écrit : > > Nice job! >> >> It would be good to be able to extract specific data from a given page via >> XPATH though. >> >> Regards, >> >> >> Le 02/03/2011 01:25, Dominique Bejean a écrit : >> >>> Hi, >>> >>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web >>> Crawler. It includes : >>> >>> * a crawler >>> * a document processing pipeline >>> * a solr indexer >>> >>> The crawler has a web administration in order to manage web sites to be >>> crawled. Each web site crawl is configured with a lot of possible parameters >>> (no all mandatory) : >>> >>> * number of simultaneous items crawled by site >>> * recrawl period rules based on item type (html, PDF, …) >>> * item type inclusion / exclusion rules >>> * item path inclusion / exclusion / strategy rules >>> * max depth >>> * web site authentication >>> * language >>> * country >>> * tags >>> * collections >>> * ... >>> >>> The pileline includes various ready to use stages (text extraction, >>> language detection, Solr ready to index xml writer, ...). >>> >>> All is very configurable and extendible either by scripting or java >>> coding. >>> >>> With scripting technology, you can help the crawler to handle javascript >>> links or help the pipeline to extract relevant title and cleanup the html >>> pages (remove menus, header, footers, ..) >>> >>> With java coding, you can develop your own pipeline stage stage >>> >>> The Crawl Anywhere web site provides good explanations and screen shots. >>> All is documented in a wiki. >>> >>> The current version is 1.1.4. You can download and try it out from here : >>> www.crawl-anywhere.com >>> >>> >>> Regards >>> >>> Dominique >>> >>> >>> >> >>