Hi. Le samedi 07 novembre 2009 à 08:55 -0500, Kingsley Idehen a écrit : > Olivier Berger wrote: > > Hello. > > > > Is there any example of use of Virtuoso Sponger cartridges to do "web > > scraping" of (old) HTML pages of a web app to produce RDF ? > > > > I'm particularly interested by analysis of the content of HTML pages for > > apps that just have old HTML, i.e. no microformats and such, where the > > scraping would consist of identifying values in tables for instance > > (goog old regexes and such) ? > > > > It seems to me that current examples only deal with XML or other XHTML > > and structured content like RDFa... > > > > Thanks in advance. > > > > Best regards, > > > The sponger cartridges for HTML at the very least require Plain Old > Semantic HTML (POSH) in place. Otherwise, the Meta Cartridges contribute > most of the Triples by looking up related data from across Web via a > plethora of services e.g. Yahoo!, Google, Bing!, Linked Data Cloud > Cache, DBpedia, Sindice, and 30 or so other places (typically Web 2.0 > style Web Services). The net effect is at the very least a model that > shows where a Page has been referenced elsewhere. Of course, we also > make triples for the outbound links in the HTML page. > > To conclude, as long as an old page has been referenced somewhere and/or > it contains outbound links, we have data for Linked Data graph > generation :-) >
OK, so, by default, there's nothing shipped with the sponger that'd demonstrate a parsing of the HTML contents (apart from the <a/> and hrefs) ? So I suppose that by writing some code I can create a new cartridge that'd do it... Regards, -- Olivier BERGER <olivier.ber...@it-sudparis.eu> http://www-public.it-sudparis.eu/~berger_o/ - OpenPGP-Id: 1024D/6B829EEC Ingénieur Recherche - Dept INF Institut TELECOM, SudParis (http://www.it-sudparis.eu/), Evry (France)