Olivier Berger wrote:
Hello.

Is there any example of use of Virtuoso Sponger cartridges to do "web
scraping" of (old) HTML pages of a web app to produce RDF ?

I'm particularly interested by analysis of the content of HTML pages for
apps that just have old HTML, i.e. no microformats and such, where the
scraping would consist of identifying values in tables for instance
(goog old regexes and such) ?

It seems to me that current examples only deal with XML or other XHTML
and structured content like RDFa...

Thanks in advance.

Best regards,
The sponger cartridges for HTML at the very least require Plain Old Semantic HTML (POSH) in place. Otherwise, the Meta Cartridges contribute most of the Triples by looking up related data from across Web via a plethora of services e.g. Yahoo!, Google, Bing!, Linked Data Cloud Cache, DBpedia, Sindice, and 30 or so other places (typically Web 2.0 style Web Services). The net effect is at the very least a model that shows where a Page has been referenced elsewhere. Of course, we also make triples for the outbound links in the HTML page.

To conclude, as long as an old page has been referenced somewhere and/or it contains outbound links, we have data for Linked Data graph generation :-)

--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO OpenLink Software Web: http://www.openlinksw.com





Reply via email to