Re: [Virtuoso-users] Web scraping with virtuoso sponger ?

Kingsley Idehen Sat, 07 Nov 2009 13:55:19 +0000

Olivier Berger wrote:

Hello.


Is there any example of use of Virtuoso Sponger cartridges to do "web
scraping" of (old) HTML pages of a web app to produce RDF ?

I'm particularly interested by analysis of the content of HTML pages for
apps that just have old HTML, i.e. no microformats and such, where the
scraping would consist of identifying values in tables for instance
(goog old regexes and such) ?

It seems to me that current examples only deal with XML or other XHTML
and structured content like RDFa...

Thanks in advance.

Best regards,

The sponger cartridges for HTML at the very least require Plain OldSemantic HTML (POSH) in place. Otherwise, the Meta Cartridges contributemost of the Triples by looking up related data from across Web via aplethora of services e.g. Yahoo!, Google, Bing!, Linked Data CloudCache, DBpedia, Sindice, and 30 or so other places (typically Web 2.0style Web Services). The net effect is at the very least a model thatshows where a Page has been referenced elsewhere. Of course, we alsomake triples for the outbound links in the HTML page.

To conclude, as long as an old page has been referenced somewhere and/orit contains outbound links, we have data for Linked Data graphgeneration :-)


--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen

President & CEOOpenLink Software Web: http://www.openlinksw.com

Re: [Virtuoso-users] Web scraping with virtuoso sponger ?

Reply via email to