Re: [Virtuoso-users] Web scraping with virtuoso sponger ?

Olivier Berger Sat, 07 Nov 2009 15:14:25 +0000

Hi.

Le samedi 07 novembre 2009 à 08:55 -0500, Kingsley Idehen a écrit :
> Olivier Berger wrote:
> > Hello.
> >
> > Is there any example of use of Virtuoso Sponger cartridges to do "web
> > scraping" of (old) HTML pages of a web app to produce RDF ?
> >
> > I'm particularly interested by analysis of the content of HTML pages for
> > apps that just have old HTML, i.e. no microformats and such, where the
> > scraping would consist of identifying values in tables for instance
> > (goog old regexes and such) ?
> >
> > It seems to me that current examples only deal with XML or other XHTML
> > and structured content like RDFa...
> >
> > Thanks in advance.
> >
> > Best regards,
> >   
> The sponger cartridges for HTML at the very least require Plain Old 
> Semantic HTML (POSH) in place. Otherwise, the Meta Cartridges contribute 
> most of the Triples by looking up related data from across Web via a 
> plethora of services e.g.  Yahoo!, Google, Bing!, Linked Data Cloud 
> Cache, DBpedia, Sindice, and 30 or so other places (typically Web 2.0 
> style Web Services).  The net effect is at the very least a model that 
> shows where a Page has been referenced elsewhere. Of course, we also 
> make triples for the outbound links in the HTML page.
> 
> To conclude, as long as an old page has been referenced somewhere and/or 
> it contains outbound links, we have data for Linked Data graph 
> generation :-)
>


OK, so, by default, there's nothing shipped with the sponger that'd
demonstrate a parsing of the HTML contents (apart from the <a/> and
hrefs) ?

So I suppose that by writing some code I can create a new cartridge
that'd do it...

Regards,
-- 
Olivier BERGER <olivier.ber...@it-sudparis.eu>
http://www-public.it-sudparis.eu/~berger_o/ - OpenPGP-Id: 1024D/6B829EEC
Ingénieur Recherche - Dept INF
Institut TELECOM, SudParis (http://www.it-sudparis.eu/), Evry (France)

Re: [Virtuoso-users] Web scraping with virtuoso sponger ?

Reply via email to