Thanks Kingsley. In response to your question, "Did you enable the "store metadata option" and then select relevant cartridge?"
Yes, I selected store metadata (xHTML option) with the following extractor cartridge settings: add-html-meta=yes get-feeds=no preview-length=512 fallback-mode=no rdfa=yes reify_html5md=0 reify_rdfa=0 reify_jsonld=0 reify_all_grddl=0 reify_html=0 passthrough_mode=yes loose=yes reify_html_misc=no reify_turtle=no Should I change any of these? > The faceted search interface seems to indicate so as I did this with > the following graph IRI, http://adlnet.gov/expapi/verbs > > http://54.152.125.100:8890/describe/?url=http%3A%2F%2Fadlnet.gov%2Fexpapi%2Fverbs&sid=4 > > I tried to set up this IRI as a crawl job and it never populated > virtuoso's data store. Did you enable the "store metadata option" and then select relevant cartridge? ---------------------------------------------------------------------- Message: 1 Date: Thu, 12 Nov 2015 17:39:37 -0500 From: Kingsley Idehen <kide...@openlinksw.com> Subject: Re: [Virtuoso-users] Mapper Options in Conductor (Question about Sponging) To: virtuoso-users@lists.sourceforge.net Message-ID: <56451529.5030...@openlinksw.com> Content-Type: text/plain; charset="windows-1252" On 11/12/15 5:16 PM, Haag, Jason wrote: > Hi All, > > I have been trying to understand how virtuoso's crawler content import > and sponging features work. I'm currently evaluating virtuoso using > 07.20.3214 VOS. > > I set up three crawl jobs for three different HTML/RDFa files and > received no errors. > > When I attempt to use the sparql interface to query the data it > doesn't show up: > > For example, http://w3id.org/xapi/adb/verbs/ is the target URL of a > crawl job I set up in conductor under content imports. I am using the > xhtml/HTM5 variants cartridge with the following options: > > fallback-mode=no > rdfa=yes > reify_html5md=0 > reify_rdfa=1 > reify_jsonld=0 > reify_all_grddl=0 > reify_html=0 > passthrough_mode=yes > loose=yes > reify_html_misc=no > reify_turtle=no > > If I go to http://54.152.125.100:8890/sparql and use the following > sparql query it returns no results: > > #Query all Verb IRIs > PREFIX xapi: <https://w3id.org/xapi/ontology#> > > SELECT DISTINCT ?Verb > > WHERE { > ?Verb a xapi:Verb . > > } > > > However, the data does start to show up in this query if I > subsequently add http://w3id.org/xapi/adb/verbs/ as the default data > set name / graph IRI in the sparql interface and also select the > sponging option to download all RDF resources. > > Is this sponging option from the sparql interface actually > adding/download the triples? Yes, "sponging" is our colloquialism for "importing data" from some URL . > Wouldn't this allow anyone to add triples that has access to the > sparql interface? Yes, if you don't apply access controls to your SPARQL endpoint [1] > The faceted search interface seems to indicate so as I did this with > the following graph IRI, http://adlnet.gov/expapi/verbs > > http://54.152.125.100:8890/describe/?url=http%3A%2F%2Fadlnet.gov%2Fexpapi%2Fverbs&sid=4 > > I tried to set up this IRI as a crawl job and it never populated > virtuoso's data store. Did you enable the "store metadata option" and then select relevant cartridge? > But as soon as I add it as a graph IRI using the sparql interface and > sponging it shows up. Is this the expected behavior / by design for > this sparql sponging option? Yes, you can also import data via SPARQL. You can even automatically convert CSV data to 5-Star Linked Data via SPARQL integration with the Sponger [2][3] > I thought graphs and triples could only be added with special SPARQL > permissions and using INSERT. > > I still don't think the crawler feature is working for HTML/RDFa. It > appears to be processing and storing the HTML file in the > repository/locally in virtuoso, but it doesn't seem to actually add > the graph or triples to the database. > > Thanks in advance for your patience and help! > > J Haag Links: [1] http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtSPARQLEndpointProtection -- Protecting Your SPARQL Endpoint [2] http://virtuoso.openlinksw.com/tutorials/sparql/SPARQL_Tutorials_Part_7/SPARQL_Tutorials_Part_7.html#(1) -- Sponger Pragmas and Web Crawling via SPARQL [3] http://kidehen.blogspot.com/2015/11/generating-linked-data-from-open-data.html -- Generating 5-Star Linked Data from Open Data . Kingsley > ------------------------------------------------------- +1.850.266.7100(office) +1.850.471.1300 (mobile) jhaag75 (skype) http://jasonhaag.com (Web) http://twitter.com/mobilejson (Twitter) http://linkedin.com/in/jasonhaag (LinkedIn) On Thu, Nov 12, 2015 at 4:21 PM, Haag, Jason <jhaa...@gmail.com> wrote: > Oh I forgot to include the error message in the database. <DB NULL> > > Query result:VQ_HOST > VARCHARVQ_TS > DATETIMEVQ_URL > VARCHARVQ_ROOT > VARCHARVQ_STAT > VARCHARVQ_OTHER > VARCHARVQ_ERROR > LONG VARCHARVQ_LEVEL > INTEGERVQ_VIA_SITEMAP > INTEGERVQ_DT > TIMESTAMPVQ_ORIGIN > IRI_ID > > > > > > > > > > > xapi.vocab.pub 2015-10-27 23:25:38.613142 /datasets/adl/verbs/ > home/dba/rdf_sink/adl/index.html retrieved <DB NULL> <DB NULL> 0 0 2015-11-12 > 21:08:17.809787 <DB NULL> > > > > On Thu, Nov 12, 2015 at 4:16 PM, Haag, Jason <jhaa...@gmail.com> wrote: > >> Hi All, >> >> I have been trying to understand how virtuoso's crawler content import >> and sponging features work. I'm currently evaluating virtuoso using >> 07.20.3214 VOS. >> >> I set up three crawl jobs for three different HTML/RDFa files and >> received no errors. >> >> When I attempt to use the sparql interface to query the data it doesn't >> show up: >> >> For example, http://w3id.org/xapi/adb/verbs/ is the target URL of a >> crawl job I set up in conductor under content imports. I am using the >> xhtml/HTM5 variants cartridge with the following options: >> >> fallback-mode=no >> rdfa=yes >> reify_html5md=0 >> reify_rdfa=1 >> reify_jsonld=0 >> reify_all_grddl=0 >> reify_html=0 >> passthrough_mode=yes >> loose=yes >> reify_html_misc=no >> reify_turtle=no >> >> If I go to http://54.152.125.100:8890/sparql and use the following >> sparql query it returns no results: >> >> #Query all Verb IRIs >> PREFIX xapi: <https://w3id.org/xapi/ontology#> >> >> SELECT DISTINCT ?Verb >> >> WHERE { >> ?Verb a xapi:Verb . >> >> } >> >> >> However, the data does start to show up in this query if I subsequently >> add http://w3id.org/xapi/adb/verbs/ as the default data set name / graph >> IRI in the sparql interface and also select the sponging option to download >> all RDF resources. >> >> Is this sponging option from the sparql interface actually >> adding/download the triples? Wouldn't this allow anyone to add triples that >> has access to the sparql interface? The faceted search interface seems to >> indicate so as I did this with >> the following graph IRI, http://adlnet.gov/expapi/verbs >> >> >> http://54.152.125.100:8890/describe/?url=http%3A%2F%2Fadlnet.gov%2Fexpapi%2Fverbs&sid=4 >> >> I tried to set up this IRI as a crawl job and it never populated >> virtuoso's data store. But as soon as I add it as a graph IRI using the >> sparql interface and sponging it shows up. Is this the expected behavior / >> by design for this sparql sponging option? I thought graphs and triples >> could only be added with special SPARQL permissions and using INSERT. >> >> I still don't think the crawler feature is working for HTML/RDFa. It >> appears to be processing and storing the HTML file in the >> repository/locally in virtuoso, but it doesn't seem to actually add the >> graph or triples to the database. >> >> Thanks in advance for your patience and help! >> >> J Haag >> >> ------------------------------------------------------- >> >> >> >> On Wed, Oct 28, 2015 at 5:17 AM, Tim Haynes <thay...@openlinksw.com> >> wrote: >> >>> >>> On 27 October 2015 at 20:49, Haag, Jason <jhaa...@gmail.com> wrote: >>> >>>> I think I know the answer to my last two questions. I had additional >>>> html files below the /verbs/ directory. I believe that is where the >>>> duplicates came from. I'm guessing sponger also looks for any html files at >>>> the specified path, not just the "index.html" file that was specified as a >>>> target URL. Can anyone verify this? >>> >>> >>> Hi, >>> >>> It's unlikely - I don't know of anything in the Sponger that implements >>> directory browsing, but it may well be following e.g. <link >>> rel="alternate" href="...." /> to RSS/Atom feeds, etc. >>> >>> As Kingsley says, Faceted Browser will show you what graphs the triples >>> appear in. >>> >>> When a page is sponged, its URL becomes 1:1 the graph IRI in which data >>> from/about/in that resource is stored. Multiple graphs implies multiple >>> sponging events. >>> >>> HTH, >>> >>> ~Tim >>> -- >>> Tim Haynes >>> Product Development Consultant >>> OpenLink Software >>> <http://www.openlinksw.com/> >>> <http://twitter.com/openlink> >>> >> >> >
------------------------------------------------------------------------------
_______________________________________________ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users