Re: [Virtuoso-users] Mapper Options in Conductor (Question about Sponging)

Haag, Jason Fri, 13 Nov 2015 12:15:12 -0800

Thanks Kingsley. In response to your question, "Did you enable the "store
metadata option" and then select relevant cartridge?"


Yes, I selected store metadata (xHTML option) with the following extractor
cartridge settings:

add-html-meta=yes
get-feeds=no
preview-length=512
fallback-mode=no
rdfa=yes
reify_html5md=0
reify_rdfa=0
reify_jsonld=0
reify_all_grddl=0
reify_html=0
passthrough_mode=yes
loose=yes
reify_html_misc=no
reify_turtle=no

Should I change any of these?




> The faceted search interface seems to indicate so as I did this with
> the following graph IRI, http://adlnet.gov/expapi/verbs
>
>
http://54.152.125.100:8890/describe/?url=http%3A%2F%2Fadlnet.gov%2Fexpapi%2Fverbs&sid=4
>
> I tried to set up this IRI as a crawl job and it never populated
> virtuoso's data store.

Did you enable the "store metadata option" and then select relevant
cartridge?


----------------------------------------------------------------------

Message: 1
Date: Thu, 12 Nov 2015 17:39:37 -0500
From: Kingsley Idehen <kide...@openlinksw.com>
Subject: Re: [Virtuoso-users] Mapper Options in Conductor (Question
        about Sponging)
To: virtuoso-users@lists.sourceforge.net
Message-ID: <56451529.5030...@openlinksw.com>
Content-Type: text/plain; charset="windows-1252"

On 11/12/15 5:16 PM, Haag, Jason wrote:
> Hi All,
>
> I have been trying to understand how virtuoso's crawler content import
> and sponging features work. I'm currently evaluating virtuoso using
> 07.20.3214 VOS.
>
> I set up three crawl jobs for three different HTML/RDFa files and
> received no errors.
>
> When I attempt to use the sparql interface to query the data it
> doesn't show up:
>
> For example, http://w3id.org/xapi/adb/verbs/ is the target URL of a
> crawl job I set up in conductor under content imports. I am using the
> xhtml/HTM5 variants cartridge with the following options:
>
> fallback-mode=no
> rdfa=yes
> reify_html5md=0
> reify_rdfa=1
> reify_jsonld=0
> reify_all_grddl=0
> reify_html=0
> passthrough_mode=yes
> loose=yes
> reify_html_misc=no
> reify_turtle=no
>
> If I go to http://54.152.125.100:8890/sparql and use the following
> sparql query it returns no results:
>
> #Query all Verb IRIs
> PREFIX xapi: <https://w3id.org/xapi/ontology#>
>
> SELECT DISTINCT ?Verb
>
> WHERE {
>    ?Verb a xapi:Verb .
>
> }
>
>
> However, the data does start to show up in this query if I
> subsequently add http://w3id.org/xapi/adb/verbs/ as the default data
> set name / graph IRI in the sparql interface and also select the
> sponging option to download all RDF resources.
>
> Is this sponging option from the sparql interface actually
> adding/download the triples?

Yes, "sponging" is our colloquialism for "importing data" from some URL .

> Wouldn't this allow anyone to add triples that has access to the
> sparql interface?

Yes, if you don't apply access controls to your SPARQL endpoint [1]

> The faceted search interface seems to indicate so as I did this with
> the following graph IRI, http://adlnet.gov/expapi/verbs
>
>
http://54.152.125.100:8890/describe/?url=http%3A%2F%2Fadlnet.gov%2Fexpapi%2Fverbs&sid=4
>
> I tried to set up this IRI as a crawl job and it never populated
> virtuoso's data store.

Did you enable the "store metadata option" and then select relevant
cartridge?

> But as soon as I add it as a graph IRI using the sparql interface and
> sponging it shows up. Is this the expected behavior / by design for
> this sparql sponging option?

Yes, you can also import data via SPARQL. You can even automatically
convert CSV data to 5-Star Linked Data via SPARQL integration with the
Sponger [2][3]

> I thought graphs and triples could only be added with special SPARQL
> permissions and using INSERT.
>
> I still don't think the crawler feature is working for HTML/RDFa. It
> appears to be processing and storing the HTML file in the
> repository/locally in virtuoso, but it doesn't seem to actually add
> the graph or triples to the database.
>
> Thanks in advance for your patience and help!
>
> J Haag

Links:

[1]
http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtSPARQLEndpointProtection
-- Protecting Your SPARQL Endpoint

[2]
http://virtuoso.openlinksw.com/tutorials/sparql/SPARQL_Tutorials_Part_7/SPARQL_Tutorials_Part_7.html#(1)
-- Sponger Pragmas and Web Crawling via SPARQL

[3]
http://kidehen.blogspot.com/2015/11/generating-linked-data-from-open-data.html
-- Generating 5-Star Linked Data from Open Data .

Kingsley
>

-------------------------------------------------------
+1.850.266.7100(office)
+1.850.471.1300 (mobile)
jhaag75 (skype)
http://jasonhaag.com (Web)
http://twitter.com/mobilejson (Twitter)
http://linkedin.com/in/jasonhaag (LinkedIn)


On Thu, Nov 12, 2015 at 4:21 PM, Haag, Jason <jhaa...@gmail.com> wrote:

> Oh I forgot to include the error message in the database. <DB NULL>
>
> Query result:VQ_HOST
> VARCHARVQ_TS
> DATETIMEVQ_URL
> VARCHARVQ_ROOT
> VARCHARVQ_STAT
> VARCHARVQ_OTHER
> VARCHARVQ_ERROR
> LONG VARCHARVQ_LEVEL
> INTEGERVQ_VIA_SITEMAP
> INTEGERVQ_DT
> TIMESTAMPVQ_ORIGIN
> IRI_ID
>
>
>
>
>
>
>
>
>
>
>  xapi.vocab.pub 2015-10-27 23:25:38.613142 /datasets/adl/verbs/
>  home/dba/rdf_sink/adl/index.html retrieved <DB NULL> <DB NULL> 0 0 2015-11-12
> 21:08:17.809787 <DB NULL>
>
>
>
> On Thu, Nov 12, 2015 at 4:16 PM, Haag, Jason <jhaa...@gmail.com> wrote:
>
>> Hi All,
>>
>> I have been trying to understand how virtuoso's crawler content import
>> and sponging features work. I'm currently evaluating virtuoso using
>> 07.20.3214 VOS.
>>
>> I set up three crawl jobs for three different HTML/RDFa files and
>> received no errors.
>>
>> When I attempt to use the sparql interface to query the data it doesn't
>> show up:
>>
>> For example, http://w3id.org/xapi/adb/verbs/ is the target URL of a
>> crawl job I set up in conductor under content imports. I am using the
>> xhtml/HTM5 variants cartridge with the following options:
>>
>> fallback-mode=no
>> rdfa=yes
>> reify_html5md=0
>> reify_rdfa=1
>> reify_jsonld=0
>> reify_all_grddl=0
>> reify_html=0
>> passthrough_mode=yes
>> loose=yes
>> reify_html_misc=no
>> reify_turtle=no
>>
>> If I go to http://54.152.125.100:8890/sparql and use the following
>> sparql query it returns no results:
>>
>> #Query all Verb IRIs
>> PREFIX xapi: <https://w3id.org/xapi/ontology#>
>>
>> SELECT DISTINCT ?Verb
>>
>> WHERE {
>>    ?Verb a xapi:Verb .
>>
>> }
>>
>>
>> However, the data does start to show up in this query if I subsequently
>> add http://w3id.org/xapi/adb/verbs/ as the default data set name / graph
>> IRI in the sparql interface and also select the sponging option to download
>> all RDF resources.
>>
>> Is this sponging option from the sparql interface actually
>> adding/download the triples? Wouldn't this allow anyone to add triples that
>> has access to the sparql interface? The faceted search interface seems to
>> indicate so as I did this with
>> the following graph IRI, http://adlnet.gov/expapi/verbs
>>
>>
>> http://54.152.125.100:8890/describe/?url=http%3A%2F%2Fadlnet.gov%2Fexpapi%2Fverbs&sid=4
>>
>> I tried to set up this IRI as a crawl job and it never populated
>> virtuoso's data store. But as soon as I add it as a graph IRI using the
>> sparql interface and sponging it shows up. Is this the expected behavior /
>> by design for this sparql sponging option? I thought graphs and triples
>> could only be added with special SPARQL permissions and using INSERT.
>>
>> I still don't think the crawler feature is working for HTML/RDFa. It
>> appears to be processing and storing the HTML file in the
>> repository/locally in virtuoso, but it doesn't seem to actually add the
>> graph or triples to the database.
>>
>> Thanks in advance for your patience and help!
>>
>> J Haag
>>
>> -------------------------------------------------------
>>
>>
>>
>> On Wed, Oct 28, 2015 at 5:17 AM, Tim Haynes <thay...@openlinksw.com>
>> wrote:
>>
>>>
>>> On 27 October 2015 at 20:49, Haag, Jason <jhaa...@gmail.com> wrote:
>>>
>>>> I think I know the answer to my last two questions. I had additional
>>>> html files below the /verbs/ directory. I believe that is where the
>>>> duplicates came from. I'm guessing sponger also looks for any html files at
>>>> the specified path, not just the "index.html" file that was specified as a
>>>> target URL. Can anyone verify this?
>>>
>>>
>>> Hi,
>>>
>>> It's unlikely - I don't know of anything in the Sponger that implements
>>> directory browsing, but it may well be following e.g. <link
>>> rel="alternate" href="...." /> to RSS/Atom feeds, etc.
>>>
>>> As Kingsley says, Faceted Browser will show you what graphs the triples
>>> appear in.
>>>
>>> When a page is sponged, its URL becomes 1:1 the graph IRI in which data
>>> from/about/in that resource is stored. Multiple graphs implies multiple
>>> sponging events.
>>>
>>> HTH,
>>>
>>> ~Tim
>>> --
>>> Tim Haynes
>>> Product Development Consultant
>>> OpenLink Software
>>> <http://www.openlinksw.com/>
>>> <http://twitter.com/openlink>
>>>
>>
>>
>

------------------------------------------------------------------------------

_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Re: [Virtuoso-users] Mapper Options in Conductor (Question about Sponging)

Reply via email to