Hi All,

I'm posting these questions to the users group to also help other Virtuoso
users potentially interested in importing RDFa-based content into Virtuoso
might also benefit from the responses. Please let me know if any of these
questions should be submitted as an issue to github or as feature requests
instead. The current documentation on importing RDFa document is a little
dated and does not accurately match the conductor interface for content
imports. The conductor interface also don't explain what the various fields
and options mean. Some of them are not obvious to a new user like myself
and might lead to bad assumptions or even cause conflicts in the system.
I've been a little confused by what some of the various options mean, and I
have also been running an older version (7.2.1). From what I have been told
many of the HTML/RDFa cartridges have been improved since the older version
of VOS. Therefore, I would like to ask a few questions and determine
exactly what these fields will do (or not do) before I make any mistakes or
assumptions. Thank you to Hugh, Tim, and Kingsley for all of the excellent
advice so far. I truly appreciate your support and patience with all of my
questions.

I'm currently running a new build of Virtuoso Open Source, Version:
07.20.3214, develop/7 branch on gitihub, Build: Oct 7 2015 on Debian+Ubuntu.

For my use case, we will have several (potentially 50 or more) HTML5 / RDFa
1.1 (core) pages available on one external server/domain and would like to
regularly "sponge" or "crawl" these URIs (as these datasets expressed in
HTML/RDFa may be updated or even grow from time to time). They will also
become more decentralized and available on multiple external servers so
Virtuoso seems like the perfect solution for being able to automatically
crawl all of these external sources of RDFa controlled vocabulary datasets
(as well as perfect for many other future objectives we have for RDF as
well).

Here are my questions (perhaps some of the answers can be used for
FAQs,etc):

*1) Does Virtuoso support crawling external domains and servers for the
Target URL if the target is HTML5/RDFa or must they be imported into DAV
first?*
*2) Am I always required to specify a local DAV collection for sponging and
crawling RDFa even if I don't want to store the RDFa/HTML locally? *
*3) If yes to #2, when I use dav(or dba) as the owner and the rdf_sink
folder to store the crawled RDFa/HTML, are there any special permissions or
configurations required to be made on the rdf_sink folder? Here are the
default configuration settings for rdf_sink:*

*Main Tab:*
- Folder Name: (rdf_sink)
- Folder Type: Linked Data Import
- Owner: dav (or dba)
- Permissions: rw-rw----
- Full Text Search: Recursively
- Default Permissions: Off
- Metdata Retrieval: Recursively
- Apply changes to all subfolders and resources: unchecked
- Expiration Date: 0
-WebDAV Properties: No properties

*Sharing Tab:*
- ODS users/groups: No Security
- WebID users: No WebID Security

*Linked Data Import Tab:*
- Graph name: urn:dav:home:dav:rdf_sink
- Base URI: http://host:8890/DAV/home/dba/rdf_sink/
- Use special graph security (on/off): unchecked
- Sponger (on/off): checked

*4) When importing content using the crawler + sponger feature I navigate
to "Conductor > Web Application Server > Content Imports" and click the
"New Target" button. *

Which of the following fields should I use to specify an external
HTML5/RDFa 1.1 URL for crawling and what do each of these fields mean?
*Note:* For the fields that are obvious (or are adequately addressed in the
VOS documentation) I have already entered those below. I would greatly
appreciate more information those fields that have an *asterisk*  with a
question in (parenthesis).

- *Target description:* This is obvious. Name of content import / crawl
job, etc.
- *Target URL:* http://domain/path/html file (* does this URL prefer an xml
sitemap for RDfa or can it explicitly point directly to an html file for
RDFa? I also have content negotiation set up on the external server where
the RDFa/HTML is hosted as it also serves JSON-LD, RDF/XML, and Turtle
serializations, but I would prefer to only regularly crawl/update based on
the HTML/RDFa data for now. I might have Virtuoso generate the alternate
serializations in the future*)
- *Login name on target:* (*if target URL is an external server, does this
need to be blank?*)
- *Login password on target:  *(*if target URL is an external server, does
this need to be blank?*)
- *Copy to Local DAV collection: *(*what does this mean? It seems to imply
that it is required to specify a Local Dav collection to create a crawl
job, but another option implies that you don't have to store the data. The
two options are conflicting and confusing. From an user experience
perspective, it seems I would either want to store it or not. If I don't
then why do I have to specify a local DAV collection?*)
- *Single page download:* (*what does this mean?*)
- *Local resources owner:* dav
- *Download only newer than:* 1900-01-01 00-00-00
- *Follow links matching (delimited with ;):* (*what does this do? what
types of "links" are examined?*)
- *Do not follow links matching (delimited with ;): *(*what does this
do?what types of "links" are examined?*)
- *Custom HTTP headers:* (*is this required for RDFa? If so, what is the
expected syntax and delimiters? "Accept: text/html"?*)
- *Number of HTTP redirects to follow: *(*I currently have a 303 redirect
in place for content negotiation, but what if this is unknown or changes in
the future? Will it break the crawler job?*)
- *XPath expression for links extraction: *(*is this applicable for
importing RDFa?*)
- *Crawling depth limit:* unlimited
- *Update Interval (minutes):* 0 - *Number of threads: *(*is this
applicable for importing RDFa?*)
- *Crawl delay (sec):* 0.00
- *Store Function: *(*is this applicable for importing RDFa?*)
- *Extract Function: *(*is this applicable for importing RDFa?*)
- *Semantic Web Crawling:* (*what does this do exactly?*)
- *If Graph IRI is unassigned use this Data Source URL: *(*what is the
purpose of this? The content can't be imported of a Target is not
specified, right?*)
-* Follow URLs outside of the target host:*  (*what does this do exactly?*)
- *Follow HTML meta link:* (*is this only for HTML/RDFa that specifies an
alternate serialization via the <link> element in the <head>?*)
- *Follow RDF properties (one IRI per row):* (*what does this do?*)
- *Download images:*
- *Use WebDAV methods:* (*what does this mean?*)
- *Delete if remove on remote detected: *(*what does this mean?*)
- *Store documents locally: *(*does this only apply to storing the content
in DAV?*)
- *Convert Links:* (*is this related to another option/field*?)
- *Run Sponger: *(*does this force to only use the sponger for reading RDFa
and populate the DB with the triples?*)
- *Accept RDF:* (*is this option only for slash-based URIs that return
XML/RDF via content negotiation?*)
- *Store metadata *:* (*what does this mean?*)
- *Cartridges: *(* I recommend improving the usability on this. At first I
thought perhaps my cartridges were not installed because the content area
below the "Cartridges" tab was empty. I realized the cartridges only appear
when you click/toggle the "Cartridges" tab. I suggest they should all be
listed by default. Turning their visibility off by default may prevent
users from realizing they are there, especially based on the old
documentation*)

*5) What do the following cartridge options do? I only listed the ones that
seem most applicable to running a crawler/sponger import job for an
externally hosted HTML5/RDFa URL.*

- RDF cartridge (*what types of RDF? what does this one do?*)
- RDFa cartridge (*which versions of RDFa are supported? RDFa 1.1 core? RDF
1.0? RDF 1.1 Lite?*)
- WebDAV Metadata
- xHTML
------------------------------------------------------------------------------
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Reply via email to