Just touching base on this post... I suspect this one was TL;DR.

-------------------------------------------------------
Advanced Distributed Learning Initiative
+1.850.266.7100(office)
+1.850.471.1300 (mobile)
jhaag75 (skype)
http://linkedin.com/in/jasonhaag

On Thu, Oct 8, 2015 at 10:39 AM, Haag, Jason <jason.haag....@adlnet.gov>
wrote:

> Hi All,
>
> I'm posting these questions to the users group to also help other Virtuoso
> users potentially interested in importing RDFa-based content into Virtuoso
> might also benefit from the responses. Please let me know if any of these
> questions should be submitted as an issue to github or as feature requests
> instead. The current documentation on importing RDFa document is a little
> dated and does not accurately match the conductor interface for content
> imports. The conductor interface also don't explain what the various fields
> and options mean. Some of them are not obvious to a new user like myself
> and might lead to bad assumptions or even cause conflicts in the system.
> I've been a little confused by what some of the various options mean, and I
> have also been running an older version (7.2.1). From what I have been told
> many of the HTML/RDFa cartridges have been improved since the older version
> of VOS. Therefore, I would like to ask a few questions and determine
> exactly what these fields will do (or not do) before I make any mistakes or
> assumptions. Thank you to Hugh, Tim, and Kingsley for all of the excellent
> advice so far. I truly appreciate your support and patience with all of my
> questions.
>
> I'm currently running a new build of Virtuoso Open Source, Version:
> 07.20.3214, develop/7 branch on gitihub, Build: Oct 7 2015 on Debian+Ubuntu.
>
> For my use case, we will have several (potentially 50 or more) HTML5 /
> RDFa 1.1 (core) pages available on one external server/domain and would
> like to regularly "sponge" or "crawl" these URIs (as these datasets
> expressed in HTML/RDFa may be updated or even grow from time to time). They
> will also become more decentralized and available on multiple external
> servers so Virtuoso seems like the perfect solution for being able to
> automatically crawl all of these external sources of RDFa controlled
> vocabulary datasets (as well as perfect for many other future objectives we
> have for RDF as well).
>
> Here are my questions (perhaps some of the answers can be used for
> FAQs,etc):
>
> *1) Does Virtuoso support crawling external domains and servers for the
> Target URL if the target is HTML5/RDFa or must they be imported into DAV
> first?*
> *2) Am I always required to specify a local DAV collection for sponging
> and crawling RDFa even if I don't want to store the RDFa/HTML locally? *
> *3) If yes to #2, when I use dav(or dba) as the owner and the rdf_sink
> folder to store the crawled RDFa/HTML, are there any special permissions or
> configurations required to be made on the rdf_sink folder? Here are the
> default configuration settings for rdf_sink:*
>
> *Main Tab:*
> - Folder Name: (rdf_sink)
> - Folder Type: Linked Data Import
> - Owner: dav (or dba)
> - Permissions: rw-rw----
> - Full Text Search: Recursively
> - Default Permissions: Off
> - Metdata Retrieval: Recursively
> - Apply changes to all subfolders and resources: unchecked
> - Expiration Date: 0
> -WebDAV Properties: No properties
>
> *Sharing Tab:*
> - ODS users/groups: No Security
> - WebID users: No WebID Security
>
> *Linked Data Import Tab:*
> - Graph name: urn:dav:home:dav:rdf_sink
> - Base URI: http://host:8890/DAV/home/dba/rdf_sink/
> - Use special graph security (on/off): unchecked
> - Sponger (on/off): checked
>
> *4) When importing content using the crawler + sponger feature I navigate
> to "Conductor > Web Application Server > Content Imports" and click the
> "New Target" button. *
>
> Which of the following fields should I use to specify an external
> HTML5/RDFa 1.1 URL for crawling and what do each of these fields mean?
> *Note:* For the fields that are obvious (or are adequately addressed in
> the VOS documentation) I have already entered those below. I would greatly
> appreciate more information those fields that have an *asterisk*  with a
> question in (parenthesis).
>
> - *Target description:* This is obvious. Name of content import / crawl
> job, etc.
> - *Target URL:* http://domain/path/html file (* does this URL prefer an
> xml sitemap for RDfa or can it explicitly point directly to an html file
> for RDFa? I also have content negotiation set up on the external server
> where the RDFa/HTML is hosted as it also serves JSON-LD, RDF/XML, and
> Turtle serializations, but I would prefer to only regularly crawl/update
> based on the HTML/RDFa data for now. I might have Virtuoso generate the
> alternate serializations in the future*)
> - *Login name on target:* (*if target URL is an external server, does
> this need to be blank?*)
> - *Login password on target:  *(*if target URL is an external server,
> does this need to be blank?*)
> - *Copy to Local DAV collection: *(*what does this mean? It seems to
> imply that it is required to specify a Local Dav collection to create a
> crawl job, but another option implies that you don't have to store the
> data. The two options are conflicting and confusing. From an user
> experience perspective, it seems I would either want to store it or not. If
> I don't then why do I have to specify a local DAV collection?*)
> - *Single page download:* (*what does this mean?*)
> - *Local resources owner:* dav
> - *Download only newer than:* 1900-01-01 00-00-00
> - *Follow links matching (delimited with ;):* (*what does this do? what
> types of "links" are examined?*)
> - *Do not follow links matching (delimited with ;): *(*what does this
> do?what types of "links" are examined?*)
> - *Custom HTTP headers:* (*is this required for RDFa? If so, what is the
> expected syntax and delimiters? "Accept: text/html"?*)
> - *Number of HTTP redirects to follow: *(*I currently have a 303 redirect
> in place for content negotiation, but what if this is unknown or changes in
> the future? Will it break the crawler job?*)
> - *XPath expression for links extraction: *(*is this applicable for
> importing RDFa?*)
> - *Crawling depth limit:* unlimited
> - *Update Interval (minutes):* 0 - *Number of threads: *(*is this
> applicable for importing RDFa?*)
> - *Crawl delay (sec):* 0.00
> - *Store Function: *(*is this applicable for importing RDFa?*)
> - *Extract Function: *(*is this applicable for importing RDFa?*)
> - *Semantic Web Crawling:* (*what does this do exactly?*)
> - *If Graph IRI is unassigned use this Data Source URL: *(*what is the
> purpose of this? The content can't be imported of a Target is not
> specified, right?*)
> -* Follow URLs outside of the target host:*  (*what does this do
> exactly?*)
> - *Follow HTML meta link:* (*is this only for HTML/RDFa that specifies an
> alternate serialization via the <link> element in the <head>?*)
> - *Follow RDF properties (one IRI per row):* (*what does this do?*)
> - *Download images:*
> - *Use WebDAV methods:* (*what does this mean?*)
> - *Delete if remove on remote detected: *(*what does this mean?*)
> - *Store documents locally: *(*does this only apply to storing the
> content in DAV?*)
> - *Convert Links:* (*is this related to another option/field*?)
> - *Run Sponger: *(*does this force to only use the sponger for reading
> RDFa and populate the DB with the triples?*)
> - *Accept RDF:* (*is this option only for slash-based URIs that return
> XML/RDF via content negotiation?*)
> - *Store metadata *:* (*what does this mean?*)
> - *Cartridges: *(* I recommend improving the usability on this. At first
> I thought perhaps my cartridges were not installed because the content area
> below the "Cartridges" tab was empty. I realized the cartridges only appear
> when you click/toggle the "Cartridges" tab. I suggest they should all be
> listed by default. Turning their visibility off by default may prevent
> users from realizing they are there, especially based on the old
> documentation*)
>
> *5) What do the following cartridge options do? I only listed the ones
> that seem most applicable to running a crawler/sponger import job for an
> externally hosted HTML5/RDFa URL.*
>
> - RDF cartridge (*what types of RDF? what does this one do?*)
> - RDFa cartridge (*which versions of RDFa are supported? RDFa 1.1 core?
> RDF 1.0? RDF 1.1 Lite?*)
> - WebDAV Metadata
> - xHTML
>
>
>
>
------------------------------------------------------------------------------
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Reply via email to