Hi All, I'm posting these questions to the users group to also help other Virtuoso users potentially interested in importing RDFa-based content into Virtuoso might also benefit from the responses. Please let me know if any of these questions should be submitted as an issue to github or as feature requests instead. The current documentation on importing RDFa document is a little dated and does not accurately match the conductor interface for content imports. The conductor interface also don't explain what the various fields and options mean. Some of them are not obvious to a new user like myself and might lead to bad assumptions or even cause conflicts in the system. I've been a little confused by what some of the various options mean, and I have also been running an older version (7.2.1). From what I have been told many of the HTML/RDFa cartridges have been improved since the older version of VOS. Therefore, I would like to ask a few questions and determine exactly what these fields will do (or not do) before I make any mistakes or assumptions. Thank you to Hugh, Tim, and Kingsley for all of the excellent advice so far. I truly appreciate your support and patience with all of my questions.
I'm currently running a new build of Virtuoso Open Source, Version: 07.20.3214, develop/7 branch on gitihub, Build: Oct 7 2015 on Debian+Ubuntu. For my use case, we will have several (potentially 50 or more) HTML5 / RDFa 1.1 (core) pages available on one external server/domain and would like to regularly "sponge" or "crawl" these URIs (as these datasets expressed in HTML/RDFa may be updated or even grow from time to time). They will also become more decentralized and available on multiple external servers so Virtuoso seems like the perfect solution for being able to automatically crawl all of these external sources of RDFa controlled vocabulary datasets (as well as perfect for many other future objectives we have for RDF as well). Here are my questions (perhaps some of the answers can be used for FAQs,etc): *1) Does Virtuoso support crawling external domains and servers for the Target URL if the target is HTML5/RDFa or must they be imported into DAV first?* *2) Am I always required to specify a local DAV collection for sponging and crawling RDFa even if I don't want to store the RDFa/HTML locally? * *3) If yes to #2, when I use dav(or dba) as the owner and the rdf_sink folder to store the crawled RDFa/HTML, are there any special permissions or configurations required to be made on the rdf_sink folder? Here are the default configuration settings for rdf_sink:* *Main Tab:* - Folder Name: (rdf_sink) - Folder Type: Linked Data Import - Owner: dav (or dba) - Permissions: rw-rw---- - Full Text Search: Recursively - Default Permissions: Off - Metdata Retrieval: Recursively - Apply changes to all subfolders and resources: unchecked - Expiration Date: 0 -WebDAV Properties: No properties *Sharing Tab:* - ODS users/groups: No Security - WebID users: No WebID Security *Linked Data Import Tab:* - Graph name: urn:dav:home:dav:rdf_sink - Base URI: http://host:8890/DAV/home/dba/rdf_sink/ - Use special graph security (on/off): unchecked - Sponger (on/off): checked *4) When importing content using the crawler + sponger feature I navigate to "Conductor > Web Application Server > Content Imports" and click the "New Target" button. * Which of the following fields should I use to specify an external HTML5/RDFa 1.1 URL for crawling and what do each of these fields mean? *Note:* For the fields that are obvious (or are adequately addressed in the VOS documentation) I have already entered those below. I would greatly appreciate more information those fields that have an *asterisk* with a question in (parenthesis). - *Target description:* This is obvious. Name of content import / crawl job, etc. - *Target URL:* http://domain/path/html file (* does this URL prefer an xml sitemap for RDfa or can it explicitly point directly to an html file for RDFa? I also have content negotiation set up on the external server where the RDFa/HTML is hosted as it also serves JSON-LD, RDF/XML, and Turtle serializations, but I would prefer to only regularly crawl/update based on the HTML/RDFa data for now. I might have Virtuoso generate the alternate serializations in the future*) - *Login name on target:* (*if target URL is an external server, does this need to be blank?*) - *Login password on target: *(*if target URL is an external server, does this need to be blank?*) - *Copy to Local DAV collection: *(*what does this mean? It seems to imply that it is required to specify a Local Dav collection to create a crawl job, but another option implies that you don't have to store the data. The two options are conflicting and confusing. From an user experience perspective, it seems I would either want to store it or not. If I don't then why do I have to specify a local DAV collection?*) - *Single page download:* (*what does this mean?*) - *Local resources owner:* dav - *Download only newer than:* 1900-01-01 00-00-00 - *Follow links matching (delimited with ;):* (*what does this do? what types of "links" are examined?*) - *Do not follow links matching (delimited with ;): *(*what does this do?what types of "links" are examined?*) - *Custom HTTP headers:* (*is this required for RDFa? If so, what is the expected syntax and delimiters? "Accept: text/html"?*) - *Number of HTTP redirects to follow: *(*I currently have a 303 redirect in place for content negotiation, but what if this is unknown or changes in the future? Will it break the crawler job?*) - *XPath expression for links extraction: *(*is this applicable for importing RDFa?*) - *Crawling depth limit:* unlimited - *Update Interval (minutes):* 0 - *Number of threads: *(*is this applicable for importing RDFa?*) - *Crawl delay (sec):* 0.00 - *Store Function: *(*is this applicable for importing RDFa?*) - *Extract Function: *(*is this applicable for importing RDFa?*) - *Semantic Web Crawling:* (*what does this do exactly?*) - *If Graph IRI is unassigned use this Data Source URL: *(*what is the purpose of this? The content can't be imported of a Target is not specified, right?*) -* Follow URLs outside of the target host:* (*what does this do exactly?*) - *Follow HTML meta link:* (*is this only for HTML/RDFa that specifies an alternate serialization via the <link> element in the <head>?*) - *Follow RDF properties (one IRI per row):* (*what does this do?*) - *Download images:* - *Use WebDAV methods:* (*what does this mean?*) - *Delete if remove on remote detected: *(*what does this mean?*) - *Store documents locally: *(*does this only apply to storing the content in DAV?*) - *Convert Links:* (*is this related to another option/field*?) - *Run Sponger: *(*does this force to only use the sponger for reading RDFa and populate the DB with the triples?*) - *Accept RDF:* (*is this option only for slash-based URIs that return XML/RDF via content negotiation?*) - *Store metadata *:* (*what does this mean?*) - *Cartridges: *(* I recommend improving the usability on this. At first I thought perhaps my cartridges were not installed because the content area below the "Cartridges" tab was empty. I realized the cartridges only appear when you click/toggle the "Cartridges" tab. I suggest they should all be listed by default. Turning their visibility off by default may prevent users from realizing they are there, especially based on the old documentation*) *5) What do the following cartridge options do? I only listed the ones that seem most applicable to running a crawler/sponger import job for an externally hosted HTML5/RDFa URL.* - RDF cartridge (*what types of RDF? what does this one do?*) - RDFa cartridge (*which versions of RDFa are supported? RDFa 1.1 core? RDF 1.0? RDF 1.1 Lite?*) - WebDAV Metadata - xHTML
------------------------------------------------------------------------------
_______________________________________________ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users