Just touching base on this post... I suspect this one was TL;DR. ------------------------------------------------------- Advanced Distributed Learning Initiative +1.850.266.7100(office) +1.850.471.1300 (mobile) jhaag75 (skype) http://linkedin.com/in/jasonhaag
On Thu, Oct 8, 2015 at 10:39 AM, Haag, Jason <jason.haag....@adlnet.gov> wrote: > Hi All, > > I'm posting these questions to the users group to also help other Virtuoso > users potentially interested in importing RDFa-based content into Virtuoso > might also benefit from the responses. Please let me know if any of these > questions should be submitted as an issue to github or as feature requests > instead. The current documentation on importing RDFa document is a little > dated and does not accurately match the conductor interface for content > imports. The conductor interface also don't explain what the various fields > and options mean. Some of them are not obvious to a new user like myself > and might lead to bad assumptions or even cause conflicts in the system. > I've been a little confused by what some of the various options mean, and I > have also been running an older version (7.2.1). From what I have been told > many of the HTML/RDFa cartridges have been improved since the older version > of VOS. Therefore, I would like to ask a few questions and determine > exactly what these fields will do (or not do) before I make any mistakes or > assumptions. Thank you to Hugh, Tim, and Kingsley for all of the excellent > advice so far. I truly appreciate your support and patience with all of my > questions. > > I'm currently running a new build of Virtuoso Open Source, Version: > 07.20.3214, develop/7 branch on gitihub, Build: Oct 7 2015 on Debian+Ubuntu. > > For my use case, we will have several (potentially 50 or more) HTML5 / > RDFa 1.1 (core) pages available on one external server/domain and would > like to regularly "sponge" or "crawl" these URIs (as these datasets > expressed in HTML/RDFa may be updated or even grow from time to time). They > will also become more decentralized and available on multiple external > servers so Virtuoso seems like the perfect solution for being able to > automatically crawl all of these external sources of RDFa controlled > vocabulary datasets (as well as perfect for many other future objectives we > have for RDF as well). > > Here are my questions (perhaps some of the answers can be used for > FAQs,etc): > > *1) Does Virtuoso support crawling external domains and servers for the > Target URL if the target is HTML5/RDFa or must they be imported into DAV > first?* > *2) Am I always required to specify a local DAV collection for sponging > and crawling RDFa even if I don't want to store the RDFa/HTML locally? * > *3) If yes to #2, when I use dav(or dba) as the owner and the rdf_sink > folder to store the crawled RDFa/HTML, are there any special permissions or > configurations required to be made on the rdf_sink folder? Here are the > default configuration settings for rdf_sink:* > > *Main Tab:* > - Folder Name: (rdf_sink) > - Folder Type: Linked Data Import > - Owner: dav (or dba) > - Permissions: rw-rw---- > - Full Text Search: Recursively > - Default Permissions: Off > - Metdata Retrieval: Recursively > - Apply changes to all subfolders and resources: unchecked > - Expiration Date: 0 > -WebDAV Properties: No properties > > *Sharing Tab:* > - ODS users/groups: No Security > - WebID users: No WebID Security > > *Linked Data Import Tab:* > - Graph name: urn:dav:home:dav:rdf_sink > - Base URI: http://host:8890/DAV/home/dba/rdf_sink/ > - Use special graph security (on/off): unchecked > - Sponger (on/off): checked > > *4) When importing content using the crawler + sponger feature I navigate > to "Conductor > Web Application Server > Content Imports" and click the > "New Target" button. * > > Which of the following fields should I use to specify an external > HTML5/RDFa 1.1 URL for crawling and what do each of these fields mean? > *Note:* For the fields that are obvious (or are adequately addressed in > the VOS documentation) I have already entered those below. I would greatly > appreciate more information those fields that have an *asterisk* with a > question in (parenthesis). > > - *Target description:* This is obvious. Name of content import / crawl > job, etc. > - *Target URL:* http://domain/path/html file (* does this URL prefer an > xml sitemap for RDfa or can it explicitly point directly to an html file > for RDFa? I also have content negotiation set up on the external server > where the RDFa/HTML is hosted as it also serves JSON-LD, RDF/XML, and > Turtle serializations, but I would prefer to only regularly crawl/update > based on the HTML/RDFa data for now. I might have Virtuoso generate the > alternate serializations in the future*) > - *Login name on target:* (*if target URL is an external server, does > this need to be blank?*) > - *Login password on target: *(*if target URL is an external server, > does this need to be blank?*) > - *Copy to Local DAV collection: *(*what does this mean? It seems to > imply that it is required to specify a Local Dav collection to create a > crawl job, but another option implies that you don't have to store the > data. The two options are conflicting and confusing. From an user > experience perspective, it seems I would either want to store it or not. If > I don't then why do I have to specify a local DAV collection?*) > - *Single page download:* (*what does this mean?*) > - *Local resources owner:* dav > - *Download only newer than:* 1900-01-01 00-00-00 > - *Follow links matching (delimited with ;):* (*what does this do? what > types of "links" are examined?*) > - *Do not follow links matching (delimited with ;): *(*what does this > do?what types of "links" are examined?*) > - *Custom HTTP headers:* (*is this required for RDFa? If so, what is the > expected syntax and delimiters? "Accept: text/html"?*) > - *Number of HTTP redirects to follow: *(*I currently have a 303 redirect > in place for content negotiation, but what if this is unknown or changes in > the future? Will it break the crawler job?*) > - *XPath expression for links extraction: *(*is this applicable for > importing RDFa?*) > - *Crawling depth limit:* unlimited > - *Update Interval (minutes):* 0 - *Number of threads: *(*is this > applicable for importing RDFa?*) > - *Crawl delay (sec):* 0.00 > - *Store Function: *(*is this applicable for importing RDFa?*) > - *Extract Function: *(*is this applicable for importing RDFa?*) > - *Semantic Web Crawling:* (*what does this do exactly?*) > - *If Graph IRI is unassigned use this Data Source URL: *(*what is the > purpose of this? The content can't be imported of a Target is not > specified, right?*) > -* Follow URLs outside of the target host:* (*what does this do > exactly?*) > - *Follow HTML meta link:* (*is this only for HTML/RDFa that specifies an > alternate serialization via the <link> element in the <head>?*) > - *Follow RDF properties (one IRI per row):* (*what does this do?*) > - *Download images:* > - *Use WebDAV methods:* (*what does this mean?*) > - *Delete if remove on remote detected: *(*what does this mean?*) > - *Store documents locally: *(*does this only apply to storing the > content in DAV?*) > - *Convert Links:* (*is this related to another option/field*?) > - *Run Sponger: *(*does this force to only use the sponger for reading > RDFa and populate the DB with the triples?*) > - *Accept RDF:* (*is this option only for slash-based URIs that return > XML/RDF via content negotiation?*) > - *Store metadata *:* (*what does this mean?*) > - *Cartridges: *(* I recommend improving the usability on this. At first > I thought perhaps my cartridges were not installed because the content area > below the "Cartridges" tab was empty. I realized the cartridges only appear > when you click/toggle the "Cartridges" tab. I suggest they should all be > listed by default. Turning their visibility off by default may prevent > users from realizing they are there, especially based on the old > documentation*) > > *5) What do the following cartridge options do? I only listed the ones > that seem most applicable to running a crawler/sponger import job for an > externally hosted HTML5/RDFa URL.* > > - RDF cartridge (*what types of RDF? what does this one do?*) > - RDFa cartridge (*which versions of RDFa are supported? RDFa 1.1 core? > RDF 1.0? RDF 1.1 Lite?*) > - WebDAV Metadata > - xHTML > > > >
------------------------------------------------------------------------------
_______________________________________________ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users