On 1/14/16 11:17 AM, Haag, Jason wrote: > Hi All, > > I"m back again evaluating Virtuoso for the HTML5/RDFa crawling > capability. We are considering moving to the Universal Server from VOS > if I can ever prove to my team that it will be a good choice for > sponging and crawling HTML5/RDFa files.
It most certainly is. > I have been testing this feature periodically over the past several > months with no luck. I appreciate the support and feedback so far, but > I haven't made any progress. Some previous posts/inquiries I made on > this topic are available > here: http://sourceforge.net/p/virtuoso/mailman/message/34507072/ and > here: > http://sourceforge.net/p/virtuoso/mailman/virtuoso-users/thread/CAHjqjnLo7-hiA30neYBsbGm93HeXe%3DHrda5rZPGS%3Dwm%2B08ZvBw%40mail.gmail.com/#msg34525370 > > I would really like to use the conductor interface to regularly > schedule the import several graph IRIs that contain RDFa and check the > triples for any additions on daily basis. I recently upgraded the > installation to VOS 7.2.3 and still can't see to get the RDFa data to > populate the data store. Why don't you approach this matter as follows: [1] Use the live instance at http://linkeddata.uriburner.com to import your target data sources [2] Compare that with what's happening on your local instance. > After I run the import from the que, anytime I query the virtuoso > database there is no data from my RDFa datasets that I have imported > through conductor. I must be doing something wrong or missing an > important step somewhere. However, if I use these same exact RDFa IRIs > using the isql-v function (DB.DBA.RDF_LOAD_RDFA) the triples load > successfully. Yes, so there is something amiss in your setup. You import/crawl jobs should include directives for invoking the sponger cartridge for HTML docs. > > Here's a summary of what I've done and discovered so far: > > 1) Installed VOS 7.2.3 successfully > 2) Read some of the newly updated documentation, which is excellent by > the way > 3) Checked/updated sponger priveledges per this guidance for securing > the > endpoint: > http://docs.openlinksw.com/virtuoso/rdfsparql.html#rdfsupportedprotocolendpointuri > 4) Installed cartridges_dav.vad from commercial version (for sponger > cartridges): > http://opldownload.s3.amazonaws.com/uda/vad-packages/7.2/cartridges_dav.vad > 5) Checked and configured xHTML / aka HTML5 (and variants) cartridge > under “extractor cartridges” with the following settings (per advice > from the mailing list/forums): > > Pattern: (application/xhtml.xml)|(text|application)/.*(html|xml) > fallback-mode=no > rdfa=yes > reify_html5md=1 > reify_rdfa=0 For now (while you are troubleshooting), also use: reify_rdfa=1 > reify_jsonld=0 > reify_all_grddl=0 > passthrough_mode=yes > loose=yes > reify_html=0 > reify_html_misc=0 > reify_turtle=no > > > I also tried this basic configuration as well: > add-html-meta=no > get-feeds=no > rdfa=yes > fallback-mode=no > reify_html=no > reify_html_misc=no > reify_html5md=no > reify_rdfa=no > reify_jsonld=no > reify_turtle=no > reify_all_grddl=no > passthrough_mode=no > loose=no > > 6) Created a content import for the HTML5/RDFa document using > conductor with the following options: > > Target URL: http://xapi.vocab.pub/datasets/adl/verbs > login owner: dba > checked the following > > * store documents locally > * run sponger > * store metadata (selected xHTML aka HTML5 and variants) > Goto: https://www.pinterest.com/pin/389561436498376210/ -- this shows your content via the lenses of our OSDS browser extension <http://osds.openlinksw.com/> . > > 7) Run the import, and 0/1 pages/sites were retrieved and looked up > the error to be: "XM003: XML parser detected an error: ERROR : Tag > nesting error: name 'head' of end tag does not match the name 'link' > of start tag at line 19 column 108 at line 20 column 9 of source text > </head> -------^" > 8) This appears to be a validation error looking for closing tags of > the <meta> and <link> elements. It appears the content import isn't > checking my doctype declaration. HTML5 doesn't need to close the > <meta> or <link> elements whereas xhtml does. > 9) Updated the HTML5 to close the meta and link tags to work around > this to see if the error would go away. It did! > 10 Created a new import targeting the updated HTML5 with closing tags. > This time, no errors and one 1 site was retrieved successfully > (http://xapi.vocab.pub/datasets/adl/verbs) > 11) Check to see if the named graph and triples populated the > database. Nothing there. > SPARQL SELECT DISTINCT ?g WHERE {GRAPH ?g { ?s ?p ?o . }} > > Here are some strange things I noticed that could be causing issues. > Not sure if anyone can explain what's happening here. > > * Even though the content type is text/html and explicitly defined > as such in the HTML metatag, the file is being stored in webdav as > the "application/xhtml+xml" content type > That's fine. Virtuoso is doing that . > * Even though I assigned dba as content owner, is is assigning dav > as content owner > Yes, since 'dav' is the super-user in the Web Content Storage aspect of Virtuoso. > * After the import que is run, two files are created and stored in > DAV/home/dba/rdf_sink even though I select the option to store a > single file: (verbs and urn_dav_home_dba_rdf_sink.RDF). If I > access the verbs file in webdav it renders the html that was > imported. If I click the urn_dav_home_dba_rdf_sink.RDF it is not > available. Note: the verbs file is being stored > as "application/xhtml+xml" content type and > the urn_dav_home_dba_rdf_sink.RDF is being stored as text/xml in > webdav. > These folders shouldn't have anything to do with your import job, certainly not at this stage. > > After all of this I decided to check and see if I could load the > HTML5/RDFa document using isql-v: > > SQL> DB.DBA.RDF_LOAD_RDFA > (http_get('http://xapi.vocab.pub/datasets/adl/verbs/' > <http://xapi.vocab.pub/datasets/adl/verbs/%27>), > 'http://xapi.vocab.pub/datasets/adl/verbs/#', > 'http://xapi.vocab.pub/datasets/adl/verbs' > <http://xapi.vocab.pub/datasets/adl/verbs%27>); > > This worked and the graph and triples are in the database. However, > for team collaboration it would be helpful for others to see the > stored imports and crawler jobs in the conductor interface rather than > strictly relying on isql-v to populate the data store. Yes. You can even make a DET folder type that's mapped to target named graph with the option to invoke the HTML sponger cartridge. Net effect: a so-called Data Lake of documents (variety of content formats) imported from the Web (or any internal HTTP network) and also passed through the sponger which deposits output into designated named graph. You can access these files via your browser or any WebDAV client (i.e., mount to any native OS via its WebDAV support) . You can share URIs via copy & paste of "share feature" that exists in your browser etc.. > Am I doing something wrong in trying to get HTML5/RDFa content to > import using conductor? I feel like I might be missing an important > step that is preventing it from working. Thanks in advance. Somewhere something has gone wrong or this is an undiscovered VOS edition quirk. We are leaning towards removing these features from VOS as its best suited as a dedicated store for data represented as SQL Tables or RDF Property Graphs. Thus, you are really going to be much better off using the commercial edition. Kingsley > > Regards, > > J Haag > > > > > ------------------------------------------------------------------------------ > Site24x7 APM Insight: Get Deep Visibility into Application Performance > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > Monitor end-to-end web transactions and take corrective actions now > Troubleshoot faster and improve end-user experience. Signup Now! > http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 > > > _______________________________________________ > Virtuoso-users mailing list > Virtuoso-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/virtuoso-users -- Regards, Kingsley Idehen Founder & CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog 1: http://kidehen.blogspot.com Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen Twitter Profile: https://twitter.com/kidehen Google+ Profile: https://plus.google.com/+KingsleyIdehen/about LinkedIn Profile: http://www.linkedin.com/in/kidehen Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this
smime.p7s
Description: S/MIME Cryptographic Signature
------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users