Hi All,

I"m back again evaluating Virtuoso for the HTML5/RDFa crawling capability.
We are considering moving to the Universal Server from VOS if I can ever
prove to my team that it will be a good choice for sponging and crawling
HTML5/RDFa files. I have been testing this feature periodically over the
past several months with no luck. I appreciate the support and feedback so
far, but I haven't made any progress. Some previous posts/inquiries I made
on this topic are available here:
http://sourceforge.net/p/virtuoso/mailman/message/34507072/ and here:
http://sourceforge.net/p/virtuoso/mailman/virtuoso-users/thread/CAHjqjnLo7-hiA30neYBsbGm93HeXe%3DHrda5rZPGS%3Dwm%2B08ZvBw%40mail.gmail.com/#msg34525370

I would really like to use the conductor interface to regularly schedule
the import several graph IRIs that contain RDFa and check the triples for
any additions on daily basis. I recently upgraded the installation to VOS
7.2.3 and still can't see to get the RDFa data to populate the data store.
After I run the import from the que, anytime I query the virtuoso database
there is no data from my RDFa datasets that I have imported through
conductor. I must be doing something wrong or missing an important step
somewhere. However, if I use these same exact RDFa IRIs using the isql-v
function (DB.DBA.RDF_LOAD_RDFA) the triples load successfully.

Here's a summary of what I've done and discovered so far:

1) Installed VOS 7.2.3 successfully
2) Read some of the newly updated documentation, which is excellent by the
way
3) Checked/updated sponger priveledges per this guidance for securing the
endpoint:
http://docs.openlinksw.com/virtuoso/rdfsparql.html#rdfsupportedprotocolendpointuri
4) Installed cartridges_dav.vad from commercial version (for sponger
cartridges):
http://opldownload.s3.amazonaws.com/uda/vad-packages/7.2/cartridges_dav.vad
5) Checked and configured xHTML /  aka HTML5 (and variants) cartridge under
“extractor cartridges” with the following settings (per advice from the
mailing list/forums):

Pattern: (application/xhtml.xml)|(text|application)/.*(html|xml)
fallback-mode=no
rdfa=yes
reify_html5md=1
reify_rdfa=0
reify_jsonld=0
reify_all_grddl=0
passthrough_mode=yes
loose=yes
reify_html=0
reify_html_misc=0
reify_turtle=no


I also tried this basic configuration as well:
add-html-meta=no
get-feeds=no
rdfa=yes
fallback-mode=no
reify_html=no
reify_html_misc=no
reify_html5md=no
reify_rdfa=no
reify_jsonld=no
reify_turtle=no
reify_all_grddl=no
passthrough_mode=no
loose=no

6) Created a content import for the HTML5/RDFa document using conductor
with the following options:

Target URL: http://xapi.vocab.pub/datasets/adl/verbs
login owner: dba
checked the following

   - store documents locally
   - run sponger
   - store metadata (selected xHTML aka HTML5 and variants)


7) Run the import, and 0/1 pages/sites were retrieved and looked up the
error to be: "XM003: XML parser detected an error: ERROR : Tag nesting
error: name 'head' of end tag does not match the name 'link' of start tag
at line 19 column 108 at line 20 column 9 of source text </head> -------^"
8) This appears to be a validation error looking for closing tags of the
<meta> and <link> elements. It appears the content import isn't checking my
doctype declaration.  HTML5 doesn't need to close the <meta> or <link>
elements whereas xhtml does.
9) Updated the HTML5 to close the meta and link tags to work around this to
see if the error would go away. It did!
10 Created a new import targeting the updated HTML5 with closing tags. This
time,  no errors and one 1 site was retrieved successfully (
http://xapi.vocab.pub/datasets/adl/verbs)
11) Check to see if the named graph and triples populated the database.
Nothing there.
SPARQL SELECT DISTINCT ?g WHERE {GRAPH ?g { ?s ?p ?o . }}

Here are some strange things I noticed that could be causing issues. Not
sure if anyone can explain what's happening here.

   - Even though the content type is text/html and explicitly defined as
   such in the HTML metatag, the file is being stored in webdav as the
   "application/xhtml+xml" content type
   - Even though I assigned dba as content owner, is is assigning dav as
   content owner
   - After the import que is run, two files are created and stored in
   DAV/home/dba/rdf_sink even though I select the option to store a single
   file: (verbs and urn_dav_home_dba_rdf_sink.RDF). If I access the verbs file
   in webdav it renders the html that was imported. If I click
   the urn_dav_home_dba_rdf_sink.RDF it is not available. Note: the verbs file
   is being stored as "application/xhtml+xml" content type and
   the urn_dav_home_dba_rdf_sink.RDF is being stored as text/xml in webdav.


After all of this I decided to check and see if I could load the HTML5/RDFa
document using isql-v:

SQL> DB.DBA.RDF_LOAD_RDFA (http_get('
http://xapi.vocab.pub/datasets/adl/verbs/'), '
http://xapi.vocab.pub/datasets/adl/verbs/#', '
http://xapi.vocab.pub/datasets/adl/verbs');

This worked and the graph and triples are in the database. However, for
team collaboration it would be helpful for others to see the stored imports
and crawler jobs in the conductor interface rather than strictly relying on
isql-v to populate the data store. Am I doing something wrong in trying to
get HTML5/RDFa content to import using conductor? I feel like I might be
missing an important step that is preventing it from working. Thanks in
advance.

Regards,

J Haag
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Reply via email to