I was working intensively with Virtuoso in the summer, sometimes trying to
get "result sets" that included 4 million rows. I tweaked a number of
variables but never was able to get back result sets that big. Sometimes
I'd wedge the server, other times I'd get back incomplete results.
This is not just a Virtuoso problem, if you try to get a big enough result
set from any database system you will run into trouble. If the system was
"scalable first" (i.e. Hadoop-like) you might beat that, but for
prescalable software this is a reality.
Fortunately there is this one old trick used by batch job pros, which is
to write something like
SELECT ?id ?label {
?id rdfs:label ?label
FILTER(?id>=LOWER_BOUND)
FILTER(?id<UPPER_BOUND)
}
choosing the bounds so that none of the result sets are oversized.
On Mon, Jan 5, 2015 at 2:57 PM, Kingsley Idehen <kide...@openlinksw.com>
wrote:
> On 1/5/15 1:33 PM, Jörn Hees wrote:
>
>> Hi,
>>
>> TLDR:
>> i'm trying to get node degree counts and neighbourhood subgraphs for a
>> lot of nodes from a local virtuoso-opensource endpoint.
>> My problem is that i seem to get partial / wrong results (similar to
>> https://github.com/openlink/virtuoso-opensource/issues/112 ).
>>
>> Now as this is a local endpoint i'd like to ask if there is any way i can
>> configure it to only return complete results.
>> If not: Which of the virtuoso.ini settings allow me to reduce the problem?
>>
>> Also: is the ResultSetMaxRows parameter applied before aggregation in the
>> counts?
>>
>>
>> Details:
>>
>> We have a local Virtuoso endpoint set up as described here:
>> https://joernhees.de/blog/2014/11/10/setting-up-a-local-
>> dbpedia-2014-mirror-with-virtuoso-7-1-0/
>> configured with:
>> ResultSetMaxRows = 1000000
>> MaxQueryCostEstimationTime = 600 ; in seconds
>> MaxQueryExecutionTime = 1200 ; in seconds
>>
>>
>> I first run queries like these to get node degrees:
>>
>> SELECT ?node count(*) as ?degree
>> WHERE {
>> {
>> ?node ?p ?o .
>> } UNION {
>> ?s ?p ?node .
>> }
>> VALUES (?node) { %(nodes)s }
>> }
>>
>> I do this in chunks of up to n nodes at once in order to cut down the
>> number of queries i'm doing.
>> (At the moment n = 32, but i'd like to increase it to ~ 1024 if this
>> works somehow.)
>>
>> Depending on the degree counts of the nodes some of them might be
>> expanded (if they're not too big) and this is where my problem starts:
>>
>>
>> skos:Concept is reported to have a ?degree = 4, so it's expanded...
>> Turns out i have 4 triples of the form { skos:Concept ?p ?o } but 1396211
>> [1] of the form { ?s ?p skos:Concept }.
>>
>> Unlike mentioned in https://github.com/openlink/
>> virtuoso-opensource/issues/112 there is no HTTP header like X-SQL-State:
>> S1TAT in the original count responses indicating anything went wrong ;(
>>
>>
>> If ResultSetMaxRows was somehow applied before aggregation, how did i get
>> 1396211 as a result for "select count(*) where { ?s ?p skos:Concept }" ?
>>
>
> Ensure you query timeout is large enough. Or set the timeout to 0.
>
> As a control mechanism for comparison, using iSQL, you can execute:
>
> sparql
>
> select count(*) where { ?s ?p skos:Concept } ;
>
>
>
>> Is there a maximum value for ResultSetMaxRows?
>>
> 2 Million.
>
>> Also are there any other settings which could be the reason for these
>> partial results?
>>
>
> No.
>
>>
>>
>> Cheers,
>> Jörn
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Dive into the World of Parallel Programming! The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is
>> your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net
>> _______________________________________________
>> Virtuoso-users mailing list
>> Virtuoso-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>>
>>
>>
>
> --
> Regards,
>
> Kingsley Idehen
> Founder & CEO
> OpenLink Software
> Company Web: http://www.openlinksw.com
> Personal Weblog 1: http://kidehen.blogspot.com
> Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
> Twitter Profile: https://twitter.com/kidehen
> Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
> LinkedIn Profile: http://www.linkedin.com/in/kidehen
> Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this
>
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming! The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is
> your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net
> _______________________________________________
> Virtuoso-users mailing list
> Virtuoso-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>
>
--
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254 paul.houle on Skype ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup
------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users