Re: [Virtuoso-users] performance toll for using variables as properties

Olivier Dameron Mon, 08 Aug 2016 09:01:05 -0700

Dear Ivan,
   Thank you for the very detailed answer!
We will investigate whether solution #3 would work for us. 500x
performance improvement sure would be nice:-)


Kind regards
Olivier Dameron

On 07/25/2016 06:58 AM, Ivan Mikhailov wrote:
> Hello Oliver,
> 
> The cost of "variable P" heavily depends on the sample query.
> 
> 1. In good case the overhead is zero. It's possible when graphs (G) are
> known or almost all data of the dataset are in very numerous small
> graphs, so it's cheaper to read whole graphs than to try to use
> additional indices that needs known P first.
> 
> 2. If the value of P can be found before search, as if
> :display :startRelation ?startRel .
> is matched before
> ?gene ?startRel ?start .
> then it may cost zero if statistics of the desired predicate (or
> predicates) are close to average statistics across all predicates of the
> database. In that case the optimizer will build an execution plan using
> "average" statistics but the plan will match one built using statistics
> specific for a specific predicate. If the desired predicate is very
> frequent or very infrequent then the optimizer can miss, resulting in
> poor plan; the cost of the planning error can be big for single-process
> installations and blocking for clusters. Note that rdf:type is the most
> common example of predicate with far-from-average statistics.
> 
> 3. In some cases, known G can eliminate the extra cost of unknown P. As
> a common rule, if you know G or if you can easily calculate G at the
> beginning of the query then the pattern
> <calculate ?g here>
> GRAPH ?g { <interesting part of query> }
> is worth trying. Some customer reported 500x performance on single
> server.
> 
> 4. If the query is executed by an ODBC client or the like and P is a
> parameter passed from outside then it may be practical to inline P into
> the text of the query rather than using ?:p parameter passing notation.
> At least, it's practical to have a separate variant of the query for
> parameter set to rdf:type
> 
> Best Regards,
> 
> Ivan Mikhailov
> OpenLink Software
> http://virtuoso.openlinksw.com
> 
> On Fri, 2016-07-22 at 17:09 +0200, Olivier Dameron wrote:
>> Dear Virtuosans,
>>    I noticed that queries can be much slower when using variables as
>> properties (even if the variable can only have one value) e.g.:
>>
>> DATA (extract):
>> :display :startRelation :posStart.
>> :gene0 rdf:type :Gene.
>> :gene0 :posStart "53416"^^xsd:numeric.
>> :gene1 rdf:type :Gene.
>> :gene1 :posStart "29513"^^xsd:numeric.
>> ...
>>
>> SIMPLE QUERY:
>> ?gene :posStart ?start.
>>
>> SLOWER QUERY:
>> :display :startRelation ?startRel .
>> ?gene ?startRel ?start.
>>
>>    I assume that the query engine first tries to match the "?gene
>> ?startRel ?start" constraint, whereas begining by ":display
>> :startRelation ?startRel." would define the value of ?startRel which
>> would be used to find the start positions of the genes.
>>    I can live with the simple query, but the second one would make the
>> development of our application easier. Is there anything we could do to
>> improve the performance of the second query?
>>
>> Thank you!
>> kind regards
>> Olivier Dameron
>>
>> NB: for the record, below is the script I used for generating the
>> dataset, as well as the two queries
>>
>> ===== generateDataSet.py
>> #! /usr/bin/env python
>>
>> import random
>>
>> nbGenes = 50000
>> geneLengthMax = 100
>> chromosomeLength = 100000
>>
>> with open("debugDataset.ttl", "w") as dataFile:
>>      dataFile.write("""
>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
>> @prefix : <http://www.univ-rennes1.fr/odameron/debugVirtuoso/>.
>>
>> :display :startRelation :posStart.
>> :display :stopRelation :posStop.
>>
>> """)
>>      for i in range(nbGenes):
>>              geneIdent = ":gene" + str(i)
>>              dataFile.write("\n" + geneIdent + " rdf:type :Gene.\n")
>>              posStart = random.randint(1, chromosomeLength-geneLengthMax)
>>              posStop = posStart + random.randint(1, geneLengthMax)
>>              dataFile.write(geneIdent + " :posStart \"" + str(posStart) +
>> "\"^^xsd:numeric.\n")
>>              dataFile.write(geneIdent + " :posStop \"" + str(posStop) +
>> "\"^^xsd:numeric.\n")
>>
>>
>> ===== getOverlap.sparql
>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> PREFIX : <http://www.univ-rennes1.fr/odameron/debugVirtuoso/>
>>
>> SELECT (count(*) as ?nbOverlap)
>>
>> WHERE {
>>   ?gene1 a :Gene;
>>     :posStart ?start1;
>>     :posStop ?stop1.
>>
>>   ?gene2 a :Gene;
>>     :posStart ?start2;
>>     :posStop ?stop2.
>>   FILTER (?start1 < ?start2 && ?start1 < ?stop2 && ?start2 < ?stop1)
>> }
>>
>>
>> ===== getOverlapSLOW.sparql
>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> PREFIX : <http://www.univ-rennes1.fr/odameron/debugVirtuoso/>
>>
>> SELECT (count(*) as ?nbOverlap)
>>
>> WHERE {
>>   :display :startRelation ?startRel.
>>   :display :stopRelation ?stopRel.
>>
>>   ?gene1 a :Gene;
>>     ?startRel ?start1;
>>     ?stopRel ?stop1.
>>
>>   ?gene2 a :Gene;
>>     ?startRel ?start2;
>>     ?stopRel ?stop2.
>>   FILTER (?start1 < ?start2 && ?start1 < ?stop2 && ?start2 < ?stop1)
>> }
>>
>>
>>
>> ------------------------------------------------------------------------------
>> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
>> patterns at an interface-level. Reveals which users, apps, and protocols are 
>> consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
>> J-Flow, sFlow and other flows. Make informed decisions using capacity 
>> planning
>> reports.http://sdm.link/zohodev2dev
>> _______________________________________________
>> Virtuoso-users mailing list
>> Virtuoso-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
> 
> 

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. http://sdm.link/zohodev2dev
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Re: [Virtuoso-users] performance toll for using variables as properties

Reply via email to