[Virtuoso-users] Virtuoso optimization for long queries

Davide Alocci Tue, 17 Feb 2015 06:57:13 -0800

Dear Virtuoso Users,

I am Davide Alocci a Ph. D. student at the Swiss Institute ofBioinformatics.Currently we are working on a software for doing substructure search indatabase of glycan structures. Here you can find some more informationabout glycan (http://en.wikipedia.org/wiki/Glycan),but what it is reallyimportant is knowing that a glycan is a tree where every node and edgecan carry different information. Our goal is to design a software thatcan retrieve all the structures in the database which contains aspecific motif.

From the begin we decided to translate every structure in triples anduse Virtuoso for doing the search.In our model each node becomes an entity and we encode the edges linkingthe entities with predicates.Because an edge has different properties we have multiple triples withdifferent predicates.

Moreover we have self-loops for specify node's properties.

In the end every structure is a long list of triples and here there isan example:


<http://mzjava.expasy.org/structureConnection/A>
<http://mzjava.expasy.org/predicate/has_components>

<http://mzjava.expasy.org/component/A/4> ,<http://mzjava.expasy.org/component/A/3> ,<http://mzjava.expasy.org/component/A/2> ,<http://mzjava.expasy.org/component/A/1> ,

<http://mzjava.expasy.org/component/A/0> .

<http://mzjava.expasy.org/component/A/0>
<http://mzjava.expasy.org/predicate/is_GlycosidicLinkage>

<http://mzjava.expasy.org/component/A/3> ,<http://mzjava.expasy.org/component/A/2> ;

<http://mzjava.expasy.org/predicate/is_SubstituentLinkage>
<http://mzjava.expasy.org/component/A/1> ;
<http://mzjava.expasy.org/predicate/is_a_Glc>
<http://mzjava.expasy.org/component/A/0> ;
<http://mzjava.expasy.org/predicate/is_connected>

<http://mzjava.expasy.org/component/A/3> ,<http://mzjava.expasy.org/component/A/2> ,<http://mzjava.expasy.org/component/A/1> ;

<http://mzjava.expasy.org/predicate/is_monosaccharide>
<http://mzjava.expasy.org/component/A/0> .


<http://mzjava.expasy.org/component/A/1>
<http://mzjava.expasy.org/predicate/is_a_NAcetyl>
<http://mzjava.expasy.org/component/A/1> ;
<http://mzjava.expasy.org/predicate/is_substituent>
<http://mzjava.expasy.org/component/A/1> .

<http://mzjava.expasy.org/component/A/2>
<http://mzjava.expasy.org/predicate/is_a_Gal>
<http://mzjava.expasy.org/component/A/2> ;
<http://mzjava.expasy.org/predicate/is_monosaccharide>
<http://mzjava.expasy.org/component/A/2> .

<http://mzjava.expasy.org/component/A/4>
<http://mzjava.expasy.org/predicate/is_a_Fuc>
<http://mzjava.expasy.org/component/A/4>;
<http://mzjava.expasy.org/predicate/is_monosaccharide>
<http://mzjava.expasy.org/component/A/4>

At the moment our endpoint contains around 30000 structures and it has asize of 200mb.For querying the endpoint we use more or less the some strategy, wefirst translate the substructure in a sparql query and we retrieve theid of the structures that contains it.

Here there is an example of query:

SELECT DISTINCT ?structureConnection
    WHERE {
        ?structureConnection predicate:has_components ?component0 . {
                    SELECT * WHERE {
                            ?component0 predicate:is_a_Glc ?component0 .

?component1 predicate:is_a_NAcetyl?component1 .?component0 predicate:is_connected?component1 .?component0 predicate:is_SubstituentLinkage?component1 .?component0 predicate:has_linkedCarbon_2?component1 .

                            ?component2 predicate:is_a_Glc ?component2 .

?component0 predicate:is_connected?component2 .?component0 predicate:is_GlycosidicLinkage?component2 .?component0predicate:has_anomerConnection_beta ?component2 .?component0 predicate:has_linkedCarbon_4?component2 .?component0 predicate:has_anomerCarbon_1?component2 .?component3 predicate:is_a_NAcetyl?component3 .?component2 predicate:is_connected?component3 .?component2 predicate:is_SubstituentLinkage?component3 .?component2 predicate:has_linkedCarbon_2?component3 .

                            ?component4 predicate:is_a_Man ?component4 .

?component2 predicate:is_connected?component4 .?component2 predicate:is_GlycosidicLinkage?component4 .?component2predicate:has_anomerConnection_beta ?component4 .?component2 predicate:has_linkedCarbon_4?component4 .?component2 predicate:has_anomerCarbon_1?component4 .

                            ?component5 predicate:is_a_Man ?component5 .

?component4 predicate:is_connected?component5 .?component4 predicate:is_GlycosidicLinkage?component5 .?component4predicate:has_anomerConnection_alpha ?component5 .?component4 predicate:has_linkedCarbon_3?component5 .?component4 predicate:has_anomerCarbon_1?component5 .

                            ?component6 predicate:is_a_Man ?component6 .

?component4 predicate:is_connected?component6 .?component4 predicate:is_GlycosidicLinkage?component6 .?component4predicate:has_anomerConnection_alpha ?component6 .?component4 predicate:has_linkedCarbon_6?component6 .?component4 predicate:has_anomerCarbon_1?component6 .

                            ?component7 predicate:is_a_Glc ?component7 .

?component5 predicate:is_connected?component7 .?component5 predicate:is_GlycosidicLinkage?component7 .?component5predicate:has_anomerConnection_beta ?component7 .?component5 predicate:has_anomerCarbon_1?component7 .?component5 predicate:has_linkedCarbon_2?component7 .?component8 predicate:is_a_NAcetyl?component8 .?component7 predicate:is_connected?component8 .?component7 predicate:is_SubstituentLinkage?component8 .?component7 predicate:has_linkedCarbon_2?component8 .

                            ?component9 predicate:is_a_Glc ?component9 .

?component6 predicate:is_connected?component9 .?component6 predicate:is_GlycosidicLinkage?component9 .?component6predicate:has_anomerConnection_beta ?component9 .?component6 predicate:has_anomerCarbon_1?component9 .?component6 predicate:has_linkedCarbon_2?component9 .?component10 predicate:is_a_NAcetyl?component10 .?component9 predicate:is_connected?component10 .?component9 predicate:is_SubstituentLinkage?component10 .?component9 predicate:has_linkedCarbon_2?component10 .

                            }
                }
}

As you could see the length of the query is related to the size of thesubstructure.Substructures can have 30 components that means more than 200 triples inthe query.At the moment we are facing the problem of having an extremely longquery that possibly is not a common problem.

We are trying to optimized Virtuoso for our goal and so far the problemis not related to ram or cpu but it seems more connected with the sizeof the query and the time to parse it.Switching from 7.1 to 7.2 version we saw a good improvement of theperformance, for our test the new version is twice as fast than the 7.1(great job :) ).We tested even graph database like Neo4j but the performance is reallypoor.

At the moment for a substructure with 25 components we have a query timeof 29 seconds whereas, with few components, the query time is under thesecond.I am keen to share our little database and some test queries because Ithink is not a really common use case for Virtuoso.

Any ideas for optimizing our model or our queries are welcome.

Best regards,
Davide

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk

_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

[Virtuoso-users] Virtuoso optimization for long queries

Reply via email to