Dear Virtuoso Users,

I am Davide Alocci a Ph. D. student at the Swiss Institute of Bioinformatics. Currently we are working on a software for doing substructure search in database of glycan structures. Here you can find some more information about glycan (http://en.wikipedia.org/wiki/Glycan),but what it is really important is knowing that a glycan is a tree where every node and edge can carry different information. Our goal is to design a software that can retrieve all the structures in the database which contains a specific motif.

From the begin we decided to translate every structure in triples and use Virtuoso for doing the search. In our model each node becomes an entity and we encode the edges linking the entities with predicates. Because an edge has different properties we have multiple triples with different predicates.
Moreover we have self-loops for specify node's properties.

In the end every structure is a long list of triples and here there is an example:

<http://mzjava.expasy.org/structureConnection/A>
<http://mzjava.expasy.org/predicate/has_components>
<http://mzjava.expasy.org/component/A/4> , <http://mzjava.expasy.org/component/A/3> , <http://mzjava.expasy.org/component/A/2> , <http://mzjava.expasy.org/component/A/1> ,
<http://mzjava.expasy.org/component/A/0> .

<http://mzjava.expasy.org/component/A/0>
<http://mzjava.expasy.org/predicate/is_GlycosidicLinkage>
<http://mzjava.expasy.org/component/A/3> , <http://mzjava.expasy.org/component/A/2> ;
<http://mzjava.expasy.org/predicate/is_SubstituentLinkage>
<http://mzjava.expasy.org/component/A/1> ;
<http://mzjava.expasy.org/predicate/is_a_Glc>
<http://mzjava.expasy.org/component/A/0> ;
<http://mzjava.expasy.org/predicate/is_connected>
<http://mzjava.expasy.org/component/A/3> , <http://mzjava.expasy.org/component/A/2> , <http://mzjava.expasy.org/component/A/1> ;
<http://mzjava.expasy.org/predicate/is_monosaccharide>
<http://mzjava.expasy.org/component/A/0> .


<http://mzjava.expasy.org/component/A/1>
<http://mzjava.expasy.org/predicate/is_a_NAcetyl>
<http://mzjava.expasy.org/component/A/1> ;
<http://mzjava.expasy.org/predicate/is_substituent>
<http://mzjava.expasy.org/component/A/1> .

<http://mzjava.expasy.org/component/A/2>
<http://mzjava.expasy.org/predicate/is_a_Gal>
<http://mzjava.expasy.org/component/A/2> ;
<http://mzjava.expasy.org/predicate/is_monosaccharide>
<http://mzjava.expasy.org/component/A/2> .

<http://mzjava.expasy.org/component/A/4>
<http://mzjava.expasy.org/predicate/is_a_Fuc>
<http://mzjava.expasy.org/component/A/4>;
<http://mzjava.expasy.org/predicate/is_monosaccharide>
<http://mzjava.expasy.org/component/A/4>


At the moment our endpoint contains around 30000 structures and it has a size of 200mb. For querying the endpoint we use more or less the some strategy, we first translate the substructure in a sparql query and we retrieve the id of the structures that contains it.
Here there is an example of query:

SELECT DISTINCT ?structureConnection
    WHERE {
        ?structureConnection predicate:has_components ?component0 . {
                    SELECT * WHERE {
                            ?component0 predicate:is_a_Glc ?component0 .
?component1 predicate:is_a_NAcetyl ?component1 . ?component0 predicate:is_connected ?component1 . ?component0 predicate:is_SubstituentLinkage ?component1 . ?component0 predicate:has_linkedCarbon_2 ?component1 .
                            ?component2 predicate:is_a_Glc ?component2 .
?component0 predicate:is_connected ?component2 . ?component0 predicate:is_GlycosidicLinkage ?component2 . ?component0 predicate:has_anomerConnection_beta ?component2 . ?component0 predicate:has_linkedCarbon_4 ?component2 . ?component0 predicate:has_anomerCarbon_1 ?component2 . ?component3 predicate:is_a_NAcetyl ?component3 . ?component2 predicate:is_connected ?component3 . ?component2 predicate:is_SubstituentLinkage ?component3 . ?component2 predicate:has_linkedCarbon_2 ?component3 .
                            ?component4 predicate:is_a_Man ?component4 .
?component2 predicate:is_connected ?component4 . ?component2 predicate:is_GlycosidicLinkage ?component4 . ?component2 predicate:has_anomerConnection_beta ?component4 . ?component2 predicate:has_linkedCarbon_4 ?component4 . ?component2 predicate:has_anomerCarbon_1 ?component4 .
                            ?component5 predicate:is_a_Man ?component5 .
?component4 predicate:is_connected ?component5 . ?component4 predicate:is_GlycosidicLinkage ?component5 . ?component4 predicate:has_anomerConnection_alpha ?component5 . ?component4 predicate:has_linkedCarbon_3 ?component5 . ?component4 predicate:has_anomerCarbon_1 ?component5 .
                            ?component6 predicate:is_a_Man ?component6 .
?component4 predicate:is_connected ?component6 . ?component4 predicate:is_GlycosidicLinkage ?component6 . ?component4 predicate:has_anomerConnection_alpha ?component6 . ?component4 predicate:has_linkedCarbon_6 ?component6 . ?component4 predicate:has_anomerCarbon_1 ?component6 .
                            ?component7 predicate:is_a_Glc ?component7 .
?component5 predicate:is_connected ?component7 . ?component5 predicate:is_GlycosidicLinkage ?component7 . ?component5 predicate:has_anomerConnection_beta ?component7 . ?component5 predicate:has_anomerCarbon_1 ?component7 . ?component5 predicate:has_linkedCarbon_2 ?component7 . ?component8 predicate:is_a_NAcetyl ?component8 . ?component7 predicate:is_connected ?component8 . ?component7 predicate:is_SubstituentLinkage ?component8 . ?component7 predicate:has_linkedCarbon_2 ?component8 .
                            ?component9 predicate:is_a_Glc ?component9 .
?component6 predicate:is_connected ?component9 . ?component6 predicate:is_GlycosidicLinkage ?component9 . ?component6 predicate:has_anomerConnection_beta ?component9 . ?component6 predicate:has_anomerCarbon_1 ?component9 . ?component6 predicate:has_linkedCarbon_2 ?component9 . ?component10 predicate:is_a_NAcetyl ?component10 . ?component9 predicate:is_connected ?component10 . ?component9 predicate:is_SubstituentLinkage ?component10 . ?component9 predicate:has_linkedCarbon_2 ?component10 .
                            }
                }
}

As you could see the length of the query is related to the size of the substructure. Substructures can have 30 components that means more than 200 triples in the query. At the moment we are facing the problem of having an extremely long query that possibly is not a common problem.

We are trying to optimized Virtuoso for our goal and so far the problem is not related to ram or cpu but it seems more connected with the size of the query and the time to parse it. Switching from 7.1 to 7.2 version we saw a good improvement of the performance, for our test the new version is twice as fast than the 7.1 (great job :) ). We tested even graph database like Neo4j but the performance is really poor.

At the moment for a substructure with 25 components we have a query time of 29 seconds whereas, with few components, the query time is under the second. I am keen to share our little database and some test queries because I think is not a really common use case for Virtuoso.
Any ideas for optimizing our model or our queries are welcome.

Best regards,
Davide





------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Reply via email to