Dear Virtuoso Users,
I am Davide Alocci a Ph. D. student at the Swiss Institute of
Bioinformatics.
Currently we are working on a software for doing substructure search in
database of glycan structures. Here you can find some more information
about glycan (http://en.wikipedia.org/wiki/Glycan),but what it is really
important is knowing that a glycan is a tree where every node and edge
can carry different information. Our goal is to design a software that
can retrieve all the structures in the database which contains a
specific motif.
From the begin we decided to translate every structure in triples and
use Virtuoso for doing the search.
In our model each node becomes an entity and we encode the edges linking
the entities with predicates.
Because an edge has different properties we have multiple triples with
different predicates.
Moreover we have self-loops for specify node's properties.
In the end every structure is a long list of triples and here there is
an example:
<http://mzjava.expasy.org/structureConnection/A>
<http://mzjava.expasy.org/predicate/has_components>
<http://mzjava.expasy.org/component/A/4> ,
<http://mzjava.expasy.org/component/A/3> ,
<http://mzjava.expasy.org/component/A/2> ,
<http://mzjava.expasy.org/component/A/1> ,
<http://mzjava.expasy.org/component/A/0> .
<http://mzjava.expasy.org/component/A/0>
<http://mzjava.expasy.org/predicate/is_GlycosidicLinkage>
<http://mzjava.expasy.org/component/A/3> ,
<http://mzjava.expasy.org/component/A/2> ;
<http://mzjava.expasy.org/predicate/is_SubstituentLinkage>
<http://mzjava.expasy.org/component/A/1> ;
<http://mzjava.expasy.org/predicate/is_a_Glc>
<http://mzjava.expasy.org/component/A/0> ;
<http://mzjava.expasy.org/predicate/is_connected>
<http://mzjava.expasy.org/component/A/3> ,
<http://mzjava.expasy.org/component/A/2> ,
<http://mzjava.expasy.org/component/A/1> ;
<http://mzjava.expasy.org/predicate/is_monosaccharide>
<http://mzjava.expasy.org/component/A/0> .
<http://mzjava.expasy.org/component/A/1>
<http://mzjava.expasy.org/predicate/is_a_NAcetyl>
<http://mzjava.expasy.org/component/A/1> ;
<http://mzjava.expasy.org/predicate/is_substituent>
<http://mzjava.expasy.org/component/A/1> .
<http://mzjava.expasy.org/component/A/2>
<http://mzjava.expasy.org/predicate/is_a_Gal>
<http://mzjava.expasy.org/component/A/2> ;
<http://mzjava.expasy.org/predicate/is_monosaccharide>
<http://mzjava.expasy.org/component/A/2> .
<http://mzjava.expasy.org/component/A/4>
<http://mzjava.expasy.org/predicate/is_a_Fuc>
<http://mzjava.expasy.org/component/A/4>;
<http://mzjava.expasy.org/predicate/is_monosaccharide>
<http://mzjava.expasy.org/component/A/4>
At the moment our endpoint contains around 30000 structures and it has a
size of 200mb.
For querying the endpoint we use more or less the some strategy, we
first translate the substructure in a sparql query and we retrieve the
id of the structures that contains it.
Here there is an example of query:
SELECT DISTINCT ?structureConnection
WHERE {
?structureConnection predicate:has_components ?component0 . {
SELECT * WHERE {
?component0 predicate:is_a_Glc ?component0 .
?component1 predicate:is_a_NAcetyl
?component1 .
?component0 predicate:is_connected
?component1 .
?component0 predicate:is_SubstituentLinkage
?component1 .
?component0 predicate:has_linkedCarbon_2
?component1 .
?component2 predicate:is_a_Glc ?component2 .
?component0 predicate:is_connected
?component2 .
?component0 predicate:is_GlycosidicLinkage
?component2 .
?component0
predicate:has_anomerConnection_beta ?component2 .
?component0 predicate:has_linkedCarbon_4
?component2 .
?component0 predicate:has_anomerCarbon_1
?component2 .
?component3 predicate:is_a_NAcetyl
?component3 .
?component2 predicate:is_connected
?component3 .
?component2 predicate:is_SubstituentLinkage
?component3 .
?component2 predicate:has_linkedCarbon_2
?component3 .
?component4 predicate:is_a_Man ?component4 .
?component2 predicate:is_connected
?component4 .
?component2 predicate:is_GlycosidicLinkage
?component4 .
?component2
predicate:has_anomerConnection_beta ?component4 .
?component2 predicate:has_linkedCarbon_4
?component4 .
?component2 predicate:has_anomerCarbon_1
?component4 .
?component5 predicate:is_a_Man ?component5 .
?component4 predicate:is_connected
?component5 .
?component4 predicate:is_GlycosidicLinkage
?component5 .
?component4
predicate:has_anomerConnection_alpha ?component5 .
?component4 predicate:has_linkedCarbon_3
?component5 .
?component4 predicate:has_anomerCarbon_1
?component5 .
?component6 predicate:is_a_Man ?component6 .
?component4 predicate:is_connected
?component6 .
?component4 predicate:is_GlycosidicLinkage
?component6 .
?component4
predicate:has_anomerConnection_alpha ?component6 .
?component4 predicate:has_linkedCarbon_6
?component6 .
?component4 predicate:has_anomerCarbon_1
?component6 .
?component7 predicate:is_a_Glc ?component7 .
?component5 predicate:is_connected
?component7 .
?component5 predicate:is_GlycosidicLinkage
?component7 .
?component5
predicate:has_anomerConnection_beta ?component7 .
?component5 predicate:has_anomerCarbon_1
?component7 .
?component5 predicate:has_linkedCarbon_2
?component7 .
?component8 predicate:is_a_NAcetyl
?component8 .
?component7 predicate:is_connected
?component8 .
?component7 predicate:is_SubstituentLinkage
?component8 .
?component7 predicate:has_linkedCarbon_2
?component8 .
?component9 predicate:is_a_Glc ?component9 .
?component6 predicate:is_connected
?component9 .
?component6 predicate:is_GlycosidicLinkage
?component9 .
?component6
predicate:has_anomerConnection_beta ?component9 .
?component6 predicate:has_anomerCarbon_1
?component9 .
?component6 predicate:has_linkedCarbon_2
?component9 .
?component10 predicate:is_a_NAcetyl
?component10 .
?component9 predicate:is_connected
?component10 .
?component9 predicate:is_SubstituentLinkage
?component10 .
?component9 predicate:has_linkedCarbon_2
?component10 .
}
}
}
As you could see the length of the query is related to the size of the
substructure.
Substructures can have 30 components that means more than 200 triples in
the query.
At the moment we are facing the problem of having an extremely long
query that possibly is not a common problem.
We are trying to optimized Virtuoso for our goal and so far the problem
is not related to ram or cpu but it seems more connected with the size
of the query and the time to parse it.
Switching from 7.1 to 7.2 version we saw a good improvement of the
performance, for our test the new version is twice as fast than the 7.1
(great job :) ).
We tested even graph database like Neo4j but the performance is really
poor.
At the moment for a substructure with 25 components we have a query time
of 29 seconds whereas, with few components, the query time is under the
second.
I am keen to share our little database and some test queries because I
think is not a really common use case for Virtuoso.
Any ideas for optimizing our model or our queries are welcome.
Best regards,
Davide
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users