Hello to all,
I have an issue related to synonimgraphfilter expanding the wrong
synonims for a phrase-term at query time.
I have a dictionary with the following lines
P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 5'-nucleotidase II
A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, acid
3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to Homo sapiens
glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, FLJ93688\, Homo
sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA
and two documents
{"body":"8. The method of claim 6 wherein said method inhibits at least one
5′-nucleotidase chosen from cytosolic 5′-nucleotidase II (cN-II),
cytosolic 5′-nucleotidase IA (cN-IA), cytosolic 5′-nucleotidase IB
(cN-IB), cytosolic 5′-nucleotidase IMA (cN-IIIA), cytosolic
5′-nucleotidase NIB (cN-IIIB), ecto-5′-nucleotidase (eN, CD73),
cytosolic 5′(3′)-deoxynucleotidase (cdN) and mitochondrial
5′(3′)-deoxynucleotidase (mdN)."}
{"body":"Trichomonosis caused by the flagellate protozoan Trichomonas vaginalis
represents the most prevalent nonviral sexually transmitted disease
worldwide (WHO-DRHR 2012). In women, the symptoms are cyclic and often
worsen around the menstruation period. In men, trichomonosis is largely
asymptomatic and these men are considered to be carriers of T. vaginalis
(Petrin et al. 1998). This infection has been associated with birth
outcomes (Klebanoff et al. 2001), infertility (Grodstein et al. 1993),
cervical and prostate cancer (Viikki et al. 2000, Sutcliffe et al. 2012)
and pelvic inflammatory disease (Cherpes et al. 2006). Importantly, T.
vaginalis is a co-factor in human immunodeficiency virus transmission
and acquisition (Sorvillo et al. 2001, Van Der Pol et al. 2008).
Therefore, it is important to study the host-parasite relationship to
understand T. vaginalis infection and pathogenesis. Colonisation of the
mucosa by T. vaginalis is a complex multi-step process that involves
distinct mechanisms (Alderete et al. 2004). The parasite interacts with
mucin (Lehker & Sweeney 1999), adheres to vaginal epithelial cells
(VECs) in a process mediated by adhesion proteins (AP120, AP65, AP51,
AP33 and AP23) and undergoes dramatic morphological changes from a
pyriform to an amoeboid form (Engbring & Alderete 1998, Kucknoor et al.
2005, Moreno-Brito et al. 2005). After adhesion to VECs, the synthesis
and gene expression of adhesins are increased (Kucknoor et al. 2005).
These mechanisms must be tightly regulated and iron plays a pivotal role
in this regulation. Iron is an essential element for all living
organisms, from the most primitive to the most complex, as a component
of haeme, iron-sulphur clusters and a variety of proteins. Iron is known
to contribute to biological functions such as DNA and RNA synthesis,
oxygen transport and metabolic reactions. T. vaginalis has developed
multiple iron uptake systems such as receptors for hololactoferrin,
haemoglobin (HB), haemin (HM) and haeme binding as well as adhesins to
erythrocytes and epithelial cells (Moreno-Brito et al. 2005, Ardalan et
al. 2009). Iron plays a crucial role in the pathogenesis of
trichomonosis by increasing cytoadherence and modulating resistance to
complement lyses, ligation to the extracellular matrix and the
expression of proteases (Figueroa-Angulo et al. 2012). In agreement with
this role, the symptoms of trichomonosis worsen after menstruation. In
addition, iron also influences nucleotide hydrolysis in T. vaginalis
(Tasca et al. 2005, de Jesus et al. 2006). The extracellular
concentrations of ATP and adenosine can markedly increase under several
conditions such as inflammation and hypoxia as well as in the presence
of pathogens (Robson et al. 2006, Sansom 2012). In the extracellular
medium, these nucleotides can act as immunomodulators by triggering
immunological effects. Extracellular ATP acts as a proinflammatory
immune-mediator by triggering multiple immunological effects on cell
types such as neutrophils, macrophages, dendritic cells and lymphocytes
(Bours et al. 2006). In this sense, ATP and adenosine concentrations in
the extracellular compartment are controlled by ectoenzymes, including
those of the nucleoside triphosphate diphosphohydrolase (NTPDase) (EC:
3.1.4.1) family, which hydrolyze tri and diphosphates and
ecto-5’-nucleotidase (EC: 3.1.3.5), which hydrolyses monophosphates
(Zimmermann 2001). Considering that de novo nucleotide synthesis is
absent in T. vaginalis (Heyworth et al. 1982, 1984), this enzyme cascade
is important as a source of the precursor adenosine for purine synthesis
in the parasite (Munagala & Wang 2003). Extracellular nucleotide
metabolism has been characterised in several parasite species such as
Toxoplasma gondii, Schistosoma mansoni, Leishmania spp, Trypanosoma
cruzi, Acanthamoeba, Entamoeba histolytica, Giardia lamblia and fungi,
Saccharomyces cerevisiae, Cryptococcus neoformans, Candida parapsilosis
and Candida albicans (Sansom 2012). In T. vaginalis , NTPDase and
ecto-5’-nucleotidase activities have been characterised and they are
involved in host-parasite interactions by controlling ATP and adenosine
levels (Matos et al. 2001, d, de Jesus et al. 2002, Tasca et al. 2003).
Considering that (i) iron plays a crucial role in the pathogenesis of
trichomonosis, (ii) ATP exerts a proinflammatory effect in inflammation,
(iii) adenosine is important to T. vaginalis growth and acts as an
antiinflammatory factor (Frasson et al. 2012) and (iv) ectonucleotidases
modulate the nucleotide levels at infection sites (such as those
observed in trichomonosis), the aim of this study was to investigate the
effect of iron on the extracellular nucleotide hydrolysis and gene
expression of T . vaginalis."}
Body has the type "text_en" configured in this way
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
the two dictionary lines are in the file "synonyms.txt".
If in a solr instance configured this way with those documents and I run
the following query
(body:"Cytosolic 5'-nucleotidase II" OR body:"EC 3.1.3.5")
both documents are returned.
Surprisingly, if I run the query
(body:"Cytosolic 5'-nucleotidase II")
the second one is not returned.
If I set debugQuery=true I see that the second line is expanded
A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, acid
3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to Homo sapiens
glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, FLJ93688\, Homo
sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA
instead of the first
P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 5'-nucleotidase II
The parsed query (given by debugquery) is
"parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1,
spanNear([body:glucosidase,, body:beta,, body:acid, body:3], 0,true),
spanNear([body:cytosolic,, body:isoform, body:cra_b], 0,true), spanNear([body:cdna,
body:flj78196,, body:highli, body:similar, body:to, body:homo, body:sapien, body:glucosidase,,
body:beta,, body:acid, body:3], 0,true), body:cytosol, spanNear([body:gba3,, body:mrna],
0,true), spanNear([body:cdna,, body:flj93688,, body:homo, body:sapien, body:glucosidase,,
body:beta,, body:acid, body:3], 0,true), body:cytosol]), body:5, body:nucleotidas, body:ii],
0,true))
If I remove the second line, no synonym is expanded
"parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas ii\")",
I think this is related to the word "cytosolic" that appears as a
synonim for the second line. If I remove cytosolic as a synonim from the
second line, then again no synonym is expanded.
Can you tell me why this happens? I thought that the first line should
be expanded since it has a multi-word synonym in it that match exactly
the phrase query.
Thank you
--
Danilo Tomasoni
COSBI
As for the European General Data Protection Regulation 2016/679 on the
protection of natural persons with regard to the processing of personal data,
we inform you that all the data we possess are object of treatement in the
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may
ask for their correction, cancellation or you may oppose to their use by
written request sent by recorded delivery to The Microsoft Research –
University of Trento Centre for Computational and Systems Biology Scarl, Piazza
Manifattura 1, 38068 Rovereto (TN), Italy.