This is the right answer.  I could go more in depth but if you get the 
significant Phrases rather than terms using shingles you will have better luck 
with a word length of three, no stop words, and a minimum existence of about 4. 
It’s a fun experiment 

> On Jun 22, 2022, at 9:11 AM, Joel Bernstein <[email protected]> wrote:
> 
> For an experiment you can test out the significantTerms Streaming
> Expression, which uses the foreground count and background count to score
> terms.
> 
> https://solr.apache.org/guide/8_9/search-sample.html#significantterms
> https://solr.apache.org/guide/8_9/stream-source-reference.html#significantterms-parameters
> 
> 
> 
> 
> 
> 
> 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> 
>> On Wed, Jun 22, 2022 at 2:37 AM Danilo Tomasoni <[email protected]> wrote:
>> 
>> Hello Dave, first of all thank you for your answer.
>> 
>> I need to clarify that I've used separate (and quite good) NER  algorithms
>> offline and the results were imported to solr.
>> 
>> Unfortunately the approach that you suggest using the morelikethis
>> functionality is not suitable for my needs since I need to discover
>> statistically significative relations between NER entities, while MLT will
>> give me NER entities "similar" to the ones I'm looking for, as far as I
>> understand.
>> 
>> Anyone knows why the relatedness is high even if the foreground (and even
>> background) popularity is 0?
>> 
>> Danilo Tomasoni
>> 
>> Fondazione The Microsoft Research - University of Trento Centre for
>> Computational and Systems Biology (COSBI)
>> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
>> [email protected]<
>> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu
>>> 
>> http://www.cosbi.eu<
>> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f
>>> 
>> 
>> As for the European General Data Protection Regulation 2016/679 on the
>> protection of natural persons with regard to the processing of personal
>> data, we inform you that all the data we possess are object of treatment in
>> the respect of the normative provided for by the cited GDPR.
>> It is your right to be informed on which of your data are used and how;
>> you may ask for their correction, cancellation or you may oppose to their
>> use by written request sent by recorded delivery to The Microsoft Research
>> – University of Trento Centre for Computational and Systems Biology Scarl,
>> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
>> P Please don't print this e-mail unless you really need to
>> ________________________________
>> Da: Dave <[email protected]>
>> Inviato: martedì 21 giugno 2022 19:51
>> A: [email protected] <[email protected]>
>> Oggetto: Re: Semantic Knowledge Graph theoric question
>> 
>> [CAUTION: EXTERNAL SENDER]
>> [Please check correspondence between Sender Display Name and Sender Email
>> Address before clicking on any link or opening attachments]
>> 
>> 
>> Two hints. The ner from solr isn’t very good, and the relatedness function
>> is iffy at best.
>> 
>> I would take a different approach. Get the ner data as you have it now and
>> use shingles to make a better formed complete index using stop words then
>> use the mlt mech to see if it’s better.   If it is, great if not it’s just
>> an idea.
>> 
>> 
>>>> On Jun 21, 2022, at 12:02 PM, Danilo Tomasoni <[email protected]> wrote:
>>> 
>>> Hello all,
>>> I'm experimenting with the SKG features available through json.facet API
>> in solr 8.11 to discover semantic relations between medical text
>> pre-annotated with NER algorithms.
>>> I store the NER annotations, annotation id, span ecc in separate solr
>> fields, to keep text clean.
>>> 
>>> The first results looks promising but I found a behaviour that surprises
>> me.
>>> To give a bit of context I'm looking for covid-related papers with a
>> standard query (q parameter)
>>> Then I set my foreground query to be a set of keywords in OR related to
>> the mithochondria, and the background query is set to *.
>>> 
>>> Then the json.facet parameters are like
>>> 
>>> "json.facet": {
>>>   "gene":{
>>>     "type": "terms",
>>>     "field": "abstracts_gene_pubtator_annotation_ids",
>>>     "sort": { "r1": "desc" },
>>>     "limit": 3,
>>>     "facet": {
>>>       "r1" : "relatedness($fore,$back)"
>>>       }
>>>     }
>>>   }
>>> This should give gene stored in abstracts_gene_pubtator_annotation_ids
>> that are more likely to occur in mitochondrial papers.
>>> Running a test query gives me this surprising result
>>> 
>>> ...
>>>       "gene": {
>>>         "buckets": [
>>>           {
>>>             "val": "3091",
>>>             "count": 1,
>>>             "rtitles1": {
>>>               "relatedness": 0.55649,
>>>               "foreground_popularity": 0,
>>>               "background_popularity": 0.00018
>>>             }
>>>           },
>>> ...
>>> or for a similar query even bigger relatedness values
>>> ...
>>>   "buckets": [
>>>     {
>>>       "val": "MESH:D028361",
>>>       "count": 1,
>>>       "rabstract_conclusions0": {
>>>         "relatedness": 0.91506,
>>>         "foreground_popularity": 5e-05,
>>>         "background_popularity": 5e-05
>>>       },
>>> 
>>> ...
>>> 
>>> But If I recall the z-score formula
>>> 
>>> countFG("3091") - totalFG * probBG
>>> ------------------------------------------------
>>> sqrt( totalFG * (1-probBG)*probBG )
>>> 
>>> and set countFG("3091") to 1 this means that the relatedness should be
>> negative (or at most 0) if totalFG * probBG >=1, while here I find a quite
>> positive relatedness.
>>> Maybe this can be controlled with min_popularity, but I don't understand
>> how to use it in conjunction with type=terms and
>> field=abstracts_gene_pubtator_annotation_ids
>>> 
>>> Can you please tell me the correct syntax, and if my reasoning is
>> correct?
>>> Thank you
>>> Danilo
>>> 
>>> Danilo Tomasoni
>>> 
>>> Fondazione The Microsoft Research - University of Trento Centre for
>> Computational and Systems Biology (COSBI)
>>> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
>>> [email protected]<
>> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu
>>> 
>>> http://www.cosbi.eu<
>> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f
>>> 
>>> 
>>> As for the European General Data Protection Regulation 2016/679 on the
>> protection of natural persons with regard to the processing of personal
>> data, we inform you that all the data we possess are object of treatment in
>> the respect of the normative provided for by the cited GDPR.
>>> It is your right to be informed on which of your data are used and how;
>> you may ask for their correction, cancellation or you may oppose to their
>> use by written request sent by recorded delivery to The Microsoft Research
>> – University of Trento Centre for Computational and Systems Biology Scarl,
>> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
>>> P Please don't print this e-mail unless you really need to
>> 

Reply via email to