This is the right answer. I could go more in depth but if you get the significant Phrases rather than terms using shingles you will have better luck with a word length of three, no stop words, and a minimum existence of about 4. It’s a fun experiment
> On Jun 22, 2022, at 9:11 AM, Joel Bernstein <[email protected]> wrote: > > For an experiment you can test out the significantTerms Streaming > Expression, which uses the foreground count and background count to score > terms. > > https://solr.apache.org/guide/8_9/search-sample.html#significantterms > https://solr.apache.org/guide/8_9/stream-source-reference.html#significantterms-parameters > > > > > > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > >> On Wed, Jun 22, 2022 at 2:37 AM Danilo Tomasoni <[email protected]> wrote: >> >> Hello Dave, first of all thank you for your answer. >> >> I need to clarify that I've used separate (and quite good) NER algorithms >> offline and the results were imported to solr. >> >> Unfortunately the approach that you suggest using the morelikethis >> functionality is not suitable for my needs since I need to discover >> statistically significative relations between NER entities, while MLT will >> give me NER entities "similar" to the ones I'm looking for, as far as I >> understand. >> >> Anyone knows why the relatedness is high even if the foreground (and even >> background) popularity is 0? >> >> Danilo Tomasoni >> >> Fondazione The Microsoft Research - University of Trento Centre for >> Computational and Systems Biology (COSBI) >> Piazza Manifattura 1, 38068 Rovereto (TN), Italy >> [email protected]< >> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu >>> >> http://www.cosbi.eu< >> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f >>> >> >> As for the European General Data Protection Regulation 2016/679 on the >> protection of natural persons with regard to the processing of personal >> data, we inform you that all the data we possess are object of treatment in >> the respect of the normative provided for by the cited GDPR. >> It is your right to be informed on which of your data are used and how; >> you may ask for their correction, cancellation or you may oppose to their >> use by written request sent by recorded delivery to The Microsoft Research >> – University of Trento Centre for Computational and Systems Biology Scarl, >> Piazza Manifattura 1, 38068 Rovereto (TN), Italy. >> P Please don't print this e-mail unless you really need to >> ________________________________ >> Da: Dave <[email protected]> >> Inviato: martedì 21 giugno 2022 19:51 >> A: [email protected] <[email protected]> >> Oggetto: Re: Semantic Knowledge Graph theoric question >> >> [CAUTION: EXTERNAL SENDER] >> [Please check correspondence between Sender Display Name and Sender Email >> Address before clicking on any link or opening attachments] >> >> >> Two hints. The ner from solr isn’t very good, and the relatedness function >> is iffy at best. >> >> I would take a different approach. Get the ner data as you have it now and >> use shingles to make a better formed complete index using stop words then >> use the mlt mech to see if it’s better. If it is, great if not it’s just >> an idea. >> >> >>>> On Jun 21, 2022, at 12:02 PM, Danilo Tomasoni <[email protected]> wrote: >>> >>> Hello all, >>> I'm experimenting with the SKG features available through json.facet API >> in solr 8.11 to discover semantic relations between medical text >> pre-annotated with NER algorithms. >>> I store the NER annotations, annotation id, span ecc in separate solr >> fields, to keep text clean. >>> >>> The first results looks promising but I found a behaviour that surprises >> me. >>> To give a bit of context I'm looking for covid-related papers with a >> standard query (q parameter) >>> Then I set my foreground query to be a set of keywords in OR related to >> the mithochondria, and the background query is set to *. >>> >>> Then the json.facet parameters are like >>> >>> "json.facet": { >>> "gene":{ >>> "type": "terms", >>> "field": "abstracts_gene_pubtator_annotation_ids", >>> "sort": { "r1": "desc" }, >>> "limit": 3, >>> "facet": { >>> "r1" : "relatedness($fore,$back)" >>> } >>> } >>> } >>> This should give gene stored in abstracts_gene_pubtator_annotation_ids >> that are more likely to occur in mitochondrial papers. >>> Running a test query gives me this surprising result >>> >>> ... >>> "gene": { >>> "buckets": [ >>> { >>> "val": "3091", >>> "count": 1, >>> "rtitles1": { >>> "relatedness": 0.55649, >>> "foreground_popularity": 0, >>> "background_popularity": 0.00018 >>> } >>> }, >>> ... >>> or for a similar query even bigger relatedness values >>> ... >>> "buckets": [ >>> { >>> "val": "MESH:D028361", >>> "count": 1, >>> "rabstract_conclusions0": { >>> "relatedness": 0.91506, >>> "foreground_popularity": 5e-05, >>> "background_popularity": 5e-05 >>> }, >>> >>> ... >>> >>> But If I recall the z-score formula >>> >>> countFG("3091") - totalFG * probBG >>> ------------------------------------------------ >>> sqrt( totalFG * (1-probBG)*probBG ) >>> >>> and set countFG("3091") to 1 this means that the relatedness should be >> negative (or at most 0) if totalFG * probBG >=1, while here I find a quite >> positive relatedness. >>> Maybe this can be controlled with min_popularity, but I don't understand >> how to use it in conjunction with type=terms and >> field=abstracts_gene_pubtator_annotation_ids >>> >>> Can you please tell me the correct syntax, and if my reasoning is >> correct? >>> Thank you >>> Danilo >>> >>> Danilo Tomasoni >>> >>> Fondazione The Microsoft Research - University of Trento Centre for >> Computational and Systems Biology (COSBI) >>> Piazza Manifattura 1, 38068 Rovereto (TN), Italy >>> [email protected]< >> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu >>> >>> http://www.cosbi.eu< >> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f >>> >>> >>> As for the European General Data Protection Regulation 2016/679 on the >> protection of natural persons with regard to the processing of personal >> data, we inform you that all the data we possess are object of treatment in >> the respect of the normative provided for by the cited GDPR. >>> It is your right to be informed on which of your data are used and how; >> you may ask for their correction, cancellation or you may oppose to their >> use by written request sent by recorded delivery to The Microsoft Research >> – University of Trento Centre for Computational and Systems Biology Scarl, >> Piazza Manifattura 1, 38068 Rovereto (TN), Italy. >>> P Please don't print this e-mail unless you really need to >>
