EDIT: "the equivalent terms are separated by commas (as they should be)" => "the equivalent terms are _not_ separated by commas (as they should be)"
On Fri, Jan 15, 2021 at 10:09 AM Michael Gibney <mich...@michaelgibney.net> wrote: > Shaun, > > I'm not 100% sure, but don't give up on this just yet: > > > For example if I enter diabetes it finds the acronym DM for diabetes > mellitus > > I think the behavior you're observing may simply be a side-effect of a > misconfiguration of synonyms.txt. In the example you posted, the equivalent > terms are separated by commas (as they should be), which would lead to > treating line `DM diabetes mellitus` as effectively "DM == diabetes == > mellitus", which as you point out is clearly not what you want. Do you see > similar results for `DM, diabetes mellitus` (which should be parsed as > meaning "DM == 'diabetes mellitus'", which iiuc _is_ what you want)? > > (see the note about ensuring proper comma-separation in my earlier > response) > > Michael > > > On Fri, Jan 15, 2021 at 9:52 AM Shaun Campbell <campbell.sh...@gmail.com> > wrote: > >> Hi Michael >> >> Thanks for that I'll have a study later. It's just reminded me of the >> expand option which I meant to have a look at. >> >> Thanks >> Shaun >> >> On Fri, 15 Jan 2021 at 14:33, Michael Gibney <mich...@michaelgibney.net> >> wrote: >> >> > The equivalent terms on the right-hand side of the `=>` operator in the >> > example you sent should be separated by a comma. You mention you already >> > tried only-comma-separated (e.g. one line: `SRN,Stroke Research >> Network`) >> > and that that yielded unexpected results as well. I would recommend >> > pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case), >> and >> > applying the synonym filter _after_ case normalization in the analysis >> > chain (there are other ways you could do, but the key point being that >> you >> > need to pay attention to case and how it interacts with the order in >> which >> > filters are applied). >> > >> > Re: Charlie's recommendation to apply these at index-time, a word of >> > caution (and it's possible that this is in fact the underlying cause of >> > some of the unexpected behavior you're observine?): be careful if you're >> > using term _expansion_ at index-time (i.e., mapping single terms to >> > multiple terms, which I note appears to be what you're trying to do in >> the >> > example lines you provided). Multi-term index-time synonyms can lead to >> > unexpected results for positional queries (either explicit phrase >> queries, >> > or implicit, e.g. as configured by `pf` param in edismax). I'm aware of >> at >> > least two good overviews of this topic, one by Mike McCandless focusing >> on >> > Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The >> underlying >> > issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are >> > relevant. >> > >> > One way to work around this is to "collapse" (rather than expand) >> synonyms, >> > at both index and query time. Another option would be to apply synonym >> > expansion only at query-time. It's also worth noting that increasing >> phrase >> > slop (`ps` param, etc.) can cause the issues with index-time synonym >> > expansion to "fly under the radar" a little, wrt the most blatant "false >> > negative" manifestations of index-time synonym issues for phrase >> queries. >> > >> > [1] >> > >> > >> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch >> > [2] >> > >> > >> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/ >> > [3] https://issues.apache.org/jira/browse/LUCENE-4312 >> > >> > On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull < >> > ch...@opensourceconnections.com> wrote: >> > >> > > I'm wondering if you should be using these acronyms at index time, not >> > > search time. It will make your index bigger and you'll have to >> re-index >> > > to add new synonyms (as they may apply to old documents) but this >> could >> > > be an occasional task, and in the meantime you could use query-time >> > > synonyms for the new ones. >> > > >> > > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy >> to >> > me. >> > > >> > > Cheers >> > > >> > > Charlie >> > > >> > > On 15/01/2021 09:48, Shaun Campbell wrote: >> > > > I have a medical journals search application and I've a list of some >> > > 9,000 >> > > > acronyms like this: >> > > > >> > > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening >> > Questionnaire >> > > > SRN=>SRN Stroke Research Network >> > > > IGBP=>IGBP isolated gastric bypass >> > > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for >> > > Obstructive >> > > > sleep apnoea–hypopnoea >> > > > SRM=>SRM standardised response mean >> > > > SRT=>SRT substrate reduction therapy >> > > > SRS=>SRS Sexual Rating Scale >> > > > SRU=>SRU stroke rehabilitation unit >> > > > T2w=>T2w T2-weighted >> > > > Ab-P=>Ab-P Aberdeen participation restriction subscale >> > > > MSOA=>MSOA middle-layer super output area >> > > > SSA=>SSA site-specific assessment >> > > > SSC=>SSC Study Steering Committee >> > > > SSB=>SSB short-stretch bandage >> > > > SSE=>SSE sum squared error >> > > > SSD=>SSD social services department >> > > > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument >> > > > >> > > > I tried to put them in a synonyms file, either just with a comma >> > between, >> > > > or with an arrow in between and the acronym repeated on the right >> like >> > > > above, and no matter what I try I'm getting really strange search >> > > results. >> > > > It's like words in one acronym are matching with the same word in >> > another >> > > > acronym and then searching with that acronym which is completely >> > > unrelated. >> > > > >> > > > I don't think Solr can handle this, but does anyone know of any >> crafty >> > > > tricks in Solr to handle this situation where I can either search by >> > the >> > > > acronym or by the text? >> > > > >> > > > Shaun >> > > > >> > > >> > > -- >> > > Charlie Hull - Managing Consultant at OpenSource Connections Limited >> > > <www.o19s.com> >> > > Founding member of The Search Network <https://thesearchnetwork.com/> >> > > and co-author of Searching the Enterprise >> > > <https://opensourceconnections.com/about-us/books-resources/> >> > > tel/fax: +44 (0)8700 118334 >> > > mobile: +44 (0)7767 825828 >> > > >> > >> >