Re: Handling acronyms

Michael Gibney Fri, 15 Jan 2021 07:33:50 -0800

EDIT: "the equivalent terms are separated by commas (as they should be)" =>
"the equivalent terms are _not_ separated by commas (as they should be)"


On Fri, Jan 15, 2021 at 10:09 AM Michael Gibney <mich...@michaelgibney.net>
wrote:

> Shaun,
>
> I'm not 100% sure, but don't give up on this just yet:
>
> > For example if I enter diabetes it finds the acronym DM for diabetes
> mellitus
>
> I think the behavior you're observing may simply be a side-effect of a
> misconfiguration of synonyms.txt. In the example you posted, the equivalent
> terms are separated by commas (as they should be), which would lead to
> treating line `DM diabetes mellitus` as effectively "DM == diabetes ==
> mellitus", which as you point out is clearly not what you want. Do you see
> similar results for `DM, diabetes mellitus` (which should be parsed as
> meaning "DM == 'diabetes mellitus'", which iiuc _is_ what you want)?
>
> (see the note about ensuring proper comma-separation in my earlier
> response)
>
> Michael
>
>
> On Fri, Jan 15, 2021 at 9:52 AM Shaun Campbell <campbell.sh...@gmail.com>
> wrote:
>
>> Hi Michael
>>
>> Thanks for that I'll have a study later.  It's just reminded me of the
>> expand option which I meant to have a look at.
>>
>> Thanks
>> Shaun
>>
>> On Fri, 15 Jan 2021 at 14:33, Michael Gibney <mich...@michaelgibney.net>
>> wrote:
>>
>> > The equivalent terms on the right-hand side of the `=>` operator in the
>> > example you sent should be separated by a comma. You mention you already
>> > tried only-comma-separated (e.g. one line: `SRN,Stroke Research
>> Network`)
>> > and that that yielded unexpected results as well. I would recommend
>> > pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case),
>> and
>> > applying the synonym filter _after_ case normalization in the analysis
>> > chain (there are other ways you could do, but the key point being that
>> you
>> > need to pay attention to case and how it interacts with the order in
>> which
>> > filters are applied).
>> >
>> > Re: Charlie's recommendation to apply these at index-time, a word of
>> > caution (and it's possible that this is in fact the underlying cause of
>> > some of the unexpected behavior you're observine?): be careful if you're
>> > using term _expansion_ at index-time (i.e., mapping single terms to
>> > multiple terms, which I note appears to be what you're trying to do in
>> the
>> > example lines you provided). Multi-term index-time synonyms can lead to
>> > unexpected results for positional queries (either explicit phrase
>> queries,
>> > or implicit, e.g. as configured by `pf` param in edismax). I'm aware of
>> at
>> > least two good overviews of this topic, one by Mike McCandless focusing
>> on
>> > Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The
>> underlying
>> > issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
>> > relevant.
>> >
>> > One way to work around this is to "collapse" (rather than expand)
>> synonyms,
>> > at both index and query time. Another option would be to apply synonym
>> > expansion only at query-time. It's also worth noting that increasing
>> phrase
>> > slop (`ps` param, etc.) can cause the issues with index-time synonym
>> > expansion to "fly under the radar" a little, wrt the most blatant "false
>> > negative" manifestations of index-time synonym issues for phrase
>> queries.
>> >
>> > [1]
>> >
>> >
>> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
>> > [2]
>> >
>> >
>> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
>> > [3] https://issues.apache.org/jira/browse/LUCENE-4312
>> >
>> > On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
>> > ch...@opensourceconnections.com> wrote:
>> >
>> > > I'm wondering if you should be using these acronyms at index time, not
>> > > search time. It will make your index bigger and you'll have to
>> re-index
>> > > to add new synonyms (as they may apply to old documents) but this
>> could
>> > > be an occasional task, and in the meantime you could use query-time
>> > > synonyms for the new ones.
>> > >
>> > > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy
>> to
>> > me.
>> > >
>> > > Cheers
>> > >
>> > > Charlie
>> > >
>> > > On 15/01/2021 09:48, Shaun Campbell wrote:
>> > > > I have a medical journals search application and I've a list of some
>> > > 9,000
>> > > > acronyms like this:
>> > > >
>> > > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening
>> > Questionnaire
>> > > > SRN=>SRN Stroke Research Network
>> > > > IGBP=>IGBP isolated gastric bypass
>> > > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
>> > > Obstructive
>> > > > sleep apnoea–hypopnoea
>> > > > SRM=>SRM standardised response mean
>> > > > SRT=>SRT substrate reduction therapy
>> > > > SRS=>SRS Sexual Rating Scale
>> > > > SRU=>SRU stroke rehabilitation unit
>> > > > T2w=>T2w T2-weighted
>> > > > Ab-P=>Ab-P Aberdeen participation restriction subscale
>> > > > MSOA=>MSOA middle-layer super output area
>> > > > SSA=>SSA site-specific assessment
>> > > > SSC=>SSC Study Steering Committee
>> > > > SSB=>SSB short-stretch bandage
>> > > > SSE=>SSE sum squared error
>> > > > SSD=>SSD social services department
>> > > > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
>> > > >
>> > > > I tried to put them in a synonyms file, either just with a comma
>> > between,
>> > > > or with an arrow in between and the acronym repeated on the right
>> like
>> > > > above, and no matter what I try I'm getting really strange search
>> > > results.
>> > > > It's like words in one acronym are matching with the same word in
>> > another
>> > > > acronym and then searching with that acronym which is completely
>> > > unrelated.
>> > > >
>> > > > I don't think Solr can handle this, but does anyone know of any
>> crafty
>> > > > tricks in Solr to handle this situation where I can either search by
>> > the
>> > > > acronym or by the text?
>> > > >
>> > > > Shaun
>> > > >
>> > >
>> > > --
>> > > Charlie Hull - Managing Consultant at OpenSource Connections Limited
>> > > <www.o19s.com>
>> > > Founding member of The Search Network <https://thesearchnetwork.com/>
>> > > and co-author of Searching the Enterprise
>> > > <https://opensourceconnections.com/about-us/books-resources/>
>> > > tel/fax: +44 (0)8700 118334
>> > > mobile: +44 (0)7767 825828
>> > >
>> >
>>
>

Re: Handling acronyms

Reply via email to