This is a common misunderstanding of RemoveDuplicatesTokenFilter. It removes tokens _introduced_ by certain other filters, not duplicates that were part of the original. This is the relevant part of the docs: "if they have the same text and position values". An input of "hey hey hey" has a different position for each "hey"...
Best, Erick On Thu, Feb 9, 2017 at 10:52 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Yeah, what does that do anyway, omit both, but not one in particular, and > where was omitTermFreq all this time, does it make sense? > > Not to me at least, so i never tried it and just overridden the similarity in > place. > > M. > > -----Original message----- >> From:Alexandre Rafalovitch <arafa...@gmail.com> >> Sent: Thursday 9th February 2017 18:00 >> To: solr-user <solr-user@lucene.apache.org> >> Subject: Re: Removing duplicate terms from query >> >> Would omitTermFreqAndPositions help here? Though that's probably an >> overkill as that disables phrase searches too. I am not sure if it is >> possible to do omitTermFreqAndPositions=true omitPositions=false to >> just skip frequencies. >> >> Regards, >> Alex. >> ---- >> http://www.solr-start.com/ - Resources for Solr users, new and experienced >> >> >> On 9 February 2017 at 11:37, Walter Underwood <wun...@wunderwood.org> wrote: >> > 1. I don’t think this is a good idea. It means that a search for “hey hey >> > hey” won’t score that document higher. >> > >> > 2. Maybe you want to change how tf is calculated. Ignore multiple >> > occurrences of a word. >> > >> > I ran into this with the movie title “New York, New York” at Netflix. It >> > isn’t twice as much about New York, but it needs to be the best match for >> > the query “new york new york”. >> > >> > wunder >> > Walter Underwood >> > wun...@wunderwood.org >> > http://observer.wunderwood.org/ (my blog) >> > >> > >> >> On Feb 9, 2017, at 5:18 AM, Ere Maijala <ere.maij...@helsinki.fi> wrote: >> >> >> >> Thanks Emir. >> >> >> >> I was thinking of something very simple like doing what >> >> RemoveDuplicatesTokenFilter does but ignoring positions. It would of >> >> course still be possible to have the same term multiple times, but at >> >> least the adjacent ones could be deduplicated. The reason I'm not too >> >> eager to do it in a query preprocessor is that I'd have to essentially >> >> duplicate functionality of the query analysis chain that contains >> >> ICUTokenizerFactory, WordDelimiterFilterFactory and whatnot. >> >> >> >> Regards, >> >> Ere >> >> >> >> 9.2.2017, 14.52, Emir Arnautovic kirjoitti: >> >>> Hi Ere, >> >>> >> >>> I don't think that there is such filter. Implementing such filter would >> >>> require looking backward which violates streaming approach of token >> >>> filters and unpredictable memory usage. >> >>> >> >>> I would do it as part of query preprocessor and not necessarily as part >> >>> of Solr. >> >>> >> >>> HTH, >> >>> Emir >> >>> >> >>> >> >>> On 09.02.2017 12:24, Ere Maijala wrote: >> >>>> Hi, >> >>>> >> >>>> I just noticed that while we use RemoveDuplicatesTokenFilter during >> >>>> query time, it will consider term positions and not really do anything >> >>>> e.g. if query is 'term term term'. As far as I can see the term >> >>>> positions make no difference in a simple non-phrase search. Is there a >> >>>> built-in way to deal with this? I know I can write a filter to do >> >>>> this, but I feel like this would be something quite basic to do for >> >>>> the query. And I don't think it's even anything too weird for normal >> >>>> users to do. Just consider e.g. searching for music by title: >> >>>> >> >>>> Hey, hey, hey ; Shivers of pleasure >> >>>> >> >>>> I also verified that at least according to debugQuery=true and >> >>>> anecdotal evicende the search really slows down if you repeat the same >> >>>> term enough. >> >>>> >> >>>> --Ere >> >>> >> >> >> >> -- >> >> Ere Maijala >> >> Kansalliskirjasto / The National Library of Finland >> > >>