Is this for a phrase search? If yes then the position of the token would
matter too and not sure which token would you want to remove. "eg
"tshirt hat tshirt".
Also, are you looking to save space and want this at index time? Or just
want to remove duplicates from the search string?

If this is at search time AND is not a phrase search, there are a couple
approaches I could think of :

1) You could either handle this in the application layer to only pass the
deduplicated string before it hits solr
2) You can write a custom search component and configure it in the
 <first-components> list to process the search string and remove duplicates
before it hits the default search components. See here (
https://lucene.apache.org/solr/guide/7_7/requesthandlers-and-searchcomponents-in-solrconfig.html#first-components-and-last-components
).

However if for search, I would still evaluate if writing those extra lines
of code is worth the investment. I say so since my assumption is that for
duplicated tokens in search string, lucene would have the intelligence to
not fetch the doc ids again, so you should not be worried about spending
computation resources to reevaluate the same tokens (Someone correct me if
I am wrong!)

-Rahul

On Thu, Sep 17, 2020 at 2:56 PM Rajdeep Sahoo <rajdeepsahoo2...@gmail.com>
wrote:

> If someone is searching with " tshirt tshirt tshirt tshirt tshirt tshirt"
> we need to remove the duplicates and search with tshirt.
>
>
> On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalovitch, <arafa...@gmail.com>
> wrote:
>
> > This is not quite enough information.
> > There is
> >
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter
> > but it has specific limitations.
> >
> > What is the problem that you are trying to solve that you feel is due
> > to duplicate tokens? Why are they duplicates? Is it about storage or
> > relevancy?
> >
> > Regards,
> >    Alex.
> >
> > On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo <rajdeepsahoo2...@gmail.com>
> > wrote:
> > >
> > > Hi team,
> > >  Is there any way to remove duplicate tokens from solr. Is there any
> > filter
> > > for this.
> >
>

Reply via email to