Re: How to remove duplicate tokens from solr

Rajdeep Sahoo Fri, 18 Sep 2020 00:33:46 -0700

Hi all,
 I have found the below details in stackoverflow but not sure how to
include the jar. Can any one help with this?



I've created a new filter class from "FilteringTokenFilter". The task is
pretty simple. I would check before adding into the list.

I have created a simple plugin Eliminate duplicate words
<https://github.com/volkan/lucene-solr-filter-eliminateduplicate>

To load the plugins, JAR files (along with EliminateDuplicate-*.jar, which
can be created by executing mvn package command or
https://github.com/volkan/lucene-solr-filter-eliminateduplicate/tree/master/solr/lib)
in a lib directory in the Solr Home directory. The location for the lib
directory is near the solr.xml file.

On Fri, 18 Sep, 2020, 1:04 am Rajdeep Sahoo, <rajdeepsahoo2...@gmail.com>
wrote:

> But not sure why these type of search string is causing high cpu
> utilization.
>
> On Fri, 18 Sep, 2020, 12:49 am Rahul Goswami, <rahul196...@gmail.com>
> wrote:
>
>> Is this for a phrase search? If yes then the position of the token would
>> matter too and not sure which token would you want to remove. "eg
>> "tshirt hat tshirt".
>> Also, are you looking to save space and want this at index time? Or just
>> want to remove duplicates from the search string?
>>
>> If this is at search time AND is not a phrase search, there are a couple
>> approaches I could think of :
>>
>> 1) You could either handle this in the application layer to only pass the
>> deduplicated string before it hits solr
>> 2) You can write a custom search component and configure it in the
>>  <first-components> list to process the search string and remove
>> duplicates
>> before it hits the default search components. See here (
>>
>> https://lucene.apache.org/solr/guide/7_7/requesthandlers-and-searchcomponents-in-solrconfig.html#first-components-and-last-components
>> ).
>>
>> However if for search, I would still evaluate if writing those extra lines
>> of code is worth the investment. I say so since my assumption is that for
>> duplicated tokens in search string, lucene would have the intelligence to
>> not fetch the doc ids again, so you should not be worried about spending
>> computation resources to reevaluate the same tokens (Someone correct me if
>> I am wrong!)
>>
>> -Rahul
>>
>> On Thu, Sep 17, 2020 at 2:56 PM Rajdeep Sahoo <rajdeepsahoo2...@gmail.com
>> >
>> wrote:
>>
>> > If someone is searching with " tshirt tshirt tshirt tshirt tshirt
>> tshirt"
>> > we need to remove the duplicates and search with tshirt.
>> >
>> >
>> > On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalovitch, <
>> arafa...@gmail.com>
>> > wrote:
>> >
>> > > This is not quite enough information.
>> > > There is
>> > >
>> >
>> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter
>> > > but it has specific limitations.
>> > >
>> > > What is the problem that you are trying to solve that you feel is due
>> > > to duplicate tokens? Why are they duplicates? Is it about storage or
>> > > relevancy?
>> > >
>> > > Regards,
>> > >    Alex.
>> > >
>> > > On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo <
>> rajdeepsahoo2...@gmail.com>
>> > > wrote:
>> > > >
>> > > > Hi team,
>> > > >  Is there any way to remove duplicate tokens from solr. Is there any
>> > > filter
>> > > > for this.
>> > >
>> >
>>
>

Re: How to remove duplicate tokens from solr

Reply via email to