Hi all, I have found the below details in stackoverflow but not sure how to include the jar. Can any one help with this?
I've created a new filter class from "FilteringTokenFilter". The task is pretty simple. I would check before adding into the list. I have created a simple plugin Eliminate duplicate words <https://github.com/volkan/lucene-solr-filter-eliminateduplicate> To load the plugins, JAR files (along with EliminateDuplicate-*.jar, which can be created by executing mvn package command or https://github.com/volkan/lucene-solr-filter-eliminateduplicate/tree/master/solr/lib) in a lib directory in the Solr Home directory. The location for the lib directory is near the solr.xml file. On Fri, 18 Sep, 2020, 1:04 am Rajdeep Sahoo, <rajdeepsahoo2...@gmail.com> wrote: > But not sure why these type of search string is causing high cpu > utilization. > > On Fri, 18 Sep, 2020, 12:49 am Rahul Goswami, <rahul196...@gmail.com> > wrote: > >> Is this for a phrase search? If yes then the position of the token would >> matter too and not sure which token would you want to remove. "eg >> "tshirt hat tshirt". >> Also, are you looking to save space and want this at index time? Or just >> want to remove duplicates from the search string? >> >> If this is at search time AND is not a phrase search, there are a couple >> approaches I could think of : >> >> 1) You could either handle this in the application layer to only pass the >> deduplicated string before it hits solr >> 2) You can write a custom search component and configure it in the >> <first-components> list to process the search string and remove >> duplicates >> before it hits the default search components. See here ( >> >> https://lucene.apache.org/solr/guide/7_7/requesthandlers-and-searchcomponents-in-solrconfig.html#first-components-and-last-components >> ). >> >> However if for search, I would still evaluate if writing those extra lines >> of code is worth the investment. I say so since my assumption is that for >> duplicated tokens in search string, lucene would have the intelligence to >> not fetch the doc ids again, so you should not be worried about spending >> computation resources to reevaluate the same tokens (Someone correct me if >> I am wrong!) >> >> -Rahul >> >> On Thu, Sep 17, 2020 at 2:56 PM Rajdeep Sahoo <rajdeepsahoo2...@gmail.com >> > >> wrote: >> >> > If someone is searching with " tshirt tshirt tshirt tshirt tshirt >> tshirt" >> > we need to remove the duplicates and search with tshirt. >> > >> > >> > On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalovitch, < >> arafa...@gmail.com> >> > wrote: >> > >> > > This is not quite enough information. >> > > There is >> > > >> > >> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter >> > > but it has specific limitations. >> > > >> > > What is the problem that you are trying to solve that you feel is due >> > > to duplicate tokens? Why are they duplicates? Is it about storage or >> > > relevancy? >> > > >> > > Regards, >> > > Alex. >> > > >> > > On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo < >> rajdeepsahoo2...@gmail.com> >> > > wrote: >> > > > >> > > > Hi team, >> > > > Is there any way to remove duplicate tokens from solr. Is there any >> > > filter >> > > > for this. >> > > >> > >> >