Re: Cursor mark page duplicates

2019-11-28 Thread Dwane Hall
M To: solr-user@lucene.apache.org Subject: Re: Cursor mark page duplicates On 11/28/2019 1:30 AM, Dwane Hall wrote: > I asked a question on the forum a couple of weeks ago regarding cursorMark > duplicates. I initially thought it may be due to HDFSCaching because I was > unable repl

Re: Cursor mark page duplicates

2019-11-28 Thread Shawn Heisey
On 11/28/2019 1:30 AM, Dwane Hall wrote: I asked a question on the forum a couple of weeks ago regarding cursorMark duplicates. I initially thought it may be due to HDFSCaching because I was unable replicate the issue on local indexes but unfortunately the dreaded duplicates have returned

Re: Cursor mark page duplicates

2019-11-28 Thread Dwane Hall
Hey guys, I asked a question on the forum a couple of weeks ago regarding cursorMark duplicates. I initially thought it may be due to HDFSCaching because I was unable replicate the issue on local indexes but unfortunately the dreaded duplicates have returned!! For a refresher I was seeing

Re: Cursor mark page duplicates

2019-11-11 Thread Dwane Hall
monitor our new index configuration and if I notice any similar behaviour I'll make the community aware of my findings. Once again, Thanks for your input Dwane From: Chris Hostetter Sent: Friday, 8 November 2019 9:58 AM To: solr-user@lucene.apache.org

Re: Cursor mark page duplicates

2019-11-07 Thread Chris Hostetter
: I'm using Solr's cursor mark feature and noticing duplicates when paging : through results. The duplicate records happen intermittently and appear : at the end of one page, and the beginning of the next (but not on all : pages through the results). So if rows=20 the duplicate rec

Re: Cursor mark page duplicates

2019-11-07 Thread Erick Erickson
f all the IDs are all the same on both replicas, I haven’t a clue….. Best, Erick > On Nov 7, 2019, at 5:34 AM, Dwane Hall wrote: > > Hey Solr community, > > I'm using Solr's cursor mark feature and noticing duplicates when paging > through results. The duplicate re

Cursor mark page duplicates

2019-11-07 Thread Dwane Hall
Hey Solr community, I'm using Solr's cursor mark feature and noticing duplicates when paging through results. The duplicate records happen intermittently and appear at the end of one page, and the beginning of the next (but not on all pages through the results). So if rows=20 the

Shards, delete duplicates ?

2017-04-14 Thread Bruno Mannina
documents. It’s normal I know. Is exists a method, a parameter, or anything else that allows me to indicate to solr to compare ID in C1 with ID12 in C2 to delete duplicates ? Many thanks for your help, Bruno Mannina <http://www.matheo-software.com> www.matheo-software.com

Re: Protect against duplicates with the Migrate statement

2015-12-03 Thread Shalin Shekhar Mangar
cuments . New > documents will only be written to HotDocuments and every night I will migrate > a chunk of documents into ColdDocuments. > > > In the test environment, I have the Collection API migrate statement working > fine. I know this won't handle duplicates ending up

Re: Protect against duplicates with the Migrate statement

2015-12-02 Thread philippa griggs
same unique id and signature. I ended up with duplicates in the cold collection. Thanks for your help, Philippa From: Zheng Lin Edwin Yeo Sent: 03 December 2015 02:30:31 To: solr-user@lucene.apache.org Subject: Re: Protect against duplicates with the

Re: Protect against duplicates with the Migrate statement

2015-12-02 Thread Zheng Lin Edwin Yeo
;m implementing two collections - HotDocuments and ColdDocuments . New > documents will only be written to HotDocuments and every night I will > migrate a chunk of documents into ColdDocuments. > > > In the test environment, I have the Collection API migrate statement > working fine.

Protect against duplicates with the Migrate statement

2015-12-02 Thread philippa griggs
ection API migrate statement working fine. I know this won't handle duplicates ending up in the ColdDocuments collection and I don't expect to have duplicate documents but I would like to protect against it- just in case. We have a unique key and I've tried to impleme

Re: Find duplicates

2014-12-02 Thread Alexandre Rafalovitch
And if I am correct, enabling docValues will do this kind of grouping as part of the indexing with docValues data structure (per segment). So, all one has to do is to get it back (through faceting). Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newslett

RE: Find duplicates

2014-12-02 Thread Gonzalo Rodriguez
@lucene.apache.org Subject: Find duplicates Hi Is it possible to formulate a Solr query which finds all documents which have the same value in a particular field? Note, I don't know what the value is, I just want to find all documents with duplicate values. For example, I have 5 documents: Doc1:

Re: Find duplicates

2014-12-02 Thread Erik Hatcher
Sort of… if you indexed the full value of the field (and you’re looking for truly exact matches) as a string field type you could facet on that field with facet.mincount=2 and the facets returned would be the ones with duplicate values. You’d have to drill down on each of the facets returned to

Find duplicates

2014-12-02 Thread Peter Kirk
Hi Is it possible to formulate a Solr query which finds all documents which have the same value in a particular field? Note, I don't know what the value is, I just want to find all documents with duplicate values. For example, I have 5 documents: Doc1: field Name = Peter Doc2: field Name = Jac

Re: Ignoring Duplicates in Multivalue Field

2014-11-03 Thread Matthew Nigl
ssuming that > Solr applies them in the order specified, to remove any existing value and > then add it to the end. > > See: > https://cwiki.apache.org/confluence/display/solr/ > Updating+Parts+of+Documents > > -- Jack Krupansky > > -Original Message- From

Re: Ignoring Duplicates in Multivalue Field

2014-11-03 Thread Jack Krupansky
cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents -- Jack Krupansky -Original Message- From: Tomer Levi Sent: Monday, November 3, 2014 4:19 AM To: solr-user@lucene.apache.org ; Ahmet Arslan Subject: RE: Ignoring Duplicates in Multivalue Field Hi Ahmet, When I add the RunU

RE: Ignoring Duplicates in Multivalue Field

2014-11-03 Thread Tomer Levi
Hi Ahmet, When I add the RunUpdateProcessorFactory Solr didn't remove any duplications. Any other idea? -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: Monday, November 03, 2014 1:35 AM To: solr-user@lucene.apache.org Subject: Re: Ignoring Duplicat

Re: Ignoring Duplicates in Multivalue Field

2014-11-02 Thread Ahmet Arslan
Hi Tomer, What happens when you addto your chain? Ahmet On Sunday, November 2, 2014 1:22 PM, Tomer Levi wrote: Hi, I’m trying to make my “update” request handler ignore multivalue duplications in updates. To make my use case clear, let’s assume my index already contains a document li

Ignoring Duplicates in Multivalue Field

2014-11-02 Thread Tomer Levi
Hi, I'm trying to make my "update" request handler ignore multivalue duplications in updates. To make my use case clear, let's assume my index already contains a document like: { id:"100", "myMultValueField": ["1","2","3"] } Later I would like to send an update like: { id:"100"," myMul

Re: mark solr documents as duplicates on hashing the combination of some fields

2014-10-22 Thread Alexandre Rafalovitch
This is the "dark art" knowledge. I've updated the Reference Guide comment with the request to have this text included, but it would also be nice to have it as part of the Javadoc for the Factory or the URP itself. Maybe WIKI as well. I can see not getting this part causing somebody a lot of headac

Re: mark solr documents as duplicates on hashing the combination of some fields

2014-10-22 Thread Chris Hostetter
: I meant signature will be broken. For example suppose the destination of : hash function for signature fields are "sig". After each partial update it : becomes: "00"! details please. how are you configuring your update processor chain? what does your schema look like? what types of at

Re: mark solr documents as duplicates on hashing the combination of some fields

2014-10-22 Thread Ali Nazemian
; > > just don't configure teh signatureField to be the same as your > uniqueKey > > > field. > > > > > > configure some othe fieldname (ie "signature") instead. > > > > > > > > > : Date: Tue, 14 Oct 2014 12:08:26 +

Re: mark solr documents as duplicates on hashing the combination of some fields

2014-10-22 Thread Alexandre Rafalovitch
tureField to be the same as your uniqueKey > > field. > > > > configure some othe fieldname (ie "signature") instead. > > > > > > : Date: Tue, 14 Oct 2014 12:08:26 +0330 > > : From: Ali Nazemian > > : Reply-To: solr-user@lucene.apache.

Re: mark solr documents as duplicates on hashing the combination of some fields

2014-10-22 Thread Ali Nazemian
: solr-user@lucene.apache.org > : To: "solr-user@lucene.apache.org" > : Subject: mark solr documents as duplicates on hashing the combination of > some > : fields > : > : Dear all, > : Hi, > : I was wondering how can I mark some documents as duplicate (just marking > : for f

Re: mark solr documents as duplicates on hashing the combination of some fields

2014-10-21 Thread Chris Hostetter
y-To: solr-user@lucene.apache.org : To: "solr-user@lucene.apache.org" : Subject: mark solr documents as duplicates on hashing the combination of some : fields : : Dear all, : Hi, : I was wondering how can I mark some documents as duplicate (just marking : for future usage not de

mark solr documents as duplicates on hashing the combination of some fields

2014-10-14 Thread Ali Nazemian
Dear all, Hi, I was wondering how can I mark some documents as duplicate (just marking for future usage not deleting) based on the hash combination of some fields? Suppose I have 2 fields name "url" and "title" I want to create hash based on url+title and send it to another field name "signature".

Odd extra character duplicates in spell checking

2014-04-15 Thread Ed Smiley
Hi, I am going to make this question pretty short, so I don’t overwhelm with technical details until the end. I suspect that some folks may be seeing this issue without the particular configuration we are using. What our problem is: 1. Correctly spelled words are returning as not spelled co

Re: Checking for similar text (duplicates)

2014-01-09 Thread Mikhail Khludnev
to see if > "could be" some duplicates of some text ? > > that wiki mention special signature field which is added to documents, try to search for it. > 2. As far as I seen the deduplication has some bottlenecks when comparing > extremely similar items (eg just one charact

Re: Checking for similar text (duplicates)

2014-01-09 Thread Cristian Bichis
Hi Mikhail, I seen deduplication part as well but I have some concerns: 1. Is deduplication supposed to work as well into a check-only (not try to actually add new record to index) request ? So if I just check to see if "could be" some duplicates of some text ? 2. As far as

Re: Checking for similar text (duplicates)

2014-01-09 Thread Mikhail Khludnev
res is to detect /if there are/ similar > records into index comparing with a potential new record and /which are > these records/. In other words to check for duplicates (which are not > necessary identical but would be very close to original). The comparison is > made checking on a des

Checking for similar text (duplicates)

2014-01-09 Thread Cristian Bichis
currently). I am not quite familiar with Solr at this point, I am into early checking stage. One of the current app features is to detect /if there are/ similar records into index comparing with a potential new record and /which are these records/. In other words to check for duplicates (which are

Re: core swap duplicates core entries in solr.xml

2013-11-09 Thread Alan Woodward
Hi Jeremy, Could you open a JIRA ticket for this? Thanks, Alan Woodward www.flax.co.uk On 8 Nov 2013, at 21:16, Branham, Jeremy [HR] wrote: > When performing a core swap in SOLR 4.5.1 with persistence on, the two core > entries that were swapped are duplicated. > > Solr.xml > > > > >

core swap duplicates core entries in solr.xml

2013-11-08 Thread Branham, Jeremy [HR]
When performing a core swap in SOLR 4.5.1 with persistence on, the two core entries that were swapped are duplicated. Solr.xml Performed swap -

Re: Removing duplicates during a query

2013-08-22 Thread Dan Davis
OK - I see that this can be done with Field Collapsing/Grouping. I also see the mentions in the Wiki for avoiding duplicates using a 16-byte hash. So, question withdrawn... On Thu, Aug 22, 2013 at 10:21 PM, Dan Davis wrote: > Suppose I have two documents with different id, and there

Removing duplicates during a query

2013-08-22 Thread Dan Davis
Suppose I have two documents with different id, and there is another field, for instance "content-hash" which is something like a 16-byte hash of the content. Can Solr be configured to return just one copy, and drop the other if both are relevant? If Solr does drop one result, do you get any indi

答复: removing duplicates

2013-08-21 Thread Liu
@lucene.apache.org 主题: removing duplicates hello, We have documents that are duplicates i.e. the ID is different, but rest of the fields are same. Is there a query that can remove duplicate, and just leave one copy of the document on solr? There is one numeric field that we can key off for find duplicates

RE: removing duplicates

2013-08-21 Thread Petersen, Robert
k@gmail.com] Sent: Wednesday, August 21, 2013 2:34 PM To: solr-user@lucene.apache.org Subject: Re: removing duplicates Thanks Aloke and Robert. Can you please give me code/query snippets? (newbie here) On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal wrote: > Hi, > > Facet by on

Re: removing duplicates

2013-08-21 Thread Aloke Ghoshal
Hi, This will help you identify the duplicates: q=*:*&fl=id&facet=true&facet.mincount=2&rows=0&facet.field= To actually remove them from Solr, you will have to do something like Robert suggested. Write an application that uses the results to build a delete by id query ( ht

Re: removing duplicates

2013-08-21 Thread Ali, Saqib
, > Aloke > > > On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib wrote: > > > hello, > > > > We have documents that are duplicates i.e. the ID is different, but rest > of > > the fields are same. Is there a query that can remove duplicate, and just > > leave

RE: removing duplicates

2013-08-21 Thread Petersen, Robert
Hi Perhaps you could query for all documents asking for the id field to be returned and then facet on the field you say you can key off of for duplicates. Set the facet mincount to 2, then you would have to filter on each facet value and page through all doc IDs (except skip the first

Re: removing duplicates

2013-08-21 Thread Aloke Ghoshal
Hi, Facet by one of the duplicate fields (probably by the numeric field that you mentioned) and set facet.mincount=2. Regards, Aloke On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib wrote: > hello, > > We have documents that are duplicates i.e. the ID is different, but rest of > th

removing duplicates

2013-08-21 Thread Ali, Saqib
hello, We have documents that are duplicates i.e. the ID is different, but rest of the fields are same. Is there a query that can remove duplicate, and just leave one copy of the document on solr? There is one numeric field that we can key off for find duplicates. Please advise. Thanks

Re: copyField multiValued duplicates

2012-11-23 Thread Erick Erickson
a million docs). Is there any way we could remove > duplicates emitted via copyField while re-indexing ? Also is there a way to > query multiValued fields to give only docs that have duplicated value ?? > > The fields having issue are declared as follows > > sortMi

Duplicates in the suggester.

2012-09-05 Thread sharath jagannath
Not sure whether it is a duplicate question. Did try to browse through the archive and did not find anything specific to what I was looking for. I see duplicates in the dictionary if I update the document concurrently. I am using Solr 3.6.1 with the following configurations for suggester: Solr

Re: Duplicates in Facets

2012-04-04 Thread Jamie Johnson
ily in Luke. > > On Wed, 2012-04-04 at 23:35 -0400, Jamie Johnson wrote: >> I am currently indexing some information and am wondering why I am >> getting duplicates in facets.  From what I can tell they are the same, >> but is there any case that could cause this that I may no

Re: Duplicates in Facets

2012-04-04 Thread Darren Govoni
Try using Luke to look at your index and see if there are multiple similar TFV's. You can browse them easily in Luke. On Wed, 2012-04-04 at 23:35 -0400, Jamie Johnson wrote: > I am currently indexing some information and am wondering why I am > getting duplicates in facets. From what

Duplicates in Facets

2012-04-04 Thread Jamie Johnson
I am currently indexing some information and am wondering why I am getting duplicates in facets. From what I can tell they are the same, but is there any case that could cause this that I may not be thinking of? Could this be some non printable character making it's way into the index? S

Re: how to avoid duplicates in search results?

2011-10-04 Thread Chris Hostetter
: There is also a Document Duplicate Detection at index time: : http://wiki.apache.org/solr/Deduplication Of just setting "url" as your UniqueKey field would solve this simplr usecase. but it's not entirely clear what else you consider "duplicates"

Re: how to avoid duplicates in search results?

2011-10-04 Thread Edoardo Tosca
> - > testing group > testing group > name="url">http://abc.xyz.com/groups/testing-group/discussions/62 > > > > > > i need to remove the duplicte results > > can anyone give me suggestions > > -- > View this message in contex

how to avoid duplicates in search results?

2011-10-04 Thread nagarjuna
-group/discussions/62 i need to remove the duplicte results can anyone give me suggestions -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-avoid-duplicates-in-search-results-tp3392524p3392524.html Sent from the Solr - User mailing list archive at

Re: Removing duplicates

2011-02-19 Thread Ahmet Arslan
> I know that I can use the > SignatureUpdateProcessorFactory to remove duplicates but I > would like the duplicates in the index but remove them > conditionally at query time. > > Is there any easy way I could accomplish this? Closest thing can be group documents by sig

Removing duplicates

2011-02-18 Thread Mark
I know that I can use the SignatureUpdateProcessorFactory to remove duplicates but I would like the duplicates in the index but remove them conditionally at query time. Is there any easy way I could accomplish this?

Re: De-duplication not working as I expected - duplicates still getting into the index

2010-12-14 Thread Markus Jelsma
Check this setting: false On Tuesday 14 December 2010 14:26:21 Jason Brown wrote: > I have configured de-duplication according to the Wiki.. > > My signature field is defined thus... > > multiValued="false" /> > > and my updateRequestProcessor as follows > > > class="

De-duplication not working as I expected - duplicates still getting into the index

2010-12-14 Thread Jason Brown
I have configured de-duplication according to the Wiki.. My signature field is defined thus... and my updateRequestProcessor as follows true false signature content org.apache.solr.update.processor.Lookup3Signature I am using

Re: using score to find high confidence duplicates

2010-10-13 Thread Matt Mitchell
No this isn't the MLT, just the standard query parser for now. I did try the heuristic approach and I might stick with that actually. I ran the process on known duplicates and created a collection of all scores. I was then able to see how well the query worked. The scores seemed focused t

Re: using score to find high confidence duplicates

2010-10-13 Thread Peter Karich
Hi, are you using moreLikeThis for that feature? I have no suggestion for a reliable threshold, I think this depends on the domain you are operating and is IMO only solvable with a heuristic. It also depends on fields, boosts, ... It could be that there is a 'score gap' between dupl

using score to find high confidence duplicates

2010-10-13 Thread Matt Mitchell
I have a solr index full of documents that contain lots of duplicates. The duplicates are not exact duplicates though. Each may vary slightly in content. After indexing, I have a bit of code that loops through the entire index just to get what I'm calling "target" documents.

Re: Duplicates

2010-07-23 Thread Pavel Minchenkov
s matched files (without files). > > > > Query: > > prop_1:val1 OR prop_2:val2 > > > > I need results (document ids): > > 1, 9 > > or > > 0, 8 > > > > 2010/7/23 Peter Karich > > > > > >> Hi Pavel! > >> >

Re: Duplicates

2010-07-23 Thread Peter Karich
1.4. >> The performance is ok, but for some situations it could be worse than >> without the patch. >> For us it works good, but others reported some exceptions >> (see the patch site: https://issues.apache.org/jira/browse/SOLR-236) >> >> >>> I n

Re: Duplicates

2010-07-23 Thread Pavel Minchenkov
eported some exceptions > (see the patch site: https://issues.apache.org/jira/browse/SOLR-236) > > > I need only to delete duplicates > > Could you give us an example what you exactly need? > (Maybe you could index each master document of the 'unique' documents > w

Re: Duplicates

2010-07-23 Thread Peter Karich
Hi Pavel! The patch can be applied to 1.4. The performance is ok, but for some situations it could be worse than without the patch. For us it works good, but others reported some exceptions (see the patch site: https://issues.apache.org/jira/browse/SOLR-236) > I need only to delete duplica

Re: Duplicates

2010-07-23 Thread Pavel Minchenkov
Thanks. Does it work with Solr 1.4 (Solr 4.0 mentioned in article)? What about performance? I need only to delete duplicates (I don't need cout of duplicates or select certain duplicate). 2010/7/23 Peter Karich > Another possibility could be the well known 'field collapse&

Re: Duplicates

2010-07-23 Thread Peter Karich
Another possibility could be the well known 'field collapse' ;-) http://wiki.apache.org/solr/FieldCollapsing Regards, Peter. > Thanks. > > If I set uniqueKey on the field, then I can save duplicates? > I need to remove duplicates only from search results. The ability to

Re: Duplicates

2010-07-23 Thread Pavel Minchenkov
Thanks. If I set uniqueKey on the field, then I can save duplicates? I need to remove duplicates only from search results. The ability to save duplicates are should be. 2010/7/23 Erick Erickson > If the field is a single token, just define the uniqueKey on it in your > schema. > &g

Re: Duplicates

2010-07-22 Thread Erick Erickson
it possible to remove duplicates in search results by a given field? > > Thanks. > > -- > Pavel Minchenkov >

Duplicates

2010-07-22 Thread Pavel Minchenkov
Hi, Is it possible to remove duplicates in search results by a given field? Thanks. -- Pavel Minchenkov

Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Neeb
Thanks guys. I will try this with some test documents, fingers crossed. And by the way, I got the minTokenLen parameter from one of the thread replies (from Erik). Cheerz, Ali -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using

Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Andrew Clegg
Markus Jelsma wrote: > > Well, it got me too! KMail didn't properly order this thread. Can't seem > to > find Hatcher's reply anywhere. ??!!? > Whole thread here: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039

Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Markus Jelsma
Well, it got me too! KMail didn't properly order this thread. Can't seem to find Hatcher's reply anywhere. ??!!? On Tuesday 08 June 2010 22:00:06 Andrew Clegg wrote: > Andrew Clegg wrote: > > Re. your config, I don't see a minTokenLength in the wiki page for > > deduplication, is this a recent a

Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Markus Jelsma
Here's my config for the updateProcessor. It not uses another signature method but i've used TextProfileSignature as well and it works - sort of. true sig true content org.apache.solr.update.processor.Lookup3Signature Of course, you must

Re: Filtering near-duplicates using TextProfileSignature

2010-06-08 Thread Andrew Clegg
-- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880385.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

2010-06-08 Thread Andrew Clegg
lection manually and see if it finds them? Andrew. -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880379.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

2010-06-08 Thread Neeb
true title,author,abstract org.apache.solr.update.processor.TextProfileSignature 3 -- Thanks in advance, -Ali -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880044.html Sent

Re: Skipping duplicates in DataImportHandler based on uniqueKey

2010-05-03 Thread Andrew Clegg
Marc Sturlese wrote: > > You can use deduplication to do that. Create the signature based on the > unique field or any field you want. > Cool, thanks, I hadn't thought of that. -- View this message in context: http://lucene.472066.n3.nabble.com/Skipping-duplicates-in-DataIm

Re: Skipping duplicates in DataImportHandler based on uniqueKey

2010-05-03 Thread Marc Sturlese
You can use deduplication to do that. Create the signature based on the unique field or any field you want. -- View this message in context: http://lucene.472066.n3.nabble.com/Skipping-duplicates-in-DataImportHandler-based-on-uniqueKey-tp771559p772768.html Sent from the Solr - User mailing list

Skipping duplicates in DataImportHandler based on uniqueKey

2010-05-02 Thread Andrew Clegg
be made to behave this way? If not, would it be an easy patch? This is using the XPathEntityProcessor by the way. Thanks, Andrew. -- :: http://biotext.org.uk/ :: -- View this message in context: http://lucene.472066.n3.nabble.com/Skipping-duplicates-in-DataImportHandler-based-on-uniqueKey

Re: Solr duplicates detection!!

2010-01-29 Thread Wangsheng Mei
Sorry by sending wrong message, this should go to my own mail box :( 2010/1/30 Wangsheng Mei > Document Duplication Detection > > [image: ] Solr1.4 > > 目录 > >1. Document Duplication > Detection<#1267b655a97b48f5_Document_Duplication_Detection> >2. Overview <#1267b

Solr duplicates detection!!

2010-01-29 Thread Wangsheng Mei
Document Duplication Detection [image: ] Solr1.4 目录 1. Document Duplication Detection <#Document_Duplication_Detection> 2. Overview <#Overview> 1. Goals <#Goals> 2. Design <#Design> 3. Notes <#Notes> 4. Configuration <#Configuration> 1. solrconfig.xml <#solrconfig.

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg
just going off the source code, not > trying it out for real. > > Sure -- it won't be til next week at the earliest though. Cheers, Andrew. -- View this message in context: http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128493.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Erik Hatcher
On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote: Thanks Erik, but I'm still a little confused as to exactly where in the Solr config I set these parameters. You'd configure them within the element, something like this: 5 The example on the wiki page uses Lookup3Signature which (p

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg
MD5 hash calculation.*/ > > There are two parameters this implementation takes: > > quantRate = params.getFloat("quantRate", 0.01f); > minTokenLen = params.getInt("minTokenLen", 2); > > Hope that helps. > > Erik > > > > * > htt

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Erik Hatcher
On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote: I'm interested in near-dupe removal as mentioned (briefly) here: http://wiki.apache.org/solr/Deduplication However the link for TextProfileSignature hasn't been filled in yet. Does anyone have an example of using TextProfileSignature that dem

Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg
ntioned in the wiki? Thanks! Andrew. -- View this message in context: http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27127151.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Finding near duplicates which searching Documents

2009-09-24 Thread Grant Ingersoll
On Sep 23, 2009, at 2:55 PM, Jason Rutherglen wrote: I think don't this handle near duplicates which would require some of the methods mentioned recently on the Mahout list. It's pluggable and I believe the TextProfileSignature is a fuzzy implementation in Solr that was brought

Re: Finding near duplicates which searching Documents

2009-09-23 Thread Jason Rutherglen
I think don't this handle near duplicates which would require some of the methods mentioned recently on the Mahout list. On Wed, Sep 23, 2009 at 2:59 AM, Shalin Shekhar Mangar wrote: > On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut wrote: > >> Hi, >> When we have news co

Re: Finding near duplicates which searching Documents

2009-09-23 Thread Shalin Shekhar Mangar
On Wed, Sep 23, 2009 at 3:50 PM, Ninad Raut wrote: > Is this feature included in SOLR 1.4?? > Yep. -- Regards, Shalin Shekhar Mangar.

Re: Finding near duplicates which searching Documents

2009-09-23 Thread Ninad Raut
Is this feature included in SOLR 1.4?? On Wed, Sep 23, 2009 at 3:29 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut >wrote: > > > Hi, > > When we have news content crawled we face a problme of same content being > > repeated in many docume

Re: Finding near duplicates which searching Documents

2009-09-23 Thread Shalin Shekhar Mangar
On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut wrote: > Hi, > When we have news content crawled we face a problme of same content being > repeated in many documents. We want to add a near duplicate document > filter > to detect such documents. Is there a way to do that in SOLR? > Look at http://wik

Finding near duplicates which searching Documents

2009-09-23 Thread Ninad Raut
Hi, When we have news content crawled we face a problme of same content being repeated in many documents. We want to add a near duplicate document filter to detect such documents. Is there a way to do that in SOLR? Regards, Ninad Raut.

Re: dealing with duplicates

2009-08-10 Thread Avlesh Singh
D is_dup = 0 >) > ) > ) > ORDER BY views > LIMIT 10 > > can a similar query be written in lucene or do i need to structure my > index differently to be able to do such a query? > > thx much > > --joe > > > On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderon

Re: dealing with duplicates

2009-08-10 Thread Joe Calderon
at, Aug 1, 2009 at 9:15 AM, Joe Calderon wrote: > hello, thanks for the response, i did take a look at that document but > in my application i actually want the duplicates, as i mentioned, the > matching text could be very different among cluster members, what > joins them together is a s

Re: dealing with duplicates

2009-08-01 Thread Joe Calderon
hello, thanks for the response, i did take a look at that document but in my application i actually want the duplicates, as i mentioned, the matching text could be very different among cluster members, what joins them together is a similar set of numeric features. currently i do a query with fq

Re: dealing with duplicates

2009-07-31 Thread Otis Gospodnetic
Joe, Maybe we can take a step back first. Would it be better if your index was cleaner and didn't have flagged duplicates in the first place? If so, have you tried using http://wiki.apache.org/solr/Deduplication ? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls L

dealing with duplicates

2009-07-31 Thread Joe Calderon
hello all, i have a collection of a few million documents; i have many duplicates in this collection. they have been clustered with a simple algorithm, i have a field called 'duplicate' which is 0 or 1 and a fields called 'description, tags, meta', documents are clustered on di

Re: Avoid duplicates in MoreLikeThis using field collapsing

2009-06-02 Thread Marc Sturlese
With DeDuplication path I create a signature field to control duplicates wich is a MD5 of 3 different fields: hashField = hash (fieldA + fieldB +fieldC) With MoreLikeThis I want to show fieldA There are documents that DeDuplication will not consider duplicates because filedC was diferent for

Re: Avoid duplicates in MoreLikeThis using field collapsing

2009-06-02 Thread Otis Gospodnetic
But why does MLT return duplicates in the first place? That seems strange to me. If there are no duplicates in your index, how does MLT manage to return dupes? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Marc Sturlese > To

Avoid duplicates in MoreLikeThis using field collapsing

2009-05-29 Thread Marc Sturlese
Hey there, I am testing MoreLikeThis feaure (with MoreLikeThis component and with MoreLikeThis handler) and I am getting lots of duplicates. I have noticed that lots of the similar documents returned are duplicates. To avoid that I have tried to use the field collapsing patch but it's not t

Re: Duplicates results when using a non optimized index

2008-05-15 Thread Mike Klaas
On 15-May-08, at 12:50 AM, Tim Mahy wrote: Hi, yep it is a very strange problem that we never encountered before. We are uploading all the documents again to see if that solves the problem (hoping that the delete will delete also the multiple document instances) If you are re-adding ever

  1   2   >