Re: deduplication of suggester results are not enough

2020-03-26 Thread Michal Hlavac
Hi Roland, I wrote AnalyzingInfixSuggester that deduplicates data on several levels at index time. I will publish it in few days on github. I'll wrote to this thread when done. m. On štvrtok 26. marca 2020 16:01:57 CET Szűcs Roland wrote: > Hi All, > > I follow the discussion of the suggester

Re: Deduplication

2015-05-20 Thread Shalin Shekhar Mangar
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam wrote: > >> Write a custom update processor and include it in your update chain. > >> You will then have the ability to do anything you want with the entire > >> input document before it hits the code to actually do the indexing. > > This sounded lik

Re: Deduplication

2015-05-20 Thread Alessandro Benedetti
What the Solr de-duplciation offers you is to calculate for each document in input an Hash ( based on a set of fields). You can then select two options : - Index everything, documents with same signature will be equals - avoid the overwriting of duplicates. How the similarity has is calculated is

Re: Deduplication

2015-05-20 Thread Bram Van Dam
On 19/05/15 14:47, Alessandro Benedetti wrote: > Hi Bram, > what do you mean with : > " I > would like it to provide the unique value myself, without having the > deduplicator create a hash of field values " . > > This is not reduplication, but simple document filtering based on a > constraint. >

Re: Deduplication

2015-05-20 Thread Bram Van Dam
>> Write a custom update processor and include it in your update chain. >> You will then have the ability to do anything you want with the entire >> input document before it hits the code to actually do the indexing. This sounded like the perfect option ... until I read Jack's comment: > > My und

Re: Deduplication

2015-05-19 Thread Jack Krupansky
Shawn, I was going to say the same thing, but... then I was thinking about SolrCloud and the fact that update processors are invoked before the document is set to its target node, so there wouldn't be a reliable way to tell if the input document field value exists on the target rather than current

Re: Deduplication

2015-05-19 Thread Shawn Heisey
On 5/19/2015 3:02 AM, Bram Van Dam wrote: > I'm looking for a way to have Solr reject documents if a certain field > value is duplicated (reject, not overwrite). There doesn't seem to be > any kind of unique option in schema fields. > > The de-duplication feature seems to make this (somewhat) poss

Re: Deduplication

2015-05-19 Thread Alessandro Benedetti
Hi Bram, what do you mean with : " I would like it to provide the unique value myself, without having the deduplicator create a hash of field values " . This is not reduplication, but simple document filtering based on a constraint. In the case you want de-duplication ( which seemed from your ver

Re: Deduplication in SolrCloud

2012-07-27 Thread Lance Norskog
Should the old Signature code be removed? Given that the goal is to have everyone use SolrCloud, maybe this kind of landmine should be removed? On Fri, Jul 27, 2012 at 8:43 AM, Markus Jelsma wrote: > This issue doesn't really describe your problem but a more general problem of > distributed dedu

RE: Deduplication in SolrCloud

2012-07-27 Thread Markus Jelsma
This issue doesn't really describe your problem but a more general problem of distributed deduplication: https://issues.apache.org/jira/browse/SOLR-3473 -Original message- > From:Daniel Brügge > Sent: Fri 27-Jul-2012 17:38 > To: solr-user@lucene.apache.org > Subject: Deduplication in

Re: Deduplication questions

2011-04-11 Thread Chris Hostetter
: Q1. Is is possible to pass *analyzed* content to the : : public abstract class Signature { No, analysis happens as the documents are being written to the lucene index, well after the UpdateProcessors have had a chance to interact with the values. : Q2. Method calculate() is using concatenat

Re: Deduplication

2010-05-19 Thread Ahmet Arslan
> TermsComponent maybe? > > or faceting? > q=*:*&facet=true&facet.field=signatureField&defType=lucene&rows=0&start=0 > > if you append &facet.mincount=1 to above url you can > see your duplications > After re-reading your message: sometimes you want to show duplicates, sometimes you don't wan

Re: Deduplication

2010-05-19 Thread Ahmet Arslan
> Basically for some uses cases I would like to show > duplicates for other I > wanted them ignored. > > If I have overwriteDupes=false and I just create the dedup > hash how can I > query for only unique hash values... ie something like a > SQL group by. TermsComponent maybe? or faceting? q

Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen
Message > >> From: Martijn v Groningen >> To: solr-user@lucene.apache.org >> Sent: Thu, November 26, 2009 3:19:40 AM >> Subject: Re: Deduplication in 1.4 >> >> Field collapsing has been used by many in their production >> environment. > > Got any po

Re: Deduplication in 1.4

2009-11-26 Thread Otis Gospodnetic
Hi Martijn, - Original Message > From: Martijn v Groningen > To: solr-user@lucene.apache.org > Sent: Thu, November 26, 2009 3:19:40 AM > Subject: Re: Deduplication in 1.4 > > Field collapsing has been used by many in their production > environment. Got any poi

Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen
Field collapsing has been used by many in their production environment. The last few months the stability of the patch grew as quiet some bugs were fixed. The only big feature missing currently is caching of the collapsing algorithm. I'm currently working on that and I will put it in a new patch in

Re: Deduplication in 1.4

2009-11-25 Thread KaktuChakarabati
Hey Otis, Yep, I realized this myself after playing some with the dedupe feature yesterday. So it does look like Field collapsing is what I need pretty much. Any idea on how close it is to being production-ready? Thanks, -Chak Otis Gospodnetic wrote: > > Hi, > > As far as I know, the point of

Re: Deduplication in 1.4

2009-11-24 Thread Otis Gospodnetic
Hi, As far as I know, the point of deduplication in Solr ( http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document before indexing it in order to avoid duplicates in the index in the first place. What you are describing is closer to field collapsing patch in SOLR-236. Ot

Re: Deduplication patch not working in nightly build

2009-01-10 Thread Grant Ingersoll
I've seen similar errors when large background merges happen while looping in a result set. See http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/ On Jan 9, 2009, at 12:50 PM, Mark Miller wrote: Your basically writing segments more often now, and somehow

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese
Hey Mark, Sorry I was not enough especific, I wanted to mean that I have and I always had autoCommit=false. I will do some more traces and test. Will post if I have any new important thing to mention. Thanks. Marc Sturlese wrote: > > Hey Shalin, > > In the begining (when the error was appeari

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller
Your basically writing segments more often now, and somehow avoiding a longer merge I think. Also, likely, deduplication is probably adding enough extra data to your index to hit a sweet spot where a merge is too long. Or something to that effect - MySql is especially sensitive to timeouts when

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese
Hey Shalin, In the begining (when the error was appearing) i had 32 and no maxBufferedDocs set Now I have: 32 50 I think taht setting maxBufferedDocs to 50 I am forcing more disk writting than I would like... but at least it works fine (but a bit slower,opiously). I keep saying that the most

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Shalin Shekhar Mangar
On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese wrote: > > hey there, > I hadn't autoCommit set to true but I have it sorted! The error stopped > appearing after setting the property maxBufferedDocs in solrconfig.xml. I > can't exactly undersand why but it just worked. > Anyway, maxBufferedDocs

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese
hey there, I hadn't autoCommit set to true but I have it sorted! The error stopped appearing after setting the property maxBufferedDocs in solrconfig.xml. I can't exactly undersand why but it just worked. Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same? Thanks Marc

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller
I can't imagine why dedupe would have anything to do with this, other than what was said, it perhaps is taking a bit longer to get a document to the db, and it times out (maybe a long signature calculation?). Have you tried changing your MySql settings to allow for a longer timeout? (sorry, I'm

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese
Hey there, I am stack in this problem sine 3 days ago and no idea how to sort it. I am using the nighlty from a week ago, mysql and this driver and url: driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/my_db" I can use deduplication patch with indexs of 200.000 docs and no problem. Whe

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese
Thanks I will have a look to my JdbcDataSource. Anyway it's weird because using the 1.3 release I don't have that problem... Shalin Shekhar Mangar wrote: > > Yes, initially I figured that we are accidentally re-using a closed data > source. But Noble has pinned it right. I guess you can try look

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
Yes, initially I figured that we are accidentally re-using a closed data source. But Noble has pinned it right. I guess you can try looking into your JDBC driver's documentation for a setting which increases the connection alive-ness. On Mon, Jan 5, 2009 at 5:29 PM, Noble Paul നോബിള്‍ नोब्ळ् < nob

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Noble Paul നോബിള്‍ नोब्ळ्
I guess the indexing of a doc is taking too long (may be because of the de-dup patch) and the resultset gets closed automaticallly (timed out) --Noble On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese wrote: > > Donig this fix I get the same error :( > > I am going to try to set up the last nigthly b

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese
Donig this fix I get the same error :( I am going to try to set up the last nigthly build... let's see if I have better luck. The thing is it stop indexing at the doc num 150.000 aprox... and give me that mysql exception error... Without DeDuplication patch I can index 2 milion docs without prob

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
Yes I meant the 05/01/2008 build. The fix is a one line change Add the following as the last line of DataConfig.Entity.clearCache() dataSrc = null; On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese wrote: > > Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one > works? If the

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese
Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one works? If the fix you did is not really big can u tell me where in the source is and what is it for? (I have been debuging and tracing a lot the dataimporthandler source and I I would like to know what the imporovement is ab

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese
Yeah looks like but... if I don't use the DeDuplication patch everything works perfect. I can create my indexed using full import and delta import without problems. The JdbcDataSource of the nightly is pretty similar to the 1.3 release... The DeDuplication patch doesn't touch the dataimporthandl

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese
Yeah looks like but... if I don't use the DeDuplication patch everything works perfect. I can create my indexed using full import and delta import without problems. The JdbcDataSource of the nightly is pretty similar to the 1.3 release... The DeDuplication patch doesn't touch the dataimporthandle

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
Marc, I've just committed a fix which may have caused the bug. Can you use svn trunk (or the next nightly build) and confirm? On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् < noble.p...@gmail.com> wrote: > looks like a bug w/ DIH with the recent fixes. > --Noble > > On Mon, Jan 5, 2009

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Noble Paul നോബിള്‍ नोब्ळ्
looks like a bug w/ DIH with the recent fixes. --Noble On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese wrote: > > Hey there, > I was using the Deduplication patch with Solr 1.3 release and everything was > working perfectly. Now I upgraded to a nigthly build (20th december) to be > able to use new