Hi Roland,
I wrote AnalyzingInfixSuggester that deduplicates data on several levels at
index time.
I will publish it in few days on github. I'll wrote to this thread when done.
m.
On štvrtok 26. marca 2020 16:01:57 CET Szűcs Roland wrote:
> Hi All,
>
> I follow the discussion of the suggester
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam wrote:
> >> Write a custom update processor and include it in your update chain.
> >> You will then have the ability to do anything you want with the entire
> >> input document before it hits the code to actually do the indexing.
>
> This sounded lik
What the Solr de-duplciation offers you is to calculate for each document
in input an Hash ( based on a set of fields).
You can then select two options :
- Index everything, documents with same signature will be equals
- avoid the overwriting of duplicates.
How the similarity has is calculated is
On 19/05/15 14:47, Alessandro Benedetti wrote:
> Hi Bram,
> what do you mean with :
> " I
> would like it to provide the unique value myself, without having the
> deduplicator create a hash of field values " .
>
> This is not reduplication, but simple document filtering based on a
> constraint.
>
>> Write a custom update processor and include it in your update chain.
>> You will then have the ability to do anything you want with the entire
>> input document before it hits the code to actually do the indexing.
This sounded like the perfect option ... until I read Jack's comment:
>
> My und
Shawn, I was going to say the same thing, but... then I was thinking about
SolrCloud and the fact that update processors are invoked before the
document is set to its target node, so there wouldn't be a reliable way to
tell if the input document field value exists on the target rather than
current
On 5/19/2015 3:02 AM, Bram Van Dam wrote:
> I'm looking for a way to have Solr reject documents if a certain field
> value is duplicated (reject, not overwrite). There doesn't seem to be
> any kind of unique option in schema fields.
>
> The de-duplication feature seems to make this (somewhat) poss
Hi Bram,
what do you mean with :
" I
would like it to provide the unique value myself, without having the
deduplicator create a hash of field values " .
This is not reduplication, but simple document filtering based on a
constraint.
In the case you want de-duplication ( which seemed from your ver
Should the old Signature code be removed? Given that the goal is to
have everyone use SolrCloud, maybe this kind of landmine should be
removed?
On Fri, Jul 27, 2012 at 8:43 AM, Markus Jelsma
wrote:
> This issue doesn't really describe your problem but a more general problem of
> distributed dedu
This issue doesn't really describe your problem but a more general problem of
distributed deduplication:
https://issues.apache.org/jira/browse/SOLR-3473
-Original message-
> From:Daniel Brügge
> Sent: Fri 27-Jul-2012 17:38
> To: solr-user@lucene.apache.org
> Subject: Deduplication in
: Q1. Is is possible to pass *analyzed* content to the
:
: public abstract class Signature {
No, analysis happens as the documents are being written to the lucene
index, well after the UpdateProcessors have had a chance to interact with
the values.
: Q2. Method calculate() is using concatenat
> TermsComponent maybe?
>
> or faceting?
> q=*:*&facet=true&facet.field=signatureField&defType=lucene&rows=0&start=0
>
> if you append &facet.mincount=1 to above url you can
> see your duplications
>
After re-reading your message: sometimes you want to show duplicates, sometimes
you don't wan
> Basically for some uses cases I would like to show
> duplicates for other I
> wanted them ignored.
>
> If I have overwriteDupes=false and I just create the dedup
> hash how can I
> query for only unique hash values... ie something like a
> SQL group by.
TermsComponent maybe?
or faceting?
q
Message
>
>> From: Martijn v Groningen
>> To: solr-user@lucene.apache.org
>> Sent: Thu, November 26, 2009 3:19:40 AM
>> Subject: Re: Deduplication in 1.4
>>
>> Field collapsing has been used by many in their production
>> environment.
>
> Got any po
Hi Martijn,
- Original Message
> From: Martijn v Groningen
> To: solr-user@lucene.apache.org
> Sent: Thu, November 26, 2009 3:19:40 AM
> Subject: Re: Deduplication in 1.4
>
> Field collapsing has been used by many in their production
> environment.
Got any poi
Field collapsing has been used by many in their production
environment. The last few months the stability of the patch grew as
quiet some bugs were fixed. The only big feature missing currently is
caching of the collapsing algorithm. I'm currently working on that and
I will put it in a new patch in
Hey Otis,
Yep, I realized this myself after playing some with the dedupe feature
yesterday.
So it does look like Field collapsing is what I need pretty much.
Any idea on how close it is to being production-ready?
Thanks,
-Chak
Otis Gospodnetic wrote:
>
> Hi,
>
> As far as I know, the point of
Hi,
As far as I know, the point of deduplication in Solr (
http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document
before indexing it in order to avoid duplicates in the index in the first place.
What you are describing is closer to field collapsing patch in SOLR-236.
Ot
I've seen similar errors when large background merges happen while
looping in a result set. See http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/
On Jan 9, 2009, at 12:50 PM, Mark Miller wrote:
Your basically writing segments more often now, and somehow
Hey Mark,
Sorry I was not enough especific, I wanted to mean that I have and I always
had autoCommit=false.
I will do some more traces and test. Will post if I have any new important
thing to mention.
Thanks.
Marc Sturlese wrote:
>
> Hey Shalin,
>
> In the begining (when the error was appeari
Your basically writing segments more often now, and somehow avoiding a
longer merge I think. Also, likely, deduplication is probably adding
enough extra data to your index to hit a sweet spot where a merge is too
long. Or something to that effect - MySql is especially sensitive to
timeouts when
Hey Shalin,
In the begining (when the error was appearing) i had
32
and no maxBufferedDocs set
Now I have:
32
50
I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
than I would like... but at least it works fine (but a bit slower,opiously).
I keep saying that the most
On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese wrote:
>
> hey there,
> I hadn't autoCommit set to true but I have it sorted! The error stopped
> appearing after setting the property maxBufferedDocs in solrconfig.xml. I
> can't exactly undersand why but it just worked.
> Anyway, maxBufferedDocs
hey there,
I hadn't autoCommit set to true but I have it sorted! The error stopped
appearing after setting the property maxBufferedDocs in solrconfig.xml. I
can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?
Thanks
Marc
I can't imagine why dedupe would have anything to do with this, other
than what was said, it perhaps is taking a bit longer to get a document
to the db, and it times out (maybe a long signature calculation?). Have
you tried changing your MySql settings to allow for a longer timeout?
(sorry, I'm
Hey there,
I am stack in this problem sine 3 days ago and no idea how to sort it.
I am using the nighlty from a week ago, mysql and this driver and url:
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/my_db"
I can use deduplication patch with indexs of 200.000 docs and no problem.
Whe
Thanks I will have a look to my JdbcDataSource. Anyway it's weird because
using the 1.3 release I don't have that problem...
Shalin Shekhar Mangar wrote:
>
> Yes, initially I figured that we are accidentally re-using a closed data
> source. But Noble has pinned it right. I guess you can try look
Yes, initially I figured that we are accidentally re-using a closed data
source. But Noble has pinned it right. I guess you can try looking into your
JDBC driver's documentation for a setting which increases the connection
alive-ness.
On Mon, Jan 5, 2009 at 5:29 PM, Noble Paul നോബിള് नोब्ळ् <
nob
I guess the indexing of a doc is taking too long (may be because of
the de-dup patch) and the resultset gets closed automaticallly (timed
out)
--Noble
On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese wrote:
>
> Donig this fix I get the same error :(
>
> I am going to try to set up the last nigthly b
Donig this fix I get the same error :(
I am going to try to set up the last nigthly build... let's see if I have
better luck.
The thing is it stop indexing at the doc num 150.000 aprox... and give me
that mysql exception error... Without DeDuplication patch I can index 2
milion docs without prob
Yes I meant the 05/01/2008 build. The fix is a one line change
Add the following as the last line of DataConfig.Entity.clearCache()
dataSrc = null;
On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese wrote:
>
> Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
> works? If the
Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
works? If the fix you did is not really big can u tell me where in the
source is and what is it for? (I have been debuging and tracing a lot the
dataimporthandler source and I I would like to know what the imporovement is
ab
Yeah looks like but... if I don't use the DeDuplication patch everything
works perfect. I can create my indexed using full import and delta import
without problems. The JdbcDataSource of the nightly is pretty similar to the
1.3 release...
The DeDuplication patch doesn't touch the dataimporthandl
Yeah looks like but... if I don't use the DeDuplication patch everything
works perfect. I can create my indexed using full import and delta import
without problems. The JdbcDataSource of the nightly is pretty similar to the
1.3 release...
The DeDuplication patch doesn't touch the dataimporthandle
Marc, I've just committed a fix which may have caused the bug. Can you use
svn trunk (or the next nightly build) and confirm?
On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള് नोब्ळ् <
noble.p...@gmail.com> wrote:
> looks like a bug w/ DIH with the recent fixes.
> --Noble
>
> On Mon, Jan 5, 2009
looks like a bug w/ DIH with the recent fixes.
--Noble
On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese wrote:
>
> Hey there,
> I was using the Deduplication patch with Solr 1.3 release and everything was
> working perfectly. Now I upgraded to a nigthly build (20th december) to be
> able to use new
36 matches
Mail list logo