On 6/9/2018 1:15 AM, S G wrote:
That means if I send {"color":"red", "size":"L"} once,
UUIDUpdateProcessorFactory
will
generate an "id" X and if I send the same document {"color":"red",
"size":"L"} again,
UUIDUpdateProcessorFactory will not know that its the same document and
will generate an "i
We do not want to generate the "id" ourselves and hence were looking for
something that would generate the "id" automatically.
UUIDUpdateProcessorFactory documentation says nothing about the
automatic "id" generation process identifying if the document received is
same as an existing document or n
First, your assumption is correct. It would be A Bad Thing if two
identical UUIDs were generated
Is this SolrCloud? If so, then the deduplication idea won't work. The
problem is that the uuid is used for routing and there is a decent (1
- 1/numShards) chance that the two "identical" docs would
Hi,
Suppose id field is the UUID linked field in the configuration and if this
is missing in the document coming to index then it will generate a UUID and
set it in id field. However if id field is present with some value then it
shouldn't.
Kindly refer
http://lucene.apache.org/solr/5_5_0/solr-co
Hi,
Is it correct to assume that UUIDUpdateProcessorFactory will produce 2
documents even if the same document is indexed twice without the "id" field
?
And to avoid such a thing, we can use the technique mentioned in
https://wiki.apache.org/solr/Deduplication ?
Thanks
SG
and config:
> >>
> >> _unique_key
> >>
> >>
> >>
> >> solr.StrField
> >> 32766
> >>
> >>
> >>
> >>
> >>
> >> [^\w-\.]
> >> _
> >>
> >>
gt;
>> We are in solr 6.0.1, here is our solr schema and config:
>>
>> _unique_key
>>
>>
>>
>> solr.StrField
>> 32766
>>
>>
>>
>>
>>
>> [^\w-\.]
>> _
>>
&
solr.StrField
> 32766
>
>
>
>
>
> [^\w-\.]
> _
>
>
>
>
> When having above configuration, and doing following operations, we will
> see duplicate documents (two documents have same _unique_key)
>
> 1, Add document:
>
> *fin
Hi there,
We are in solr 6.0.1, here is our solr schema and config:
_unique_key
solr.StrField
32766
[^\w-\.]
_
When having above configuration, and doing following operations, we will
see duplicate documents (two documents have same
I'm using solr 4.10.0. I'm using "id" field as the unique key - it is passed
in with the document when ingesting the documents into solr. When querying
on different shards, I get duplicate documents with different "_version_".
Out off approx. milions of these do
Sounds a lot like multi-tenancy, where you don't want the document
frequencies of one tenant to influence the query relevancy scores for other
tenants.
No ready solution.
Although, I have thought of a simplified document scoring using just tf and
leaving out df/idf. Not as good a tf*idf or BM25 s
Hey Solr people:
Suppose that we did not want to break up our document set into separate
indexes, but had certain cases where many versions of a document were not
relevant for certain searches.
I guess this could be thought of as a "authorization" class of problem,
however it is not that
Thanks. Okay have done what you suggest, I.e. removed the overwrite=true
which should default to solr's default value. I've also tried a re-index
and left it to run for a few days; so far so good, nothing indicating
duplicates, so as you say, could just be a bug in my code.
Will continue to monito
On 9/12/2015 10:51 AM, Mr Havercamp wrote:
> Unfortunately, has never changed. The issue can take some time
> to show itself although I think there were logic issues with the way I
> update documents in my index.
>
> I first do a full purge and reindex of all items without issue.
>
> Over time,
Unfortunately, has never changed. The issue can take some time
to show itself although I think there were logic issues with the way I
update documents in my index.
I first do a full purge and reindex of all items without issue.
Over time, I only index items that have changed/are new since initia
OK, this makes no sense whatsoever, so I"m missing something.
commitWithin shouldn't matter at all, there's code to handle multiple
updates between commits.
I'm _really_ shooting in the dark here, but...
> did you perhaps change the definition from the default "id"
to "key" without blowing away
Thanks for the suggestions. No, not using MERGEINDEXES nor
MapReduceIndexerTool.
I've pasted the XML in case there is something broken there (cut
down for brevity, i.e. the "..."):
123456789/3Test
SubmissionTest Submission11Test Collectiontest
collection|||Test CollectionTest
Collectionyoung,
ha
I'm wondering if the commitWithin is causing issues.
On 11 September 2015 at 18:52, Mr Havercamp wrote:
> Thanks for the suggestions. No, not using MERGEINDEXES nor
> MapReduceIndexerTool.
>
> I've pasted the XML in case there is something broken there (cut
> down for brevity, i.e. the "..."):
Are you by any chance using the MERGEINDEXES
core admin call? Or using MapReduceIndexerTool?
Neither of those delete duplicates
This is a fundamental part of Solr though, so it's
virtually certain that there's some innocent-seeming
thing you're doing that's causing this...
Best,
Erick
On Fr
At query time, you could externally roll in the dups when they have the
same signature.
If you define your use case, it might be easier..
On 09/11/2015 11:55 AM, Shawn Heisey wrote:
On 9/11/2015 9:10 AM, Mr Havercamp wrote:
fieldType def:
It is not SolrCloud.
As long
On 9/11/2015 9:10 AM, Mr Havercamp wrote:
> fieldType def:
>
>
> sortMissingLast="true" />
>
> It is not SolrCloud.
As long as it's not a distributed index, I can't think of any problem
those field/type definitions might cause. Even if it were distributed
and you had the same do
Hi Shawn
Thanks for your response.
fieldType def:
It is not SolrCloud.
Cheers
Hayden
On 11 September 2015 at 16:35, Shawn Heisey wrote:
> On 9/11/2015 8:25 AM, Mr Havercamp wrote:
> > Running 4.8.1. I am experiencing the same problem where I get duplicates
> on
> > index
On 9/11/2015 8:25 AM, Mr Havercamp wrote:
> Running 4.8.1. I am experiencing the same problem where I get duplicates on
> index update despite using overwrite=true when adding existing documents.
> My duplicate ratio is a lot higher with maybe 25 - 50% of records having
> duplicates (and as the ind
t looks like updating documents is causing it
> sporadically. Going to try deleting the document and then update.
>
>
> -Original Message-
> From: Tarala, Magesh
> Sent: Monday, August 03, 2015 8:27 AM
> To: solr-user@lucene.apache.org
> Subject: Duplicate Documents
&g
: solr-user@lucene.apache.org
Subject: Duplicate Documents
I'm using solr 4.10.2. I'm using "id" field as the unique key - it is passed in
with the document when ingesting the documents into solr. When querying I get
duplicate documents with different "_version_&quo
I'm using solr 4.10.2. I'm using "id" field as the unique key - it is passed in
with the document when ingesting the documents into solr. When querying I get
duplicate documents with different "_version_". Out off approx. 25K unique
documents ingested into solr
elect?q=id:%22mongo.com-e25a2-11e3-8a73-0026b9414f30%22&wt=xml&shards.info=true
Response:
*1*
17.853292
3
*1*
17.850622
2
0
0.0
3
0
0.0
4
0
0.0
19
--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162
Hmmm, with that setup you should _not_ be getting
duplicate documents.
So, when you see duplicate documents, you're seeing
the exact same UUID on two shards, correct? My best
guess is that you've done something innocent-seeming
(that perhaps you forgot!) the resulted in this. Other
ID from mongodb and
the ID generated is same while we are doing update as well, using the same
code. /
We are unable to guess the root cause for having duplicate documents in
multiple shards. Also, it looks reindexing is the only solution for
removing the duplicates.
--
View this message in cont
range and the documents will be assigned
> to the shard based on the key range it belongs to with its hashkey.
>
> Reitzel,
> The uuid is generated during update and it is unique and not a new id for
> the document. Also having shard specific routkey[env] is not possible in our
> case.
>
ring update and it is unique and not a new id for
the document. Also having shard specific routkey[env] is not possible in our
case.
Thanks,
Senthil
--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218556.html
Sen
f the hash dominate the
> distribution of data.
>
> -Original Message-
> From: Reitzel, Charles
> Sent: Tuesday, July 21, 2015 9:55 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr Cloud: Duplicate documents in multiple shards
>
> When are you generating the UUID
the distribution
of data.
-Original Message-
From: Reitzel, Charles
Sent: Tuesday, July 21, 2015 9:55 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Cloud: Duplicate documents in multiple shards
When are you generating the UUID exactly? If you set the unique ID field on
an "updat
t: Tuesday, July 21, 2015 4:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud: Duplicate documents in multiple shards
Unable to delete by passing distrib=false as well. Also it is difficult to
identify those duplicate documents among the 130 million.
Is there a way we can see the gene
Unable to delete by passing distrib=false as well. Also it is difficult to
identify those duplicate documents among the 130 million.
Is there a way we can see the generated hash key and mapping them to the
specific shard?
--
View this message in context:
http://lucene.472066.n3.nabble.com
we explicitly set "!" with
> shard
> key?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218296.html
> Sent from the Solr - User mailing list archive at Nabble.com.
would have gone to multiple shards. Do you have any
suggestion for fixing this. Or we need to completely rebuild the index.
When the routing key is compositeId, should we explicitly set "!" with shard
key?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Cloud
p and the routing key is set as
> "compositeId".
>
> Senthil
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162.html
> Sent from the Solr - User mailing list archive at Nabble.com.
nthil
--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162.html
Sent from the Solr - User mailing list archive at Nabble.com.
You need to store the color field as multi valued stored field. You have to
do pagination manually. If you worried, then use database. Have a table
with Product Name and Color. You could retrieve data with pagination.
Still if you want to achieve it via Solr. Have a separate record for every
produ
Look for the presentations online. You are not the first store to use Solr,
there are some explanations around. Try one from Gilt, but I think there
were more.
You will want to store data at the lowest meaningful level of search
granularity. So, in your case, it might be ProductVariation (shoes+co
I was hoping to do this from within Solr, that way I don't have to manually
mess around with pagination. The number of items on each page would be
indeterministic.
On Jul 25, 2013, at 9:48 AM, Anshum Gupta wrote:
> Have a multivalued stored 'color' field and just iterate on it outside of
> so
Have a multivalued stored 'color' field and just iterate on it outside of
solr.
On Thu, Jul 25, 2013 at 10:12 PM, Mark wrote:
> How would I go about doing something like this. Not sure if this is
> something that can be accomplished on the index side or its something that
> should be done in ou
How would I go about doing something like this. Not sure if this is something
that can be accomplished on the index side or its something that should be done
in our application.
Say we are an online store for shoes and we are selling Product A in red, blue
and green. Is there a way when we sea
Hi folks
We have a use case where i have 2 solr indexes with the same schema but
different field populated, for example:
Common schema:
// Unique key
Now i have one index which stores the information about products (first 5
fields). This index is built every 2 days.
I have a 2nd in
y away from a tokenized (text)
> key.
>
> You could also get duplicates by merging cores or if your "add" has
> allowDups = "true" or overwrite="false".
>
> -- Jack Krupansky
>
> -Original Message-
> From: Parmeley, Michael
>
Message-
From: Parmeley, Michael
Sent: Friday, May 18, 2012 5:50 PM
To: solr-user@lucene.apache.org
Subject: Duplicate documents being added even with unique key
I have a uniquekey set in my schema; however, I am still getting duplicated
documents added. Can anyone provide any in
Your unique key field should be of type "string" not a tokenized type.
Erik
On May 18, 2012, at 17:50, "Parmeley, Michael" wrote:
> I have a uniquekey set in my schema; however, I am still getting duplicated
> documents added. Can anyone provide any insight into why this may be
> happenin
I have a uniquekey set in my schema; however, I am still getting duplicated
documents added. Can anyone provide any insight into why this may be happening?
This is in my schema.xml:
uniquekey
On startup I get this message in catalina.out:
INFO: unique key field: uniquekey
However, you
You're probably talking a custom update handler here. That
way you can do a document ID lookup, that is just see if the
incoming document ID is in the index already and throw
the document away if you find one. This should be very
efficient, much more efficient than making a separate query
for each
Man,
Does overwrite=false work for you?
http://wiki.apache.org/solr/UpdateXmlMessages#add.2BAC8-replace_documents
Regards
On Tue, Dec 13, 2011 at 11:34 PM, Alexander Aristov <
alexander.aris...@gmail.com> wrote:
> People,
>
> I am asking for your help with solr.
>
> When a document is sent to
People,
I am asking for your help with solr.
When a document is sent to solr and such document already exists in its
index (by its ID) then the new doc replaces the old one.
But I don't want to automatically replace documents. Just ignore and
proceed to the next. How can I configure solr to do s
Omri Cohen wrote:
>>>>>>>>>
>>>>>>>>>> What you need to do, is to calculate some HASH (using any message
>>>>>> digest
>>>>>>>>>> algorithm you want, md5, sha-1 and so on), then do some re
>>> [image:
>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>>>> WordPress]<http://omricohen.me>
>>>>>> Please consider your environmental responsibility. Before printing
>> this
>>>>>> e-mail message, ask yourself whether you r
u need to do, is to calculate some HASH (using any message
> >>>> digest
> >>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
> >>>> solr
> >>>>>>>> field collapse capabilities. Should n
.
>>>>>>>>
>>>>>>>> *Omri Cohen*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
>>>>>> +972-3-6036295
>>>>>>>>
t;>>> +972-3-6036295
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
> >>>> [image:
> >>>>>> Twitter] <http://www.twitte
>>> [image:
>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>>>> WordPress]<http://omricohen.me>
>>>>>> Please consider your environmental responsibility. Before printing
>> this
>>>>>> e-mail messa
>>> Please consider your environmental responsibility. Before printing
> this
> >>>> e-mail message, ask yourself whether you really need a hard copy.
> >>>> IMPORTANT: The contents of this email and any attachments are
> >> confidential.
> >
eceived
>>>> this
>>>> email by mistake, please notify the sender immediately and do not
>> disclose
>>>> the contents to anyone or make copies thereof.
>>>> Signature powered by
>>>> <
>>>>
>> http://www.wise
do not
> disclose
> >> the contents to anyone or make copies thereof.
> >> Signature powered by
> >> <
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >> >
> >> WiseStamp&l
campaign=footer
>> >
>> WiseStamp<
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>> >
>>
>>
>>
>> -- Forwarded message --
>> From: Pranav Prakash
>> Date: Thu,
Would you care to even index the duplicate documents? Finding duplicacy in
content fields would be not so easy as in some untokenized/keyword field.
May be you could do this filtering at indexing time before sending the
document to SOLR. Then the question comes, which one document should go(from
a
dium=email&utm_campaign=footer
> >
>
>
>
> -- Forwarded message --
> From: Pranav Prakash
> Date: Thu, Jun 23, 2011 at 12:26 PM
> Subject: Removing duplicate documents from search results
> To: solr-user@lucene.apache.org
>
>
> How can I rem
anav Prakash
Date: Thu, Jun 23, 2011 at 12:26 PM
Subject: Removing duplicate documents from search results
To: solr-user@lucene.apache.org
How can I remove very similar documents from search results?
My scenario is that there are documents in the index which are almost
similar (people s
the top N results, quite frequently, same document comes up multiple
times. I want to remove those duplicate (or possible duplicate) documents.
Very similar to what Google does when they say "In order to show you most
relevant result, duplicates have been removed". How can I achieve this
It would be nice if the documentation mentioned this. :)
/Tim
2010/3/18 Erik Hatcher :
> The StreamingUpdateSolrServer does not support binary format, unfortunately.
>
> Erik
>
> On Mar 18, 2010, at 8:15 AM, Tim Terlegård wrote:
>
>> I'm using StreamingUpdateSolrServer to index a document
The StreamingUpdateSolrServer does not support binary format,
unfortunately.
Erik
On Mar 18, 2010, at 8:15 AM, Tim Terlegård wrote:
I'm using StreamingUpdateSolrServer to index a document.
StreamingUpdateSolrServer server = new
StreamingUpdateSolrServer("http://localhost:8983/solr/c
I'm using StreamingUpdateSolrServer to index a document.
StreamingUpdateSolrServer server = new
StreamingUpdateSolrServer("http://localhost:8983/solr/core0";, 20, 4);
server.setRequestWriter(new BinaryRequestWriter());
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "12121212")
scribe if the 'id' is a 'text' field.
>
> ryan
>
>
>
--
View this message in context:
http://www.nabble.com/SOLR-1.2---Duplicate-Documents---tp13621332p14531206.html
Sent from the Solr - User mailing list archive at Nabble.com.
re on near dup detection so you should be able to get one for
>free!
>
>On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
>> Otis,
>>
>> Thanks for your response.
>>
> > I just gave a quick look to the Nutch Forum and find that
v 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> >> Otis,
> >>
> >> Thanks for your response.
> >>
> > > I just gave a quick look to the Nutch Forum and find that there is an
> >> implementation to obtain de-duplicate documents/pages
On 21-Nov-07, at 12:29 AM, climbingrose wrote:
The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on
implementation to obtain de-duplicate documents/pages but none for Near
Duplicates documents. Can you guide me a little further as to where exactly
> under Nutch I should be concentrating, regarding near duplicate documents?
>
> Regards,
Rishabh
On Nov 21, 2007 12:41 PM, Otis
> > Otis,
> >
> > Thanks for your response.
> >
> > I just gave a quick look to the Nutch Forum and find that there is an
> > implementation to obtain de-duplicate documents/pages but none for Near
> > Duplicates documents. Can you guide me a little furth
e for
free!
On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> Otis,
>
> Thanks for your response.
>
> I just gave a quick look to the Nutch Forum and find that there is an
> implementation to obtain de-duplicate documents/pages but none for Near
> Dup
Otis,
Thanks for your response.
I just gave a quick look to the Nutch Forum and find that there is an
implementation to obtain de-duplicate documents/pages but none for Near
Duplicates documents. Can you guide me a little further as to where exactly
under Nutch I should be concentrating
r-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents
On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> Is there any idea implementing that feature in the up coming
releases?
Not currently. Feel free to contribute something if you find a good
solution
On 18-Nov-07, at 8:17 AM, Eswar K wrote:
Is there any idea implementing that feature in the up coming releases?
Not currently. Feel free to contribute something if you find a good
solution .
-Mike
On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
On Nov 18, 2007 10:50
Eswar K wrote:
We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.
The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes
Is there any idea implementing that feature in the up coming releases?
Regards,
Eswar
On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > We have a scenario, where we want to find out documents which are
> similar in
On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> We have a scenario, where we want to find out documents which are similar in
> content. To elaborate a little more on what we mean here, lets take an
> example.
>
> The example of this email chain in which we are interacting on, can be
o search for other similar documents based on the results of
> another query.
>
> ryan
>
>
> rishabh9 wrote:
> > Can anyone help me?
> >
> > Rishabh
> >
> >
> > rishabh9 wrote:
> >> Hi,
> >>
> >> I am evaluating "Solr 1.2&qu
ishabh
rishabh9 wrote:
Hi,
I am evaluating "Solr 1.2" for my project and wanted to know if it can
return near duplicate documents (near dups) and how do i go about it? I am
not sure, but is "MoreLikeThisHandler" the implementation for near dups?
Rishabh
Can anyone help me?
Rishabh
rishabh9 wrote:
>
> Hi,
>
> I am evaluating "Solr 1.2" for my project and wanted to know if it can
> return near duplicate documents (near dups) and how do i go about it? I am
> not sure, but is "MoreLikeThisHandler" the impl
Hi,
I am evaluating "Solr 1.2" for my project and wanted to know if it can
return near duplicate documents (near dups) and how do i go about it? I am
not sure, but is "MoreLikeThisHandler" the implementation for near dups?
Rishabh
On Nov 7, 2007 12:30 PM, realw5 <[EMAIL PROTECTED]> wrote:
> We did have Tomcat crash once (JVM OutOfMem) durning an indexing process,
> could that be a possible source of the issue?
Yes.
Deletes are buffered and carried out in a different phase.
-Yonik
On Nov 7, 2007, at 12:10 PM, Chris Hostetter wrote:
: Hey all, I have a fairly odd case of duplicate documents in our
solr index
: (See attached xml sample). THe index is roughtly 35k in
documents. The only
How did you index those documents?
Any chance you inadvertently set the "allo
process,
could that be a possible source of the issue?
Dan
hossman wrote:
>
> : Hey all, I have a fairly odd case of duplicate documents in our solr
> index
> : (See attached xml sample). THe index is roughtly 35k in documents. The
> only
>
> How did you index th
: Hey all, I have a fairly odd case of duplicate documents in our solr index
: (See attached xml sample). THe index is roughtly 35k in documents. The only
How did you index those documents?
Any chance you inadvertently set the "allowDups=true" attribute when
sending them to Solr
ited schema.xml since building a full index from scratch? If
> so, try rebuilding the index.
>
> People often get the behavior you describe if the 'id' is a 'text' field.
>
> ryan
>
>
>
--
View this message in context:
http://www.nabble.com/SOLR-1.
Schema.xml
Have you edited schema.xml since building a full index from scratch? If
so, try rebuilding the index.
People often get the behavior you describe if the 'id' is a 'text' field.
ryan
Hey all, I have a fairly odd case of duplicate documents in our solr index
(See attached xml sample). THe index is roughtly 35k in documents. The only
way I've found to fix the problem is to run a delete statement by id, which
deletes both, I can then re-index that one document. This hap
93 matches
Mail list logo