Unfortunately I have a lot of duplicates  and taking that searching might
suffer I will try with implementing update procesor.

But your idea is interesting and I will consider it, thanks.

Best Regards
Alexander Aristov


On 28 December 2011 19:12, Tanguy Moal <tanguy.m...@gmail.com> wrote:

> Hello Alexander,
>
> I don't know much about your requirements in terms of size and
> performances, but I've had a similar use case and found a pretty simple
> workaround.
> If your duplicate rate is not too high, you can have the
> SignatureProcessor to generate fingerprint of documents (you already did
> that).
>
> Simply turn off overwritting of duplicates, you can then rely on solr's
> grouping / field collapsing to group your search results by fingerprints.
> You'll then have one document group per "real" document. You can use
> group.sort to sort your groups by indexing date ascending, and
> group.limit=1 to keep only the oldest one.
> You can even use group.format = simple to serve results as if no
> collapsing occured, and use group.ngroups (/!\ could be expansive /!\) to
> get the real number of deduplicated documents.
>
> Of course the index will be larger, as I said, I made no assumption
> regarding your operating requirements. And search can be a bit slower,
> depending on the average rate of duplicated documents.
> But you've got your issue addressed by configuration tuning only...
> Depending on your project's sizing, it could be time saving.
>
> The advantage is that you have the precious information of what content is
> duplicated from where :-)
>
> Hope this helps,
>
> --
> Tanguy
>
> Le 28/12/2011 15:45, Alexander Aristov a écrit :
>
>  Thanks Eric,
>>
>> it sets me direction. I will be writing new plugin and will get back to
>> the
>> dev forum with results and then we will decide next steps.
>>
>> Best Regards
>> Alexander Aristov
>>
>>
>> On 28 December 2011 18:08, Erick 
>> Erickson<erickerickson@gmail.**com<erickerick...@gmail.com>>
>>  wrote:
>>
>>  Well, the short answer is that nobody else has
>>> 1>  had a similar requirement
>>> AND
>>> 2>  not found a suitable work around
>>> AND
>>> 3>  implemented the change and contributed it back.
>>>
>>> So, if you'd like to volunteer<G>.....
>>>
>>> Seriously. If you think this would be valuable and are
>>> willing to work on it, hop on over to the dev list and
>>> discuss it, open a JIRA and make it work. I'd start
>>> by opening a discussion on the dev list before
>>> opening a JIRA, just to get a sense of where the
>>> snags would be to changing the Solr code, but that's
>>> optional.
>>>
>>> That said, writing your own update request handler
>>> that detected this case isn't very difficult,
>>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor
>>> and use it as a plugin.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
>>> <alexander.aris...@gmail.com>  wrote:
>>>
>>>> the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES
>>>>
>>> old
>>>
>>>> docs. I have tried it already.
>>>>
>>>> Best Regards
>>>> Alexander Aristov
>>>>
>>>>
>>>> On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com>  wrote:
>>>>
>>>>  The SignatureUpdateProcessor is for exactly this problem:
>>>>>
>>>>>
>>>>>
>>>>>  http://www.lucidimagination.**com/search/link?url=http://**
>>> wiki.apache.org/solr/**Deduplication<http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication>
>>>
>>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
>>>>> <alexander.aris...@gmail.com>  wrote:
>>>>>
>>>>>> I get docs from external sources and the only place I keep them is
>>>>>>
>>>>> solr
>>>
>>>> index. I have no a database or other means to track indexed docs (my
>>>>>> personal oppinion is that it might be a huge headache).
>>>>>>
>>>>>> Some docs might change slightly in there original sources but I don't
>>>>>>
>>>>> need
>>>>>
>>>>>> that changes. In fact I need original data only.
>>>>>>
>>>>>> So I have no other ways but to either check if a document is already
>>>>>>
>>>>> in
>>>
>>>> index before I put it to solrj array (read - query solr) or develop my
>>>>>>
>>>>> own
>>>>>
>>>>>> update chain processor and implement ID check there and skip such
>>>>>>
>>>>> docs.
>>>
>>>> Maybe it's wrong place to aguee and probably it's been discussed
>>>>>>
>>>>> before
>>>
>>>> but
>>>>>
>>>>>> I wonder why simple the overwrite parameter doesn't work here.
>>>>>>
>>>>>> My oppinion it perfectly suits here. In combination with unique ID it
>>>>>>
>>>>> can
>>>
>>>> cover all possible variants.
>>>>>>
>>>>>> cases:
>>>>>>
>>>>>> 1. overwrite=true and uniquID exists then newer doc should overwrite
>>>>>>
>>>>> the
>>>
>>>> old one.
>>>>>>
>>>>>> 2. overwrite=false and uniqueID exists then newer doc must be skipped
>>>>>>
>>>>> since
>>>>>
>>>>>> old exists.
>>>>>>
>>>>>> 3. uniqueID doesn't exist then newer doc just gets added regardless if
>>>>>>
>>>>> old
>>>>>
>>>>>> exists or not.
>>>>>>
>>>>>>
>>>>>> Best Regards
>>>>>> Alexander Aristov
>>>>>>
>>>>>>
>>>>>> On 27 December 2011 22:53, Erick 
>>>>>> Erickson<erickerickson@gmail.**com<erickerick...@gmail.com>
>>>>>> >
>>>>>>
>>>>> wrote:
>>>>>
>>>>>> Mikhail is right as far as I know, the assumption built into Solr is
>>>>>>>
>>>>>> that
>>>>>
>>>>>> duplicate IDs (when<uniqueKey>  is defined) should trigger the old
>>>>>>> document to be replaced.
>>>>>>>
>>>>>>> what is your system-of-record? By that I mean what does your SolrJ
>>>>>>> program do to send data to Solr? Is there any way you could just
>>>>>>> *not* send documents that are already in the Solr index based on,
>>>>>>> for instance, any timestamp associated with your system-of-record
>>>>>>> and the last time you did an incremental index?
>>>>>>>
>>>>>>> Best
>>>>>>> Erick
>>>>>>>
>>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
>>>>>>> <alexander.aris...@gmail.com>  wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> I am not using database. All needed data is in solr index that's
>>>>>>>>
>>>>>>> why I
>>>
>>>>  want
>>>>>>>
>>>>>>>> to skip excessive checks.
>>>>>>>>
>>>>>>>> I will check DIH but not sure if it helps.
>>>>>>>>
>>>>>>>> I am fluent with Java and it's not a problem for me to write a
>>>>>>>>
>>>>>>> class
>>>
>>>> or
>>>>>
>>>>>> so
>>>>>>>
>>>>>>>> but I want to check first  maybe there are any ways (workarounds)
>>>>>>>>
>>>>>>> to
>>>
>>>> make
>>>>>
>>>>>> it working without codding, just by playing around with
>>>>>>>>
>>>>>>> configuration
>>>
>>>> and
>>>>>
>>>>>> params. I don't want to go away from default solr implementation.
>>>>>>>>
>>>>>>>> Best Regards
>>>>>>>> Alexander Aristov
>>>>>>>>
>>>>>>>>
>>>>>>>> On 27 December 2011 09:33, Mikhail Khludnev<
>>>>>>>>
>>>>>>> mkhlud...@griddynamics.com
>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>  On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov<
>>>>>>>>> alexander.aris...@gmail.com>  wrote:
>>>>>>>>>
>>>>>>>>>  Hi people,
>>>>>>>>>>
>>>>>>>>>> I urgently need your help!
>>>>>>>>>>
>>>>>>>>>> I have solr 3.3 configured and running. I do uncremental
>>>>>>>>>>
>>>>>>>>> indexing 4
>>>
>>>>  times a
>>>>>>>>>
>>>>>>>>>> day using bulk updates. Some documents are identical to some
>>>>>>>>>>
>>>>>>>>> extent
>>>
>>>>  and I
>>>>>>>
>>>>>>>> wish to skip them, not to index.
>>>>>>>>>> But here is the problem as I could not find a way to tell solr
>>>>>>>>>>
>>>>>>>>> ignore
>>>>>
>>>>>> new
>>>>>>>
>>>>>>>> duplicate docs and keep old indexed docs. I don't care that it's
>>>>>>>>>>
>>>>>>>>> new.
>>>>>
>>>>>>  Just
>>>>>>>>>
>>>>>>>>>> determine by ID that such document is in the index already and
>>>>>>>>>>
>>>>>>>>> that's
>>>>>
>>>>>> it.
>>>>>>>
>>>>>>>> I use solrj for indexing. I have tried setting overwrite=false
>>>>>>>>>>
>>>>>>>>> and
>>>
>>>>  dedupe
>>>>>>>
>>>>>>>> apprache but nothing helped me. I either have that a newer doc
>>>>>>>>>>
>>>>>>>>> overwrites
>>>>>>>
>>>>>>>> old one or I get duplicate.
>>>>>>>>>>
>>>>>>>>>> I think it's a very simple and basic feature and it must exist.
>>>>>>>>>>
>>>>>>>>> What
>>>>>
>>>>>> did
>>>>>>>
>>>>>>>> I
>>>>>>>>>
>>>>>>>>>> make wrong or didn't do?
>>>>>>>>>>
>>>>>>>>>>  I guess, because  the mainstream approach is delta-import , when
>>>>>>>>>
>>>>>>>> you
>>>
>>>>  have
>>>>>>>
>>>>>>>> "updated" timestamps in your DB and "last-import" timestamp stored
>>>>>>>>> somewhere. You can check how it works in DIH.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Tried google but I couldn't find a solution there althoght many
>>>>>>>>>>
>>>>>>>>> people
>>>>>
>>>>>>  encounted such problem.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  it's definitely can be done by overriding
>>>>>>>>> o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand),
>>>>>>>>> but I
>>>>>>>>>
>>>>>>>> suggest
>>>>>>>
>>>>>>>> to start from implementing your own
>>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor<http://wiki.apache.org/solr/UpdateRequestProcessor>-
>>>>>>>>>  search for
>>>>>>>>>
>>>>>>>> PK,
>>>
>>>>  bypass
>>>>>>>
>>>>>>>> chain call if it's found. Then if you meet performance issues on
>>>>>>>>>
>>>>>>>> querying
>>>>>>>
>>>>>>>> your PKs one by one, (but only after that) you can batch your
>>>>>>>>>
>>>>>>>> searches,
>>>>>
>>>>>>  there are couple of optimization techniques for huge disjunction
>>>>>>>>>
>>>>>>>> queries
>>>>>
>>>>>>  like PK:(2 OR 4 OR 5 OR 6).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  I start considering that I must query index to check if a doc
>>>>>>>>>>
>>>>>>>>> to be
>>>
>>>>  added
>>>>>>>
>>>>>>>> is in the index already and do not add it to array but I have so
>>>>>>>>>>
>>>>>>>>> many
>>>>>
>>>>>>  docs
>>>>>>>>>
>>>>>>>>>> that I am affraid it's not a good solution.
>>>>>>>>>>
>>>>>>>>>> Best Regards
>>>>>>>>>> Alexander Aristov
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sincerely yours
>>>>>>>>> Mikhail Khludnev
>>>>>>>>> Lucid Certified
>>>>>>>>> Apache Lucene/Solr Developer
>>>>>>>>> Grid Dynamics
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> goks...@gmail.com
>>>>>
>>>>>
>

Reply via email to